Google Summer of Code 2017 – Work Product Submission

 

TV Commercial Classification and Optimization of Visual Recognition Pipeline for News Videos

Organization: Red Hen Lab / CCExtractor

My project was officially under the authority of CCExtractor, but my work spanned the interests and scope of both organizations. I contributed code for both organizations.

Table of Contents:

  1. Introduction
  2. TV Commercial Classification
  3. Visual Recognition Pipeline
  4. CCExtractor Improvements
  5. Repository Links
  6. Known Issues / Future Work
  7. Future Collaboration
  8. References / Licences

Introduction

My project’s contributions have been of three different types:-

  1. Creating a TV Commercial Classification system
  2. Optimizing Red Hen’s Visual Recognition Pipeline
  3. Adding various OCR and international support improvements to CCExtractor

All of these are directly helpful to the processing of the NewsScape dataset.

return to top

TV Commercial Classification

Red Hen had an existing visual feature based classification and labeling system for news videos (optimising which was the second part of my project), but TV commercials are not considered by the pipeline so far. I wanted to add a multimodal TV commercial classification system to the pipeline’s current capabilities.

return to top

Problem and Dataset Description

The problem at hand is classifying an input advertisement video into its product category. This classification of commercials is into the following 23 categories:-

  • 01_alcoholic-drinks-tobacco
  • 02_automotive
  • 03_business-equipment-services
  • 04_consumer-public-services
  • 05_culture-leisure-sport
  • 06_fast-food-outlets-restaurants
  • 07_health-pharmaceuticals
  • 08_household-maintenance-pet-products
  • 09_industrial-agriculture
  • 10_non-alcoholic-drinks
  • 11_publishing-media
  • 12_transport-travel-tourism
  • 13_apparel-clothing-footwear
  • 14_banking
  • 15_confectionery-snacks
  • 16_cosmetics-beauty-products
  • 17_dairy-products-eggs
  • 18_grocery-other-foods
  • 19_home-electronics-home-appliances
  • 20_hygiene-personal-care-products
  • 21_internet
  • 22_public-awareness
  • 23_retail-distribution-rental-companies

The dataset I used is a subset from the Coloribus archive, which is a thoroughly collected and daily updated advertising archive, and is the biggest online collection of creative advertising pieces from all over the world. It is a highly structured database containing information about brands, agencies, people involved, awards and other very relevant data, combined with advanced full-text search engine. I downloaded a few samples from each category and used it for training a convolutional neural network.

return to top

Usage Instructions

python process_ad_video.py <path_to_video>

This generates an output of the form “label:probability” for each of the top 5 classes in a list.

Dependencies

  • Python (https://www.python.org/downloads/) : The language of the project. The code has been tested with Python 2.7.8 and 2.7.12. It should work with any recent version of Python 2. Python 3 and any other versions are your own adventure.
  • Caffe (https://github.com/BVLC/caffe) : Neural network framework. Needs to be complied with Python support (pycaffe). The code has been tested with Caffe rc3, and should work with the GitHub version.
  • FFMpeg (https://github.com/FFmpeg/FFmpeg) : For video processing. The code has been tested with v2.8.2 and v3.1.0, and should work with the GitHub version.

Installation

Simply clone the repository while ensuring all dependencies are correctly installed.

Example Output

An example keyframe along with the probablities for each class for a given input video are shown below:-

File: kentucky-fried-chickenkfc-chamber-test-360-45192.mp4
Category: 06_fast-food-outlets-restaurants

kfc-example
The top 5 [labels : probabilites] in order are: [[’06_fast-food-outlets-restaurants’, 0.45], [’23_retail-distribution-rental-companies’, 0.10], [’22_public-awareness’, 0.09], [’13_apparel-clothing-footwear’, 0.08], [’04_consumer-public-services’, 0.054]]
File: nike-skates-360-99472.mp4
Category: 13_apparel-clothing-footwear

nike-example
The top 5 [labels : probabilites] in order are: [[’13_apparel-clothing-footwear’, 0.51708157997511861], [’22_public-awareness’, 0.18499881764358048], [’04_consumer-public-services’, 0.16289945019604854], [’01_alcoholic-drinks-tobacco’, 0.061124437368357168], [’06_fast-food-outlets-restaurants’, 0.023659305755164235]]
File: uefa-together-we-play-strong-360-61902.mp4
Category: 05_culture-leisure-sport

uefa-example
The top 5 [labels : probabilites] in order are: [[’05_culture-leisure-sport’, 0.89002430691967704], [’15_confectionery-snacks’, 0.043071130712228789], [’12_transport-travel-tourism’, 0.025212416302817134], [’02_automotive’, 0.023489738645691184], [’22_public-awareness’, 0.0077147928995272754]]
File: corona-the-oceans-360-34498.mp4
Category: 01_alcoholic-drinks-tobacco

corona-example
The top 5 [labels : probabilites] in order are: [[’22_public-awareness’, 0.5100584880878537], [’01_alcoholic-drinks-tobacco’, 0.31181776194454913], [’02_automotive’, 0.094240187919956253], [’04_consumer-public-services’, 0.024657759607296301], [’05_culture-leisure-sport’, 0.02131677328555057]]
return to top

Performance Details

I fine-tuned a standard 7 layer deep neural network using the modified AlexNet architecture using the Caffe deep learning framework with fairly standard parameters that are typically used for such a task. The dataset consisted of ad videos in 23 categories and nearly 8000 training points were supplied to the network for the fine tuning process.

Accuracy

The process gives us a best top-1 accuracy of 55.3%, a best top-3 accuracy of 83.2%, and a best top-5 accuracy of 89.75%.

These results are positive. A random classifier (chance, coin-toss, dice-roll etc) on the same dataset would give a top-1 accuracy of 4.34%, a top-3 accuracy of 13.04%, and a top-5 accuracy of 21.73%.

return to top

More Documentation on GitHub

Training details and code are available at the following GitHub repo:-
https://github.com/Abhinav95/tv-ad-classification

return to top

Visual Recognition Pipeline

The Visual Recognition Pipeline was developed during last GSoC with the aim of being able to extract useful visual features using convolutional neural networks (CNNs) from news videos in order to establish a richer automated understanding of the news.

A video demo of the kind of information that can be extracted follows below:-

The aim is to generate such useful annotations from the visual modality for the entire NewsScape dataset.

return to top

Building on Shruti Gullapuram’s GSoC Project

The majority of my work this summer was focused on developing optimization strategies for the visual recognition pipeline developed by Shruti Gullapuram during GSoC 2016 in order to reduce the overall processing time taken for a particular input news video, which stood at around 2.5 to 3 hours per hour of video at the beginning of my project.

Detailed documentation, along with the kind of labels, the process followed and example outputs, of what Shruti achieved during last GSoC can be seen at the following link:-

https://shrutigullapuram.wordpress.com/2016/08/22/gsoc-work-product-submission

The way the pipeline works is shown below:-

news-shot-2

return to top

Usage Instructions

This section details the installation and the usage of the system.

Dependencies

Installation

All the required external files and classifier models can be found here:
https://www.dropbox.com/sh/hv811iqnupcusp8/AAA-nn4mYD2LIP2-deK1VUSWa?dl=0
The paths to all external files required by the code can be modified in path_params.py according to the user’s convenience.

Normal Usage

On a CPU, without any additional dependencies, run:-

python ShotClass-01.py <path_to_video>

On a GPU enabled machine (with nvidia-smi), run:-

python ShotClass-02.py <path_to_video>

In both cases, the code will generate an output file with the same name as the video with a .sht file (in the Red Hen piped format) and .json file in the JSON Lines format.

Usage on HPC

You can process news videos on Case HPC in two ways:

  1. Process a list of videos using -l flag:
    Run  ./manager.sh -l .txt
    .txt contains YYYY-MM-DD_HOUR_NETWORKNAME.mp4 (only basenames of files)
  2. Process a particular day’s worth of news videos using -d flag:
    Run  ./manager.sh -d YYYY/MM/DD

Edit the variable VIDEO_DST in manager.sh to change the path of the processed video files.

Another alternative form of usage on HPC is of the form:-

python ShotClass-03.py <path_to_video>

In this usage, the GPU support is included, and used when available, and there is also a logical segmentation of jobs using SLURM, such that different jobs are submitted for different parts of the pipeline in order to use the available resources as effectively as possible.

return to top

GPU Support

Earlier, the pipeline was supposed to be run as one single job on a CPU compute node, and even if the requested node was a GPU node, the capabilities of the GPU would not be used by it. I added the capability to use a GPU if available. If no GPU is available (or detected by the code), we fall back to default CPU execution.

The changes to make this happen can be seen at:-
https://github.com/gshruti95/news-shot-classification/pull/2

This is especially useful in speeding up the runtime of the feature extraction and classification steps that involve deep neural networks (namely anything that invovles running a Caffe model). GPU execution of these steps speeds up the runtime by an exponential factor which could be anything between 20 to 200 times depending on the individual computational power of the CPU/GPU in question. On my computer, a nearly 40 times speedup can be observed.

return to top

Optimizations for Runtime

A core aim of my project was to cut down the runtime of the system. I briefly explain a few changes I made in order to do this:-

GPU Benchmarks

In an earlier section, I have shown how I added GPU support for more efficient processing. This section takes a look at some of the benchmarks that I have made.

While running 100 iterations 10 times and averaging the time values for a batch size of 10 at test time, the benchmarks for my Nvidia GTX 1050 Ti 4GB GPU, and that for an Intel Xeon CPU node look something like this:-

Model Avg GPU Mode time (s)
Avg CPU Mode time (s)
AlexNet 7.34 283.66
ResNet 49.54 1812.31

We can clearly see an exponential increase in speed when running the model on a GPU node as compared to a CPU node.

CPU Performance Increase with Intel Caffe

Case HPC has CPU nodes with Intel Xeon processors, and they heavily outnumber the available GPU nodes on the cluster. Thus, it made sense to attempt to optimize the runtime of the pipeline on a CPU node as well.

I set up the pipeline to work with the Intel optimized version of Caffe which led to big gains in efficiency on Intel architecture based CPU nodes. The speedup factor for training was around 10 times and was also pretty good for testing.

A detailed report showing the compiler level optimizations can be seen at:-
https://software.intel.com/sites/default/files/managed/bd/7d/Caffe_optimized_for_IA.pdf

return to top

Further Documentation on Red Hen Website

https://sites.google.com/site/distributedlittleredhen/home/the-cognitive-core-research-topics-in-red-hen/visual-recognition-pipeline
return to top

CCExtractor Improvements

I have also been contributing to CCExtractor (I am the maintainer for the OCR portions of the code). I pushed a few improvements that improved the caption extraction results by a lot for certain kinds of files. I briefly explain those in this section.

return to top

Improving French OCR Results

There were a few DVB files which had subtitle bitmaps that had a transparent background. We were getting completely garbage results for these, and I was completely unsure about why this was happening. On investigating, I realized that this was an issue with the transparent background which poses a well documented problem to Tesseract OCR. To overcome this, I converted the subtitle bitmap into a non transparent grayscale image before performing OCR. This instantly improved results.

https://github.com/CCExtractor/ccextractor/pull/759

This one was one of those problems that happened to be a one line fix, but involved a huge amount of analysis, reading and experimentation to get done right. It took me the good part of a week to get this to work as it is now, and hopefully this yields near perfect OCR results for all use cases, including those with transparent DVB subtitles.

return to top

Deduplication in Brazilian ISDB

There was an issue with the Brazilian ISDB subtitle decoder which caused broken timestamps (Timestamps are the primary key for the Red Hen dataset, and without them, the data can’t be part of the searchable archive). There was also an issue with duplication with certain ISDB streams that had an incorrectly defined rollup mode in the stream. I changed the logic for calling the deduplication code in such cases.

The issue conversation and fix can be seen at :-
https://github.com/CCExtractor/ccextractor/issues/739

return to top

Fixing end timestamp in transcripts from DVB

There was a bug which had made transcripts generated from any DVB file had a broken timestamp (lacking the UNIX offset and generating timestamps in the year 1970, not very ideal). I fixed this by adding the delay parameter at the required place.

The changes made can be seen at:-
https://github.com/CCExtractor/ccextractor/pull/755

return to top

Repository Links

All code that I wrote is publicly available on GitHub at the following repositories:-

  1. For the TV Commercial Classification system:-
    https://github.com/Abhinav95/tv-ad-classification
  2. For the latest version of CCExtractor that includes my changes:-
    https://github.com/CCExtractor/ccextractor
  3. For the latest version of the Visual Recognition Pipeline that includes my changes:-
    https://github.com/gshruti95/news-shot-classification

return to top

Known Issues/Future Work

In this section, I outline a couple of current issues that need some thought to be fixed in order to get our system working for certain kinds of captions. I also outline the next logical steps to be taken to build on my work.

return to top

Tickertape Caption Merging

I had previously worked on extracting text from tickertape style captions, and the process worked fairly well. However, generating a timed transcript from these was a problem because of the OCR system not being able to determine legible text from the endpoints of the recognized text for a particular frame. This led to a problem in merging the text across successive frames, because of lots of garbage text at the terminal points. This is not a problem that could be solved by a simple Levenshtein distance computation.

I have set up the basic framework for tickertape OCR here:-
https://github.com/CCExtractor/ccextractor/commit/3278b31a8f571f5188d376142b1981a3a99ccff2

However, some thought needs to be given on how to get an accurate timed transcript as the final step of this method.

return to top

Deduplication with Incorrect OCR Results

A similar deduplication issue as described above was seen in French DVB recordings of the TF2 channel in which subtitles appeared word by word. The obtained text was fine in most instances but sometimes mangled around the ends which made it very troublesome to merge the text into an accurate timed transcript.

return to top

Putting the System into Production

The core aim of doing this project was to put the visual recognition pipeline into production such that useful visual annotations are generated for the NewsScape dataset. This requires usage of available computational resources in the most optimal manner, which we try to achieve by means of the job manager, Intel distribution of Caffe and the GPU support.

The system will soon be actively submitting jobs on the Case (and possibly Erlangen) HPC clusters and save the visual annotations which will be accessible by the Edge search engine in the future.

return to top

Future Collaboration

I have really enjoyed my collaboration with Red Hen, in a direct capacity this year and an indirect capacity last year. It was a great experience meeting everyone in person too at ICMC 2017 at Osnabrück, Germany. I will most certainly carry on working with the organization.

Red Hen is also applying to be a mentoring org for Google Code In this year. I have been a Code In mentor for CCExtractor in the past and I would most definitely enjoy a similar role for Red Hen in the near future as well.

return to top

References/Licenses

  • Places205-AlexNet model: B. Zhou, A. Lapedriza, J. Xiao, A. Torralba, and A. Oliva Learning Deep Features for Scene Recognition using Places Database. Advances in Neural Information Processing Systems 27 (NIPS) spotlight, 2014. http://places.csail.mit.edu/downloadCNN.html
  • Reference Caffenet model: AlexNet trained on ILSVRC 2012, with a minor variation from the version as described in ImageNet classification with deep convolutional neural networks by Krizhevsky et al. in NIPS 2012. Model trained by J. Donahue.
  • GoogleNet model: http://arxiv.org/abs/1409.4842
    Szegedy et al., Going Deeper with Convolutions, CoRR 2014
    Used BVLC Googlenet model, trained by S. Guadarama.
  • YOLOv2 Model : YOLO 9000: Better, Faster, Stronger
    https://pjreddie.com/darknet/yolo/
  • CWRU HPC: This work made use of the High Performance Computing Resource in the Core Facility for Advanced Research Computing at Case Western Reserve University.
    https://sites.google.com/a/case.edu/hpc-upgraded-cluster/
  • Coloribus archive: Coloribus is a thoroughly collected and daily updated advertising archive, the biggest online collection of creative advertising pieces from all over the world. Highly structured database containing information about brands, agencies, people involved, awards and other very relevant data, combined with advanced full-text search engine.
  • Red Hen Lab NewsScape Dataset: This work made use of the NewsScape dataset and the facilities of the Distributed Little Red Hen Lab, co-directed by Francis Steen and Mark Turner. http://redhenlab.org

return to top

Google Summer of Code Update – July 2017

I’ve been working for Red Hen Lab and CCExtractor as part of GSoC 2017. This year, my project was divided into three major components:-

  1. TV Commercial Classification – Details and code here (Done during June)
  2. Improving the visual recognition pipeline
  3. Optimizing the performance of the visual recognition pipeline and deploying it

Apart from this, I have also been working on analyzing and fixing issues with the CCExtractor code in order to better cater to some of Red Hen’s new international use cases.

In this post, I will describe what I have been up to this month, what things I have been spending my time working on, and some interesting decisions that I have had to make based on what I have observed in my experiments.

My Current Setup

Red Hen’s GSOC students have been working extensively on solving machine and deep learning problems which typically require a GPU for computational tractability. The Case HPC is where most of the important number crunching happens. However, being a shared resource, the HPC (and especially it’s GPU nodes) were not always readily available. I recently purchased a new laptop with a 4GB Nvidia GTX 1050 Ti GPU, which I have been using to work locally (as well as on my home institute cluster with more GPUs). Most of the benchmarking work that I have done have been on my computer, however the linearity of the observations and the computational power of other GPUs should hold for our particular use case of image classification (ie a more powerful GPU should take a linearly shorter time and a less powerful GPU a linearly longer time for the same use case).

Benchmarking ResNet for News Shot Classification

AlexNet (CaffeNet in our implementation) is the deep neural network architecture used for the shot characterization into 5 classes in the pipeline. This is a 7 layer deep architecture.

ResNet is a newer, deeper and more accurate model developed by Microsoft Research. It achieves state-of-the-art results on large scale visual recognition and is in essence a modern day upgrade to the 2014 CaffeNet architecture that we use.

Upon training ResNet on the same task with the same training/testing data split to identify news shot categories, the saturation accuracy of ResNet was superior. It achieves a peak accuracy of 93.7% whereas the AlexNet model achieves a peak accuracy of 88.3%. The ‘real’ accuracy (on exhaustive testing) for AlexNet is close to 86.5% and that for ResNet is close to 91%.

The training graph for the two architectures (done on a K40 GPU and a 1050 Ti GPU respectively) looks like this:-

resnet-vs-alexnet

ResNet takes a slightly larger time to converge, likely because of the deeper architecture requiring more time for the gradients to flow across the network. But it achieves an overall accuracy of roughly 5-6% higher than the AlexNet model after training is complete. This is to be expected because the ResNet model is newer and more state-of-the-art. However, it’s extreme depth in terms of layers leads to a much, much higher time taken to train and test (predict new samples using) the model.

While running 100 iterations 10 times and averaging the time values for a batch size of 10 at test time, the benchmarks for my Nvidia GTX 1050 Ti 4GB GPU, and that for an Intel Xeon CPU node look something like this:-

Model Avg GPU Mode time (s)
Avg CPU Mode time (s)
AlexNet 7.34 283.66
ResNet 49.54 1812.31

My GPU seems to be around 3 times slower than a GTX 1080 judging by these benchmarks.

The final decision that I took on the basis of these observations was to stick to the existing AlexNet framework for the sake of speed on CPU nodes. An accuracy increase of 5-6% for the shot type categorization from an existing ~86% was good, but the tradeoff of 6-7 times the runtime was perhaps not ideal for our use case of processing over 300,000 hours worth of video.

Brazilian Timestamps and Deduplication

There was an issue with the Brazilian ISDB subtitle decoder which caused broken timestamps (Timestamps are the primary key for the Red Hen dataset, and without them, the data can’t be part of the searchable archive). There was also an issue with duplication
https://github.com/CCExtractor/ccextractor/issues/739

French OCR – Fixing an Issue with Image Transparency

https://github.com/CCExtractor/ccextractor/pull/759

This one was one of those problems that happened to be a one line fix, but involved a huge amount of analysis, reading and experimentation to get done right. It took me the good part of a week to get this to work as it is now, and hopefully this yields near perfect OCR results for all use cases, including those with transparent DVB subtitles.

Adding GPU specific usage to the Visual Pipeline

Earlier, the pipeline was supposed to be run as one single job on a CPU compute node, and even if the requested node was a GPU node, the capabilities of the GPU would not be used by it. I added the capability to use a GPU if available. If no GPU is available (or detected by the code), we fall back to default CPU execution.

The changes to make this happen can be seen at:-
https://github.com/gshruti95/news-shot-classification/pull/2

This is especially useful in speeding up the runtime of the feature extraction and classification steps that involve deep neural networks (namely anything that invovles running a Caffe model). GPU execution of these steps speeds up the runtime by an exponential factor which could be anything between 20 to 200 times depending on the individual computational power of the CPU/GPU in question. On my computer, a nearly 40 times speedup can be observed (benchmarks above).

Upgrading the YOLO person detector in the Visual Pipeline to YOLOv2

The current pipeline has the YOLO object detector specifically for person detection. However, YOLO was upgraded to YOLOv2 this year and has been accompanied by significant accuracy gains. I upgraded the version of YOLO used in the pipeline to the latest one, while retaining the same output format in the SHT and the JSON files.

I integrated the original C code based on Darknet for YOLOv2 to the pipeline, and person detection results are slightly better than before.

Singularity Container for Portable HPC Execution of the Visual Pipeline

Singularity is an HPC friendly alternative to the popular Docker framework. Red Hen has been heavily using Singularity this GSoC. One particularly useful use case is portable usage of the Singularity image on multiple HPC clusters (e.g. Case HPC, one of the many clusters at Erlangen HPC etc)

I have written a basic singularity image for the state of the code at this moment, which creates a container and then downloads all required models and dependencies, and sets up the container for usage of the visual recognition pipeline.

Upcoming Work

In the final third of GSOC, I will work on reducing the overall runtime of the pipeline by as much as possible. I will also work on writing an HPC job manager script which logically segments the different parts of the pipeline and submits and tracks different jobs for each based on available resources (GPU/CPU). Another thing to do is set up the pipeline with CPU optimized Intel Caffe which will allow automatic parallel processing on CPU nodes on HPC. After doing this and testing for sanity, the pipeline should be ready to be put into production on the entire NewsScape dataset, perhaps on multiple HPCs.

Google Summer of Code – Work Product Submission

Google Summer of Code – Work Product Submission

HardsubX : Burned-in Subtitle Extraction Subsystem for CCExtractor

Project Description

My project was to add the capability of extracting burned-in (hard) subtitles from videos to CCExtractor. As of now, CCExtractor works by only extracting caption data in the video if it is present in specific structures in the stream, and skips the actual video data (pixels) completely. However a lot of videos have hard subtitles burned into them, extracting which is a computer vision problem, and something which CCExtractor did not earlier have the capability to process.

Compilation

CCExtractor can be compiled with HardsubX support as follows:-
make ENABLE_HARDSUBX=yes
This needs to be run from the ccextractor/linux directory.

Usage

The -hardsubx flag needs to be specified to the ccextractor executable in order to enable burned-in subtitle extraction. Other options along with a description of them are as follows:-

  • -ocr_mode : Set the OCR mode to either frame-wise, word-wise or letter-wise.
    e.g. -ocr_mode frame (default), -ocr_mode word
  • -subcolor : Specify the color of the subtitles, which man be one of white, yellow, green, cyan, blue, magenta, red. Alternatively, a custom hue value between 0 and 360 may also be supplied, taking reference from a standard hue chart.
    e.g. -subcolor white or -subcolor 270 (for violet)
  • -min_sub_duration : Specify the minimum duration in seconds that a subtitle line must exist on the screen. Lower values give better timed results, but increase processing time. The default value is 0.5.
    e.g. -min_sub_duration 1.0 (for a duration of 1 second)
  • -detect_italics : Specify whether italics are to be detected from the OCR text. Italic detection automatically enforces the OCR mode to be word-wise.
  • -conf_thresh : Specify the classifier confidence threshold between 1 and 100. Try and use a threshold which works for you if you get a lot of garbage text.
    e.g. -conf_thresh 50
  • -whiteness_thresh : For white subtitles only, specify the luminance threshold between 1 and 100. This threshold is very content dependent, and adjusting values may give you better results. Recommended values are in the range 80 to 100. The default value is 95.
    e.g. -whiteness_thresh 98

A composite example command is as follows:-
ccextractor video.mp4 -hardsubx -subcolor white -detect_italics -whiteness_thresh 90 -conf_thresh 60

Example Outputs

Video 1

1
00:00:00,000 –> 00:00:06,919
Well, they’re both spectacles,

2
00:00:06,921 –> 00:00:08,879
NBA basketball as well as football here.

3
00:00:08,881 –> 00:00:12,879
It’s a spectacle for the fans, they enjoy it and see

4
00:00:12,881 –> 00:00:13,919
something different.

5
00:00:13,921 –> 00:00:19,919
And I think a player like Messi could be compared to Curry

6
00:00:19,921 –> 00:00:21,919
in the USA,

7
00:00:21,921 –> 00:00:24,879
because they have created something special

8
00:00:24,881 –> 00:00:26,999
something not seen before,

9
00:00:27,001 –> 00:00:30,919
and something that makes people excited, ecstatic.

 Video 2

1
00:00:00,000 –> 00:00:08,999
This is one of my favorite places.

2
00:00:09,001 –> 00:00:18,165
I love that on one hand, if you look far enough,
there are the untouched snowy steppes

3
00:00:18,167 –> 00:00:24,874
and if you turn around there is
a powerful contrast.

4
00:00:24,876 –> 00:00:30,332
The large industrial complex,
a city in a snowy desert.

5
00:00:30,334 –> 00:00:35,749
It’s awe-inspiring that all of this
was created by the hands of men.

List of new files

hardsubx.c – The ‘main’ file of the subsystem, in which the required structures and context are initialized.
hardsubx.h – The header file which contains definitions that are used across the code of the subsystem.
hardsubx_classifier.c – Handles various levels of subtitle text recognition, italic detection, and confidence of the output text.
hardsubx_decoder.c – Decodes video frames using the FFMpeg library, reads frames into context structures, passes structures to appropriate OCR functions and encodes correctly timed subtitles.
hardsubx_imgops.c – Handles conversion between color spaces necessary to process subtitles of a certain color.
hardsubx_utility.c – Contains various utility functions used in the subsystem.

Additional Work

In addition to my main project, I also worked on improving existing CCExtractor features and fixing issues. I worked on the following features/issues:-
1. Reducing memory consumption by 180 MB
2. Case fixing for teletext subtitles
3. Adding color detection for DVB subtitles
4. Fixing a crash with DVB subtitles
5. Adding OCR support for potentially 99 different DVB languages
6. Adding parameters for DVB and OCR language selection

User Documentation

Installation instructions can be found at docs/HARDSUBX.txt.
General usage instructions can be found in the help screen in params.c.

Link to Commits

All my commits to the mainstream master branch can be seen at:-
https://github.com/CCExtractor/ccextractor/commits?author=Abhinav95

My changes to the cross-platform GUI can be seen at:-
https://github.com/kisselef/ccextractor-gui-qt/pull/4/files

All the merged pull requests which I have made to the mainstream master branch can be seen at:-
https://github.com/CCExtractor/ccextractor/pulls?q=is%3Apr%20author%3AAbhinav95%20is%3Amerged

Link to blog posts

All the blog posts which I have written about my project can be seen at:-
https://abhinavshukla95.wordpress.com/category/gsoc/

Known Issues and Future Work

  • Lower resolution videos (e.g. 360p) do not work well with the current set of parameters. I tried upscaling the video but without success. This is likely due to quantization artifacts in lower quality video which prevent text recognition in the same way.
  • The entire system is a rule based classifier. The current state of the art in text recognition uses advanced techniques like Neural Networks and MRFs, but the integration of those into the current C code base would have been really difficult as most libraries are in different languages, which is why I chose to stick with the Leptonica C library (already a CCExtractor dependency) and a simple image processing approach.
  • There are problems in videos which have background hues similar to the color of the subtitle text to be detected. Relaxing the threshold value in the code caused too much garbage text to be detected whereas tightening the threshold caused no text to be reliably detected. This issue is highly content dependent.

Future Contribution to CCExtractor

I really enjoyed working with the CCExtractor organization throughout the summer. I will now maintain my project on my own time, outside of GSOC too. Also, for anyone willing to contribute to the existing code, and HardsubX in particular, feel free to fork the main repository at https://github.com/CCExtractor/ccextractor and send a patch. You will really like being a part of the community.

Google Summer of Code, Week 12 – GUI Integration and DVB Languages

Google Summer of Code, Week 12 – GUI Integration and DVB Languages

My work for this week focused on integrating the new features that I developed throughout GSOC into the GUI, and also adding support for multiple OCR languages to DVB subtitles.

GUI Integration

The most common users of CCExtractor use the GUI, and may not necessarily be experienced with the command line. Hence, it was important to integrate the new features into the GUI so that they may be available to a much wider audience.

The GUI is essentially a program which provides a graphical interface to the user who can supply the required options with ease, and the program in turn calls the ccextractor executable after parsing the options that the user supplied and converting them into a command string. The GUI interacts with the executable and works just like the normal executable would from the command line, but it gives easy to interpret visual feedback to the user, bundled in an easy to use application.

In addition to the SourceForge Windows GUI, CCExtractor also has a cross-platform Qt GUI which was developed by Oleg Kisselef in GSOC 2015 (https://github.com/kisselef/ccextractor-gui-qt). I needed to add parsing support for the options which I had added to the main program and create an interface which would allow those options to be sent to the executable.

It was fairly easy to add the options to the ‘Options’ window. A main checkbox needs to be checked in order to access the other parameters (equivalent to how -hardsubx needs to be specified before any related options on the command line). After I had created the UI, I had to map the elements in the GUI (radio buttons, sliders, checkboxes) to events in the application which would pass on the appropriate commands to the executable. I also verified that the options worked as intended, and constrained them to take only valid values, and also be initialized to the default and recommended values.

The resultant additions to the Qt GUI on Linux look like this:-

gui

DVB Languages

Another thing which I did this week was to add support for potentially 99 different languages using Tesseract’s .traineddata files. Before this point,  only English was supported.

Adding this feature was like solving a partially solved jigsaw puzzle. I just had to complete some of the existing code to search for Tesseract language packs and make sure that it looked for the necessary files in the correct locations.

Initially, I had also added special cases for certain languages like Chinese (simplified) which seemed to come with non standard language codes in the video stream. However, instead of hard-coding a particular case like this, it was deemed better to let the user specify the non standard names, if at all necessary.

I added the -ocrlang and -dvblang parameters. -dvblang allows the user to select which language’s caption stream will be processed. In the event that there were multiple caption streams in the video, only the one specified by the parameter would be processed. -ocrlang allows the user to manually select the name of the Tesseract .traineddata file. This option is helpful if you want to OCR a caption stream of one language with the data of another language. e.g. -dvblang chs -ocrlang chi_tra will decode the Chinese (simplified) caption stream but perform OCR using the Chinese (Traditional) trained data. This option is also useful when the Tesseract .traineddata files don’t come with standard ISO names.

Google Summer of Code, Weeks 10 and 11 – Different Types of Subtitle Detection

Google Summer of Code, Weeks 10 and 11 – Different Types of Subtitle Detection

In some cases, the detected text may be filled with noise and unwanted artifacts. Hence, there was a need to improve the text classifier in order to try and improve the quality of the detected captions. I set up three different levels of Tesseract subtitle line classifiers, and added confidence based thresholds as possible parameters which could potentially improve the quality of the OCR results.

The Three Modes – Frame, Word and Character

There are three different modes at which I used Tesseract to process a particular frame:-

  • Frame: In this mode, the entire frame is processed at once, and the entire UTF8 text detected by Tesseract is written to the caption file.
  • Word: In this mode, every word detected in the frame is individually processed. It can be later thresholded based on confidence or whether it is a dictionary word, or an expletive word etc.
  • Letter: In this mode, every letter detected in the frame is individually processed. It can also be later thresholded. This mode is just present because of the possibility to make it so. For any practical purposes, the first two should serve fine.

I created a parameter called -ocr_mode which allows the user to specify the level at which the OCR will be performed and interpreted. The default is the ‘frame’ mode.

Tesseract Confidence Ratings

The Tesseract Engine supplies confidence ratings along with its OCR predictions. I chose to use these confidence values to improve the quality of text recognition performed by the system. I created an optional parameter called -conf_thresh which allows the user to put a threshold on the confidence rating of the text classification by Tesseract (having a default value of 0, ie all classifications accepted). Only classification results which had a confidence above the threshold were processed and written as captions.

The confidence thresholding works for each of the three OCR modes as described above. For the ‘frame’ mode, the confidence used is the mean text confidence, for the ‘word’ mode, the per-word confidence, and for the ‘letter’ mode, the per character confidence. These results are then thresholded and only the good ones remain.

Italic Detection

Another small part of my proposal was to detect if the formatting of the subtitles was italic. I had originally intended to do this using an orientation estimation using the Fourier transform or instead looking at the average angle of the longest lines in the characters found in their Hough Transform.

However, none of that proved to be necessary since the Tesseract API had a call to detect word font attributes. So, whenever italic detection was to be done, I set the OCR mode to word-wise and called the Tesseract API which would then determine if the word was italic.

An excerpt from the video at https://www.facebook.com/uniladmag/videos/2282957648393948 (about British warship HMS Bulwark) is:-

1
00:00:00,000 –> 00:00:04,959
<i>The ship is an enormous machine </i>

2
00:00:04,961 –> 00:00:07,919
<i>one of the complicated machines that Britains ever built </i>

3
00:00:07,921 –> 00:00:09,879
<i>we make our own water from sea water </i>

4
00:00:09,881 –> 00:00:10,959
<i>we deal with our own sewage </i>

5
00:00:10,961 –> 00:00:12,959
<i>we cook our own food </i>

6
00:00:12,961 –> 00:00:15,879
<i>we’re a floating city. </i>

Google Summer of Code, Weeks 8 and 9 – Detecting DVB Subtitle Color

Google Summer of Code, Weeks 8 and 9 – Detecting DVB Subtitle Color

DVB Subtitles

DVB (Digital Video Broadcasting) is the standard for TV video in a large number of countries, and is especially prevalent in Europe. In a DVB video stream, subtitles are present as colored bitmap images, which are simply overlaid on the video if subtitles are turned on in the viewing system.

CCExtractor already had excellent support for DVB subtitle text recognition, using Tesseract. It was done by first binarizing the bitmap so that text and the background were separate. This resulted in accurate text recognition by cleaning up the image, but lost color information in cases where multiple colors of text were present in a single bitmap. An additional requirement was to detect the color of each word in the subtitle.

Why Color Is Important

Color changes in DVB subtitles refer to speaker changes in the program. Assigning a different color for a different speaker enriches the assistive capabilities of captions (e.g. for hearing impaired people). Speaker change detection also holds a very large importance for various text processing algorithms for which CCExtractor is a major data source.

Bitmaps and Color Histograms

Bitmaps are just 2-D arrays of numbers, along with an accompanying palette. A palette is like a dictionary which represents a mapping from the pixel value in the bitmap to the actual RGB value. For example, the bitmap may have values ranging from 1 to 8, and 1 may represent Black (0,0,0), 8 may represent White (255,255,255) and so on. DVB subtitles are also bitmaps with their corresponding palettes. Color Histograms are a way to represent the frequency of each color in the image. They are a frequency representation of every single pixel value in the bitmap. The more the amount of a particular pixel value in the image, the higher will its histogram value be.

Word-Wise Color Quantization

The color detection for every word is done by iterating over the bounding boxes of every word obtained in the original DVB OCR results. For every bounding box, a color quantization process is performed. Color quantization essentially means changing the pixel value to a nearby value which has a much higher frequency in the histogram. Using this information from a two bin color quantization, the background and foreground colors are determined, and the foreground (text) color is assigned as the detected color of the text.

Successive words with the same color are grouped together and the points at which the color changed are marked with <font> tags.

Output

A DVB frame with three colours along with the corresponding output is as shown:-

iansub

34
00:01:47,780 –> 00:01:51,339
<font color=”#00ff00″>So he spent last night in a cell?</font>
<font color=”#ececec”>It’s a ROOM. Not a cell. </font><font color=”#ffff00″>Ian!</font>

The color values are exactly what their pixel values are in the bitmap.

DVB Crash Fix

As a result of working on DVB Color Detection, I also noticed and fixed an important bug which was causing a lot of periodic crashes while continuously processing DVB subtitles. The bug was largely due to Tesseract OCR returning multiple newlines at the end of a line. I made a quick fix by increasing the memory allocated to the resulting string variable. It resulted in a large increase in the stability of the DVB processing pipeline.
https://github.com/CCExtractor/ccextractor/issues/401
Although there are still a few issues and bugs in the program, the DVB system is quite stable.

Google Summer of Code, Weeks 6 and 7 – Detecting Colored Subtitles

Google Summer of Code, Weeks 6 and 7 – Detecting Colored Subtitles

Till this point, I have a system which works well for burned-in white subtitles and generates a timed output file. The next step is to add the same support for colored subtitles too.

The HSV Color Space

The HSV color space, and the Hue component (H) in particular, is an excellent representation of the exact color value of a pixel. The normal RGB space requires 3 values to represent the color, whereas the H component takes a value in the range of 0-360 and gives the necessary color information.

You can read more about the color space here.

The chart below shows how the values of H vary for different types of colors.

hue

I exploited a conversion from the RGB to the HSV space in order to detect colored subtitles. Just like there was a luminance threshold in order to detect white subtitles, there is a threshold around the range of the user-specified hue value in order to detect subtitles of a particular color.

This hue based thresholding, along with the existing vertical edge dilation was used to detect subtitles of a particular color.

Color options in the program

The program has 7 predefined color names. The first and most prevalent case is White, the detection of which is luminance based. The other 6 are equally spaced in the hue value range. The colors, along with their hue values, are:-

  1. Yellow – 60
  2. Green – 120
  3. Cyan – 180
  4. Blue – 240
  5. Magenta – 300
  6. Red – 0

Each of these colors can be specified along with the -subcolor option. For example:-

ccextractor video.mp4 -hardsubx -subcolor yellow

In addition to these preset values, there is also the possibility to supply a custom hue value. This value is a custom value between 0 and 360 (not included) which can be supplied to the subcolor option, and could be of help to users who want to extract subtitles of the precise hue value in their stream if it fails to meet one of the presets.

Local Adaptive Thresholding

In addition to detecting colored subtitles, I was also able to improve the detection of white subtitles using local adaptive thresholding algorithms, and Sauvola Binarization in particular. This was an additional step which marginally improved the quality of results for white subtitles (which always have a pixel value greater than their surroundings), however could not be applied to colored subtitles in all cases due to a wide variety of contrasting backgrounds.

Google Summer of Code, Weeks 4 and 5 – Determining Subtitle Appearance Time

Google Summer of Code, Weeks 4 and 5 – Determining Subtitle Appearance Time

So far, I have been able to successfully extract white colored subtitles at an interval of 25 frames, and the output looks decent. However, I need to now actually created a timed transcript (e.g. an SRT file).

Original Plan

I had originally intended on having two strategies to determine subtitle time, which I had described in my proposal as:-

  1. A linear search across the video at a certain interval. Whenever a subtitle gets detected, a binary search will be performed in a window around that frame. Using this, we will detect the exact time of the beginning and the end of the particular subtitle line. This will be of benefit in sequentially processing a file (possible use case of processing a live stream as it is being recorded).
  2. If the entire video is already available to us, instead of doing linear search which will involve a lot of processing overheads for frames in which there are no subtitles, we can directly do a binary search on the entire video to detect subtitle lines. We will get the exact timing of the line as described above, but the overall processing will be faster

However, neither of them were possible, due to constraints which I had not originally anticipated.

FFMpeg Constraints

It turns out that binary search was not a viable option at all, because I could not arbitrarily seek to a timestamp in a video using the FFMpeg library. The closest thing which I could do was seek the file to the nearest I-frame and then iterate through frames to the desired timestamp and then reconstruct the needed frame. However, in a binary search, the whole point of which was to optimize the search, this way would create a massive processing overhead and high redundancy in reconstructing frames during the search. Instead, a linear search with a specified step-size seemed a much better option.

The problem that I described is fairly well documented online:-
http://stackoverflow.com/questions/17546073/how-can-i-seek-to-frame-no-x-with-ffmpeg
http://www.mjbshaw.com/2012/04/seeking-in-ffmpeg-know-your-timestamp.html

New Plan – Efficient Linear Search

I decided to use a linear search across the video with a specified step-size, which was a parameter called the minimum subtitle duration. I set the default value for this as 0.5 seconds, which seems a reasonable assumption for most subtitles.

I also needed to convert times to a single format (milliseconds), from the various different time bases that various different video streams could have. From here, I iterated through the video and sampled frames at regular intervals. The decision that a subtitle line was the same as the last encountered one was when it’s Levenshtein distance was very low. This was necessary in order to combine successive detections which were off by a character or two, which happens quite often due to the natural noise present in the video stream. Whenever the detected subtitle line ended, I would encode it with the seen times.

Integrating with the CCExtractor Encoder

It was really easy to integrate the calculated time with the CCExtractor encoder structure (which with itself brought full output parameter functionality). All I had to do was call two functions at the appropriate times in my code:-

add_cc_sub_text(ctx->dec_sub, subtitle_text, begin_time, end_time, “”, “BURN”, CCX_ENC_UTF_8);
encode_sub(enc_ctx, ctx->dec_sub);

That says a lot about how well written and modularized the existing library is.

And oh, I chose the subtitle mode ‘BURN’ myself. It stands for burned-in subtitles xD.

Example Output

The SRT output for the video at https://www.facebook.com/uefachampionsleague/videos/1255606254485834/ (A Gerard Pique interview), looked as follows:-

1
00:00:00,000 –> 00:00:06,919
Well, they’re both spectacles,

2
00:00:06,921 –> 00:00:08,879
NBA basketball as well as football here.

3
00:00:08,881 –> 00:00:12,879
It’s a spectacle for the fans, they enjoy it and see

4
00:00:12,881 –> 00:00:13,919
something different.

5
00:00:13,921 –> 00:00:19,919
And I think a player like Messi could be compared to Curry

6
00:00:19,921 –> 00:00:21,919
in the USA,

7
00:00:21,921 –> 00:00:24,879
because they have created something special

8
00:00:24,881 –> 00:00:26,999
something not seen before,

9
00:00:27,001 –> 00:00:30,919
and something that makes people excited, ecstatic.

It looks pretty good, and the times are pretty close to perfect, with some variation at the extremes due to those edge frames not being processed. A lower value for the minimum subtitle duration will give even more accurately timed results, but will take a longer processing time.

Google Summer of Code, Weeks 2 and 3 – Recognizing White Subtitles

Google Summer of Code, Weeks 2 and 3 – Recognizing White Subtitles

These last two weeks were slightly challenging owing to the fact that I had to learn a lot of new things in order to complete my tasks.

Setting up the HardsubX expansion in CCExtractor

Before I could get started on diving deep into writing code, I needed to organize and setup the workflow of all the new code which I am supposed to write throughout the summer into the original program. This included parsing input parameters for the new type of extraction process, creating and organizing new files in the source code, and changing the compilation settings and dependencies to match what I would need for my pipeline.

The pipeline essentially comprises of the following entities:-

  1. The ‘main’ file
    Handles the parsing of parameters and initializing the required data structures.
  2. The Decoder
    Gets the text of the burned in subtitle in the video by processing it
  3. The Timer
    Gets the precise timing of each extracted subtitle
  4. The Encoder
    Converts the output of the decoder and the timer into a standard output format such as a .srt(SubRip) file

I created separate files for each of these entities and their helper functions, along with one shared header file which would allow the internal librarization of the files (being able to use functions from one in another), as well as the potential external librarization (being able to be called from the main CCExtractor library).

You can view the project repository at https://github.com/Abhinav95/ccextractor. The new source code files are in the ‘src/lib_ccx’ directory and have the ‘hardsubx’ prefix in their names.

Processing a Video Stream in C

The very first step when trying to get subtitles from a video frame, is to actually get those video frames themselves and store them in a data structure in the context of the program. The FFMpeg library is the comprehensive open source media processing library in use today. I am using its C API to process the input video stream.

I needed to store the video stream format and codec information in the program context. Then, out of the many different kinds of streams present in the media file (video, audio, captions, others), I needed to find the ID of the video stream and then process only its packets. Every video stream packet is then decoded and the image content extracted and stored in a Leptonica PIX structure (for compatibility with Tesseract OCR). For the sake of efficiency and avoiding redundancy in frame extraction, I extract frames at an interval of 0.5 seconds, which I have assumed to be the minimum time that a subtitle line is present in the video. This number can be fine-tuned based on the real situation, but some threshold is required in order to avoid the massive processing overheads of reading every single frame in the video.

In a nutshell, the process goes like this. FFMpeg gives me the video frames at a certain interval, and then I further process them to detect subtitles.

Example frame:-

im

Detecting Subtitle Regions

The detection of white subtitle regions involved two steps:-

  1. Luminance based thresholding
  2. Vertical Edge detection and Dilation

The Luminance (L) of a particular pixel represents the ‘whiteness’ of the pixel. The closer it is to pure white, the higher is its luminance. When aiming to detect white subtitles, luminance based thresholding is useful because if we binarize the image in such a way that only regions of high luminance are retained, then all of the white subtitle regions will be retained (with possibly other white objects/artifacts too). This thresholding is done to narrow down the search for the candidate subtitle region.

Thresholded Luminance image:-

lum

The second part of the subtitle detection pipeline is the detection of vertical edges in the image, which is done by a vertical Sobel filter. This method is effective because subtitles have a high density of strong vertical edges in their region, due to the alternating white foreground letters and the non-white background. The edge image is then dilated with a horizontal structuring element in order to get the rough region of the subtitles.

Vertical edges:-

edge

After dilation and thresholding:-

dilated

The final subtitle region is determined by taking a bitwise AND of the two feature images described above, i.e. regions which are both wide and also have strong vertical edges. Both these features are typical of white letters in the subtitle line. In some cases, one step may not work well. For instance, if there is a white background, then the thresholded luminance image will not be an accurate representation of the subtitle region. Also, if there is an object with lots of vertical edges near the subtitle region, the edge image will not be an accurate representation. But using both of them together give us a high likelihood of accurately detecting the subtitle region.

Subtitle Recognition / OCR

Once the subtitle region of interest has been detected, the actual text needs to be recognized using OCR (Optical Character Recognition). The intuitive choice to perform this task was the Tesseract OCR library by Google, which has already been previously used by CCExtractor to recognize DVB subtitles (predominantly used in Europe) which essentially comprise of a subtitle bitmap being overlaid on the video frame. An OCR essentially works using character and word classification based on stored labels on trained data. In a layman’s terms, you show the OCR engine 1000 images of the letter ‘a’, and it learns to recognize the letter ‘a’ the next time it sees it.

For the output image of the previous steps:-result

Tesseract’s Detected text : “Well, they’re both spectacles,”

All I need to do is pass the binarized image containing only the clean detected subtitle text to a Tesseract API handle and it returns the recognized text to me. Pretty cool, right?

Over the course of the summer, I will have to use the Tesseract API extensively, as compared to just directly making a call to get the recognized subtitle text. I will be using advanced Tesseract features such as the per character and the per word confidence ratings in order to refine and improve my text classification output. A common use case for this would be to root out simple misclassifications such as ‘giape’ instead of ‘grape’ in the recognized text, and to get the overall output to have the highest probability of being correct.

What’s Next?

The next thing that I need to work on is to accurately and optimally determine the time that each subtitle line was present in the video. This will involve seeking the video around the neighborhood of the frame of the originally detected subtitle, and then determining when that particular subtitle line appeared in the video for the first and the last time. A potential problem with optimizing this seems to be the fact that ffmpeg does not allow straightforward seeking to a given frame number or a timestamp, and I will have to manually seek to the desired location from the nearest I-frame (You can understand this problem better by understanding the GOP structure of video frames, explained here).

Here’s looking forward to weeks 4 and 5 and the mid-term evaluation which is on the near horizon. I’ll keep posting my progress, right here. Cheers!