TV Commercial Classification and Optimization of Visual Recognition Pipeline for News Videos
Organization: Red Hen Lab / CCExtractor
My project was officially under the authority of CCExtractor, but my work spanned the interests and scope of both organizations. I contributed code for both organizations.
Table of Contents:
- Introduction
- TV Commercial Classification
- Visual Recognition Pipeline
- CCExtractor Improvements
- Repository Links
- Known Issues / Future Work
- Future Collaboration
- References / Licences
Introduction
My project’s contributions have been of three different types:-
- Creating a TV Commercial Classification system
- Optimizing Red Hen’s Visual Recognition Pipeline
- Adding various OCR and international support improvements to CCExtractor
All of these are directly helpful to the processing of the NewsScape dataset.
TV Commercial Classification
Red Hen had an existing visual feature based classification and labeling system for news videos (optimising which was the second part of my project), but TV commercials are not considered by the pipeline so far. I wanted to add a multimodal TV commercial classification system to the pipeline’s current capabilities.
Problem and Dataset Description
The problem at hand is classifying an input advertisement video into its product category. This classification of commercials is into the following 23 categories:-
- 01_alcoholic-drinks-tobacco
- 02_automotive
- 03_business-equipment-services
- 04_consumer-public-services
- 05_culture-leisure-sport
- 06_fast-food-outlets-restaurants
- 07_health-pharmaceuticals
- 08_household-maintenance-pet-products
- 09_industrial-agriculture
- 10_non-alcoholic-drinks
- 11_publishing-media
- 12_transport-travel-tourism
- 13_apparel-clothing-footwear
- 14_banking
- 15_confectionery-snacks
- 16_cosmetics-beauty-products
- 17_dairy-products-eggs
- 18_grocery-other-foods
- 19_home-electronics-home-appliances
- 20_hygiene-personal-care-products
- 21_internet
- 22_public-awareness
- 23_retail-distribution-rental-companies
The dataset I used is a subset from the Coloribus archive, which is a thoroughly collected and daily updated advertising archive, and is the biggest online collection of creative advertising pieces from all over the world. It is a highly structured database containing information about brands, agencies, people involved, awards and other very relevant data, combined with advanced full-text search engine. I downloaded a few samples from each category and used it for training a convolutional neural network.
Usage Instructions
python process_ad_video.py <path_to_video>
This generates an output of the form “label:probability” for each of the top 5 classes in a list.
Dependencies
- Python (https://www.python.org/downloads/) : The language of the project. The code has been tested with Python 2.7.8 and 2.7.12. It should work with any recent version of Python 2. Python 3 and any other versions are your own adventure.
- Caffe (https://github.com/BVLC/caffe) : Neural network framework. Needs to be complied with Python support (pycaffe). The code has been tested with Caffe rc3, and should work with the GitHub version.
- FFMpeg (https://github.com/FFmpeg/FFmpeg) : For video processing. The code has been tested with v2.8.2 and v3.1.0, and should work with the GitHub version.
Installation
Simply clone the repository while ensuring all dependencies are correctly installed.
Example Output
An example keyframe along with the probablities for each class for a given input video are shown below:-
File: kentucky-fried-chickenkfc-chamber-test-360-45192.mp4
Category: 06_fast-food-outlets-restaurants
Category: 13_apparel-clothing-footwear
Category: 05_culture-leisure-sport
Category: 01_alcoholic-drinks-tobacco
Performance Details
I fine-tuned a standard 7 layer deep neural network using the modified AlexNet architecture using the Caffe deep learning framework with fairly standard parameters that are typically used for such a task. The dataset consisted of ad videos in 23 categories and nearly 8000 training points were supplied to the network for the fine tuning process.
The process gives us a best top-1 accuracy of 55.3%, a best top-3 accuracy of 83.2%, and a best top-5 accuracy of 89.75%.
These results are positive. A random classifier (chance, coin-toss, dice-roll etc) on the same dataset would give a top-1 accuracy of 4.34%, a top-3 accuracy of 13.04%, and a top-5 accuracy of 21.73%.
More Documentation on GitHub
Training details and code are available at the following GitHub repo:-
https://github.com/Abhinav95/tv-ad-classification
return to top
Visual Recognition Pipeline
The Visual Recognition Pipeline was developed during last GSoC with the aim of being able to extract useful visual features using convolutional neural networks (CNNs) from news videos in order to establish a richer automated understanding of the news.
A video demo of the kind of information that can be extracted follows below:-
The aim is to generate such useful annotations from the visual modality for the entire NewsScape dataset.
Building on Shruti Gullapuram’s GSoC Project
The majority of my work this summer was focused on developing optimization strategies for the visual recognition pipeline developed by Shruti Gullapuram during GSoC 2016 in order to reduce the overall processing time taken for a particular input news video, which stood at around 2.5 to 3 hours per hour of video at the beginning of my project.
Detailed documentation, along with the kind of labels, the process followed and example outputs, of what Shruti achieved during last GSoC can be seen at the following link:-
https://shrutigullapuram.wordpress.com/2016/08/22/gsoc-work-product-submission
The way the pipeline works is shown below:-
Usage Instructions
This section details the installation and the usage of the system.
Dependencies
- Python (https://www.python.org/downloads/) : The language of the project. The code has been tested with Python 2.7.8 and 2.7.12. It should work with any recent version of Python 2. Python 3 and any other versions are your own adventure.
- Caffe (https://github.com/BVLC/caffe) : Neural network framework. Needs to be complied with Python support (pycaffe). The code has been tested with Caffe rc3, and should work with the GitHub version.
- Intel Caffe [optional] (https://github.com/intel/caffe) : Gives significantly better CPU performance
- FFMpeg (https://github.com/FFmpeg/FFmpeg) : For video processing. The code has been tested with v2.8.2 and v3.1.0, and should work with the GitHub version.
- PySceneDetect (https://github.com/Breakthrough/PySceneDetect) : For shot detection. The code has been tested with v0.3.5.
- Scikit-Learn (http://scikit-learn.org/stable/install.html) : For various classifiers. The pip installation of scikit-learn should work.
- Darknet (https://pjreddie.com/darknet/) : For YOLO v2. The standard installation by building from source should work.
- SLURM [optional, only for HPC deployment] (https://slurm.schedmd.com/) : Needed for the job manager when processing on HPC.
- Singularity [optional] (http://singularity.lbl.gov/) : Useful for HPC and/or portable installations
Installation
All the required external files and classifier models can be found here:
https://www.dropbox.com/sh/hv811iqnupcusp8/AAA-nn4mYD2LIP2-deK1VUSWa?dl=0
The paths to all external files required by the code can be modified in path_params.py
according to the user’s convenience.
Normal Usage
On a CPU, without any additional dependencies, run:-
python ShotClass-01.py <path_to_video>
On a GPU enabled machine (with nvidia-smi), run:-
python ShotClass-02.py <path_to_video>
In both cases, the code will generate an output file with the same name as the video with a .sht file (in the Red Hen piped format) and .json file in the JSON Lines format.
Usage on HPC
You can process news videos on Case HPC in two ways:
- Process a list of videos using -l flag:
Run./manager.sh -l .txt
.txt contains YYYY-MM-DD_HOUR_NETWORKNAME.mp4 (only basenames of files)
- Process a particular day’s worth of news videos using -d flag:
Run./manager.sh -d YYYY/MM/DD
Edit the variable VIDEO_DST in manager.sh to change the path of the processed video files.
Another alternative form of usage on HPC is of the form:-
python ShotClass-03.py <path_to_video>
In this usage, the GPU support is included, and used when available, and there is also a logical segmentation of jobs using SLURM, such that different jobs are submitted for different parts of the pipeline in order to use the available resources as effectively as possible.
GPU Support
Earlier, the pipeline was supposed to be run as one single job on a CPU compute node, and even if the requested node was a GPU node, the capabilities of the GPU would not be used by it. I added the capability to use a GPU if available. If no GPU is available (or detected by the code), we fall back to default CPU execution.
The changes to make this happen can be seen at:-
https://github.com/gshruti95/news-shot-classification/pull/2
This is especially useful in speeding up the runtime of the feature extraction and classification steps that involve deep neural networks (namely anything that invovles running a Caffe model). GPU execution of these steps speeds up the runtime by an exponential factor which could be anything between 20 to 200 times depending on the individual computational power of the CPU/GPU in question. On my computer, a nearly 40 times speedup can be observed.
Optimizations for Runtime
A core aim of my project was to cut down the runtime of the system. I briefly explain a few changes I made in order to do this:-
GPU Benchmarks
In an earlier section, I have shown how I added GPU support for more efficient processing. This section takes a look at some of the benchmarks that I have made.
While running 100 iterations 10 times and averaging the time values for a batch size of 10 at test time, the benchmarks for my Nvidia GTX 1050 Ti 4GB GPU, and that for an Intel Xeon CPU node look something like this:-
Model | Avg GPU Mode time (s) |
Avg CPU Mode time (s) |
AlexNet | 7.34 | 283.66 |
ResNet | 49.54 | 1812.31 |
We can clearly see an exponential increase in speed when running the model on a GPU node as compared to a CPU node.
CPU Performance Increase with Intel Caffe
Case HPC has CPU nodes with Intel Xeon processors, and they heavily outnumber the available GPU nodes on the cluster. Thus, it made sense to attempt to optimize the runtime of the pipeline on a CPU node as well.
I set up the pipeline to work with the Intel optimized version of Caffe which led to big gains in efficiency on Intel architecture based CPU nodes. The speedup factor for training was around 10 times and was also pretty good for testing.
A detailed report showing the compiler level optimizations can be seen at:-
https://software.intel.com/sites/default/files/managed/bd/7d/Caffe_optimized_for_IA.pdf
Further Documentation on Red Hen Website
https://sites.google.com/site/distributedlittleredhen/home/the-cognitive-core-research-topics-in-red-hen/visual-recognition-pipeline
return to top
CCExtractor Improvements
I have also been contributing to CCExtractor (I am the maintainer for the OCR portions of the code). I pushed a few improvements that improved the caption extraction results by a lot for certain kinds of files. I briefly explain those in this section.
Improving French OCR Results
There were a few DVB files which had subtitle bitmaps that had a transparent background. We were getting completely garbage results for these, and I was completely unsure about why this was happening. On investigating, I realized that this was an issue with the transparent background which poses a well documented problem to Tesseract OCR. To overcome this, I converted the subtitle bitmap into a non transparent grayscale image before performing OCR. This instantly improved results.
https://github.com/CCExtractor/ccextractor/pull/759
This one was one of those problems that happened to be a one line fix, but involved a huge amount of analysis, reading and experimentation to get done right. It took me the good part of a week to get this to work as it is now, and hopefully this yields near perfect OCR results for all use cases, including those with transparent DVB subtitles.
Deduplication in Brazilian ISDB
There was an issue with the Brazilian ISDB subtitle decoder which caused broken timestamps (Timestamps are the primary key for the Red Hen dataset, and without them, the data can’t be part of the searchable archive). There was also an issue with duplication with certain ISDB streams that had an incorrectly defined rollup mode in the stream. I changed the logic for calling the deduplication code in such cases.
The issue conversation and fix can be seen at :-
https://github.com/CCExtractor/ccextractor/issues/739
Fixing end timestamp in transcripts from DVB
There was a bug which had made transcripts generated from any DVB file had a broken timestamp (lacking the UNIX offset and generating timestamps in the year 1970, not very ideal). I fixed this by adding the delay parameter at the required place.
The changes made can be seen at:-
https://github.com/CCExtractor/ccextractor/pull/755
Repository Links
All code that I wrote is publicly available on GitHub at the following repositories:-
- For the TV Commercial Classification system:-
https://github.com/Abhinav95/tv-ad-classification - For the latest version of CCExtractor that includes my changes:-
https://github.com/CCExtractor/ccextractor - For the latest version of the Visual Recognition Pipeline that includes my changes:-
https://github.com/gshruti95/news-shot-classification
Known Issues/Future Work
In this section, I outline a couple of current issues that need some thought to be fixed in order to get our system working for certain kinds of captions. I also outline the next logical steps to be taken to build on my work.
Tickertape Caption Merging
I had previously worked on extracting text from tickertape style captions, and the process worked fairly well. However, generating a timed transcript from these was a problem because of the OCR system not being able to determine legible text from the endpoints of the recognized text for a particular frame. This led to a problem in merging the text across successive frames, because of lots of garbage text at the terminal points. This is not a problem that could be solved by a simple Levenshtein distance computation.
I have set up the basic framework for tickertape OCR here:-
https://github.com/CCExtractor/ccextractor/commit/3278b31a8f571f5188d376142b1981a3a99ccff2
However, some thought needs to be given on how to get an accurate timed transcript as the final step of this method.
Deduplication with Incorrect OCR Results
A similar deduplication issue as described above was seen in French DVB recordings of the TF2 channel in which subtitles appeared word by word. The obtained text was fine in most instances but sometimes mangled around the ends which made it very troublesome to merge the text into an accurate timed transcript.
Putting the System into Production
The core aim of doing this project was to put the visual recognition pipeline into production such that useful visual annotations are generated for the NewsScape dataset. This requires usage of available computational resources in the most optimal manner, which we try to achieve by means of the job manager, Intel distribution of Caffe and the GPU support.
The system will soon be actively submitting jobs on the Case (and possibly Erlangen) HPC clusters and save the visual annotations which will be accessible by the Edge search engine in the future.
Future Collaboration
I have really enjoyed my collaboration with Red Hen, in a direct capacity this year and an indirect capacity last year. It was a great experience meeting everyone in person too at ICMC 2017 at Osnabrück, Germany. I will most certainly carry on working with the organization.
Red Hen is also applying to be a mentoring org for Google Code In this year. I have been a Code In mentor for CCExtractor in the past and I would most definitely enjoy a similar role for Red Hen in the near future as well.
References/Licenses
- Places205-AlexNet model: B. Zhou, A. Lapedriza, J. Xiao, A. Torralba, and A. Oliva Learning Deep Features for Scene Recognition using Places Database. Advances in Neural Information Processing Systems 27 (NIPS) spotlight, 2014. http://places.csail.mit.edu/downloadCNN.html
- Reference Caffenet model: AlexNet trained on ILSVRC 2012, with a minor variation from the version as described in ImageNet classification with deep convolutional neural networks by Krizhevsky et al. in NIPS 2012. Model trained by J. Donahue.
- GoogleNet model: http://arxiv.org/abs/1409.4842
Szegedy et al., Going Deeper with Convolutions, CoRR 2014
Used BVLC Googlenet model, trained by S. Guadarama. - YOLOv2 Model : YOLO 9000: Better, Faster, Stronger
https://pjreddie.com/darknet/yolo/ - CWRU HPC: This work made use of the High Performance Computing Resource in the Core Facility for Advanced Research Computing at Case Western Reserve University.
https://sites.google.com/a/case.edu/hpc-upgraded-cluster/ - Coloribus archive: Coloribus is a thoroughly collected and daily updated advertising archive, the biggest online collection of creative advertising pieces from all over the world. Highly structured database containing information about brands, agencies, people involved, awards and other very relevant data, combined with advanced full-text search engine.
- Red Hen Lab NewsScape Dataset: This work made use of the NewsScape dataset and the facilities of the Distributed Little Red Hen Lab, co-directed by Francis Steen and Mark Turner. http://redhenlab.org