HardsubX : Burned-in Subtitle Extraction Subsystem for CCExtractor
My project was to add the capability of extracting burned-in (hard) subtitles from videos to CCExtractor. As of now, CCExtractor works by only extracting caption data in the video if it is present in specific structures in the stream, and skips the actual video data (pixels) completely. However a lot of videos have hard subtitles burned into them, extracting which is a computer vision problem, and something which CCExtractor did not earlier have the capability to process.
CCExtractor can be compiled with HardsubX support as follows:-
This needs to be run from the
-hardsubx flag needs to be specified to the ccextractor executable in order to enable burned-in subtitle extraction. Other options along with a description of them are as follows:-
-ocr_mode: Set the OCR mode to either frame-wise, word-wise or letter-wise.
-subcolor: Specify the color of the subtitles, which man be one of
white, yellow, green, cyan, blue, magenta, red. Alternatively, a custom hue value between 0 and 360 may also be supplied, taking reference from a standard hue chart.
-subcolor 270(for violet)
-min_sub_duration: Specify the minimum duration in seconds that a subtitle line must exist on the screen. Lower values give better timed results, but increase processing time. The default value is 0.5.
-min_sub_duration 1.0(for a duration of 1 second)
-detect_italics: Specify whether italics are to be detected from the OCR text. Italic detection automatically enforces the OCR mode to be word-wise.
-conf_thresh: Specify the classifier confidence threshold between 1 and 100. Try and use a threshold which works for you if you get a lot of garbage text.
-whiteness_thresh: For white subtitles only, specify the luminance threshold between 1 and 100. This threshold is very content dependent, and adjusting values may give you better results. Recommended values are in the range 80 to 100. The default value is 95.
A composite example command is as follows:-
ccextractor video.mp4 -hardsubx -subcolor white -detect_italics -whiteness_thresh 90 -conf_thresh 60
00:00:00,000 –> 00:00:06,919
Well, they’re both spectacles,
00:00:06,921 –> 00:00:08,879
NBA basketball as well as football here.
00:00:08,881 –> 00:00:12,879
It’s a spectacle for the fans, they enjoy it and see
00:00:12,881 –> 00:00:13,919
00:00:13,921 –> 00:00:19,919
And I think a player like Messi could be compared to Curry
00:00:19,921 –> 00:00:21,919
in the USA,
00:00:21,921 –> 00:00:24,879
because they have created something special
00:00:24,881 –> 00:00:26,999
something not seen before,
00:00:27,001 –> 00:00:30,919
and something that makes people excited, ecstatic.
00:00:00,000 –> 00:00:08,999
This is one of my favorite places.
00:00:09,001 –> 00:00:18,165
I love that on one hand, if you look far enough,
there are the untouched snowy steppes
00:00:18,167 –> 00:00:24,874
and if you turn around there is
a powerful contrast.
00:00:24,876 –> 00:00:30,332
The large industrial complex,
a city in a snowy desert.
00:00:30,334 –> 00:00:35,749
It’s awe-inspiring that all of this
was created by the hands of men.
List of new files
hardsubx.c – The ‘main’ file of the subsystem, in which the required structures and context are initialized.
hardsubx.h – The header file which contains definitions that are used across the code of the subsystem.
hardsubx_classifier.c – Handles various levels of subtitle text recognition, italic detection, and confidence of the output text.
hardsubx_decoder.c – Decodes video frames using the FFMpeg library, reads frames into context structures, passes structures to appropriate OCR functions and encodes correctly timed subtitles.
hardsubx_imgops.c – Handles conversion between color spaces necessary to process subtitles of a certain color.
hardsubx_utility.c – Contains various utility functions used in the subsystem.
In addition to my main project, I also worked on improving existing CCExtractor features and fixing issues. I worked on the following features/issues:-
1. Reducing memory consumption by 180 MB
2. Case fixing for teletext subtitles
3. Adding color detection for DVB subtitles
4. Fixing a crash with DVB subtitles
5. Adding OCR support for potentially 99 different DVB languages
6. Adding parameters for DVB and OCR language selection
Link to Commits
All my commits to the mainstream master branch can be seen at:-
My changes to the cross-platform GUI can be seen at:-
All the merged pull requests which I have made to the mainstream master branch can be seen at:-
Link to blog posts
All the blog posts which I have written about my project can be seen at:-
Known Issues and Future Work
- Lower resolution videos (e.g. 360p) do not work well with the current set of parameters. I tried upscaling the video but without success. This is likely due to quantization artifacts in lower quality video which prevent text recognition in the same way.
- The entire system is a rule based classifier. The current state of the art in text recognition uses advanced techniques like Neural Networks and MRFs, but the integration of those into the current C code base would have been really difficult as most libraries are in different languages, which is why I chose to stick with the Leptonica C library (already a CCExtractor dependency) and a simple image processing approach.
- There are problems in videos which have background hues similar to the color of the subtitle text to be detected. Relaxing the threshold value in the code caused too much garbage text to be detected whereas tightening the threshold caused no text to be reliably detected. This issue is highly content dependent.
Future Contribution to CCExtractor
I really enjoyed working with the CCExtractor organization throughout the summer. I will now maintain my project on my own time, outside of GSOC too. Also, for anyone willing to contribute to the existing code, and HardsubX in particular, feel free to fork the main repository at https://github.com/CCExtractor/ccextractor and send a patch. You will really like being a part of the community.