These last two weeks were slightly challenging owing to the fact that I had to learn a lot of new things in order to complete my tasks.

Setting up the HardsubX expansion in CCExtractor

Before I could get started on diving deep into writing code, I needed to organize and setup the workflow of all the new code which I am supposed to write throughout the summer into the original program. This included parsing input parameters for the new type of extraction process, creating and organizing new files in the source code, and changing the compilation settings and dependencies to match what I would need for my pipeline.

The pipeline essentially comprises of the following entities:-

  1. The ‘main’ file
    Handles the parsing of parameters and initializing the required data structures.
  2. The Decoder
    Gets the text of the burned in subtitle in the video by processing it
  3. The Timer
    Gets the precise timing of each extracted subtitle
  4. The Encoder
    Converts the output of the decoder and the timer into a standard output format such as a .srt(SubRip) file

I created separate files for each of these entities and their helper functions, along with one shared header file which would allow the internal librarization of the files (being able to use functions from one in another), as well as the potential external librarization (being able to be called from the main CCExtractor library).

You can view the project repository at The new source code files are in the ‘src/lib_ccx’ directory and have the ‘hardsubx’ prefix in their names.

Processing a Video Stream in C

The very first step when trying to get subtitles from a video frame, is to actually get those video frames themselves and store them in a data structure in the context of the program. The FFMpeg library is the comprehensive open source media processing library in use today. I am using its C API to process the input video stream.

I needed to store the video stream format and codec information in the program context. Then, out of the many different kinds of streams present in the media file (video, audio, captions, others), I needed to find the ID of the video stream and then process only its packets. Every video stream packet is then decoded and the image content extracted and stored in a Leptonica PIX structure (for compatibility with Tesseract OCR). For the sake of efficiency and avoiding redundancy in frame extraction, I extract frames at an interval of 0.5 seconds, which I have assumed to be the minimum time that a subtitle line is present in the video. This number can be fine-tuned based on the real situation, but some threshold is required in order to avoid the massive processing overheads of reading every single frame in the video.

In a nutshell, the process goes like this. FFMpeg gives me the video frames at a certain interval, and then I further process them to detect subtitles.

Example frame:-


Detecting Subtitle Regions

The detection of white subtitle regions involved two steps:-

  1. Luminance based thresholding
  2. Vertical Edge detection and Dilation

The Luminance (L) of a particular pixel represents the ‘whiteness’ of the pixel. The closer it is to pure white, the higher is its luminance. When aiming to detect white subtitles, luminance based thresholding is useful because if we binarize the image in such a way that only regions of high luminance are retained, then all of the white subtitle regions will be retained (with possibly other white objects/artifacts too). This thresholding is done to narrow down the search for the candidate subtitle region.

Thresholded Luminance image:-


The second part of the subtitle detection pipeline is the detection of vertical edges in the image, which is done by a vertical Sobel filter. This method is effective because subtitles have a high density of strong vertical edges in their region, due to the alternating white foreground letters and the non-white background. The edge image is then dilated with a horizontal structuring element in order to get the rough region of the subtitles.

Vertical edges:-


After dilation and thresholding:-


The final subtitle region is determined by taking a bitwise AND of the two feature images described above, i.e. regions which are both wide and also have strong vertical edges. Both these features are typical of white letters in the subtitle line. In some cases, one step may not work well. For instance, if there is a white background, then the thresholded luminance image will not be an accurate representation of the subtitle region. Also, if there is an object with lots of vertical edges near the subtitle region, the edge image will not be an accurate representation. But using both of them together give us a high likelihood of accurately detecting the subtitle region.

Subtitle Recognition / OCR

Once the subtitle region of interest has been detected, the actual text needs to be recognized using OCR (Optical Character Recognition). The intuitive choice to perform this task was the Tesseract OCR library by Google, which has already been previously used by CCExtractor to recognize DVB subtitles (predominantly used in Europe) which essentially comprise of a subtitle bitmap being overlaid on the video frame. An OCR essentially works using character and word classification based on stored labels on trained data. In a layman’s terms, you show the OCR engine 1000 images of the letter ‘a’, and it learns to recognize the letter ‘a’ the next time it sees it.

For the output image of the previous steps:-result

Tesseract’s Detected text : “Well, they’re both spectacles,”

All I need to do is pass the binarized image containing only the clean detected subtitle text to a Tesseract API handle and it returns the recognized text to me. Pretty cool, right?

Over the course of the summer, I will have to use the Tesseract API extensively, as compared to just directly making a call to get the recognized subtitle text. I will be using advanced Tesseract features such as the per character and the per word confidence ratings in order to refine and improve my text classification output. A common use case for this would be to root out simple misclassifications such as ‘giape’ instead of ‘grape’ in the recognized text, and to get the overall output to have the highest probability of being correct.

What’s Next?

The next thing that I need to work on is to accurately and optimally determine the time that each subtitle line was present in the video. This will involve seeking the video around the neighborhood of the frame of the originally detected subtitle, and then determining when that particular subtitle line appeared in the video for the first and the last time. A potential problem with optimizing this seems to be the fact that ffmpeg does not allow straightforward seeking to a given frame number or a timestamp, and I will have to manually seek to the desired location from the nearest I-frame (You can understand this problem better by understanding the GOP structure of video frames, explained here).

Here’s looking forward to weeks 4 and 5 and the mid-term evaluation which is on the near horizon. I’ll keep posting my progress, right here. Cheers!


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s