So far, I have been able to successfully extract white colored subtitles at an interval of 25 frames, and the output looks decent. However, I need to now actually created a timed transcript (e.g. an SRT file).
I had originally intended on having two strategies to determine subtitle time, which I had described in my proposal as:-
- A linear search across the video at a certain interval. Whenever a subtitle gets detected, a binary search will be performed in a window around that frame. Using this, we will detect the exact time of the beginning and the end of the particular subtitle line. This will be of benefit in sequentially processing a file (possible use case of processing a live stream as it is being recorded).
- If the entire video is already available to us, instead of doing linear search which will involve a lot of processing overheads for frames in which there are no subtitles, we can directly do a binary search on the entire video to detect subtitle lines. We will get the exact timing of the line as described above, but the overall processing will be faster
However, neither of them were possible, due to constraints which I had not originally anticipated.
It turns out that binary search was not a viable option at all, because I could not arbitrarily seek to a timestamp in a video using the FFMpeg library. The closest thing which I could do was seek the file to the nearest I-frame and then iterate through frames to the desired timestamp and then reconstruct the needed frame. However, in a binary search, the whole point of which was to optimize the search, this way would create a massive processing overhead and high redundancy in reconstructing frames during the search. Instead, a linear search with a specified step-size seemed a much better option.
The problem that I described is fairly well documented online:-
New Plan – Efficient Linear Search
I decided to use a linear search across the video with a specified step-size, which was a parameter called the minimum subtitle duration. I set the default value for this as 0.5 seconds, which seems a reasonable assumption for most subtitles.
I also needed to convert times to a single format (milliseconds), from the various different time bases that various different video streams could have. From here, I iterated through the video and sampled frames at regular intervals. The decision that a subtitle line was the same as the last encountered one was when it’s Levenshtein distance was very low. This was necessary in order to combine successive detections which were off by a character or two, which happens quite often due to the natural noise present in the video stream. Whenever the detected subtitle line ended, I would encode it with the seen times.
Integrating with the CCExtractor Encoder
It was really easy to integrate the calculated time with the CCExtractor encoder structure (which with itself brought full output parameter functionality). All I had to do was call two functions at the appropriate times in my code:-
add_cc_sub_text(ctx->dec_sub, subtitle_text, begin_time, end_time, “”, “BURN”, CCX_ENC_UTF_8);
That says a lot about how well written and modularized the existing library is.
And oh, I chose the subtitle mode ‘BURN’ myself. It stands for burned-in subtitles xD.
The SRT output for the video at https://www.facebook.com/uefachampionsleague/videos/1255606254485834/ (A Gerard Pique interview), looked as follows:-
00:00:00,000 –> 00:00:06,919
Well, they’re both spectacles,
00:00:06,921 –> 00:00:08,879
NBA basketball as well as football here.
00:00:08,881 –> 00:00:12,879
It’s a spectacle for the fans, they enjoy it and see
00:00:12,881 –> 00:00:13,919
00:00:13,921 –> 00:00:19,919
And I think a player like Messi could be compared to Curry
00:00:19,921 –> 00:00:21,919
in the USA,
00:00:21,921 –> 00:00:24,879
because they have created something special
00:00:24,881 –> 00:00:26,999
something not seen before,
00:00:27,001 –> 00:00:30,919
and something that makes people excited, ecstatic.
It looks pretty good, and the times are pretty close to perfect, with some variation at the extremes due to those edge frames not being processed. A lower value for the minimum subtitle duration will give even more accurately timed results, but will take a longer processing time.