In some cases, the detected text may be filled with noise and unwanted artifacts. Hence, there was a need to improve the text classifier in order to try and improve the quality of the detected captions. I set up three different levels of Tesseract subtitle line classifiers, and added confidence based thresholds as possible parameters which could potentially improve the quality of the OCR results.
The Three Modes – Frame, Word and Character
There are three different modes at which I used Tesseract to process a particular frame:-
- Frame: In this mode, the entire frame is processed at once, and the entire UTF8 text detected by Tesseract is written to the caption file.
- Word: In this mode, every word detected in the frame is individually processed. It can be later thresholded based on confidence or whether it is a dictionary word, or an expletive word etc.
- Letter: In this mode, every letter detected in the frame is individually processed. It can also be later thresholded. This mode is just present because of the possibility to make it so. For any practical purposes, the first two should serve fine.
I created a parameter called
-ocr_mode which allows the user to specify the level at which the OCR will be performed and interpreted. The default is the ‘frame’ mode.
Tesseract Confidence Ratings
The Tesseract Engine supplies confidence ratings along with its OCR predictions. I chose to use these confidence values to improve the quality of text recognition performed by the system. I created an optional parameter called
-conf_thresh which allows the user to put a threshold on the confidence rating of the text classification by Tesseract (having a default value of 0, ie all classifications accepted). Only classification results which had a confidence above the threshold were processed and written as captions.
The confidence thresholding works for each of the three OCR modes as described above. For the ‘frame’ mode, the confidence used is the mean text confidence, for the ‘word’ mode, the per-word confidence, and for the ‘letter’ mode, the per character confidence. These results are then thresholded and only the good ones remain.
Another small part of my proposal was to detect if the formatting of the subtitles was italic. I had originally intended to do this using an orientation estimation using the Fourier transform or instead looking at the average angle of the longest lines in the characters found in their Hough Transform.
However, none of that proved to be necessary since the Tesseract API had a call to detect word font attributes. So, whenever italic detection was to be done, I set the OCR mode to word-wise and called the Tesseract API which would then determine if the word was italic.
An excerpt from the video at https://www.facebook.com/uniladmag/videos/2282957648393948 (about British warship HMS Bulwark) is:-
00:00:00,000 –> 00:00:04,959
<i>The ship is an enormous machine </i>
00:00:04,961 –> 00:00:07,919
<i>one of the complicated machines that Britains ever built </i>
00:00:07,921 –> 00:00:09,879
<i>we make our own water from sea water </i>
00:00:09,881 –> 00:00:10,959
<i>we deal with our own sewage </i>
00:00:10,961 –> 00:00:12,959
<i>we cook our own food </i>
00:00:12,961 –> 00:00:15,879
<i>we’re a floating city. </i>