Introduction

Here, I will try to extract the characters from the subtitles displayed under the political broadcast. Since there is no background, it seems quite so with binarization.

It is possible to get the character and position with considerable accuracy by extracting the character with the google cloud vision API, but here I will try to get the character by other methods.

tesseract-ocr / pyocr

First, try character recognition using tesseract and pyocr.

This is the source image.

Extract the characters and positions with the script below.

import sys

import pyocr
import pyocr.builders

import cv2
from PIL import Image

def imageToText(src):
	tools = pyocr.get_available_tools()
	if len(tools) == 0:
		print("No OCR tool found")
		sys.exit(1)

	tool = tools[0]

	dst = tool.image_to_string(
		Image.open(src),
		lang='jpn',
		builder=pyocr.builders.WordBoxBuilder(tesseract_layout=6)
	)
	return dst	

if __name__ == '__main__':
	img_path = sys.argv[1]

	out = imageToText(img_path)

	img = cv2.imread(img_path)
	sentence = []
	for d in out:
		sentence.append(d.content)
		cv2.rectangle(img, d.position[0], d.position[1], (0, 0, 255), 2)

	print("".join(sentence).replace("。","。\n"))

	cv2.imshow("img", img)
	cv2.imwrite("output.png ", img)
	cv2.waitKey(0)
	cv2.destroyAllWindows()

letter

Article 25 All citizens have the right to live a healthy and culturally minimal life.
The two countries must endeavor to improve and promote social welfare, social security and public health in all aspects of life.
(Right to education and agenda]Article 26 All citizens have the right to equal education according to their abilities, as provided for by law.
2 All citizens are obliged to have their children receive general education as required by law.
Compulsory education is free of charge.
[Rights and obligations of work, standards of working conditions and prohibition of child abuse] Article 27 All citizens have the right to work and are obliged to do so.
2 Standards for wages, working hours, rest and other working conditions are stipulated by law.
3 Children must not use this.
Workers' right to organize and collective bargaining] Article 28 The right to collective workers and the right to collective bargaining and other collective actions shall be guaranteed.
Property rights] Article 29 Property rights must not be infringed.
2 The content of property rights shall be stipulated by law so as to conform to the public welfare.
3 Private property may be used for the public with just compensation.

--Character position

In the image of only characters obtained from word or html, the characters themselves can be obtained, but the exact position of the characters seems to be difficult to obtain. What I want here is the position in sentence units, but even if I adjust it with the parameter tesseract_layout = 6, it seems that I can only get it in character units.

Method

I tried linear extraction by binarization and Hough transform, but I would like to apply OCR by ROI (cutting out a part of the image) only the part where subtitles are likely to appear once.

I wondered if I could extract only the gray subtitles in the area, but it's difficult for me to know what I'm likely to do because I'm suffering from people: scream:

development of

import sys

import cv2
import os
import numpy as np

import pyocr
import pyocr.builders

from PIL import Image, ImageDraw, ImageFont

import time

def process(src):
	kernel = np.ones((3,3),np.uint8)
	gray = cv2.cvtColor(src, cv2.COLOR_BGR2GRAY)

	o_ret, o_dst = cv2.threshold(gray, 0, 255, cv2.THRESH_OTSU)
	dst = cv2.morphologyEx(o_dst, cv2.MORPH_OPEN, kernel)
	return cv2.bitwise_not(dst)

def imageToText(tool, src):
	tmp_path = "temp.png "

	cv2.imwrite(tmp_path, src)
	dst = tool.image_to_string(
		Image.open(tmp_path),
		lang='jpn',
		builder=pyocr.builders.WordBoxBuilder(tesseract_layout=6)
	)

	sentence = []
	for item in dst:
		sentence.append(item.content)

	return "".join(sentence)


def createTextImage(src, sentence, px, py, color=(8,8,8), fsize=28):

	tmp_path = "src_temp.png "
	cv2.imwrite(tmp_path, src)

	img = Image.open(tmp_path)
	draw = ImageDraw.Draw(img)

	font = ImageFont.truetype("./IPAfont00303/ipag.ttf", fsize)
	draw.text((px, py), sentence, fill=color, font=font)
	img.save(tmp_path)
	return cv2.imread(tmp_path)



if __name__ == '__main__':

	tools = pyocr.get_available_tools()
	if len(tools) == 0:
		print("No OCR tool found")
		sys.exit(1)

	tool = tools[0]

	cap = cv2.VideoCapture('one_minutes.mp4')

	cap_width = int(cap.get(cv2.CAP_PROP_FRAME_WIDTH))
	cap_height = int(cap.get(cv2.CAP_PROP_FRAME_HEIGHT))
	fps = cap.get(cv2.CAP_PROP_FPS)

	telop_height = 50

	fourcc = cv2.VideoWriter_fourcc('m','p','4','v')
	writer = cv2.VideoWriter('extract_telop_text.mp4',fourcc, fps, (cap_width, cap_height + telop_height))

	start = time.time()
	count = 0
	try :
		while True:
			if not cap.isOpened():
				break

			if cv2.waitKey(1) & 0xFF == ord('q'):
				break

			ret, frame = cap.read()

			if frame is None:
				break

			telop = np.zeros((telop_height, cap_width, 3), np.uint8)
			telop[:] = tuple((128,128,128))

			gray_frame = process(frame)
			roi = gray_frame[435:600, :]
			txt = imageToText(tool, roi)

			images = [frame, telop]

			frame = np.concatenate(images, axis=0)
			font = cv2.FONT_HERSHEY_SIMPLEX

			seconds = round(count/fps, 4)

			cv2.putText(frame, "{:.4f} [sec]".format(seconds), 
						(cap_width - 250, cap_height + telop_height - 10), 
						font, 
						1, 
						(0, 0, 255), 
						2, 
						cv2.LINE_AA)

			writer.write(createTextImage(frame, txt, 20, cap_height + 10))
			count += 1

			print("{}[sec]".format(seconds))

	except cv2.error as e:
		print(e)	

	writer.release()
	cap.release()

	print("Done!!! {}[sec]".format(round(time.time() - start,4)))

Supplement

――I use PIL instead of openCV to write Japanese characters, but when I pass the data, I temporarily save the image file. Because of that, it took more than 10 minutes to generate the video: sweat_smile: Is there any good way: disappointed_relieved:

Example)


tmp_path = "src_temp.png "
#Output the image data used in openCV
cv2.imwrite(tmp_path, src)

#Read data with PIL
img = Image.open(tmp_path)

--The font uses IPA font. https://ipafont.ipa.go.jp/IPAfont/IPAfont00303.zip

--The flow before character recognition is as follows.

Make the image black and white
Binarization processing by Otsu formula
Noise removal by opening process (shrinkage-> enlargement)
Invert the image

def process(src):
	kernel = np.ones((3,3),np.uint8)
	gray = cv2.cvtColor(src, cv2.COLOR_BGR2GRAY)

	o_ret, o_dst = cv2.threshold(gray, 0, 255, cv2.THRESH_OTSU)
	dst = cv2.morphologyEx(o_dst, cv2.MORPH_OPEN, kernel)
	return cv2.bitwise_not(dst)

result

Kampe interferes with character recognition a little, but I think it can be read to some extent.

in conclusion

Next, let's recognize characters using the Google Cloud Vision API. I tried Demo from https://cloud.google.com/vision/, but the accuracy is still high.

Helpful link

-[Python] Read the expiration date with OCR (tesseract-ocr / pyocr) (image → sequence) [Home IT # 19] -Convert image data to text with pyOCR in Mac environment -Put Japanese characters in the image with Python.

I tried to extract characters from subtitles (OpenCV: tesseract-ocr edition)