Here, I will try to extract the characters from the subtitles displayed under the political broadcast. Since there is no background, it seems quite so with binarization.
It is possible to get the character and position with considerable accuracy by extracting the character with the google cloud vision API, but here I will try to get the character by other methods.
tesseract-ocr / pyocr
First, try character recognition using tesseract
and pyocr
.
This is the source image.
Extract the characters and positions with the script below.
import sys
import pyocr
import pyocr.builders
import cv2
from PIL import Image
def imageToText(src):
tools = pyocr.get_available_tools()
if len(tools) == 0:
print("No OCR tool found")
sys.exit(1)
tool = tools[0]
dst = tool.image_to_string(
Image.open(src),
lang='jpn',
builder=pyocr.builders.WordBoxBuilder(tesseract_layout=6)
)
return dst
if __name__ == '__main__':
img_path = sys.argv[1]
out = imageToText(img_path)
img = cv2.imread(img_path)
sentence = []
for d in out:
sentence.append(d.content)
cv2.rectangle(img, d.position[0], d.position[1], (0, 0, 255), 2)
print("".join(sentence).replace("。","。\n"))
cv2.imshow("img", img)
cv2.imwrite("output.png ", img)
cv2.waitKey(0)
cv2.destroyAllWindows()
Article 25 All citizens have the right to live a healthy and culturally minimal life.
The two countries must endeavor to improve and promote social welfare, social security and public health in all aspects of life.
(Right to education and agenda]Article 26 All citizens have the right to equal education according to their abilities, as provided for by law.
2 All citizens are obliged to have their children receive general education as required by law.
Compulsory education is free of charge.
[Rights and obligations of work, standards of working conditions and prohibition of child abuse] Article 27 All citizens have the right to work and are obliged to do so.
2 Standards for wages, working hours, rest and other working conditions are stipulated by law.
3 Children must not use this.
Workers' right to organize and collective bargaining] Article 28 The right to collective workers and the right to collective bargaining and other collective actions shall be guaranteed.
Property rights] Article 29 Property rights must not be infringed.
2 The content of property rights shall be stipulated by law so as to conform to the public welfare.
3 Private property may be used for the public with just compensation.
--Character position
In the image of only characters obtained from word or html, the characters themselves can be obtained, but the exact position of the characters seems to be difficult to obtain. What I want here is the position in sentence units, but even if I adjust it with the parameter tesseract_layout = 6
, it seems that I can only get it in character units.
I tried linear extraction by binarization and Hough transform, but I would like to apply OCR by ROI (cutting out a part of the image) only the part where subtitles are likely to appear once.
I wondered if I could extract only the gray subtitles in the area, but it's difficult for me to know what I'm likely to do because I'm suffering from people: scream:
import sys
import cv2
import os
import numpy as np
import pyocr
import pyocr.builders
from PIL import Image, ImageDraw, ImageFont
import time
def process(src):
kernel = np.ones((3,3),np.uint8)
gray = cv2.cvtColor(src, cv2.COLOR_BGR2GRAY)
o_ret, o_dst = cv2.threshold(gray, 0, 255, cv2.THRESH_OTSU)
dst = cv2.morphologyEx(o_dst, cv2.MORPH_OPEN, kernel)
return cv2.bitwise_not(dst)
def imageToText(tool, src):
tmp_path = "temp.png "
cv2.imwrite(tmp_path, src)
dst = tool.image_to_string(
Image.open(tmp_path),
lang='jpn',
builder=pyocr.builders.WordBoxBuilder(tesseract_layout=6)
)
sentence = []
for item in dst:
sentence.append(item.content)
return "".join(sentence)
def createTextImage(src, sentence, px, py, color=(8,8,8), fsize=28):
tmp_path = "src_temp.png "
cv2.imwrite(tmp_path, src)
img = Image.open(tmp_path)
draw = ImageDraw.Draw(img)
font = ImageFont.truetype("./IPAfont00303/ipag.ttf", fsize)
draw.text((px, py), sentence, fill=color, font=font)
img.save(tmp_path)
return cv2.imread(tmp_path)
if __name__ == '__main__':
tools = pyocr.get_available_tools()
if len(tools) == 0:
print("No OCR tool found")
sys.exit(1)
tool = tools[0]
cap = cv2.VideoCapture('one_minutes.mp4')
cap_width = int(cap.get(cv2.CAP_PROP_FRAME_WIDTH))
cap_height = int(cap.get(cv2.CAP_PROP_FRAME_HEIGHT))
fps = cap.get(cv2.CAP_PROP_FPS)
telop_height = 50
fourcc = cv2.VideoWriter_fourcc('m','p','4','v')
writer = cv2.VideoWriter('extract_telop_text.mp4',fourcc, fps, (cap_width, cap_height + telop_height))
start = time.time()
count = 0
try :
while True:
if not cap.isOpened():
break
if cv2.waitKey(1) & 0xFF == ord('q'):
break
ret, frame = cap.read()
if frame is None:
break
telop = np.zeros((telop_height, cap_width, 3), np.uint8)
telop[:] = tuple((128,128,128))
gray_frame = process(frame)
roi = gray_frame[435:600, :]
txt = imageToText(tool, roi)
images = [frame, telop]
frame = np.concatenate(images, axis=0)
font = cv2.FONT_HERSHEY_SIMPLEX
seconds = round(count/fps, 4)
cv2.putText(frame, "{:.4f} [sec]".format(seconds),
(cap_width - 250, cap_height + telop_height - 10),
font,
1,
(0, 0, 255),
2,
cv2.LINE_AA)
writer.write(createTextImage(frame, txt, 20, cap_height + 10))
count += 1
print("{}[sec]".format(seconds))
except cv2.error as e:
print(e)
writer.release()
cap.release()
print("Done!!! {}[sec]".format(round(time.time() - start,4)))
――I use PIL instead of openCV to write Japanese characters, but when I pass the data, I temporarily save the image file. Because of that, it took more than 10 minutes to generate the video: sweat_smile: Is there any good way: disappointed_relieved:
Example)
tmp_path = "src_temp.png "
#Output the image data used in openCV
cv2.imwrite(tmp_path, src)
#Read data with PIL
img = Image.open(tmp_path)
--The font uses IPA font. https://ipafont.ipa.go.jp/IPAfont/IPAfont00303.zip
--The flow before character recognition is as follows.
def process(src):
kernel = np.ones((3,3),np.uint8)
gray = cv2.cvtColor(src, cv2.COLOR_BGR2GRAY)
o_ret, o_dst = cv2.threshold(gray, 0, 255, cv2.THRESH_OTSU)
dst = cv2.morphologyEx(o_dst, cv2.MORPH_OPEN, kernel)
return cv2.bitwise_not(dst)
Kampe interferes with character recognition a little, but I think it can be read to some extent.
Next, let's recognize characters using the Google Cloud Vision API. I tried Demo from https://cloud.google.com/vision/, but the accuracy is still high.
-[Python] Read the expiration date with OCR (tesseract-ocr / pyocr) (image → sequence) [Home IT # 19] -Convert image data to text with pyOCR in Mac environment -Put Japanese characters in the image with Python.
Recommended Posts