Hi, I've been taking a break from school lately thanks to Corona, and I've been spending endless free days, so I'm killing time by playing with various technologies every day. To be honest, it's a lot of fun. By the way, in those days, I found something when I was making Discord bots, implementing 2048, playing in esoteric languages, and playing with morphological analysis. Yes, currently at Google Drive [by Netorabo editorial department](https://nlab.itmedia.co.jp/nl/articles/2004 /25/news026.html) This is a ** public comment ** of the ** Kagawa Prefecture Net Game Addiction Countermeasures Ordinance **, which has been highly acclaimed. When I found this, I thought.
** It looks like it's fun to play with this **.
Since the person read by the scanner is converted to data on PDF, it cannot be treated as data as it is, so it is necessary to convert it to text data, but the process to convert it to text data seems to be already fun. I haven't touched on image processing technology around here yet, so new knowledge is likely to be expanded. Moreover, from what I heard, it seems that there are some unnatural biases in the data. It's absolutely fun to analyze this. That's why I decided to play.
First, convert the PDF to an image using pdf2image. It is a plagiarism of the code written in almost this article. I'm sorry, I don't feel like I can write better code ...
Imaging.py
import pathlib
import pdf2image
pdf_files = pathlib.Path('PDF').glob('*.pdf')
for pdf_file in pdf_files:
base = pdf_file.stem
img_dir = pathlib.Path(f'image/{base}')
img_dir.mkdir()
images = pdf2image.convert_from_path(pdf_file, grayscale=True, dpi=200)
for index, image in enumerate(images):
image.save(img_dir/pathlib.Path(f'{index + 1}.png'), 'png')
print(base) #For checking progress
It will take some time to execute, so please wait patiently.
If you wait, it will be like this. Well, when I put them side by side like this, I feel like I have a pub rice in my hand.
Use Tessertact_OCR. I worship at the computer with the expectation that it will not be recognized in a good way. It is important to bow as deeply as possible. It would be nice to have an offering. If you feel that your worship is understood, let's try to recognize the 14th (appropriately decided) approval on January 23rd.
C:\Users\usr\Documents\Kagawa>tesseract .\image\Agree 0123\14.png .\Character recognition\test -l jpn
Tesseract Open Source OCR Engine v5.0.0-alpha.20200328 with Leptonica
Warning: Invalid resolution 0 dpi. Using 70 instead.
Estimating resolution as 344
Detected 201 diacritics
It looks like there are a lot of problems, but it's probably because of my mind, because of my mind. Now let's compare the input and the output. Click here for the entered image Here is the output text.
test.txt
desknefs NEO -Page 171
Parliamentary Secretariat(glkeldprefr kagawa lg jp)
-------- "
―――(Omitted as blank lines continue)―――
Citizen: "Wagawa Prefecture Opinion / Inquiry Page"<hp- adm@pref. kagawa.Idg.]p>-
destination: gikaiGpref.kagawa.Ig.jp
CC :-"
subject:Posting from the freezing / inquiry page
White time:January 23, 2020(Book) 15:16
―――(Omitted as blank lines continue)―――
[Contents of opinions and inquiries]
[Opinion box to the prefectural assembly homepage]- 」
The prefectural assembly will continue to inform you of the status of the assembly in an easy-to-understand manner.
I will. Please let us know your opinions and impressions when you visit the parliamentary website.
I. We will refer to your opinions as valuable voices from everyone.
I will do it.
Please note$---
@We cannot accept petition by e-mail or e-mail about individual members of the Diet.
【(Residence] ---
【E-Maill . ,
[Subject] Opinions on public comments
[Opinions / impressions]
Age axis phone number
I agree with the Net Game Addiction Countermeasures Article.
I'm worried that there are children playing games and smartphones wherever I go
[ADDR]192. 168.7. 21
[DATE]2020/01/23 15:16: 42
[USERAGENT]Mozilla/5.0 (Windows NT 10.0: Win64: x64) AppleWeb
Kit/537.36 (KHTML, like Gecko) Chrome/70.0.3538. 102 faP/53 or 3
6 TOg9. 18362
Uh ~~~~~~~ There are some very unstable points, but I can get the date I plan to play with this time without any problem, so I'm okay for the time being.
It's PyOCR's turn.
Characters that match the / ^ ([^ 0-9 \ n] * \ d) {12} [^ 0-9 \ n] * $ /
regular expression (a line containing 12 "just" numbers) Extract the column. The numbers seem to be recognized fairly accurately, so you won't miss them that much.
The acquired date is stored in four text files, "Agree", "Disagree", "Business operator", and "Proposal".
This code, which was written based on this article, is written with some mysterious power to cause a miracle and quadruple the specifications of the personal computer. Believe and do.
OCR.py
from PIL import Image
import sys
import pyocr
import pyocr.builders
from pathlib import Path
import re
count = 0
tool = pyocr.get_available_tools()[0]
folders = list(Path("image").glob("*")) #imageフォルダのパスをすべて取得
agr, opp, bsp, rec = open("Agree.txt", "w"), open("Opposition.txt", "w"), open("business person.txt", "w"), open("Recommendation.txt", "w") #Initialize the text file once
agr,opp,bsp,rec.close()
dic = {"Praise": "Praise成.txt", "Anti": "Anti対.txt", "Thing": "Thing業者.txt", "Proposal": "Proposal言.txt"} #A dictionary for writing Switch statements
for fol in folders:
with open(dic[str(fol)[3]],"a") as fil: #Judge the file to open with the "4th character" of the folder path
for path in (Path(fol).glob("*")):
count += 1
text = tool.image_to_string(
Image.open(path),
lang="jpn",
builder=pyocr.builders.TextBuilder(tesseract_layout=6)
)
match = re.search(r'^([^0-9\n]*\d){12}[^0-9\n]*$', text, re.MULTILINE)
if match != None: #For documents that span several pages, there may be no date anywhere on the page.
match = match.group()
fil.write(match + "\n")
print(count) #For checking progress
By the way, no miracle happened to me, the execution time is too long Maybe there is a way to finish this a little earlier
As a result of running this program, for example, the contents of "Agree.txt" look like this.
Agree.txt
Date and time:January 23, 2020(wood) 11:39 ー ー
Date and time:January 23, 2020(wood) 11:49 ー ー
-Time:January 23, 2020(Book) 11:50 .
Date and time:January 23, 2020(wood) 11:55 ---
Date and time:January 23, 2020(wood) 13:49
Date and time:January 23, 2020(Book) 15:16 ---.
.Date and time:January 23, 2020(wood) 15:31
Date and time:January 23, 2020(wood) 15:51 .---
Date and time:January 23, 2020(wood) 15:58 .
Date and time:January 23, 2020(wood) 17:55 . ----
Date and time:January 23, 2020(wood) 20:23 .
Date and time:January 23, 2020(wood) 12:22
Date and time:January 23, 2020(wood) 20:31 -"・
Date and time:January 23, 2020(wood) 13:10 ---.
Date and time:January 23, 2020(wood) 16:27 ] 」
Date and time:January 23, 2020(wood) 17:03
Date and time:January 23, 2020(wood) 18:09 ]---
Date and time:January 23, 2020(wood) 21:41
22812 050 Return presentation IO008 "1-
Date and time:January 24, 2020(Money) 08:49 ー ー
.Date and time:January 24, 2020(Money) 12:40 .
Date and time:January 24, 2020(Money) 13:28
Date and time:January 24, 2020(Money) 13:31
Date and time:January 24, 2020(Money) 13:34 -
Date and time:January 24, 2020(Money) 13:35
.Date and time:January 24, 2020(Money) 14:01 ]-
.Date and time:January 24, 2020(Money) 15:08 ー ー.
.. Date and time: "January 24, 2020(Money) 08:49 .---
Date and time:January 24, 2020(Money) 15:33 ー ー
Date and time:January 24, 2020(Money) 15:34
Date and time:January 24, 2020(Money) 15:37 ・
Date and time:January 24, 2020(Money) 15:44 ・
Date and time:January 24, 2020(Money) 16:03 」 - -
Date and time:January 24, 2020(Money) 16:13 ー ー
-Date and time:January 24, 2020(Money) 16:14
Date and time:January 24, 2020(Money) 16:16 -"-
Date and time:January 24, 2020(Money) 16:39 -
-.At the time of:January 24, 2020(Money) 08:50 ー ー
Date and time:January 24, 2020(Money) 16:47 -
(The following is omitted)
It seems that some "non-date" is mixed in, but it seems to be generally successful. By the way, there were only a few "non-dates" in the whole, so I manually removed them, which was a moment.
If this is left as it is, the noise will be terrible, so normalize the data. Easily unify with "a combination of all the numbers in the date". The number of characters should be fixed at 12, so you should be able to normalize with this.
Normalization.py
import re
for name in ["Agree","Opposition","business person","Recommendation"]:
with open(name + ".txt") as fil:
contents = fil.read()
match = re.findall(r'([0-9]|\n)', contents, re.MULTILINE)
with open(name + "_Normalization.txt","w") as fil:
fil.write("".join(match))
Agree_Normalization.txt
202001231139
202001231149
202001231150
202001231155
202001231349
202001231516
202001231531
202001231551
202001231558
202001231755
202001232023
202001231222
202001232031
202001231310
(The following is omitted)
it is a good feeling.
Finally draw a scatter plot. The recruitment period for pub rice is ** 1/23 to 2/6 ** (isn't it short? This), so let's plot the ** distribution of votes in favor ** during this period for the time being. Find the best answer in this question on teratile.
Graph generation.py
import matplotlib.pyplot as plt
from matplotlib import dates as mdates
from datetime import datetime as dt
date = []
time = []
x = []
y = []
with open("Agree_Normalization.txt", "r") as fil:
for line in fil:
date.append(line[4:10])
time.append(line[10:12])
for d in date:
y.append(dt.strptime(d, "%m%d%H"))
for d in time:
x.append(dt.strptime(d, "%M"))
ax = plt.subplot()
ax.scatter(x, y, alpha=0.1,c='red',s=40)
ax.set_xlim([dt.strptime('00', '%M'),
dt.strptime('59', '%M')])
ax.set_ylim([dt.strptime('01/23', '%m/%d'), dt.strptime('02/06', '%m/%d')])
plt.xticks(rotation=90)
plt.savefig("Graph.png ")
Here is the output graph [^ 1]. ** Obviously something is happening. ** ** As mentioned in the annotation, the vertical line is engraved with "month and time" and the horizontal line is engraved with "minute". After all, these two clearly dark lines are probably due to the posting of pub rice at such a high speed that it can be seen continuously even in "minute" increments. Well, it's interesting.
It was a lot of fun. I'm quitting because I'm sleepy today, but pub rice is still open to the public so I think you should play with it if you have time.
[^ 1]: I didn't set the label because I'm sleepy anymore, but to explain it, the x-axis represents "minutes" (0-59), and the y-axis represents "months and days" (1) in 1-hour increments. It is a feeling that represents / 23/00 to 2/6/23).
Recommended Posts