Preface

Hi, I've been taking a break from school lately thanks to Corona, and I've been spending endless free days, so I'm killing time by playing with various technologies every day. To be honest, it's a lot of fun. By the way, in those days, I found something when I was making Discord bots, implementing 2048, playing in esoteric languages, and playing with morphological analysis. Yes, currently at Google Drive [by Netorabo editorial department](https://nlab.itmedia.co.jp/nl/articles/2004 /25/news026.html) This is a ** public comment ** of the ** Kagawa Prefecture Net Game Addiction Countermeasures Ordinance **, which has been highly acclaimed. When I found this, I thought.

** It looks like it's fun to play with this **.

Since the person read by the scanner is converted to data on PDF, it cannot be treated as data as it is, so it is necessary to convert it to text data, but the process to convert it to text data seems to be already fun. I haven't touched on image processing technology around here yet, so new knowledge is likely to be expanded. Moreover, from what I heard, it seems that there are some unnatural biases in the data. It's absolutely fun to analyze this. That's why I decided to play.

Environment

Windows10
Python 3.8.1
pdf2image 1.12.1
Pillow 7.1.1
Poppler 0.68.0 (Poppler for Windows and poppler-data-0.4.9 blfs / view / svn / general / poppler.html) used)
tesseract 5.0.0-alpha.20200328
PyOCR 0.7.2
pathlib 1.0.1
re 3.4.1
Matplotlib 3.2.1

For the time being, in the image

First, convert the PDF to an image using pdf2image. It is a plagiarism of the code written in almost this article. I'm sorry, I don't feel like I can write better code ...

`Imaging.py`


import pathlib
import pdf2image

pdf_files = pathlib.Path('PDF').glob('*.pdf')

for pdf_file in pdf_files:
    base = pdf_file.stem
    img_dir = pathlib.Path(f'image/{base}')
    img_dir.mkdir()
    images = pdf2image.convert_from_path(pdf_file, grayscale=True, dpi=200)
    for index, image in enumerate(images):
        image.save(img_dir/pathlib.Path(f'{index + 1}.png'), 'png')
    print(base)  #For checking progress

It will take some time to execute, so please wait patiently.

If you wait, it will be like this. Well, when I put them side by side like this, I feel like I have a pub rice in my hand.

From image to string

Use Tessertact_OCR. I worship at the computer with the expectation that it will not be recognized in a good way. It is important to bow as deeply as possible. It would be nice to have an offering. If you feel that your worship is understood, let's try to recognize the 14th (appropriately decided) approval on January 23rd.

C:\Users\usr\Documents\Kagawa>tesseract .\image\Agree 0123\14.png .\Character recognition\test -l jpn
Tesseract Open Source OCR Engine v5.0.0-alpha.20200328 with Leptonica
Warning: Invalid resolution 0 dpi. Using 70 instead.
Estimating resolution as 344
Detected 201 diacritics

It looks like there are a lot of problems, but it's probably because of my mind, because of my mind. Now let's compare the input and the output. Click here for the entered image Here is the output text.

`test.txt`


desknefs NEO             -Page 171

Parliamentary Secretariat(glkeldprefr kagawa lg jp)

 

-------- "

―――(Omitted as blank lines continue)―――
 
Citizen: "Wagawa Prefecture Opinion / Inquiry Page"<hp- adm@pref. kagawa.Idg.]p>-
destination: gikaiGpref.kagawa.Ig.jp

CC :-"

subject:Posting from the freezing / inquiry page

White time:January 23, 2020(Book) 15:16

―――(Omitted as blank lines continue)―――

[Contents of opinions and inquiries]
[Opinion box to the prefectural assembly homepage]-                                」
The prefectural assembly will continue to inform you of the status of the assembly in an easy-to-understand manner.
I will. Please let us know your opinions and impressions when you visit the parliamentary website.
I. We will refer to your opinions as valuable voices from everyone.
I will do it.
Please note$---

@We cannot accept petition by e-mail or e-mail about individual members of the Diet.

【(Residence] ---
【E-Maill . ,
[Subject] Opinions on public comments

[Opinions / impressions]

Age axis phone number

I agree with the Net Game Addiction Countermeasures Article.
I'm worried that there are children playing games and smartphones wherever I go

 

[ADDR]192. 168.7. 21

[DATE]2020/01/23 15:16: 42
[USERAGENT]Mozilla/5.0 (Windows NT 10.0: Win64: x64) AppleWeb
Kit/537.36 (KHTML, like Gecko) Chrome/70.0.3538. 102 faP/53 or 3
6 TOg9. 18362

Uh ~~~~~~~ There are some very unstable points, but I can get the date I plan to play with this time without any problem, so I'm okay for the time being.

Get all at once

It's PyOCR's turn. Characters that match the / ^ ([^ 0-9 \ n] * \ d) {12} [^ 0-9 \ n] * $ / regular expression (a line containing 12 "just" numbers) Extract the column. The numbers seem to be recognized fairly accurately, so you won't miss them that much. The acquired date is stored in four text files, "Agree", "Disagree", "Business operator", and "Proposal". This code, which was written based on this article, is written with some mysterious power to cause a miracle and quadruple the specifications of the personal computer. Believe and do.

`OCR.py`


from PIL import Image
import sys
import pyocr
import pyocr.builders
from pathlib import Path
import re
count = 0
tool = pyocr.get_available_tools()[0]
folders = list(Path("image").glob("*")) #imageフォルダのパスをすべて取得
agr, opp, bsp, rec = open("Agree.txt", "w"), open("Opposition.txt", "w"), open("business person.txt", "w"), open("Recommendation.txt", "w")  #Initialize the text file once
agr,opp,bsp,rec.close()
dic = {"Praise": "Praise成.txt", "Anti": "Anti対.txt", "Thing": "Thing業者.txt", "Proposal": "Proposal言.txt"}  #A dictionary for writing Switch statements
for fol in folders:
    with open(dic[str(fol)[3]],"a") as fil: #Judge the file to open with the "4th character" of the folder path
        for path in (Path(fol).glob("*")):
            count += 1
            text = tool.image_to_string(
                Image.open(path),
                lang="jpn",
                builder=pyocr.builders.TextBuilder(tesseract_layout=6)
            )
            match = re.search(r'^([^0-9\n]*\d){12}[^0-9\n]*$', text, re.MULTILINE)
            if match != None:  #For documents that span several pages, there may be no date anywhere on the page.
                match = match.group()
                fil.write(match + "\n")
            print(count) #For checking progress

By the way, no miracle happened to me, the execution time is too long Maybe there is a way to finish this a little earlier

Acquisition result

As a result of running this program, for example, the contents of "Agree.txt" look like this.

`Agree.txt`



Date and time:January 23, 2020(wood) 11:39 ー ー
Date and time:January 23, 2020(wood) 11:49 ー ー
-Time:January 23, 2020(Book) 11:50                              .
Date and time:January 23, 2020(wood) 11:55 ---
Date and time:January 23, 2020(wood) 13:49
Date and time:January 23, 2020(Book) 15:16 ---.
.Date and time:January 23, 2020(wood) 15:31
Date and time:January 23, 2020(wood) 15:51   .---
Date and time:January 23, 2020(wood) 15:58                            .
Date and time:January 23, 2020(wood) 17:55    .                 ----
Date and time:January 23, 2020(wood) 20:23       .
Date and time:January 23, 2020(wood) 12:22
Date and time:January 23, 2020(wood) 20:31      -"・
Date and time:January 23, 2020(wood) 13:10 ---.
Date and time:January 23, 2020(wood) 16:27                            ]      」
Date and time:January 23, 2020(wood) 17:03
Date and time:January 23, 2020(wood) 18:09             ]---
Date and time:January 23, 2020(wood) 21:41
22812 050 Return presentation IO008 "1-
Date and time:January 24, 2020(Money) 08:49 ー ー
.Date and time:January 24, 2020(Money) 12:40                .
Date and time:January 24, 2020(Money) 13:28
Date and time:January 24, 2020(Money) 13:31
Date and time:January 24, 2020(Money) 13:34                    -
Date and time:January 24, 2020(Money) 13:35
.Date and time:January 24, 2020(Money) 14:01    ]-
.Date and time:January 24, 2020(Money) 15:08 ー ー.
.. Date and time: "January 24, 2020(Money) 08:49  .---
Date and time:January 24, 2020(Money) 15:33 ー ー
Date and time:January 24, 2020(Money) 15:34
Date and time:January 24, 2020(Money) 15:37 ・
Date and time:January 24, 2020(Money) 15:44 ・
Date and time:January 24, 2020(Money) 16:03            」      -       -
Date and time:January 24, 2020(Money) 16:13 ー ー
-Date and time:January 24, 2020(Money) 16:14
Date and time:January 24, 2020(Money) 16:16     -"-
Date and time:January 24, 2020(Money) 16:39    -
-.At the time of:January 24, 2020(Money) 08:50 ー ー
Date and time:January 24, 2020(Money) 16:47      -
(The following is omitted)

It seems that some "non-date" is mixed in, but it seems to be generally successful. By the way, there were only a few "non-dates" in the whole, so I manually removed them, which was a moment.

Normalization

If this is left as it is, the noise will be terrible, so normalize the data. Easily unify with "a combination of all the numbers in the date". The number of characters should be fixed at 12, so you should be able to normalize with this.

`Normalization.py`


import re

for name in ["Agree","Opposition","business person","Recommendation"]:
    with open(name + ".txt") as fil:
        contents = fil.read()
    match = re.findall(r'([0-9]|\n)', contents, re.MULTILINE)
    with open(name + "_Normalization.txt","w") as fil:
        fil.write("".join(match))

`Agree_Normalization.txt`



202001231139
202001231149
202001231150
202001231155
202001231349
202001231516
202001231531
202001231551
202001231558
202001231755
202001232023
202001231222
202001232031
202001231310
(The following is omitted)

it is a good feeling.

Draw a scatter plot

Finally draw a scatter plot. The recruitment period for pub rice is ** 1/23 to 2/6 ** (isn't it short? This), so let's plot the ** distribution of votes in favor ** during this period for the time being. Find the best answer in this question on teratile.

`Graph generation.py`


import matplotlib.pyplot as plt
from matplotlib import dates as mdates
from datetime import datetime as dt
date = []
time = []
x = []
y = []
with open("Agree_Normalization.txt", "r") as fil:
    for line in fil:
        date.append(line[4:10])
        time.append(line[10:12])
for d in date:
    y.append(dt.strptime(d, "%m%d%H"))
for d in time:
    x.append(dt.strptime(d, "%M"))
ax = plt.subplot()
ax.scatter(x, y, alpha=0.1,c='red',s=40)
ax.set_xlim([dt.strptime('00', '%M'),
             dt.strptime('59', '%M')])
ax.set_ylim([dt.strptime('01/23', '%m/%d'), dt.strptime('02/06', '%m/%d')])
plt.xticks(rotation=90)
plt.savefig("Graph.png ")

Here is the output graph [^ 1]. グラフ.png ** Obviously something is happening. ** ** As mentioned in the annotation, the vertical line is engraved with "month and time" and the horizontal line is engraved with "minute". After all, these two clearly dark lines are probably due to the posting of pub rice at such a high speed that it can be seen continuously even in "minute" increments. Well, it's interesting.

Finally

It was a lot of fun. I'm quitting because I'm sleepy today, but pub rice is still open to the public so I think you should play with it if you have time.

Various things that I referred to

https://qiita.com/kikuyan8540/items/35751c573de014df205b
http://pdf-file.nnn2.com/?p=863
https://qiita.com/henjiganai/items/7a5e871f652b32b41a18
http://blog.machine-powers.net/2018/08/02/learning-tesseract-command-utility/#%E3%82%A4%E3%83%B3%E3%82%B9%E3%83%88%E3%83%BC%E3%83%AB
https://blog.14nigo.net/2018/03/tesseract-ocr.html
https://qiita.com/nabechi6011/items/3a367ca94dbd208efcc7
https://qiita.com/amowwee/items/e63b3610ea750f7dba1b
https://narito.ninja/blog/detail/72/
https://teratail.com/questions/143164
https://qiita.com/Alice1017/items/4ce5be3f46aa34f9f900

[^ 1]: I didn't set the label because I'm sleepy anymore, but to explain it, the x-axis represents "minutes" (0-59), and the y-axis represents "months and days" (1) in 1-hour increments. It is a feeling that represents / 23/00 to 2/6/23).

I got the date from the pub rice in Kagawa and drew a graph