Introduction

The composition about taxes is your homework, right? I did my homework. For the time being, I looked up the composition that won the award in the past on What seems to be a homepage that came out when I googled, but something like this ... … ** Do you say that you remember something like “unity” **, or you say you pushed the mark? Well… yes. If you analyze this with morphological analysis, you will get interesting results, so I think I'll do it. Yes.

environment

Ruby 2.7.0
Python 3.8.3
Nokogiri 1.10.9
pdf-reader 2.4.0
wordcloud 1.7.0
mecab-python 1.0.1

do

Get as text data

This homepage contains the essays that won the award from 2017 to the first year of Reiwa, so I think it will be a mess to download this one. So let's get the data by scraping first. By the way, ** for some reason ** only the composition of the junior high school students in the first year of Reiwa is prepared as a PDF file instead of solid HTML, so the processing method will be divided between those of the junior high school students in the first year of Reiwa and others. .. By the way, the number of this composition group is clearly larger than the others, so it seems that the reason for converting it to PDF is that.

Get the data written in solid HTML

As I checked earlier, the format of HTML solid data is ** different depending on the item **, roughly divided into ** "high school students in 2018 and the first year of Reiwa" ** and ** "others" ** Is different. Why did you make this specification? ?? ?? ?? ?? ?? ?? ?? It looks like this in an easy-to-understand table.

What i wanted to do

Both 2017 and 2018 junior high school students

It is solid writing A mentioned in the previous table. I wrote it roughly

`text_dl_A.rb`


require 'nokogiri'
require 'open-uri'

base_urls =["https://www.nta.go.jp/taxes/kids/sakubun/koko/h29/sakubun.htm",
            "https://www.nta.go.jp/taxes/kids/sakubun/chugaku/h29/sakubun.htm",
            "https://www.nta.go.jp/taxes/kids/sakubun/chugaku/h30/sakubun.htm"] #I could have generated the URL in the program separately, but it was annoying, so I can just copy and paste it.
count = 0
base_urls.each do |url|
    doc = Nokogiri::HTML(URI.open(url))
    doc.xpath("//p[@class='sakubun_texts']").each do |content|
        count += 1
        File.open("texts/#{count}.txt","w") do |f|
            f.puts content.inner_text
        end
    end
    sleep 1
end

High school students in 2018 and the first year of Reiwa

It is solid writing B mentioned in the previous table, this is also miscellaneous

`text_dl_B.rb`


require 'nokogiri'
require 'open-uri'

base_urls =["https://www.nta.go.jp/taxes/kids/sakubun/koko/h30/sakubun.htm",
            "https://www.nta.go.jp/taxes/kids/sakubun/koko/r1/sakubun.htm"] 
count = 20
base_urls.each do |url|
    doc = Nokogiri::HTML(URI.open(url))
    doc.xpath("//p[@class='movePageTop']").each do |ptag|
        count += 1
        File.open("texts/#{count}.txt","w") do |f|
            f.puts ptag.previous.previous.inner_text #Since the class and ID were not set in the composition itself, I made it in the direction of searching for sibling relationships from the "Top of this page" button.
        end
    end
    sleep 1
end

This completes the download of the solid data, 39 files in total have been dropped.

Get the data posted as PDF

Get the URL of the PDF file

For the time being, I thought I would write a code like this and get it roughly, but

require 'nokogiri'
require 'open-uri'

urls = []
doc = Nokogiri::HTML(URI.open("https://www.nta.go.jp/taxes/kids/sakubun/chugaku/r01/index.htm"))
doc.xpath("//td[@class='mvC left']/a[@target='_blank']").each do |atag|
    urls.push("https://www.nta.go.jp" + atag[:href])
end
puts urls

Somehow it stopped in the middle, I wonder if it reached something like the "limit number of characters" in scraping. I'm not sure, I don't know what to scrape anyway, but it seems to be useless, so I'll do it in a more crude way:

Display the [table of contents source](view-source: https://www.nta.go.jp/taxes/kids/sakubun/chugaku/r01/index.htm) of the composition of junior high school students in the first year of Reiwa.
Copy with Ctrl + A → Ctrl + C
Paste it in the "Input" field of Regular expression string extraction web application that came out when you googled earlier.
Paste the regular expression (? <= Href =") \ / taxes \ / kids \ / sakubun \ /chugaku \ / r01 \ / pdf \ /. + \. Pdf in the "Extraction pattern" field. Extract the specified group that matches the pattern with "Extract"
As you can see

Great. Store this in a file with a name like ** urls.txt **.

Download from URL

I wrote it roughly

`pdf_dl.rb`


require 'open-uri'

count = 0
File.foreach("urls.txt") do |f|
    count += 1
    url = "https://www.nta.go.jp" + f.chomp
    URI.open(url) do |dwn_f|
        File.open("pdfs/#{count}.pdf","wb") do |out|
            out.write(dwn_f.read)
        end
    end
    sleep 1
end

You have now downloaded 130 PDF files under the "pdfs" folder, aren't there many? Apparently abnormal amount

PDF→TXT There was a library called pdf-reader, so I will convert it with this one.

`pdf2txt`


require 'pdf-reader'

Dir.glob("pdfs/*.pdf").each do |i|
    first_indent_flag = true
    text = ""
    reader = PDF::Reader.new(i)
    pdf = ""
    reader.pages.each do |page|
        pdf += page.text
    end
    pdf.each_line do |line|
        if first_indent_flag
            if /^ [^ ].*/ === line.scrub #Judge the start of the text at the point where "indentation with only one half-width space" first appeared
                first_indent_flag = false
                text += line
            end
        else
            text += line
        end
    end
    File.open("texts/#{i.match(/[0-9]+/)[0].to_i + 39}.txt","w") do |f|
        f.puts text
    end
end

Did I do something a little more? Well, by executing this code, I think that a total of 169 text files will be placed directly under the texts file. I think I've missed somewhere, but I didn't know where I was spilling.

Output to Wordcloud with Mecab

I'll do it in Python. I will reuse the code written in Article written before.

import MeCab
from wordcloud import WordCloud
t = MeCab.Tagger()
s = []

for i in range(1, 169):
    with open(f'texts/{i}.txt', encoding="UTF-8") as txt_file:
        text = txt_file.read()
    nodes = t.parseToNode(text)
    while nodes:
        if nodes.feature[:2] == "noun":
            s.append(nodes.surface)
        nodes = nodes.next

wc = WordCloud(width=720, height=480, background_color="black", stopwords={
                "this", "For", "It", "Yo", "thing", "thing"}, font_path="C:\Windows\Fonts\HGRGE.TTC")
wc.generate(" ".join(s))
wc.to_file('test.png')

Such an image is output Great! It feels like a composition about taxes! I'd like to consider various things, but I'm getting tired of it, so I'll fold the article. I think that you can think of quite a lot with just this one image, so please try it too! !! !! !! Then! !! !! !! !!

Summary

After writing that taxes are necessary for daily life based on medical matters (it is better to mention grandfather or grandmother at this time), it is like "Let's fulfill the obligation to pay taxes as a citizen" It looks good to write with a feeling, and don't forget to mention the consumption tax

Look up popular words in "Composition about taxes"