The composition about taxes is your homework, right? I did my homework. For the time being, I looked up the composition that won the award in the past on What seems to be a homepage that came out when I googled, but something like this ... … ** Do you say that you remember something like “unity” **, or you say you pushed the mark? Well… yes. If you analyze this with morphological analysis, you will get interesting results, so I think I'll do it. Yes.
This homepage contains the essays that won the award from 2017 to the first year of Reiwa, so I think it will be a mess to download this one. So let's get the data by scraping first. By the way, ** for some reason ** only the composition of the junior high school students in the first year of Reiwa is prepared as a PDF file instead of solid HTML, so the processing method will be divided between those of the junior high school students in the first year of Reiwa and others. .. By the way, the number of this composition group is clearly larger than the others, so it seems that the reason for converting it to PDF is that.
As I checked earlier, the format of HTML solid data is ** different depending on the item **, roughly divided into ** "high school students in 2018 and the first year of Reiwa" ** and ** "others" ** Is different. Why did you make this specification? ?? ?? ?? ?? ?? ?? ?? It looks like this in an easy-to-understand table.
What i wanted to do
It is solid writing A mentioned in the previous table. I wrote it roughly
text_dl_A.rb
require 'nokogiri'
require 'open-uri'
base_urls =["https://www.nta.go.jp/taxes/kids/sakubun/koko/h29/sakubun.htm",
"https://www.nta.go.jp/taxes/kids/sakubun/chugaku/h29/sakubun.htm",
"https://www.nta.go.jp/taxes/kids/sakubun/chugaku/h30/sakubun.htm"] #I could have generated the URL in the program separately, but it was annoying, so I can just copy and paste it.
count = 0
base_urls.each do |url|
doc = Nokogiri::HTML(URI.open(url))
doc.xpath("//p[@class='sakubun_texts']").each do |content|
count += 1
File.open("texts/#{count}.txt","w") do |f|
f.puts content.inner_text
end
end
sleep 1
end
It is solid writing B mentioned in the previous table, this is also miscellaneous
text_dl_B.rb
require 'nokogiri'
require 'open-uri'
base_urls =["https://www.nta.go.jp/taxes/kids/sakubun/koko/h30/sakubun.htm",
"https://www.nta.go.jp/taxes/kids/sakubun/koko/r1/sakubun.htm"]
count = 20
base_urls.each do |url|
doc = Nokogiri::HTML(URI.open(url))
doc.xpath("//p[@class='movePageTop']").each do |ptag|
count += 1
File.open("texts/#{count}.txt","w") do |f|
f.puts ptag.previous.previous.inner_text #Since the class and ID were not set in the composition itself, I made it in the direction of searching for sibling relationships from the "Top of this page" button.
end
end
sleep 1
end
This completes the download of the solid data, 39 files in total have been dropped.
For the time being, I thought I would write a code like this and get it roughly, but
require 'nokogiri'
require 'open-uri'
urls = []
doc = Nokogiri::HTML(URI.open("https://www.nta.go.jp/taxes/kids/sakubun/chugaku/r01/index.htm"))
doc.xpath("//td[@class='mvC left']/a[@target='_blank']").each do |atag|
urls.push("https://www.nta.go.jp" + atag[:href])
end
puts urls
Somehow it stopped in the middle, I wonder if it reached something like the "limit number of characters" in scraping. I'm not sure, I don't know what to scrape anyway, but it seems to be useless, so I'll do it in a more crude way:
(? <= Href =") \ / taxes \ / kids \ / sakubun \ /chugaku \ / r01 \ / pdf \ /. + \. Pdf
in the "Extraction pattern" field. Extract the specified group that matches the pattern with "Extract"Great. Store this in a file with a name like ** urls.txt **.
I wrote it roughly
pdf_dl.rb
require 'open-uri'
count = 0
File.foreach("urls.txt") do |f|
count += 1
url = "https://www.nta.go.jp" + f.chomp
URI.open(url) do |dwn_f|
File.open("pdfs/#{count}.pdf","wb") do |out|
out.write(dwn_f.read)
end
end
sleep 1
end
You have now downloaded 130 PDF files under the "pdfs" folder, aren't there many? Apparently abnormal amount
PDF→TXT There was a library called pdf-reader, so I will convert it with this one.
pdf2txt
require 'pdf-reader'
Dir.glob("pdfs/*.pdf").each do |i|
first_indent_flag = true
text = ""
reader = PDF::Reader.new(i)
pdf = ""
reader.pages.each do |page|
pdf += page.text
end
pdf.each_line do |line|
if first_indent_flag
if /^ [^ ].*/ === line.scrub #Judge the start of the text at the point where "indentation with only one half-width space" first appeared
first_indent_flag = false
text += line
end
else
text += line
end
end
File.open("texts/#{i.match(/[0-9]+/)[0].to_i + 39}.txt","w") do |f|
f.puts text
end
end
Did I do something a little more? Well, by executing this code, I think that a total of 169 text files will be placed directly under the texts file. I think I've missed somewhere, but I didn't know where I was spilling.
I'll do it in Python. I will reuse the code written in Article written before.
import MeCab
from wordcloud import WordCloud
t = MeCab.Tagger()
s = []
for i in range(1, 169):
with open(f'texts/{i}.txt', encoding="UTF-8") as txt_file:
text = txt_file.read()
nodes = t.parseToNode(text)
while nodes:
if nodes.feature[:2] == "noun":
s.append(nodes.surface)
nodes = nodes.next
wc = WordCloud(width=720, height=480, background_color="black", stopwords={
"this", "For", "It", "Yo", "thing", "thing"}, font_path="C:\Windows\Fonts\HGRGE.TTC")
wc.generate(" ".join(s))
wc.to_file('test.png')
Such an image is output Great! It feels like a composition about taxes! I'd like to consider various things, but I'm getting tired of it, so I'll fold the article. I think that you can think of quite a lot with just this one image, so please try it too! !! !! !! Then! !! !! !! !!
After writing that taxes are necessary for daily life based on medical matters (it is better to mention grandfather or grandmother at this time), it is like "Let's fulfill the obligation to pay taxes as a citizen" It looks good to write with a feeling, and don't forget to mention the consumption tax
Recommended Posts