1. Text extraction by scraping 2. Use MeCab to separate words 3. Creating Word Cloud
Here has "Night on the Galactic Railroad" on the site, so extract only the text from here.
<div class ="main-text">
As you can see, it seems okay if you extract the text in the lower hierarchy from this'div'!
import urllib.request
from bs4 import BeautifulSoup
text = []
# URL of the target site
url = 'https://www.aozora.gr.jp/cards/000081/files/456_15050.html'
html = urllib.request.urlopen(url)
soup = BeautifulSoup(html,'html.parser')
# Note that it is class_ instead of class
ginga = soup.findAll('div' , class_= 'main_text')
for i in ginga:
# Take out only the text and add
text.append(i.text)
# Save as a text file with the name ginga.txt
file = open('ginga.txt','w',encoding='utf-8')
file.writelines(text)
file.close()
I was able to confirm that the full text was extracted properly!
MeCab decomposes and analyzes sentences into morphemes (the smallest unit in which a word has meaning) based on the grammar of the target language and the part-speech information of the word. Please refer to the site below for details
[Technical explanation] What is morphological analysis? From MeCab installation procedure to execution example in Python https://mieruca-ai.com/ai/morphological_analysis_mecab/
import MeCab
# Open the saved text file
data = open("ginnga.txt","rb").read()
text = data.decode('utf-8')
mecab = MeCab.Tagger("-ochasen")
# Morphological analysis with perseToNode
# Put the analysis result in node
node = mecab.parseToNode(text)
ginga_text = []
# Separate words using part of speech
while node:
#word
word = node.surface
#Part of speech
hinnsi = node.feature.split(",")[0]
#Specify the word to be added in the array by part of speech
if hinnsi in ["verb", "adverb", "adjective", "noun"]:
ginga_text.append(word)
else:
#Check what words have not been added (not necessary)
print("|{0}|Part of speech is{1}So don't add".format(node.surface,node.feature.split(",")[0]))
print("-"*35)
node = node.next
["Verb", "Adverb", "Adjective", "Noun"] By changing this content, you can change the word to be added to the array.
WordCloud can be created in a little more time!
To create WordCloud, you need to install the module. Install with ** pip install wordcloud **. Maybe now you can use it. If you can't use it, check it out (sorry)!
I wrote it under the previous file.
from wordcloud import WordCloud
text = ' '.join(ginga_text)
# It seems to be a Japanese pass
fpath = "C:/Windows/Fonts/YuGothM.ttc"
wordcloud = WordCloud (background_color = "white", # white background
font_path=fpath,width = 800,height=600).generate(text)
# Save as png
wordcloud.to_file("./ginnga.png ")
ginnga.png
If you remove things that you don't understand, such as "yo" and "na", when you add them to the array, you'll end up with something that makes more sense.
** I'm satisfied with this this time! ** **
I want to put words in the image of Kenji Miyazawa ↓ Prepared image
I will change the place where I made Word Cloud earlier
import numpy as np
from wordcloud import WordCloud ,ImageColorGenerator
from PIL import Image
text = ' '.join(ginga_text)
imagepaht = "./miyazawa.png "
img_color = np.array(Image.open( imagepaht ))
wc = WordCloud(width=800,
height = 800,
font_path=fpath,
mask = img_color,
background_color= "white",
collocations=False,).generate(text)
wc.to_file("./wc_miyazawa.png ")
** I'm very happy to be able to clean it! **
I tried to visualize the lyrics of Kenshi Yonezu with WordCloud Power BI x Python with Japanese Word Cloud-Python Visual Edition-
I'm glad I was able to do it more beautifully than I expected. Next, I think I tried to visualize the news article. Thank you for reading to the end.
Recommended Posts