at first

I created it because I wanted to create WordCloud.
The code may be incorrect (sorry)

environment

Python 3.7.3
Jupyter Notebook
Windows

flow

1. Text extraction by scraping 2. Use MeCab to separate words 3. Creating Word Cloud

1. Scraping

Here has "Night on the Galactic Railroad" on the site, so extract only the text from here.

キャプチャ14.PNG

 <div class ="main-text">

As you can see, it seems okay if you extract the text in the lower hierarchy from this'div'!

import urllib.request
from bs4 import BeautifulSoup

text = []

# URL of the target site
url = 'https://www.aozora.gr.jp/cards/000081/files/456_15050.html'
html = urllib.request.urlopen(url)
soup = BeautifulSoup(html,'html.parser')

# Note that it is class_ instead of class
ginga =  soup.findAll('div' , class_= 'main_text')

for i in ginga:
# Take out only the text and add
    text.append(i.text)

# Save as a text file with the name ginga.txt
file = open('ginga.txt','w',encoding='utf-8')
file.writelines(text)
file.close()

Check text file

I was able to confirm that the full text was extracted properly!

2. Use MeCab to separate words

MeCab decomposes and analyzes sentences into morphemes (the smallest unit in which a word has meaning) based on the grammar of the target language and the part-speech information of the word. Please refer to the site below for details

[Technical explanation] What is morphological analysis? From MeCab installation procedure to execution example in Python 　https://mieruca-ai.com/ai/morphological_analysis_mecab/

import MeCab

# Open the saved text file
data = open("ginnga.txt","rb").read()
text = data.decode('utf-8')

mecab = MeCab.Tagger("-ochasen")
# Morphological analysis with perseToNode
# Put the analysis result in node
node = mecab.parseToNode(text)

ginga_text = []

# Separate words using part of speech

while node:
 #word
    word = node.surface
 #Part of speech
    hinnsi = node.feature.split(",")[0]
 #Specify the word to be added in the array by part of speech
 if hinnsi in ["verb", "adverb", "adjective", "noun"]:
        ginga_text.append(word)
    else:
 #Check what words have not been added (not necessary)
         print("|{0}|Part of speech is{1}So don't add".format(node.surface,node.feature.split(",")[0]))
        print("-"*35)
    node = node.next

["Verb", "Adverb", "Adjective", "Noun"] By changing this content, you can change the word to be added to the array.

WordCloud can be created in a little more time!

3. Create Word Cloud

To create WordCloud, you need to install the module. Install with ** pip install wordcloud **. Maybe now you can use it. If you can't use it, check it out (sorry)!

I wrote it under the previous file.

from wordcloud import WordCloud
text = ' '.join(ginga_text)
# It seems to be a Japanese pass
fpath = "C:/Windows/Fonts/YuGothM.ttc"
 wordcloud = WordCloud (background_color = "white", # white background
                     font_path=fpath,width = 800,height=600).generate(text)

# Save as png
wordcloud.to_file("./ginnga.png ")

result

ginnga.png

If you remove things that you don't understand, such as "yo" and "na", when you add them to the array, you'll end up with something that makes more sense.

** I'm satisfied with this this time! ** **

bonus

I want to put words in the image of Kenji Miyazawa ↓ Prepared image

I will change the place where I made Word Cloud earlier

import numpy as np
from wordcloud import WordCloud ,ImageColorGenerator
from PIL import Image

text = ' '.join(ginga_text)

imagepaht = "./miyazawa.png "
img_color = np.array(Image.open( imagepaht ))
wc = WordCloud(width=800,
              height = 800,
              font_path=fpath,
              mask = img_color,
              background_color= "white",
              collocations=False,).generate(text)

wc.to_file("./wc_miyazawa.png ")

result

** I'm very happy to be able to clean it! **

Reference article

I tried to visualize the lyrics of Kenshi Yonezu with WordCloud Power BI x Python with Japanese Word Cloud-Python Visual Edition-

At the end

I'm glad I was able to do it more beautifully than I expected. Next, I think I tried to visualize the news article. Thank you for reading to the end.

[Python] I tried to visualize the night on the Galactic Railroad with WordCloud!

at first

environment

flow

1. Scraping

Check text file

2. Use MeCab to separate words

3. Create Word Cloud

result

bonus

result

Reference article

At the end