[Python] I tried to visualize the night on the Galactic Railroad with WordCloud!

at first

environment

flow

1. Text extraction by scraping 2. Use MeCab to separate words 3. Creating Word Cloud

1. Scraping

Here has "Night on the Galactic Railroad" on the site, so extract only the text from here.

キャプチャ14.PNG


 <div class ="main-text">

As you can see, it seems okay if you extract the text in the lower hierarchy from this'div'!

import urllib.request
from bs4 import BeautifulSoup

text = []

# URL of the target site
url = 'https://www.aozora.gr.jp/cards/000081/files/456_15050.html'
html = urllib.request.urlopen(url)
soup = BeautifulSoup(html,'html.parser')

# Note that it is class_ instead of class
ginga =  soup.findAll('div' , class_= 'main_text')

for i in ginga:
# Take out only the text and add
    text.append(i.text)

# Save as a text file with the name ginga.txt
file = open('ginga.txt','w',encoding='utf-8')
file.writelines(text)
file.close()


Check text file

I was able to confirm that the full text was extracted properly! t.PNG

2. Use MeCab to separate words

MeCab decomposes and analyzes sentences into morphemes (the smallest unit in which a word has meaning) based on the grammar of the target language and the part-speech information of the word. Please refer to the site below for details

[Technical explanation] What is morphological analysis? From MeCab installation procedure to execution example in Python  https://mieruca-ai.com/ai/morphological_analysis_mecab/

import MeCab

# Open the saved text file
data = open("ginnga.txt","rb").read()
text = data.decode('utf-8')

mecab = MeCab.Tagger("-ochasen")
# Morphological analysis with perseToNode
# Put the analysis result in node
node = mecab.parseToNode(text)

ginga_text = []

# Separate words using part of speech

while node:
 #word
    word = node.surface
 #Part of speech
    hinnsi = node.feature.split(",")[0]
 #Specify the word to be added in the array by part of speech
 if hinnsi in ["verb", "adverb", "adjective", "noun"]:
        ginga_text.append(word)
    else:
 #Check what words have not been added (not necessary)
         print("|{0}|Part of speech is{1}So don't add".format(node.surface,node.feature.split(",")[0]))
        print("-"*35)
    node = node.next

["Verb", "Adverb", "Adjective", "Noun"] By changing this content, you can change the word to be added to the array.

WordCloud can be created in a little more time!

3. Create Word Cloud

To create WordCloud, you need to install the module. Install with ** pip install wordcloud **. Maybe now you can use it. If you can't use it, check it out (sorry)!

I wrote it under the previous file.

from wordcloud import WordCloud
text = ' '.join(ginga_text)
# It seems to be a Japanese pass
fpath = "C:/Windows/Fonts/YuGothM.ttc"
 wordcloud = WordCloud (background_color = "white", # white background
                     font_path=fpath,width = 800,height=600).generate(text)

# Save as png
wordcloud.to_file("./ginnga.png ")

result

ginnga.png ginnga_2.png

If you remove things that you don't understand, such as "yo" and "na", when you add them to the array, you'll end up with something that makes more sense.

** I'm satisfied with this this time! ** **

bonus

I want to put words in the image of Kenji Miyazawa ↓ Prepared image miyazawa.png

I will change the place where I made Word Cloud earlier

import numpy as np
from wordcloud import WordCloud ,ImageColorGenerator
from PIL import Image

text = ' '.join(ginga_text)

imagepaht = "./miyazawa.png "
img_color = np.array(Image.open( imagepaht ))
wc = WordCloud(width=800,
              height = 800,
              font_path=fpath,
              mask = img_color,
              background_color= "white",
              collocations=False,).generate(text)

wc.to_file("./wc_miyazawa.png ")

result

wc_miyazawa2.png

** I'm very happy to be able to clean it! **

Reference article

I tried to visualize the lyrics of Kenshi Yonezu with WordCloud Power BI x Python with Japanese Word Cloud-Python Visual Edition-

At the end

I'm glad I was able to do it more beautifully than I expected. Next, I think I tried to visualize the news article. Thank you for reading to the end.

Recommended Posts

[Python] I tried to visualize the night on the Galactic Railroad with WordCloud!
[Python] I tried to visualize tweets about Corona with WordCloud
I tried to summarize everyone's remarks on slack with wordcloud (Python)
I tried with the top 100 PyPI packages> I tried to graph the packages installed on Python
I tried to visualize the text of the novel "Weathering with You" with WordCloud
I tried to implement Minesweeper on terminal with python
I tried to touch the CSV file with Python
I tried to solve the soma cube with python
I tried to solve the problem with Python Vol.1
I tried to easily visualize the tweets of JAWS DAYS 2017 with Python + ELK
I tried to find the entropy of the image with python
I tried to simulate how the infection spreads with Python
[Python] I tried to visualize the follow relationship of Twitter
I tried changing the python script from 2.7.11 to 3.6.0 on windows10
I tried to divide the file into folders with Python
I tried to visualize the characteristics of new coronavirus infected person information with wordcloud
I tried to visualize AutoEncoder with TensorFlow
I tried to solve the ant book beginner's edition with python
I tried to get started with Bitcoin Systre on the weekend
I tried to improve the efficiency of daily work with Python
I tried "smoothing" the image with Python + OpenCV
I tried to refer to the fun rock-paper-scissors poi for beginners with Python
I tried "differentiating" the image with Python + OpenCV
I tried to save the data with discord
I tried to get CloudWatch data with Python
I tried to output LLVM IR with Python
I tried to get the authentication code of Qiita API with Python.
I tried "binarizing" the image with Python + OpenCV
I tried to automate sushi making with python
Introduction to Python with Atom (on the way)
[Python] I tried to visualize the prize money of "ONE PIECE" over 100 million characters with matplotlib.
I tried to streamline the standard role of new employees with Python
I tried to visualize the model with the low-code machine learning library "PyCaret"
I tried to get the movie information of TMDb API with Python
I tried playing with the calculator on tkinter
I wanted to visualize 3D particle simulation with the Python visualization library Matplotlib.
[IBM Cloud] I tried to access the Db2 on Cloud table from Cloud Funtions (python)
I tried to automatically send the literature of the new coronavirus to LINE with Python
I tried to graph the packages installed in Python
I tried Python on Mac for the first time.
I tried to get started with blender python script_Part 01
I tried to draw a route map with Python
I tried python on heroku for the first time
I tried to get started with blender python script_Part 02
I tried to implement an artificial perceptron with python
I want to inherit to the back with python dataclass
[Python] I tried to graph the top 10 eyeshadow rankings
I tried to automatically generate a password with Python3
I tried to visualize the spacha information of VTuber
I want to AWS Lambda with Python on Mac!
I tried to analyze J League data with Python
I tried to notify the honeypot report on LINE
I tried hitting the API with echonest's python client
I tried to summarize the string operations of Python
I tried to solve AOJ's number theory with Python
I tried fp-growth with python
I tried scraping with Python
I tried gRPC with Python
I tried scraping with python
I made wordcloud with Python.
I tried to find out how to streamline the work flow with Excel x Python ④