Execute the janome studied in the following article in the local environment. I will try text mining the diary I wrote. https://mocobeta.github.io/janome/
-PYthon 3.7.4
-Janome 0.30.10 -wordcloud 1.7.0
From the module installation
pip install Janome
pip install wordcloud
Don't forget to cd into the module folder and do the following (I forgot)
Python setup.py install
Processing order
from janome.tokenizer import Tokenizer
from janome.analyzer import Analyzer
from janome.charfilter import *
from janome.tokenfilter import *
from wordcloud import WordCloud
#A function that specifies the part of speech to replace or filter unrecognized characters
def create_analyzer():
tokenizer=Tokenizer()
char_filters=[RegexReplaceCharFilter('《.*?》', '')] #Filter that replaces strings
token_filters=[POSKeepFilter(['noun','adjective','Adjectival noun','Interjection']),POSStopFilter(['noun,Non-independent','noun,代noun']),ExtractAttributeFilter('base_form')]
#Keep targets the target words, top excludes them, and Extract targets only the uninflected words.
#This time, we focused on nouns, adjectives, adjective verbs, and interjections.
return Analyzer(char_filters,tokenizer,token_filters=token_filters)
#A function that divides a sentence into words and returns it as a text file
def split_text(src, out): #Apply user dictionary information to divide sentences into words and preprocess
#Reads the file passed in src, splits words and writes to out.
a=create_analyzer()
with open(src,encoding='utf-8') as f1:
with open(out, mode='w', encoding='utf-8') as f2:
for line in f1:
tokens=list(a.analyze(line))
f2.write('%s\n' % ' '.join(tokens))
split_text('data/diary.txt', 'words.txt')
with open("words.txt",encoding='utf-8')as f:
text=f.read()
wc = WordCloud(width=1920, height=1080,
font_path="fonts/ipagp.ttf", #Font download
max_words=100,#Number of words in the word cloud
background_color="white",#Background color
stopwords={"myself","Absent","Good","Good"}) #Set prohibited words
wc.generate(text)
wc.to_file('data/test_wordcloud.png')
You can add a csv file of a dictionary that describes technical terms with the very first function create_analyzer, but this time I omitted it. Again, you can study on the page below https://mocobeta.github.io/janome/
The following png file is created. In the future, I would like to read from JSON files in combination with the information and APIs picked up by web scraping.
Recommended Posts