Background

Execute the janome studied in the following article in the local environment. I will try text mining the diary I wrote. https://mocobeta.github.io/janome/

environment

-PYthon 3.7.4

Module used

-Janome 0.30.10 -wordcloud 1.7.0

From the module installation

pip install Janome
pip install wordcloud

Don't forget to cd into the module folder and do the following (I forgot)

Python setup.py install

Processing order

Prepare a text file (.txt)
Word division of text
Create a word cloud


from janome.tokenizer import Tokenizer
from janome.analyzer import Analyzer
from janome.charfilter import *
from janome.tokenfilter import *
from wordcloud import WordCloud

#A function that specifies the part of speech to replace or filter unrecognized characters
def create_analyzer(): 
  tokenizer=Tokenizer()
  char_filters=[RegexReplaceCharFilter('《.*?》', '')]  #Filter that replaces strings
  token_filters=[POSKeepFilter(['noun','adjective','Adjectival noun','Interjection']),POSStopFilter(['noun,Non-independent','noun,代noun']),ExtractAttributeFilter('base_form')]
  #Keep targets the target words, top excludes them, and Extract targets only the uninflected words.
  #This time, we focused on nouns, adjectives, adjective verbs, and interjections.

  return Analyzer(char_filters,tokenizer,token_filters=token_filters)

#A function that divides a sentence into words and returns it as a text file
def split_text(src, out): #Apply user dictionary information to divide sentences into words and preprocess
  #Reads the file passed in src, splits words and writes to out.
  a=create_analyzer()
  with open(src,encoding='utf-8') as f1:
    with open(out, mode='w', encoding='utf-8') as f2:
      for line in f1:
        tokens=list(a.analyze(line))
        f2.write('%s\n' % ' '.join(tokens))


split_text('data/diary.txt', 'words.txt')
with open("words.txt",encoding='utf-8')as f:
    text=f.read()

wc = WordCloud(width=1920, height=1080,
               font_path="fonts/ipagp.ttf", #Font download
               max_words=100,#Number of words in the word cloud
               background_color="white",#Background color
               stopwords={"myself","Absent","Good","Good"}) #Set prohibited words

wc.generate(text)
wc.to_file('data/test_wordcloud.png')

You can add a csv file of a dictionary that describes technical terms with the very first function create_analyzer, but this time I omitted it. Again, you can study on the page below https://mocobeta.github.io/janome/

The following png file is created. In the future, I would like to read from JSON files in combination with the information and APIs picked up by web scraping.

Try text mining your diary in Python

Background

environment

Module used