I tried morphological analysis and vectorization of words

Try using Word2vec

Install gensim to use word2vec.
Install janome for character processing.

pip install gensim
pip install janome

Code for reading Aozora Bunko with word2vec

#Import required libraries

from janome.tokenizer import Tokenizer
from gensim.models import word2vec
import re

#Read after opening txt file
binarydata = open("kazeno_matasaburo.txt").read()

#By the way, the one who printed and checked one by one
binarydata = open("kazeno_matasaburo.txt")
print(type(binarydata))

Execution result <class'_io.BufferedReader'>

binarydata = open("kazeno_matasaburo.txt").read()
print(type(binarydata))

Execution result <class'bytes'>

#Convert data type to string type (how to write python)
text = binarydata.decode('shift_jis')
#Remove unnecessary data
text = re.split(r'\-{5,}',text)[2]
text = re.split(r'Bottom book:',text)[0]
text = text.strip()

#Perform morphological analysis
t = Tokenizer()
results = []
lines = text.split("\r\n")  #Separated by line

for line in lines:
    s = line
    s = s.replace('|','')
    s = re.sub(r'《.+?》','',s)
    s = re.sub(r'［＃.+?］','',s)
    tokens = t.tokenize(s)  #Contains the analyzed one
    r = []
　　#Take them out one by one.base_form.You can access it on the surface
    for token in tokens:
        if token.base_form == "*":
            w = token.surface
        else:
            w = token.base_form
        ps = token.part_of_speech
        hinshi = ps.split(',')[0]
        if hinshi in ['noun','adjective','verb','symbol']:
            r.append(w)
    rl = ("　".join(r)).strip()
    results.append(rl)
    print(rl)

#Write the analyzed one at the same time as the file is generated
wakachigaki_file = "matasaburo.wakati"
with open(wakachigaki_file,'w', encoding='utf-8') as fp:
    fp.write('\n'.join(results))

#Analysis start
data = word2vec.LineSentence(wakachigaki_file)
model = word2.Word2Vec(data,size=200,window=10,hs=1,min_count=2,sg=1)
model.save('matasaburo.model')

#try using model
model.most_similar(positive=['school'])

Summary

① Get the sentence you want to analyze. ② Process so that it is only sentences. Get rid of things like the last bibliography ③ Take out line by line with the for statement and remove unnecessary parts. ④ Perform morphological analysis with tokenizer. Put it in the list. ⑤ Write the created list to a file ⑥ Create a model using the morphologically analyzed file

Recommended Posts

I tried morphological analysis and vectorization of words

I tried morphological analysis of the general review of Kusoge of the Year

I tried cluster analysis of the weather map

I tried various things with Python: scraping (Beautiful Soup + Selenium + PhantomJS) and morphological analysis.

I displayed the chat of YouTube Live and tried playing

I tried using GrabCut of OpenCV

Clash of Clans and image analysis (3)

I played with Mecab (morphological analysis)!

Morphological analysis of sentences containing recent words in Windows10 64bit environment

I tried to make an analysis base of 5 patterns in 3 years

I tried multiple regression analysis with polynomial regression

I tried the asynchronous server of Django 3.0

I tried using Twitter api and Line api

I tried to visualize the age group and rate distribution of Atcoder

I installed DSX Desktop and tried it

I tried time series analysis! (AR model)

I tried factor analysis with Titanic data!

I tried using PyEZ and JSNAPy. Part 2: I tried using PyEZ

I tried to verify and analyze the acceleration of Python by Cython

Conversion between singular and plural of words

I tried to perform a cluster analysis of customers using purchasing data

I tried 3D detection of a car

I tried combining Fabric, Cuisine and Jinja2

I tried to notify the update of "Hamelin" using "Beautiful Soup" and "IFTTT"

[Python] From morphological analysis of CSV data to CSV output and graph display [GiNZA]

I tried handwriting recognition of runes with scikit-learn

I tried using PyEZ and JSNAPy. Part 1: Overview

[Python / DynamoDB / boto3] List of operations I tried

I tried hundreds of millions of SQLite with python

I tried the pivot table function of pandas

I tried image recognition of CIFAR-10 with Keras-Learning-

I tried web scraping using python and selenium

I tried image recognition of CIFAR-10 with Keras-Image recognition-

I tried to notify slack of Redmine update

I tried object detection using Python and OpenCV

I tried to find 100 million digits of pi

I tried Flask with Remote-Containers of VS Code

Before the coronavirus, I first tried SARS analysis

I tried playing with PartiQL and MongoDB connected

I tried principal component analysis with Titanic data!

I tried Jacobian and partial differential with python

I tried FX technical analysis by AI "scikit-learn"

I tried function synthesis and curry with python

I tried to correct the keystone of the image

Thorough comparison of three Python morphological analysis libraries

I / O related summary of python and fortran

I read and implemented the Variants of UKR

I tried using the image filter of OpenCV

[Introduction to PID] I tried to control and play ♬

I tried to predict the price of ETF

I tried to vectorize the lyrics of Hinatazaka46!

[Linux] I learned LPIC lv1 in 10 days and tried to understand the mechanism of Linux.

I implemented "Basics of Time Series Analysis and State Space Model" (Hayamoto) with pystan

I tried to notify the update of "Become a novelist" using "IFTTT" and "Become a novelist API"

I tried PyQ

I tried to automate the article update of Livedoor blog with Python and selenium.

I tried AutoKeras

I tried papermill

Image processing with Python (I tried binarizing it into a mosaic art of 0 and 1)

I tried django-slack

I tried Django