I wrote a corpus reader that reads the results of MeCab analysis

Imitating chasen.py written by mhagiwara for studying mecab. py was written.

mecab.py

It is assumed that NLTK and nltk_data have been installed and downloaded. Place the data under nltk_data / corpus or create a symbolic link.

import nltk
corpora_path = nltk.data.find('corpora/test')
"""
your data must be stored or linked in nltk/corpora
"""

fileids = r'.*\.mecab'
"""
:param corpus name: regular expression or list of corpus name.
:type corpus: list or strings
"""

reader = MeCabCorpusReader(corpora_path, fileids, encoding='utf8')
print reader.raw()
print ', '.join(reader.words())
for w, t in reader.tagged_words():
    print w, t
for para in reader.paras():
    for sent in para:
        for word in sent:
            print word
for para in reader.tagged_paras():
    for sent in para:
        for (word, pos) in sent:
            print word, pos   

corpus / test is a directory containing files that have been analyzed by MeCab and has the extension mecab. The contents of the file look like this.

Plum noun,General,*,*,*,*,Plum,Plum,Plum
Also particles,Particle,*,*,*,*,Also,Mo,Mo
Peach noun,General,*,*,*,*,Peaches,peach,peach
Also particles,Particle,*,*,*,*,Also,Mo,Mo
Peach noun,General,*,*,*,*,Peaches,peach,peach
Particles,Attributive,*,*,*,*,of,No,No
Noun,Non-independent,Adverbs possible,*,*,*,home,Uchi,Uchi
EOS

The output is

raw()
Plum noun,General,*,*,*,*,Plum,Plum,Plum
Also particles,Particle,*,*,*,*,Also,Mo,Mo
Peach noun,General,*,*,*,*,Peaches,peach,peach
Also particles,Particle,*,*,*,*,Also,Mo,Mo
Peach noun,General,*,*,*,*,Peaches,peach,peach
Particles,Attributive,*,*,*,*,of,No,No
Noun,Non-independent,Adverbs possible,*,*,*,home,Uchi,Uchi
EOS

words()
Plum,Also,Alsoも,Also,Alsoも,of,home

tagged_words()
Plum info:noun,General,*,*,*,*,Plum,Plum,Plum
Also info:Particle,Particle,*,*,*,*,Also,Mo,Mo
Peach info:noun,General,*,*,*,*,Peaches,peach,peach
Also info:Particle,Particle,*,*,*,*,Also,Mo,Mo
Peach info:noun,General,*,*,*,*,Peaches,peach,peach
Info:Particle,Attributive,*,*,*,*,of,No,No
Of info:noun,Non-independent,Adverbs possible,*,*,*,home,Uchi,Uchi

paras()
Plum
Also
Peaches
Also
Peaches
of
home

tagged_paras()
Plum info:noun,General,*,*,*,*,Plum,Plum,Plum
Also info:Particle,Particle,*,*,*,*,Also,Mo,Mo
Peach info:noun,General,*,*,*,*,Peaches,peach,peach
Also info:Particle,Particle,*,*,*,*,Also,Mo,Mo
Peach info:noun,General,*,*,*,*,Peaches,peach,peach
Info:Particle,Attributive,*,*,*,*,of,No,No
Of info:noun,Non-independent,Adverbs possible,*,*,*,home,Uchi,Uchi

Recommended Posts

I wrote a corpus reader that reads the results of MeCab analysis
I wrote a script that splits the image in two
Create a bot that only returns the result of morphological analysis with MeCab on Discord
I made a slack bot that notifies me of the temperature
I analyzed the image of the Kyudo scoring book (a booklet that records the results of the hits). (Google Colaboratory)
I made a calendar that automatically updates the distribution schedule of Vtuber
The story of developing a web application that automatically generates catchphrases [MeCab]
I tried cluster analysis of the weather map
A memo that I wrote a quicksort in Python
The story of IPv6 address that I want to keep at a minimum
I wrote a PyPI module that extends the parameter style in Python's sqlite3 module
I wrote a script to revive the gulp watch that will die soon
[Discode Bot] I created a bot that tells me the race value of Pokemon
I made a github action that notifies Slack of the visual regression test
I made a twitter app that decodes the characters of Pricone with heroku (failure)
The story of Linux that I want to teach myself half a year ago
When I swapped random.randint (a, b) and np.random.randint (a, b), the analysis results were exactly the opposite!
[Python / C] I made a device that wirelessly scrolls the screen of a PC remotely.
I made a calendar that automatically updates the distribution schedule of Vtuber (Google Calendar edition)
A story that reduces the effort of operation / maintenance
[Python] A program that counts the number of valleys
# Function that returns the character code of a string
Generate that shape of the bottom of a PET bottle
A memo that I touched the Datastore with python
A story that analyzed the delivery of Nico Nama.
[Python] A program that compares the positions of kangaroos.
The end of programming beginners (my pattern) who wrote the code motivated by the results that bring programming, not from a technical perspective
I wrote a class that makes it easier to divide by specifying part of speech when using Mecab in python
I wrote a Python script that exports all my posts using the Qiita API v2
I made a class to get the analysis result by MeCab in ndarray with python
[Python] I wrote a test of "Streamlit" that makes it easy to create visualization applications.
I wrote a doctest in "I tried to simulate the probability of a bingo game with Python"
A tool that automatically turns the gacha of a social game
I just changed the sample source of Python a little.
Let's make the analysis of the Titanic sinking data like that
I wrote the basic grammar of Python with Jupyter Lab
I wrote a demo program for linear transformation of a matrix
I wrote the basic operation of Seaborn in Jupyter Lab
I tried morphological analysis of the general review of Kusoge of the Year
I made a function to check the model of DCGAN
Data analysis based on the election results of the Tokyo Governor's election (2020)
I made a dot picture of the image of Irasutoya. (part1)
I wrote a script to combine the divided ts files
I tried a little bit of the behavior of the zip function
I wrote the basic operation of Numpy in Jupyter Lab.
I made a dot picture of the image of Irasutoya. (part2)
I wrote the basic operation of matplotlib with Jupyter Lab
A Python script that compares the contents of two directories
I tried to make a site that makes it easy to see the update information of Azure
I wrote AWS Lambda, and I was a little addicted to the default value of Python arguments
I made a Line bot that guesses the gender and age of a person from an image
Output the result of morphological analysis with Mecab to a WEB browser compatible with Sakura server / UTF-8
Reuse the results of clustering
A memo that reproduces the slide show (gadget) of Windows 7 on Windows 10.
When incrementing the value of a key that does not exist
pandas Fetch the name of a column that contains a specific character
[Python] I wrote the route of the typhoon on the map using folium
A formula that simply calculates the age from the date of birth
[Introduction to StyleGAN] I played with "The Life of a Man" ♬
I wrote the code to write the code of Brainf * ck in python
A function that measures the processing time of a method in python