This time, I would like to write a place where I learned when I was studying natural language processing on a site called Kikagaku where I can learn about deep learning for free.
・ MacOS ・ Python3.6 (anaconda) ・ VS Code
[Illustration! Thorough explanation of how to use Python Beautiful Soup! (select, find, find_all, install, scraping, etc.)](https://ai-inter1.com/beautifulsoup_1/#:~:text=Beautiful%20Soup(%E3%83%93%E3%83%A5%E3) % 83% BC% E3% 83% 86% E3% 82% A3% E3% 83% 95% E3% 83% AB% E3% 83% BB% E3% 82% B9% E3% 83% BC% E3% 83 % 97),% E3% 82% B9% E3% 82% AF% E3% 83% AC% E3% 82% A4% E3% 83% 94% E3% 83% B3% E3% 82% B0% E7% 94 % A8% E3% 81% AE% E3% 83% A9% E3% 82% A4% E3% 83% 96% E3% 83% A9% E3% 83% AA% E3% 81% A7% E3% 81% 99 % E3% 80% 82 & text = Python% E3% 81% A7% E3% 81% AF% E3% 80% 81Beautiful% 20Soup% E3% 82% 92,% E3% 81% 99% E3% 82% 8B% E3% 81% 93% E3% 81% A8% E3% 81% 8C% E3% 81% A7% E3% 81% 8D% E3% 81% BE% E3% 81% 99% E3% 80% 82) [Python] Case conversion of character strings (lower, upper function) Clone repository (https://docs.github.com/en/free-pro-team@latest/github/creating-cloning-and-archiving-repositories/cloning-a-repository) Comparison of morphological analyzers at the end of 2019 Preparing the environment for using MeCab on Mac Morphological analysis with MeCab on Mac Settings for using JUMAN ++ in Python's pyenv environment on mac I tried morphological analysis with pyknp (JUMANN ++) Running Human ++ with Python
First of all, nice to meet you, beautifulsoup.
What this is is that it is a library that can extract only the necessary information from the htmlm statement. For example, I think that HTML sentences on the net are surrounded by 'div'
and 'h1'
. However, these tags are an obstacle to parsing sentences, so it is beautifulsoup to get only the information without them.
There are lower function
and ʻupper function`, which are a function to make ** lowercase ** and a function to make ** uppercase ** respectively.
I installed MeCab
andHuman ++
because they were good for morphological analysis.
MeCab is probably the best in terms of speed, and if you only pursue accuracy, Human ++ seems to be good.
Installing Mecab was pretty easy.
$ brew install mecab
$ brew install mecab-ipadic
$ pip install mecab-python3
$ git clone url
For the URL after git clone, paste the URL copied from Github repository.
import MeCab
m = MeCab.Tagger('-d/usr/local/lib/mecab/dic/mecab-ipadic-neologd')
text = '<html>Deep learning from scratch</html>'
print(m.parse(text))
MeCab will be written like this.
By the way, in the parentheses of Tagger ()
Can be specified.
Juman ++ also needed pyknp installation
to be usable in python.
$ brew install jumanpp
$ pip install pyknp
This completes the installation of Juman ++
and Pyknp
.
Next, I will write about how to write Human ++ on Python.
from pyknp.juman.juman import Juman
juman = Juman()
result = juman.analysis("Foreigners to vote")
for mrph in result.mrph_list():
print(mrph.midasi, mrph.yomi)
Juman ++ is written on Python like this. I'm using Juman ++, but it seems that it's okay to leave it as Juman
when writing code.
In addition to midasi and yomi in the print part, you can also add mrph.genkei, mrph.hinsi, mrph.bunrui, mrph.katuyou1, mrph.katuyou2, mrph.imis, mrph.repname
.
I used split
, which is a function that can be extracted word by word.
At first I thought it was a function specialized in natural language processing, but it was usually a python function.
** split outputs sentences separated by words. ** **
However, if only split is used, the margin after the comma is output as it is, so the function called strip
is used because it does not look very good.
By using this, it is possible to output with the margins removed.
This time, it took a long time for the morphological analysis part. However, I thought that it was necessary knowledge for natural language processing from now on, so I am glad that I could learn it carefully.
Recommended Posts