When I made the material for the Advent calendar, I needed to perform morphological analysis in English, but I got stuck a little, so I will leave it.
In the Japanese domain, MeCab is often taken up as a morphological analysis tool, but there are not so many cases when it comes to morphological analysis in English. Polygot is a library that has many functions of natural language analysis such as language identification and language detection, including morphological analysis of English sentences.
It is as described in Official Document, but the supported languages differ depending on the function.
Function name | Number of supported languages | Description |
---|---|---|
Tokenization | 165 languages | Divides a character string into the smallest unit of sentences to be handled when performing natural language processing |
Language detection | 196 languages | Identify the language of the string to be parsed |
Named entity recognition | 40 languages | Extracts named entity from the string to be parsed with Polygotplace、Organization、ManYou can extract three types of |
Part-of-speech tagging | 16 languages | Part of speech tag is added to each token of the character string to be parsed. |
Sentiment analysis | 136 languages | Negative、neutral、positiveYou can get 3 types of |
Distributed representation | 137 languages | Map words to a d-dimensional vector space |
Morphological analysis | 135 languages | Divide the character string to be parsed into the smallest meaningful units |
Transliteration | 69 languages | Converts the input string to a string in the specified language |
As you can see from the table above, it supports many languages.
Install Let's set up Polygot to actually work.
$ sudo pip3 install -U polyglot
polyglot itself can be installed by simply executing the above command. However, in order to actually perform language analysis with polyglot, it is necessary to obtain a dictionary of the language to be analyzed. If ICU is not installed when retrieving the dictionary, an error will be thrown. So, before downloading, execute the following command to get the required library.
$ sudo apt-get -y install libicu-dev
$ sudo pip3 install -U pyicu
$ sudo pip3 install -U morfessor
In addition, ** pycld2 ** is required to download the model.
In a normal Linux environment, you can install by just hitting $ sudo pip install pycld2
. However, when I execute the above command on the Raspberry Pi, the following error is displayed.
arm-linux-gnueabihf-gcc: error: unrecognized command line option ‘-m64’
error: command 'arm-linux-gnueabihf-gcc' failed with exit status 1
----------------------------------------
ERROR: Failed building wheel for pycld2
The above error occurs because the compiler for the ARM architecture does not provide the -m64 option and the compilation fails. As it is, pycld2 cannot be installed, so Polyglot cannot be run on Raspberry Pi. I'm in trouble ...
Since it cannot be installed as it is, it is necessary to execute setup.py after removing the -m64 compile option specified in setup.py
of pycld2.
After git clone
from the repository below, play with setup.py
.
aboSamoor/pycld2 - Github
$ git clone https://github.com/aboSamoor/pycld2.git
$ cd pycld2/
Move to the directory of pycld2 that was git cloned, delete **-m64 ** from the array of compile options described in Line 78 of setup.py located directly under it, and then save it.
Change before
language="c++",
# TODO: -m64 may break 32 bit builds
extra_compile_args=["-w", "-O2", "-m64", "-fPIC"],
After change
language="c++",
# TODO: -m64 may break 32 bit builds
extra_compile_args=["-w", "-O2", "-fPIC"],
After making the changes, execute the following command.
$ sudo pip3 install hogehoge/pycld2/
Successfully built pycld2
Installing collected packages: pycld2
Successfully installed pycld2-0.42
After execution, if Successfully is displayed, the installation is successful.
You can download the model that executes the following command. This time we will perform morphological analysis of English sentences, so download the English model.
$ polyglot download morph2.en
[polyglot_data] Downloading package morph2.en to
[polyglot_data] /home/pi/polyglot_data...
All you have to do now is run the sample code below.
morph.py
from polyglot.text import Text
sample_text = "One Hamburger and a Medium Coffee please."
tokens = Text(sample_text)
print(tokens.morphemes)
When you actually execute the above script, you can get the result in the following form.
$ python3 morph.py
['One', ' ', 'Ham', 'burg', 'er and a Medium Coffee p', 'lease', '.']
This time, I used Polyglot for the first time to create a certain program. Since language can be determined, if it is Japanese in connection with Twitter API, it can be processed on the MeCab side and the rest can be left to Polyglot. I don't think that English natural language processing will be used in business, but I will leave it as a memorandum as a drawer.
Recommended Posts