When I made the material for the Advent calendar, I needed to perform morphological analysis in English, but I got stuck a little, so I will leave it.

What is Polygot

In the Japanese domain, MeCab is often taken up as a morphological analysis tool, but there are not so many cases when it comes to morphological analysis in English. Polygot is a library that has many functions of natural language analysis such as language identification and language detection, including morphological analysis of English sentences.

Language support scope

It is as described in Official Document, but the supported languages differ depending on the function.

Function name	Number of supported languages	Description
Tokenization	165 languages	Divides a character string into the smallest unit of sentences to be handled when performing natural language processing
Language detection	196 languages	Identify the language of the string to be parsed
Named entity recognition	40 languages	Extracts named entity from the string to be parsed with Polygotplace、Organization、ManYou can extract three types of
Part-of-speech tagging	16 languages	Part of speech tag is added to each token of the character string to be parsed.
Sentiment analysis	136 languages	Negative、neutral、positiveYou can get 3 types of
Distributed representation	137 languages	Map words to a d-dimensional vector space
Morphological analysis	135 languages	Divide the character string to be parsed into the smallest meaningful units
Transliteration	69 languages	Converts the input string to a string in the specified language

As you can see from the table above, it supports many languages.

Install Let's set up Polygot to actually work.

Install Polygot

$ sudo pip3 install -U polyglot

polyglot itself can be installed by simply executing the above command. However, in order to actually perform language analysis with polyglot, it is necessary to obtain a dictionary of the language to be analyzed. If ICU is not installed when retrieving the dictionary, an error will be thrown. So, before downloading, execute the following command to get the required library.

$ sudo apt-get -y install libicu-dev
$ sudo pip3 install -U pyicu
$ sudo pip3 install -U morfessor

In addition, ** pycld2 ** is required to download the model. In a normal Linux environment, you can install by just hitting $ sudo pip install pycld2. However, when I execute the above command on the Raspberry Pi, the following error is displayed.

arm-linux-gnueabihf-gcc: error: unrecognized command line option ‘-m64’
  error: command 'arm-linux-gnueabihf-gcc' failed with exit status 1
  ----------------------------------------
  ERROR: Failed building wheel for pycld2

The above error occurs because the compiler for the ARM architecture does not provide the -m64 option and the compilation fails. As it is, pycld2 cannot be installed, so Polyglot cannot be run on Raspberry Pi. I'm in trouble ...

Install pycld2 on Raspberry Pi

Since it cannot be installed as it is, it is necessary to execute setup.py after removing the -m64 compile option specified in setup.py of pycld2. After git clone from the repository below, play with setup.py. aboSamoor/pycld2 - Github

$ git clone https://github.com/aboSamoor/pycld2.git
$ cd pycld2/

Move to the directory of pycld2 that was git cloned, delete **-m64 ** from the array of compile options described in Line 78 of setup.py located directly under it, and then save it.

`Change before`


    language="c++",
    # TODO: -m64 may break 32 bit builds
    extra_compile_args=["-w", "-O2", "-m64", "-fPIC"],

`After change`


    language="c++",
    # TODO: -m64 may break 32 bit builds
    extra_compile_args=["-w", "-O2", "-fPIC"],

After making the changes, execute the following command.

$ sudo pip3 install hogehoge/pycld2/

Successfully built pycld2
Installing collected packages: pycld2
Successfully installed pycld2-0.42

After execution, if Successfully is displayed, the installation is successful.

Download the model

You can download the model that executes the following command. This time we will perform morphological analysis of English sentences, so download the English model.

$ polyglot download morph2.en
[polyglot_data] Downloading package morph2.en to
[polyglot_data]     /home/pi/polyglot_data...

Actually perform morphological analysis

All you have to do now is run the sample code below.

`morph.py`


from polyglot.text import Text

sample_text = "One Hamburger and a Medium Coffee please."
tokens = Text(sample_text)
print(tokens.morphemes)

When you actually execute the above script, you can get the result in the following form.

$ python3 morph.py 
['One', ' ', 'Ham', 'burg', 'er and a Medium Coffee p', 'lease', '.']

in conclusion

This time, I used Polyglot for the first time to create a certain program. Since language can be determined, if it is Japanese in connection with Twitter API, it can be processed on the MeCab side and the rest can be left to Polyglot. I don't think that English natural language processing will be used in business, but I will leave it as a memorandum as a drawer.

Run Polyglot on Raspberry Pi to perform morphological analysis in English