You become an engineer in 100 days ――Day 66 ――Programming ――About natural language processing

Click here until yesterday

From this time on, it's about natural language processing.

What is natural language processing?

Languages that humans have spontaneously used, such as English and Japanese, are called natural languages.

On the other hand, artificial languages based on rules such as programming languages are called formal languages to distinguish them.

What is natural language processing? Let the computer process thenatural languagethat humans use on a daily basis. It refers to a series of technologies.

Many technologies are included in natural language processing.

Major natural language processing technologies

The technical system in natural language processing is like this.

name	Contents
Morphological analysis	A method of dividing into morphemes and discriminating the part of speech of each morpheme
Parsing	A method of dividing into morphemes and clarifying the relationships between them and syntactic relationships by diagramming them.
Semantic analysis	A method of interpreting the meaning of a sentence using a concept dictionary, etc.
Context analysis	A method to check the connection of multiple sentences

When processing Japanese with a computer, morphological analysis is a basic technology. Since the language is changing day by day, it is difficult for computers to handle it.

Because humans do not completely process linguistic information, but make reasonable interpretations out of many interpretations. It makes it difficult to implement that validity on a computer.

It is quite difficult to do more than semantic analysis, and future research is awaited.

About morphological analysis

Morphological analysis`` separates sentences into the smallest unit of words called morphemes. It is a method to distinguish the part of speech of each morpheme.

** Divided **

It is a writing style that puts a space between words like English. Watashi Ga Hentai Death I Had Lewd Death

** English morphological analysis **

Very easy in languages like English where words are separated by spaces The procedure for English morphological analysis is summarized below.

1.Lowercase the entire sentence to prevent words from being distinguished by word position

2.it's and don'Split abbreviations such as t (it's → it 's 、 don't → do n't）

3.Separate the period at the end of the sentence from the previous word (Mr.Do not separate periods that are not related to the end of the sentence used for

4.Divide by space

** Japanese morphological analysis **

Unlike English, Japanese has few spaces and you can't see the breaks in words. Therefore, it is necessary to consider division by rules on a dictionary basis using a dedicated dictionary.

If you do your own morphological analysis, you need to define and implement this division rule yourself.

Several libraries have been developed for Japanese morphological analysis. It is common to use this for morphological analysis.

A typical library is called MeCab.

https://ja.wikipedia.org/wiki/MeCab

There is also a library called janome in the Python language.

https://mocobeta.github.io/janome/

If implemented using such a library, morphological analysis can be performed relatively easily.

The mechanism of the library around here is explained in this article. Peek behind the scenes of Japanese morphological analysis! How MeCab Parses Morphological Analysis

The basic idea is to build a lattice and select the best path.

A lattice is a possible word-breaking solution.

I think the following is an easy-to-understand example, so I will refer to it.

Reference: https://techlife.cookpad.com/entry/2016/05/11/170000

This is the lattice, from which the optimal path is selected based on the cost.

The cost depends on the dictionary used for morphological analysis.

In general morphological analysis, the NAIST dictionary is used, The calculated values of occurrence cost and concatenation cost are listed in this. It seems that the cost value for the corpus is calculated from it.

It seems that this path with the lowest cost value is the result of morphological analysis.

Of course, if it does not exist in this dictionary, proper nouns etc. will be divided by ordinary words. The maintenance of a dictionary is indispensable for correct morphological analysis.

Newly created words are sometimes called unknown words, but in the work of morphological analysis, Correspondence to such unknown words and maintenance of dictionaries will occupy most of the man-hours of development work.

If you are a company that handles natural language processing, you have registered a large number of words on your own. We are building a database to handle unknown words.

About parsing

Syntax analysis is also called dependency analysis and is a kind of natural language processing technology. After dividing the sentence into morphemes, we will analyze the modifier relationships between words.

There is a famous library called CaboCha.

https://taku910.github.io/cabocha/

It is not suitable for parsing too long sentences, and it is necessary to think in short sentences.

The result of the analysis looks like this.

Ichiro filled the holes made by Jiro with potatoes purchased in Hokkaido.

Ichiro-------------D
Jiro-D         |
Had made-D       |
In the hole-------D
In Hokkaido-D   |
Purchased-D |
Potatoes-D
Stuffed

Dependency analysis is a technology that can be used when you want to analyze the meaning of a sentence. I think it can be used to analyze the grammatical structure and clarify the meaning of sentences.

Words that often appear in natural language processing

Regular expressions

This is an expression method for expressing several character strings in one format. It is often used when processing a large amount of sentences according to certain rules.

Click here for details You will become an engineer in 100 days --Day 46 --Programming --Regular expressions

N-Gram

A text segmentation method that divides an arbitrary character string or document into consecutive n characters. When n is 1,uni-gramis when2 isbi-gram Case 3 is calledtri-gram`.

Character-based

# unigram
'now', 'Day', 'Is', 'I', 'I', 'Heaven', 'Qi'

# bigram
'today', 'day', 'Yes', 'Good', 'Heaven', 'weather'

# trigram
'today', 'Yes yes', 'Is good', 'Good heaven', 'Good weather'

If it is word-based, it will be a concatenation of n morphologically analyzed words.

# unigram
'today', 'Is', 'Good', 'weather'

# bigram
'today', 'Is good', 'Nice weather'

# trigram
'Good today', 'Nice weather'

** word vector **

After dividing the sentence into words, the words are assigned to the columns of the table and converted into data. If there is a word, the data will be 1, otherwise it will be 0.

[1,0,0,0,0,0,1,1,1],
[1,0,0,0,0,0,1,1,0], ...

TF-IDF

tf-idf is a type of weight for words in a document and is used in fields such as information retrieval and sentence summarization. Calculations are made based on the word vector and are used to determine the rarity of words.

Summary

Natural language processing is one of the most difficult research fields, but the fields where research has not progressed are On the contrary, it is also a field with many opportunities.

Studying Japanese is particularly difficult, and where to implement the part that analyzes the meaning It is very difficult, so you need to sit down and work on your research.

If you are interested, let's learn natural language processing.

34 days until you become an engineer

Author information

Otsu py's HP: http://www.otupy.net/

Youtube： https://www.youtube.com/channel/UCaT7xpeq8n1G_HcJKKSOXMw

Twitter： https://twitter.com/otupython