Click here until yesterday
You will become an engineer in 100 days --Day 63 --Programming --Probability 1
You will become an engineer in 100 days-Day 59-Programming-Algorithms
You will become an engineer in 100 days --- Day 53 --Git --About Git
You will become an engineer in 100 days --Day 42 --Cloud --About cloud services
You will become an engineer in 100 days --Day 36 --Database --About the database
You will be an engineer in 100 days --Day 24 --Python --Basics of Python language 1
You will become an engineer in 100 days --Day 18 --Javascript --JavaScript basics 1
You will become an engineer in 100 days --Day 14 --CSS --CSS Basics 1
You will become an engineer in 100 days --Day 6 --HTML --HTML basics 1
From this time on, it's about natural language processing.
Languages that humans have spontaneously used, such as English and Japanese, are called natural languages
.
On the other hand, artificial languages based on rules such as programming languages
are called formal languages
to distinguish them.
What is natural language processing
? Let the computer process thenatural language
that humans use on a daily basis.
It refers to a series of technologies.
Many technologies are included in natural language processing
.
The technical system in natural language processing is like this.
name | Contents |
---|---|
Morphological analysis | A method of dividing into morphemes and discriminating the part of speech of each morpheme |
Parsing | A method of dividing into morphemes and clarifying the relationships between them and syntactic relationships by diagramming them. |
Semantic analysis | A method of interpreting the meaning of a sentence using a concept dictionary, etc. |
Context analysis | A method to check the connection of multiple sentences |
When processing Japanese with a computer, morphological analysis is a basic technology. Since the language is changing day by day, it is difficult for computers to handle it.
Because humans do not completely process linguistic information, but make reasonable
interpretations out of many interpretations.
It makes it difficult to implement that validity
on a computer.
It is quite difficult to do more than semantic analysis, and future research is awaited.
Morphological analysis`` separates sentences
into the smallest unit of words called morphemes.
It is a method to distinguish the part of speech of each morpheme.
** Divided **
It is a writing style that puts a space between words like English.
Watashi Ga Hentai Death I Had Lewd Death
** English morphological analysis **
Very easy in languages like English where words are separated by spaces The procedure for English morphological analysis is summarized below.
1.Lowercase the entire sentence to prevent words from being distinguished by word position
2.it's and don'Split abbreviations such as t (it's → it 's 、 don't → do n't)
3.Separate the period at the end of the sentence from the previous word (Mr.Do not separate periods that are not related to the end of the sentence used for
4.Divide by space
** Japanese morphological analysis **
Unlike English, Japanese has few spaces and you can't see the breaks in words.
Therefore, it is necessary to consider division by rules on a dictionary basis using a dedicated dictionary
.
If you do your own morphological analysis, you need to define and implement this division rule yourself.
Several libraries
have been developed for Japanese morphological analysis.
It is common to use this for morphological analysis.
A typical library is called MeCab
.
https://ja.wikipedia.org/wiki/MeCab
There is also a library called janome
in the Python language.
https://mocobeta.github.io/janome/
If implemented using such a library, morphological analysis can be performed relatively easily.
The mechanism of the library around here is explained in this article. Peek behind the scenes of Japanese morphological analysis! How MeCab Parses Morphological Analysis
The basic idea is to build a lattice
and select the best path
.
A lattice
is a possible word-breaking solution
.
I think the following is an easy-to-understand example, so I will refer to it.
Reference: https://techlife.cookpad.com/entry/2016/05/11/170000
This is the lattice
, from which the optimal path
is selected based on the cost
.
The cost
depends on the dictionary
used for morphological analysis.
In general morphological analysis, the NAIST dictionary
is used,
The calculated values of occurrence cost
and concatenation cost
are listed in this.
It seems that the cost
value for the corpus
is calculated from it.
It seems that this path
with the lowest cost
value is the result of morphological analysis.
Of course, if it does not exist in this dictionary, proper nouns etc. will be divided by ordinary words.
The maintenance of a dictionary
is indispensable for correct morphological analysis.
Newly created words are sometimes called unknown words
, but in the work of morphological analysis,
Correspondence to such unknown words
and maintenance of dictionaries
will occupy most of the man-hours of development work.
If you are a company that handles natural language processing, you have registered a large number of words on your own.
We are building a database to handle unknown words
.
Syntax analysis
is also called dependency analysis
and is a kind of natural language processing technology.
After dividing the sentence into morphemes, we will analyze the modifier relationships between words.
There is a famous library called CaboCha
.
https://taku910.github.io/cabocha/
It is not suitable for parsing too long sentences, and it is necessary to think in short sentences.
The result of the analysis looks like this.
Ichiro filled the holes made by Jiro with potatoes purchased in Hokkaido.
Ichiro-------------D
Jiro-D |
Had made-D |
In the hole-------D
In Hokkaido-D |
Purchased-D |
Potatoes-D
Stuffed
Dependency analysis is a technology that can be used when you want to analyze the meaning of a sentence. I think it can be used to analyze the grammatical structure and clarify the meaning of sentences.
Regular expressions
This is an expression method for expressing several character strings in one format. It is often used when processing a large amount of sentences according to certain rules.
Click here for details You will become an engineer in 100 days --Day 46 --Programming --Regular expressions
N-Gram
A text segmentation method that divides an arbitrary character string or document into consecutive n
characters.
When n
is 1,uni-gram
is when2 is
bi-gram Case 3 is called
tri-gram`.
Character-based
# unigram
'now', 'Day', 'Is', 'I', 'I', 'Heaven', 'Qi'
# bigram
'today', 'day', 'Yes', 'Good', 'Heaven', 'weather'
# trigram
'today', 'Yes yes', 'Is good', 'Good heaven', 'Good weather'
If it is word-based, it will be a concatenation of n
morphologically analyzed words.
# unigram
'today', 'Is', 'Good', 'weather'
# bigram
'today', 'Is good', 'Nice weather'
# trigram
'Good today', 'Nice weather'
** word vector **
After dividing the sentence into words, the words are assigned to the columns of the table and converted into data. If there is a word, the data will be 1, otherwise it will be 0.
[1,0,0,0,0,0,1,1,1],
[1,0,0,0,0,0,1,1,0], ...
TF-IDF
tf-idf
is a type of weight for words in a document and is used in fields such as information retrieval and sentence summarization.
Calculations are made based on the word vector
and are used to determine the rarity of words.
Natural language processing is one of the most difficult research fields, but the fields where research has not progressed are On the contrary, it is also a field with many opportunities.
Studying Japanese is particularly difficult, and where to implement the part that analyzes the meaning It is very difficult, so you need to sit down and work on your research.
If you are interested, let's learn natural language processing.
34 days until you become an engineer
Otsu py's HP: http://www.otupy.net/
Youtube: https://www.youtube.com/channel/UCaT7xpeq8n1G_HcJKKSOXMw
Twitter: https://twitter.com/otupython
Recommended Posts