If you are reading this, I think that you have already tried some kind of machine learning by extracting features (hereinafter referred to as features) from text data. For example, document classification.
Even if you search for Qiita quickly, you will find some "tried" articles. Morning Musume. I tried to automatically classify my blog. Natural language processing with R. Attempt document classification with Naive Bayes
In document classification, the basic method is to create matrix data by using words as features. It's a matrix called the frequency matrix.
Now, one question comes up here. __ There are many words that are not related to classification, is that okay? __
It's a good question. It's not okay. If there are many features that are not related to the classification, it will act as noise. Noise hinders the improvement of classification performance, isn't it? Troubled. Troubled.
Then, there comes up the idea of __ "You only have to leave the relevant features" __. Yes, this is the one called feature selection.
There are two merits in selecting features.
It is quite troublesome to seriously select the feature amount. Therefore, I used Package feature selection method.
Works with Python 3.x. Python 2.x will support it soon.
All internal processing uses scipy sparse matrices. In addition, all the parts that can be distributed can be multi-processed, so it is reasonably fast.
If you make a dict in the state of morpheme division and throw it in, it will even build a sparse matrix.
For example, the input dict looks like this
input_dict = {
"label_a": [
["I", "aa", "aa", "aa", "aa", "aa"],
["bb", "aa", "aa", "aa", "aa", "aa"],
["I", "aa", "hero", "some", "ok", "aa"]
],
"label_b": [
["bb", "bb", "bb"],
["bb", "bb", "bb"],
["hero", "ok", "bb"],
["hero", "cc", "bb"],
],
"label_c": [
["cc", "cc", "cc"],
["cc", "cc", "bb"],
["xx", "xx", "cc"],
["aa", "xx", "cc"],
]
}
I made it with much effort, so I will try it. I put the ipython notes I tried in Gist.
For ipython notes, scipy, morphological analysis wrapper package and feature selection package Use 0.9).
The text has prepared 5 genres. I picked up the text that seems to be applicable from the net and made it by copying it. (~~ This is collective intelligence ~~)
5 genres
is. [^ 1]
I tried PMI and SOA.
I will try to extract from the result.
These results were seen in descending order of score.
{'label': 'iranian_cities', 'score': 0.67106056632551592, 'word': 'population'},
{'label': 'conan_movies', 'score': 0.34710665998172219, 'word': 'Appearance'},
{'label': 'av_actress', 'score': 0.30496452198069324, 'word': 'AV actress'},
{'label': 'av_actress', 'score': 0.26339266409673928, 'word': 'Appearance'},
{'label': 'av_actress', 'score': 0.2313987055319647, 'word': 'Female'},
The words "Uh, yeah, that's right ~" are lined up.
Words that are easily related to labels are highly weighted, so it will be a success in terms of feature selection.
There seems to be no particular suggestion in terms of observing the data.
On the contrary, what happens to the places where the score is low?
{'label': 'av_actress', 'score': 5.7340738217327128e-06, 'word': 'Man'},
{'label': 'conan_movies', 'score': 5.7340738217327128e-06, 'word': '3'},
{'label': 'conan_movies', 'score': 5.7340738217327128e-06, 'word': 'To'},
{'label': 'conan_movies', 'score': 5.7340738217327128e-06, 'word': 'Notation'},
{'label': 'terror', 'score': 5.7340738217327128e-06, 'word': 'Mold'}
?? The result is also mixed. It seems to be a word used functionally in the document. The number "3" is mixed in, which is a mistake in morphological analysis ... (This often happens when using Mecab's Neologd dictionary).
I kept the function word words to a low score. In that respect, it looks like it's working.
The order has changed slightly. This is often the case (probably) because SOA is based on PMI expressions.
[{'label': 'conan_movies', 'score': 5.3625700793847084, 'word': 'Appearance'},
{'label': 'iranian_cities', 'score': 5.1604646721932461, 'word': 'population'},
{'label': 'av_actress', 'score': 5.1395513523987937, 'word': 'AV actress'},
{'label': 'av_actress', 'score': 4.8765169465650002, 'word': 'Sa'},
{'label': 'av_actress', 'score': 4.8765169465650002, 'word': 'Hmm'},
{'label': 'av_actress', 'score': 4.8765169465650002, 'word': 'Female'},
{'label': 'terror', 'score': 4.8765169465650002, 'word': 'Syria'},
Now, let's look at the part where the score is low. The low score in SOA can be interpreted as "label irrelevance".
{'label': 'terror', 'score': -1.4454111483223628, 'word': 'population'},
{'label': 'iranian_cities', 'score': -1.6468902498643583, 'word': 'By the way'},
{'label': 'iranian_cities', 'score': -1.6468902498643583, 'word': 'thing'},
{'label': 'iranian_cities', 'score': -1.6468902498643583, 'word': 'During ~'},
{'label': 'iranian_cities', 'score': -1.6468902498643583, 'word': 'Manufacturing'},
{'label': 'iranian_cities', 'score': -2.009460329249066, 'word': 'thing'},
{'label': 'airplane', 'score': -3.3923174227787602, 'word': 'Man'}]
Somehow, it doesn't feel right.
If you look at the frequency in the document, this word appears only once or twice. In other words, it can be said that the relationship with the label is weak, and it is reasonable that the negative value becomes large.
In this article, we talked about feature selection and packages that make feature selection easy.
This time, we did not check the performance of document classification after selecting features.
However, it is a method that has been sufficiently effective in previous studies. Please use it for document classification tasks.
You can install it with pip install DocumentFeature Selection
.
From version 1.0 of the package, input data can be designed flexibly.
In one example, if you want to design features with (surface word, POS)
as a bigram, you can give an array of tuples like this. Here, ((" he "," N "), (" is "," V "))
is one feature.
input_dict_tuple_feature = {
"label_a": [
[ (("he", "N"), ("is", "V")), (("very", "ADV"), ("good", "ADJ")), (("guy", "N"),) ],
[ (("you", "N"), ("are", "V")), (("very", "ADV"), ("awesome", "ADJ")), (("guy", "N"),) ],
[ (("i", "N"), ("am", "V")), (("very", "ADV"), ("good", "ADJ")), (("guy", "N"),) ]
],
"label_b": [
[ (("she", "N"), ("is", "V")), (("very", "ADV"), ("good", "ADJ")), (("girl", "N"),) ],
[ (("you", "N"), ("are", "V")), (("very", "ADV"), ("awesome", "ADJ")), (("girl", "N"),) ],
[ (("she", "N"), ("is", "V")), (("very", "ADV"), ("good", "ADJ")), (("guy", "N"),) ]
]
}
Since tuples can be given as features of input sentences, users can freely design features. [^ 2] For example, it can be used for such tasks
--When you want to use (surface word, some tag)
as a feature
--When you want to use the edge label of the dependency as a feature
[^ 1]: I'm often asked, "Why does the sample text include adult videos and Persian or Iran?" That's because I like adult videos. Because I was studying Persian, I am attached to it.
[^ 2]: Even in the past, it is possible to forcibly extract features by making the surface word_tag
a str type. But isn't that something smart? Do you need pre-processing and post-processing? I thought, so I added this function.
Recommended Posts