What to introduce in this article

Introduced because I made a package to select features
Importance of selecting features from text data
Typical feature selection method
Simple example

Is it necessary to select features?

If you are reading this, I think that you have already tried some kind of machine learning by extracting features (hereinafter referred to as features) from text data. For example, document classification.

Even if you search for Qiita quickly, you will find some "tried" articles. Morning Musume. I tried to automatically classify my blog. Natural language processing with R. Attempt document classification with Naive Bayes

In document classification, the basic method is to create matrix data by using words as features. It's a matrix called the frequency matrix.

Now, one question comes up here. __ There are many words that are not related to classification, is that okay? __

It's a good question. It's not okay. If there are many features that are not related to the classification, it will act as noise. Noise hinders the improvement of classification performance, isn't it? Troubled. Troubled.

Then, there comes up the idea of __ "You only have to leave the relevant features" __. Yes, this is the one called feature selection.

There are two merits in selecting features.

Before rushing into the machine learning algorithm, the feature selection should be done properly to improve the performance of the model (in some cases, like Random Forest, the feature selection is included in the algorithm itself, but that is another story).
Make it easy to observe the data

I made a package that makes it easy to select features.

It is quite troublesome to seriously select the feature amount. Therefore, I used Package feature selection method.

Works with Python 3.x. Python 2.x will support it soon.

Supported methods

TF-IDF: This is a basic method. It's actually just calling Scikit-learn class.
PMI: A technique that considers the correlation between document labels and features. Reference article
SOA: A technique that compensates for the weaknesses of PMI. The "unrelatedness" of the document label and the feature can be calculated as a minus.
BNS: Can only be used for 2 class classification. However, it is known to produce excellent results when the labels of training data are biased. Reference article

Package features

(Maybe) early

All internal processing uses scipy sparse matrices. In addition, all the parts that can be distributed can be multi-processed, so it is reasonably fast.

It does the pre-processing to the techto.

If you make a dict in the state of morpheme division and throw it in, it will even build a sparse matrix.

For example, the input dict looks like this


input_dict = {
    "label_a": [
        ["I", "aa", "aa", "aa", "aa", "aa"],
        ["bb", "aa", "aa", "aa", "aa", "aa"],
        ["I", "aa", "hero", "some", "ok", "aa"]
    ],
    "label_b": [
        ["bb", "bb", "bb"],
        ["bb", "bb", "bb"],
        ["hero", "ok", "bb"],
        ["hero", "cc", "bb"],
    ],
    "label_c": [
        ["cc", "cc", "cc"],
        ["cc", "cc", "bb"],
        ["xx", "xx", "cc"],
        ["aa", "xx", "cc"],
    ]
}

Let's play a little

I made it with much effort, so I will try it. I put the ipython notes I tried in Gist.

For ipython notes, scipy, morphological analysis wrapper package and feature selection package Use 0.9).

The text has prepared 5 genres. I picked up the text that seems to be applicable from the net and made it by copying it. (~~ This is collective intelligence ~~)

5 genres

Wiki article about aircraft
News articles about the adult video industry
News articles about Conan's movies
Wiki article about the city of Iran
News articles about terrorism

is. [^ 1]

I tried PMI and SOA.

I will try to extract from the result.

PMI results

These results were seen in descending order of score.

{'label': 'iranian_cities', 'score': 0.67106056632551592, 'word': 'population'},
{'label': 'conan_movies', 'score': 0.34710665998172219, 'word': 'Appearance'},
 {'label': 'av_actress', 'score': 0.30496452198069324, 'word': 'AV actress'},
 {'label': 'av_actress', 'score': 0.26339266409673928, 'word': 'Appearance'},
{'label': 'av_actress', 'score': 0.2313987055319647, 'word': 'Female'},

The words "Uh, yeah, that's right ~" are lined up.

Words that are easily related to labels are highly weighted, so it will be a success in terms of feature selection.

There seems to be no particular suggestion in terms of observing the data.

On the contrary, what happens to the places where the score is low?

 {'label': 'av_actress', 'score': 5.7340738217327128e-06, 'word': 'Man'},
 {'label': 'conan_movies', 'score': 5.7340738217327128e-06, 'word': '3'},
 {'label': 'conan_movies', 'score': 5.7340738217327128e-06, 'word': 'To'},
 {'label': 'conan_movies', 'score': 5.7340738217327128e-06, 'word': 'Notation'},
 {'label': 'terror', 'score': 5.7340738217327128e-06, 'word': 'Mold'}

?? The result is also mixed. It seems to be a word used functionally in the document. The number "3" is mixed in, which is a mistake in morphological analysis ... (This often happens when using Mecab's Neologd dictionary).

I kept the function word words to a low score. In that respect, it looks like it's working.

SOA results

The order has changed slightly. This is often the case (probably) because SOA is based on PMI expressions.

[{'label': 'conan_movies', 'score': 5.3625700793847084, 'word': 'Appearance'},
 {'label': 'iranian_cities', 'score': 5.1604646721932461, 'word': 'population'},
 {'label': 'av_actress', 'score': 5.1395513523987937, 'word': 'AV actress'},
 {'label': 'av_actress', 'score': 4.8765169465650002, 'word': 'Sa'},
 {'label': 'av_actress', 'score': 4.8765169465650002, 'word': 'Hmm'},
 {'label': 'av_actress', 'score': 4.8765169465650002, 'word': 'Female'},
 {'label': 'terror', 'score': 4.8765169465650002, 'word': 'Syria'},

Now, let's look at the part where the score is low. The low score in SOA can be interpreted as "label irrelevance".

{'label': 'terror', 'score': -1.4454111483223628, 'word': 'population'},
 {'label': 'iranian_cities', 'score': -1.6468902498643583, 'word': 'By the way'},
 {'label': 'iranian_cities', 'score': -1.6468902498643583, 'word': 'thing'},
 {'label': 'iranian_cities', 'score': -1.6468902498643583, 'word': 'During ~'},
 {'label': 'iranian_cities', 'score': -1.6468902498643583, 'word': 'Manufacturing'},
 {'label': 'iranian_cities', 'score': -2.009460329249066, 'word': 'thing'},
 {'label': 'airplane', 'score': -3.3923174227787602, 'word': 'Man'}]

Somehow, it doesn't feel right.

If you look at the frequency in the document, this word appears only once or twice. In other words, it can be said that the relationship with the label is weak, and it is reasonable that the negative value becomes large.

Summary

In this article, we talked about feature selection and packages that make feature selection easy.

This time, we did not check the performance of document classification after selecting features.

However, it is a method that has been sufficiently effective in previous studies. Please use it for document classification tasks.

You can install it with pip install DocumentFeature Selection.

Supplement

From version 1.0 of the package, input data can be designed flexibly.

In one example, if you want to design features with (surface word, POS) as a bigram, you can give an array of tuples like this. Here, ((" he "," N "), (" is "," V ")) is one feature.

input_dict_tuple_feature = {
    "label_a": [
        [ (("he", "N"), ("is", "V")), (("very", "ADV"), ("good", "ADJ")), (("guy", "N"),) ],
        [ (("you", "N"), ("are", "V")), (("very", "ADV"), ("awesome", "ADJ")), (("guy", "N"),) ],
        [ (("i", "N"), ("am", "V")), (("very", "ADV"), ("good", "ADJ")), (("guy", "N"),) ]
    ],
    "label_b": [
        [ (("she", "N"), ("is", "V")), (("very", "ADV"), ("good", "ADJ")), (("girl", "N"),) ],
        [ (("you", "N"), ("are", "V")), (("very", "ADV"), ("awesome", "ADJ")), (("girl", "N"),) ],
        [ (("she", "N"), ("is", "V")), (("very", "ADV"), ("good", "ADJ")), (("guy", "N"),) ]
    ]
}

Since tuples can be given as features of input sentences, users can freely design features. [^ 2] For example, it can be used for such tasks

--When you want to use (surface word, some tag) as a feature --When you want to use the edge label of the dependency as a feature

[^ 1]: I'm often asked, "Why does the sample text include adult videos and Persian or Iran?" That's because I like adult videos. Because I was studying Persian, I am attached to it. [^ 2]: Even in the past, it is possible to forcibly extract features by making the surface word_tag a str type. But isn't that something smart? Do you need pre-processing and post-processing? I thought, so I added this function.

Select features using text data