Feature extraction by TF method using the result of morphological analysis

Introduction

In this article, we will explain the TF (Term Frequency method) method as a feature extraction method for implementing a document classifier.

1. Morphological analysis

Document classification uses word information in a document. Japanese is a non-separate language, not a separate word-separated language like English, so it is necessary to divide each sentence in a document into words. Dividing a sentence into words and estimating the part of speech of each word is called morphological analysis.

Here, we use the open source morphological analysis software MeCab. • http://taku910.github.io/mecab/

2. Feature extraction

When dealing with a classification problem, the information used for classification in the data is generally called a feature amount, and the work of extracting this feature amount from the data is called a feature extraction. In document classification, words in a document are used as features.

The frequency of occurrence of each word in a document is often used as a word weight. This weighting method is called the TF method (Term Frequency method). In the TF method, words that appear more frequently are considered to be characteristic words in the document. Note that the order of words and the order of appearance are not taken into consideration.

3. Implementation code

`tf.py`


import MeCab as mc
from collections import Counter
import sys
import fileinput
from pathlib import Path

def mecab_analysis(text):
    t = mc.Tagger("-Ochasen")
    t.parse('')
    node = t.parseToNode(text)
    output = []
    while node:
        if node.surface != "":
            word_type = node.feature.split(",")[0]
            if word_type in ["adjective","verb","noun","adverb","助verb","symbol","Particle"]:
                output.append(node.surface)
        node = node.next
        if node is None:
            break
    return output

if Path(sys.argv[1]).exists():  
    for line in fileinput.input():  
        if line:
            line = line.replace('"', '')
            line = line.replace('\\', '')
            words = mecab_analysis(line)
            counter = Counter(words)
            for word, count in counter.most_common():
                if len(word) > 0:
                    print("%s:%d "%(word, count),end ="")
            print("")
        else:
            break

4. How to execute

Target file

`test.txt`


It is compact and can be leaned against the corner of the room so that it can be used immediately when you want to use it. The suction power is also wonderful, was there so much dust? I am surprised. The price is reasonable and it is a recommended product.
I decided to buy after seeing this review. My vacuum cleaner is made by Hitachi (purchased last year), but it doesn't fit in the mouthpiece, and it feels like it's stuck with the attached attachment. But since it's wobbly, it naturally comes off many times while I'm wearing it. Sure, you can vacuum the futon without sucking it in, but I was disappointed that it didn't fit the same Hitachi product. I was looking forward to seeing how much dust it would collect, so I replaced it with a new paper carton and then cleaned it. After finishing all the steps, I looked inside the paper carton, but it was irresistible. Isn't it sucking in dust because it's wobbly? I'm disappointed again.

Each line becomes an input document.

Execution method

Execute the command as follows.

python3 tf.py test.txt > output.txt

The target file is given as the first argument, and the result of feature extraction by the TF method is output to output.txt here.

output

`output.txt`


To:3 、:3 。:3 of:2:2:2 too:2 compact:At 1:1 room:1 corner:1 leaning:1 Hey:1 Use:1 want:1 time:1 Immediately:1 Use:1 suction:1 force:1 wonderfully:1 こんなTo:1 Dust:1 is:1 Oh:1:1 ？:1 and:1 Surprise:1 price:1 Affordable:1 Recommended:1 product:At 1す:1 
hand:8 。:7 to:7 of:6:6:5:4 、:4 Cleaning:3 is:3:At 3:3 is:3 purchase:2 machines:2 Hitachi:Made of 2:2 fit:2:2:2:2 wobbly:2 pieces:2:2 not:2 disappointed:2:2 Dust:2:2 paper:2 pack:2 only:2:2 here:1 Review:1 look:1 decision:1 out:1 （:1 Last year:1 ）:1 suck:1 mouth:1 included:1 attachment:1 somehow:1 stick:1:1 feeling:1 so:1 Naturally:1 in the middle:1 what:Once:1 too:1 off:1 End:1 Certainly:1 cloth:1 suck込ま:1 cloth団:1 multiply:1 thing:1 can:1:1 なんhand:1 which:First place:1 Accumulate:1 fun:1 new:1 Replacement:From 1:1 one:1 way:1 end:1 medium:1 peep:1 accumulated:1 no:1:1:1 suck込ん:1ょ:1:1 ？:1 again:1

The feature extraction result is output.