This article is based on ❷ of Collection and classification of machine learning related information (concept).
For example, let's say you have news that a company has built a Q & A system using a service on the cloud.
Then, as you can infer from Example of folder of local file system in ❷, Internet shortcuts for this news
・ Tools / Cloud / Certain services ・ Machine learning / application / Bot ・ Dialogue system ・ Social trends / companies / certain companies
Must be placed in at least three locations. These classifications are not exclusive, so they are so-called multi-label classifications.
There are various algorithms that deal with such problems in Python / scikit-learn, but it seems that the API interface is unified as it is, and you can write code that works just by replacing the algorithm.
So, in this article, let's check the API interface.
Since the script that handles the actual crawl result is not general to introduce on Qiita, I will publish it on GitHub, and in this article I will check the interface Only sample scripts are handled. It's also easy to choose Random Forest [^ 1]. The data also uses artificial values to confirm the correspondence between the numerical values, and is not really meaningful data. Please understand this point.
As a framework for handling multi-class algorithms and multi-label algorithms with scikit-learn, [Translation] scikit-learn 0.18 User Guide 1.12. Multi-class algorithms and multi-label algorithms There is sklearn.multiclass described in / 9decf45d106accc6afe1). However, this module is a general-purpose module that breaks down the problem into binary classification problems and handles them, and is not optimized for individual classification algorithms.
Decision Tree, [Random Forest](https://ja.wikipedia. org / wiki /% E3% 83% A9% E3% 83% B3% E3% 83% 80% E3% 83% A0% E3% 83% 95% E3% 82% A9% E3% 83% AC% E3% 82 % B9% E3% 83% 88), the nearest neighbor method implements multi-label classification in each algorithm itself, so you will use each algorithm directly.
Now let's write a sample script.
random-forest.py
#from sklearn.tree import DecisionTreeClassifier as classifier
from sklearn.ensemble import RandomForestClassifier as classifier
from gensim import matutils
corpus = [[(1,10),(2,20)],[(3,30),(4,40)],[(5,50),(6,60)]]
labels = [[100,500,900],[300,400,800],[200,600,700]]
dense = matutils.corpus2dense(corpus, 7)
print(dense) #=> (*1)
print(dense.T) #=> (*2)
clf = classifier(random_state=777)
clf.fit(dense.T, labels)
for target in [[[0,10,20, 0, 0, 0, 0]], #=> (*3)
[[0,10,20,30,40,50,60]], #=> (*4)
[[0,10,10,0,0,0,0], #=> (*5)
[0,0,0,20,20,0,0],
[0,0,0,0,0,30,30]]]:
print(clf.predict(target))
print(clf.predict_proba(target))
classifier
If you change the import source of the classifier algorithm as in the comment-out example, it will be classified by the specified algorithm.
The input corpus is a sparse matrix that describes the vocabulary frequencies of the three documents. First document-ID: 1 word 10 times, ID: 2 word 20 times, other words frequency 0 The following document-ID: 3 words 30 times, ID: 4 words 40 times, other words frequency 0 ・ Last document-ID: 5 words 50 times, ID: 6 words 60 times, other words frequency 0
Teacher data labels describe the classification of each document. First document-the first label has a value of 100, the second label has a value of 500, and the last label has a value of 900. · Next document-the first label has a value of 300, the next label has a value of 400, and the last label has a value of 800. · Last document-first label has a value of 200, second label has a value of 600, last label has a value of 700 In other words · First label-three classes with values 100, 200 and 300 · Next label-three classes with values 400, 500 and 600 -Last label-three classes with values 700, 800 and 900
The classification algorithm provided by scikit-learn expects a dense matrix as input, so we have to convert the sparse matrix to a dense matrix.
You can use corpus2dense in the gensim.matutils module for this. Let's see the result (* 1).
dense.
[[ 0. 0. 0.]
[ 10. 0. 0.]
[ 20. 0. 0.]
[ 0. 30. 0.]
[ 0. 40. 0.]
[ 0. 0. 50.]
[ 0. 0. 60.]]
that?
It is true that it is a dense matrix that expresses 0 without omitting it properly, but since the row is the word ID and the column is the document number, there is no correspondence with labels as it is. Therefore, in order to input the classification algorithm, the rows and columns must be transposed (→ result (* 2)).
dense.T
[[ 0. 10. 20. 0. 0. 0. 0.]
[ 0. 0. 0. 30. 40. 0. 0.]
[ 0. 0. 0. 0. 0. 50. 60.]]
I think that there are overwhelmingly many use cases where dense.T is used for dense and dense.T, but I'm not sure why corpus2dense has such a specification.
・ Classifier generation
python
clf = RandomForestClassifier(random_state=777)
The Random Forest algorithm uses random numbers internally, so if you don't fix random_state, you'll get different results each time.
・ Classification
print(clf.predict(target))
print(clf.predict_proba(target))
In this sample script, predict is used to calculate the classification result, and predict_proba is used to calculate the estimated probability value [^ 2].
Let's look at the results in order.
(*3) [[0,10,20, 0, 0, 0, 0]]
[[ 100. 500. 900.]]
[array([[ 0.8, 0.1, 0.1]]), array([[ 0.1, 0.8, 0.1]]), array([[ 0.1, 0.1, 0.8]])]
Since this is one of the trained patterns, it is expected that the classification will be as per the teacher data.
-First label-Value 100: Probability 0.8, 200: Probability 0.1, 300: Probability 0.1 → Value 100. -Next label-Value 400: Probability 0.1, 500: Probability 0.8, 600: Probability 0.1 → Value 500. -Last label-Value 700: Probability 0.1, 800: Probability 0.1, 900: Probability 0.8 → Value 900.
Certainly the teacher data can be reproduced.
Note that the values are managed in numerical order, not in order of appearance.
(*4) [[0,10,20,30,40,50,60]]
[[ 200. 600. 700.]]
[array([[ 0.3, 0.5, 0.2]]), array([[ 0.2, 0.3, 0.5]]), array([[ 0.5, 0.2, 0.3]])]
-First label-Value 100: Probability 0.3, 200: Probability 0.5, 300: Probability 0.2 → Value 200. -Next label-Value 400: Probability 0.2, 500: Probability 0.3, 600: Probability 0.5 → Value 600. -Last label-Value 700: Probability 0.5, 800: Probability 0.2, 900: Probability 0.3 → Value 700.
(*5) [[0,10,10,0,0,0,0],[0,0,0,20,20,0,0],[0,0,0,0,0,30,30]]]
[[ 100. 500. 900.]
[ 100. 400. 800.]
[ 200. 600. 700.]]
[array([[ 0.7, 0.2, 0.1],
[ 0.4, 0.2, 0.4],
[ 0.3, 0.5, 0.2]]),
array([[ 0.1, 0.7, 0.2],
[ 0.4, 0.4, 0.2],
[ 0.2, 0.3, 0.5]]),
array([[ 0.2, 0.1, 0.7],
[ 0.2, 0.4, 0.4],
[ 0.5, 0.2, 0.3]])]
-First label-Value 100: Probability 0.7, 200: Probability 0.2, 300: Probability 0.1 → Value 100. -Next label-Value 400: Probability 0.1, 500: Probability 0.7, 600: Probability 0.2 → Value 500. -Last label-Value 700: Probability 0.2, 800: Probability 0.1, 900: Probability 0.7 → Value 900.
-First label-Value 100: Probability 0.4, 200: Probability 0.2, 300: Probability 0.4 → Value 100. -Next label-Value 400: Probability 0.4, 500: Probability 0.4, 600: Probability 0.2 → Value 400. -Last label-Value 700: Probability 0.2, 800: Probability 0.4, 900: Probability 0.4 → Value 800.
-First label-Value 100: Probability 0.3, 200: Probability 0.5, 300: Probability 0.2 → Value 200. -Next label-Value 400: Probability 0.2, 500: Probability 0.3, 600: Probability 0.5 → Value 600. -Last label-Value 700: Probability 0.5, 800: Probability 0.2, 900: Probability 0.3 → Value 700.
Note that the outermost dimension of predict_proba is the label No.
When the problem is decomposed into binary classification problems with sklearn.multiclass, the outermost repetition is inevitably the label No. [^ 3], so I think it is a natural specification.
In this way, what each dimension of the matrix that serves as an interface corresponds to is complicated and difficult to understand, especially in the case of multiple labels, and I think it was meaningful to leave such a memo.
I would appreciate it if you could point out any misunderstandings.
[^ 1]: Regarding random forest, there is an article Classifying news articles by scikit-learn and gensim. [^ 2]: Depending on the classification algorithm, it may not always be considered as a probability. For example, in the case of DecisionTree, this value seems to be either 0. or 1. because it is "decided". [^ 3]: I haven't actually confirmed it.
Recommended Posts