liblinear is an opportunity learning library for linearly separating data with 1 million digits of instance and features. https://www.csie.ntu.edu.tw/~cjlin/liblinear/ The feature is that it can process a huge number of features quickly, which is different from others.
Anyway, put it on your desktop and hit it. First, put what you want to import. Instead of opening a terminal (or cmd) and typing the ipython command right away, go to the working directory (folder) and then launch ipython
import sys
sys.path.append('/Users/xxxx/Desktop/liblinear-2.1/python')
from liblinearutil import *
from liblinear import *
The data used is the following news20 https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/multiclass.html The form of liblinear data is special. When actually using it, it is necessary to process it into this shape
y, x = svm_read_problem('news20.txt')
train → predict
In [20]: len(y)
Out[20]: 15935
#Since there are 15935 pieces of data, up to 5000 pieces can be learned as teacher data.
m = train(y[:5000], x[:5000])
save_model('news20.model', m)
optimization finished, #iter = 1000
WARNING: reaching max number of iterations Using -s 2 may be faster (also see FAQ)
Objective value = -38.201637 nSV = 1028 .* optimization finished, #iter = 17 Objective value = -18.665411 nSV = 903
I don't know what it is, but it looks like I was able to learn Save the created model and take a look inside. Generated in a folder as news20.model
weight = open('news20.model').readlines()
weight[:10]
['solver_type L2R_L2LOSS_SVC_DUAL\n', 'nr_class 20\n', 'label 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20\n', 'nr_feature 62017\n', 'bias -1\n', 'w\n', '-0.339495505987624 -0.2729835882642053 -0.1726449590446147 -0.2479101793530862 -0.4274669775000416 -0.2412066297893888 -0.2293917779069297 -0.1540898055174211 -0.215426735582579 -0.2955027766952972 -0.07665514316560521 -0.2067955978156952 -0.2129682323900661 -0.3178416173675406 -0.1100450398128613 -0.1089058297966 0.2118441015471185 -0.1789025390838444 -0.2308991526979358 -0.3216302447541755 \n', '0.03464116990799743 0.03296276686709169 -0.005516289618528965 0 0 8.487270131488089e-19 -0.03693284638681263 0 0 0 -0.0005436471560843025 0 4.336808689942018e-19 0 0 0 -1.355252715606881e-20 0.005881877772996123 0.0004078249397363432 -0.005592803559260878 \n', '0 0 0 0 -0.006337527074141217 0 -0.01043809306013021 -0.02848401075118318 -0.02192217208113558 0 -0.002743696876587976 -0.002823046244597745 5.421010862427522e-19 0 -0.01184141317622985 -0.00327656833111874 -0.00300798970221013 0.07620931881353635 0.07709902339068471 -0.007496992406231962 \n', '0 0.000336438903090087 -0.002105522336459381 -0.003408253600602967 0.04532864192038737 0.00358490636419236 -0.01288493688454648 -0.03829009043077678 -0.02192217208113558 0 -0.002743696876587976 -0.006148372938504376 0.04416917489366715 0 -0.03749035441444219 0.00486249738297638 -0.003188508027714593 0.1323725656877747 0.09645265180639011 -0.01123137774909418 \n']
There are 20 labels, each of which has a weight. http://qwone.com/~jason/20Newsgroups/ The vocabulary.txt of is the index. Now you can see which words are effective in the classification
predict
p_label, p_acc, p_val = predict(y[5000:], x[5000:], m)
Accuracy = 74.3576% (8131/10935) (classification)
The accuracy is 74%, which seems to be fair.
Let's look at the result of the prediction. First, put the correct answer y, the prediction label p_label, and the score p_val of each label in the data frame for easy understanding.
import pandas as pd
a=pd.DataFrame([y[5000:],p_label,p_val])
a[0]
a[0][2]
0 1 1 1 2 [-0.434941406833, -2.4992939688, -1.9156773889... Name: 0, dtype: object
[-0.43494140683299093, -2.499293968803961, -1.9156773889387406, -1.652996684855934, -0.64663025115734, -1.981531321375946, -2.0506304515990794, -1.9845217707935987, -1.816531448715213, -1.9993917151454117, -2.6192052686130403, -2.375782174561902, -2.1841316767499994, -2.787946449405093, -1.981463462884227, -2.4769599630955956, -1.3508140247538216, -1.7235783924583472, -1.7785165908522975, -2.2096245620379604]
1 is the closest and is the same as the correct answer data
Let's see how much it matches
b=pd.DataFrame([y[5000:],p_label])
b
0 1 2 3 4 5 6 7 8 9
0 1 2 3 4 5 6 7 8 9 10 1 1 2 2 4 5 7 4 8 9 4
... 10925 10926 10927 10928 10929 10930 10931 10932 10933 10934 0 ... 18 19 7 9 15 16 17 18 19 17 1 ... 18 18 7 9 15 16 17 15 19 17
It seems that it can be classified so well
By the way, I don't know the parameters well ... And the question is how to make this data ...
https://github.com/zygmuntz/phraug https://github.com/zygmuntz/phraug/blob/master/csv2libsvm.py Convert from CSV found this
Please let me know if you would like to do this
Recommended Posts