This time, we will classify the article data of the specified tag by unsupervised learning (k-means method).
Please refer to this article for how to get the article data of the specified tag.
・ How to get article data using Qiita API https://qiita.com/wakudar/items/8c594c8cc7bda9b93b4e
#Word-separation
def wakatigaki(text):
mecab = MeCab.Tagger()
mecab_result = mecab.parse(text).replace("EOS", "").split('\n')
mecab_result = [i.replace("#", "").replace("\"", "").replace("\'", "").replace("\t", "_").replace(",","_").split("_") for i in mecab_result if i != ""]
return mecab_result
#Reading and classifying article data
def load_article():
category = ["Vagrant", "iOS", "numpy"]
category_num = [0, 1, 2]
docs = []
labels = []
labels_num = []
for c_name, c_num in zip(category, category_num):
files = glob.glob("./qiita/{c_name}/*.txt".format(c_name=c_name))
text = ""
for file in files:
with open(file, "r", encoding="utf-8") as f:
lines = f.read().splitlines()
body = "".join(lines[0:]).replace('\u3000', '')
text = body
text = " ".join([w[0] for w in wakatigaki(text)])
docs.append(text)
labels.append(c_name)
labels_num.append(c_num)
return docs, labels, category
#Reading and classifying article data
docs, labels, category = load_article()
Article data is saved in the form of qiita / tag name / ------. Txt. This time, we will estimate the categories of the three tags saved in advance, "Vagrant", "iOS", and "numpy".
# TF-Generate vector representation converter by IDF
vectorizer = TfidfVectorizer()
#Document vector conversion
vecs = vectorizer.fit_transform(docs)
# k-Implement means method
kmeans_model = KMeans(n_clusters=n_cluster, random_state=0).fit(vecs)
#Stores labels for clustering results
predict_labels = kmeans_model.labels_
#Aggregate and visualize results
res = {
0:{},
1:{},
2:{}
}
#Storage and display of results
for pre_label, r_label in zip(predict_labels, labels):
#What to do if there is a value
try:
res[pre_label][r_label] += 1
#Exception handling
except:
res[pre_label][r_label] = 1
#Result output
for i in range(n_cluster):
print(res[i])
#Majority category name major_cat
major_cat = []
#Element number of the majority category name
major_num = []
for i in range(n_cluster):
major_cat.append(max(res[i], key=res[i].get))
major_num.append(category.index(major_cat[i]))
adjusted_labels = []
#Number of articles in each category
article_num = [900, 900, 900]
for i in range(n_cluster):
adjusted_labels.extend([major_num[i]] * article_num[i])
#Variable for txt file name cnt
cnt = 0
#If the label before and after clustering is different, the content of the article is output.
for label1, label2 in zip(adjusted_labels, predict_labels):
cnt += 1
if label1 == label2:
pass
else:
path_w = "./result/" + str(label1) + "-" + str(label2) + "/" + str(cnt) + ".txt"
#File name path_output of w
with open(path_w, mode='w') as f:
f.write(docs[cnt])
{'Vagrant': 108, 'iOS': 900, 'numpy': 333}
{'Vagrant': 792}
{'numpy': 567}
The correct answer rate for each tag is iOS: about 67% Vagrant: about 88% numpy: about 63% The result was that.
This time it wasn't very accurate ... Since it is running with a program that is almost the same as the one used for the Livedoor news corpus, it is possible that the source code part of many programs in Qiita is affecting it. In the future, I think it will be necessary to consider methods such as classifying by different learning methods in order to improve accuracy, so I would like to try it when I have time!
・ This time, we searched multiple tags so that the tags did not overlap, and selected the three tags with the least overlap. (Vagrant, iOS, numpy)
・ I tried to classify with Android and iOS tags once, but the results were disappointing. I think there were many articles with two tags in terms of smartphone development.
・ Unsupervised sentence classification (sentence clustering) [python] https://appswingby.com/2019/08/15/python%E6%95%99%E5%B8%AB%E3%81%AA%E3%81%97%E6%96%87%E7%AB%A0%E5%88%86%E9%A1%9E%EF%BC%88%E6%96%87%E7%AB%A0%E3%82%AF%E3%83%A9%E3%82%B9%E3%82%BF%E3%83%AA%E3%83%B3%E3%82%B0%EF%BC%89/
・ Qiita tag list https://qiita.com/tags
・ Articles with both the tag "iOS" and the tag "Vagrant" https://qiita.com/search?q=tag%3A+iOS+tag%3AVagrant
Recommended Posts