Since the patent text is long, I want to read it efficiently, or I want to grasp the overall tendency as a patent group. At this time, it is easy to understand if the sentences can be categorized by "problem (purpose)" and "solution" and mapped. The figure looks like the one below.
Reference: http://www.sato-pat.co.jp/contents/service/examination/facture.html
I want to automatically extract this problem axis and the solution axis (label) from the text. The problem awareness is almost the same as this article. One of the methods is LDA. However, the topic cannot be freely manipulated in normal LDA. Guided LDA is a way for humans to adjust that "this topic has such words (I want them to appear)". See if you can set the axis well as you want it to be used.
See here and here for an overview of guided LDA. Official
First of all, I will bring it to the point where it can be output.
!pip install guidedlda
import numpy as np
import pandas as pd
import guidedlda
#Function for creating onehot encoded result from corpus () from sklearn.feature_extraction.text import CountVectorizer def get_X_vocab(corpus): vectorizer = CountVectorizer(token_pattern='(?u)\b\w\w+\b') X = vectorizer.fit_transform(corpus) return X.toarray(), vectorizer.get_feature_names()
#Function to extract the main word for each topic def out1(model,vocab): n_top_words = 10 dic = {} topic_word = model.topic_word_ for i,topic_dist in enumerate(topic_word): topic_words = np.array(vocab)[np.argsort(topic_dist)][:-(n_top_words+1):-1] print('Topic {}: {}'.format(i, ' '.join(topic_words))) dic['topic'+str(i)] = ' '.join(topic_words) return dic
Divide with mecab (I think it doesn't have to be mecab). col is the column to be processed.
df[col+'_1g']= df[col].apply(wakati,args=('DE',))
col_name = "Problem that the invention tries to solve _1g" # @param {type: "string"} col_name2 = "Claims _1g" # @ param {type: "string"}
df[col_name].replace({'\d+':''},regex=True,inplace=True)
df[col_name2].replace({'\d+':''},regex=True,inplace=True)
#Corpus ⇒ X (Document word frequency matrix & vocal list output corpus = df[col_name].apply(lambda x:" ".join(x.split("|"))) X,vocab = get_X_vocab(corpus) word2id = dict((v,idx) for idx,v in enumerate(vocab))
#Corpus ⇒ X (Document word frequency matrix & vocal list output corpus2 = df[col_name2].apply(lambda x:" ".join(x.split("|"))) X2,vocab2 = get_X_vocab(corpus2) word2id2 = dict((v,idx) for idx,v in enumerate(vocab2))
print ("Extracted vocabulary list ---------------") print(vocab) print ("word count:" + str (len (vocab))) pd.DataFrame (vocab) .to_csv (col_name + "word list.csv") print(vocab2) print ("word count:" + str (len (vocab2))) pd.DataFrame (vocab2) .to_csv (col_name2 + "word list.csv") print ("The word list was saved in a virtual file as" word list.xlsx "")
Since the wording of the axis is troublesome, select it appropriately.
#It is really important to specify the word list here topic0_subj = ",".join(vocab[51:60]) topic1_subj = ",".join(vocab[61:70]) topic2_subj = ",".join(vocab[71:80]) topic3_subj = ",".join(vocab[81:90]) topic4_subj = ",".join(vocab[91:100]) topic5_subj = ",".join(vocab[101:110]) topic6_subj = ",".join(vocab[111:120])
input_topic0 = topic0_subj.split(",")
input_topic1 = topic1_subj.split(",")
input_topic2 = topic2_subj.split(",")
input_topic3 = topic3_subj.split(",")
input_topic4 = topic4_subj.split(",")
input_topic5 = topic5_subj.split(",")
input_topic6 = topic6_subj.split(",")
topic_list = [input_topic0
,input_topic1
,input_topic2
,input_topic3
,input_topic4
,input_topic5]
seed_topic_list = []
for k,topic in enumerate(topic_list):
if topic[0]=="":
pass
else:
seed_topic_list.append(topic)
#topic number is the specified number of topics + 1 num_topic = len(seed_topic_list)+1
s_conf = 0.12 #@param {type:"slider", min:0, max:1, step:0.01}
model = guidedlda.GuidedLDA(n_topics=num_topic, n_iter=100, random_state=7, refresh=20)
seed_topics = {}
for t_id,st in enumerate(seed_topic_list):
for word in st:
seed_topics[word2id[word]] = t_id
model.fit(X,seed_topics=seed_topics,seed_confidence=s_conf)
docs = model.fit_transform(X,seed_topics={},seed_confidence=s_conf)
print(docs)
print ("Result --- Typical words for each topic after learning ------------------------------ ---------- ") print ("The last topic was automatically inserted" Other "topic ----------------------------") dic = out1(model,vocab)
print ("Result of topic assignment to each application ---------------------------------------- ---------------- ") print("") df["no"]=df.index.tolist() df ['LDA result_subj'] = df ["no"] .apply (lambda x: "topic" + str (docs [x] .argmax ())) df [["Application number", "LDA result_subj"]] df ['LDA result_subj'] = df ['LDA result_subj'] .replace (dic)
Further, the axis of the solution is processed in the same manner.
ct = pd.crosstab (df ['LDA result_kai'], df ['LDA result_subj'], df ['application number'], aggfunc =','.join) ct
Result ↓ As a point that I devised, if the output is as it is, the axis name will appear as topic ●, so I tried to output the top 10 representative words included in the topic.
If you want to display the number of cases,
ct = pd.crosstab (df ['LDA result_kai'], df ['LDA result_subj'], df ['application number'], aggfunc = np.size)
~~ It was a mess ... ~~ I've put together the words properly, so next time I have to try to reproduce the map made by humans properly. I also feel that the code is redundant (for 2-axis processing), so I need to consider how to write it more concisely.
Recommended Posts