The other day, I noticed this statement by Shinjiro Koizumi.
"Imagine the future of the Diet, and when the same question is asked many times, I want you to use artificial intelligence. There are many things you can do, such as the use of artificial intelligence and the future of the Diet. Yes ”(Liberal Democratic Party, Mr. Koizumi“ Similar question, in the future, flip with AI ”)
So, I wondered if I could really do that, while also studying Doc2Vec, I tried using the "questionnaire", which makes it easy to obtain data to some extent. BERT (Bidirectional Encoder Representations from Transformers) and ELMo (Embeddings from Language Models) are coming out these days, Doc2Vec? I think there are people who say that, but I think that it is still active for doing a little analysis, such as being packaged in gensim and easy to use.
When you ask a question in the Diet, you can think of a member of the Diet verbally asking the minister. Asking the Cabinet questions about bills and policies is an important task for lawmakers, but these questions can also be asked in writing. That is the question statement.
The time to ask questions in the Diet is limited, especially among minority parties. This question statement is often used in such cases. It is one of the important pursuit methods for members of the Diet who have been exposed to the HIV-tainted blood scandal in the past.
However, as a general rule, the Cabinet that receives the questionnaire must respond in writing within 7 days, but the response at this time must be passed through the Cabinet. The answer to the questionnaire will remain as the official opinion of the government, partly because it has passed the cabinet meeting, so it seems to be a heavy task for the field.
What is Kasumigaseki's hated "questionnaire"? ABC in current affairs terms I tried to use the word cloud as the submitter of the questionnaire for the period targeted this time. Muneo Suzuki's overwhelming presence ...
Doc2Vec Docment-to-Vector, which represents document = document features as a vector. Doc2Vec was proposed by Tomas Mikilov, who developed Word2Vec, and the following two learning methods are proposed.
In PV-DM, the document vector is learned so as to predict the immediately following word from the word vector that is continuous with the document vector.
In PV-DBOW, the sentence vector is learned so as to guess the word contained in the sentence after ignoring the word order. It is said that distributed representation of sentences by this method is easier than PV-DM, but less accurate. How to use natural language processing technology-I tried to predict the quality of papers using Doc2Vec and DAN! -
This time, I will use the questionnaire of the House of Representatives and the House of Councilors obtained by crawling and scraping. Actually, I was going to extract all the data from the 1st Diet session, but because my heart broke in the middle, I focused on Heisei and Reiwa and extracted the data of the questionnaire. The target Diet is from the 114th to the 200th. In addition, since Doc2Vec is the main this time, I will omit the crawling & scraping code and post only the result. In addition, in this article, we will model from the 114th Diet to the 199th Diet and estimate the similarity of the questionnaire in the 200th Diet.
Before looking at the contents of the data, first import the library. This time, I'm drawing in Japanese, so I'm using the font for that, NotoSansMonoCJKjp-Regular.otf
. Farewell to tofu.
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from matplotlib.pylab import rcParams
from matplotlib.font_manager import FontProperties
font_path = "/mnt/c/u_home/font/NotoSansMonoCJKjp-Regular.otf"
font_prop = FontProperties(fname=font_path)
By the way, the data this time is like this. From the top, the questionnaires of the House of Representatives from the 114th to the 199th Diet, the questionnaires of the House of Representatives of the 200th Diet, and the period during which the Diet is open ([page here]](Obtained from http://www.shugiin.go.jp/internet/itdb_annai.nsf/html/statics/shiryo/kaiki.htm).
data = pd.read_excel("data.xlsx")
data_200 = pd.read_excel("data_200.xlsx")
data_diet_open = pd.read_excel("data_diet_open.xlsx")
Sort by date and display the top two and bottom two. ..
data.sort_values(by=['date']).iloc[np.r_[0:2, -2:0]]
Supplementally,
time
The session of the Diet in which the questionnaire was submitted
diet
House of Representatives or House of Councilors
title
The title of the question statement
name
Submitter of the questionnaire
date
The date when the questionnaire was submitted
question
The text of the question statement
q_numb
Number of questions in the questionnaire (details will be described later)
year
The year when the questionnaire was submitted
month
The month when the questionnaire was submitted
day
The day when the questionnaire was submitted
The oldest data used this time is the "Questionnaire on Interest Rate Control of Money Lenders" submitted by Juji Inokuma on 1989-01-20, and the newest is Yukihito Koga's 2019-08- This is the "Questionnaire on" Japan Post Group's Contract Investigation and Efforts for Improvement "" submitted in 05. The number of data is 15,515. Do you feel that this number of data is not too small?
data.shape
(15515, 10)
The same applies to the data of the 200th Diet. The number of data is 167.
data_200.sort_values(by=['date']).iloc[np.r_[0:2, -2:0]]
Finally, a list of parliamentary sessions
EDA Explore various data before going to Doc2Vec
First of all, a graph for each year. It has been increasing rapidly since around 2006.
rcParams['figure.figsize'] = 8,4
data.groupby('year').size().plot.bar()
Then monthly. Many in June. There are few from July to September.
rcParams['figure.figsize'] = 8,4
data.groupby('year').size().plot.bar()
Finally every day. Since the number of documents is small on the 31st, I understand that the number of intents is small, but there is no tendency other than that.
rcParams['figure.figsize'] = 8,4
data.groupby('day').size().plot.bar()
In order to understand how many intents were submitted each day, we sorted by date and counted the number of cases, and reduced the number of days without submission to zero, the first day of the 114th Diet (December 30, 1988). ) To the final day of the 199th Diet (August 5, 2019), I would like to create time-series data.
def convert_to_daily(data):
time_index = pd.DataFrame(index=pd.date_range('19881230','20190805'))
#Set the first and last days of the target period
doc_num_daily = pd.DataFrame(data.groupby('date').size(), columns=['doc_num'])
#Count the number of main documents for each date and enter the column name'doc_num'To
data_daily = pd.concat([time_index, doc_num_daily], axis=1) #merge
data_daily['doc_num'] = data_daily['doc_num'].fillna(0) #Zero missing values
data_daily = data_daily.iloc[:,0] #in the pandas series
return data_daily
data_daily = convert_to_daily(data) #Run
In addition, during the Diet period, we created a function to make the background color gray.
def plot_daily_data(data_daily, start_day, end_day):
subdata = data_diet_open[(data_diet_open['end'] >= start_day) & (data_diet_open['end'] <= end_day)].sort_values(by='diet_time').reset_index(drop=True)
plt.plot(data_daily.loc[start_day:end_day])
plt.title("Number of docments between " + start_day + " and " + end_day)
for i in range(subdata.shape[0]):
plt.axvspan(subdata.start[i],subdata.end[i], color=sns.xkcd_rgb['grey'], alpha=0.5)
So plot.
rcParams['figure.figsize'] = 20,5
start_day = "1988-12-30"; end_day = "2019-12-31"
plot_daily_data(data_daily, start_day, end_day)
Somehow it is rising, and it seems that it has been increasing since 2004, but it is hard to see ... so I divided it into three parts.
start_day = "1988-12-30"; end_day = "1999-12-31"
plot_daily_data(data_daily, start_day, end_day)
start_day = "2000-01-01"; end_day = "2009-12-31"
plot_daily_data(data_daily, start_day, end_day)
start_day = "2010-01-01"; end_day = "2019-12-31"
plot_daily_data(data_daily, start_day, end_day)
Since the questionnaire can be submitted only during the Diet session, it is clear that the number of submissions will increase at the end of each session of the Diet (= the far right of gray). Even so, I divided the graph into three parts, but the scale on the left side is gradually increasing, which clearly shows that the question statement is often used these days.
Histogram of the number of questions submitted. It is a beautiful downward-sloping graph.
plt.hist(data_daily[data_daily > 0], bins=50)
The maximum value is 84 on September 25, 2015! What happened on this day will be described later ...
data_daily.max()
doc_num 84.0
dtype: float64
data.groupby('date').size().idxmax()
Timestamp('2015-09-25 00:00:00')
I would like to see which Diet members often submit the questionnaire. First of all, the number of parliamentarians listed in the data is 789. Please note that this data treats the joint name as "Seiken Akamine, Chizuko Takahashi, Hidekatsu Yoshii".
len(data.name.unique())
789
By the way, here are the top 30 people who are interested!
subdata = data.groupby('name').filter(lambda x: len(x) >= 100)
len(subdata.name.unique())
30 #Extract the top 30 people
plot_value = subdata.groupby('name').size().sort_values(ascending=False)
plot_index = subdata.groupby('name').size().sort_values(ascending=False).index
rcParams['figure.figsize'] = 20,5
plt.bar(plot_index, plot_value)
plt.xticks(plot_index, rotation=90, fontproperties=font_prop)
for i, v in enumerate(plot_value):
plt.text(i - 0.5, v + 5, str(v), color='blue')
It was Muneo Suzuki who took the lead by far behind the 2nd place and below. The number 2155 ... [Wikipedia](https://ja.wikipedia.org/wiki/%E8%B3%AA%E5%95%8F%E4%B8%BB%E6%84%8F%E6%9B%B8#%E6% 8F% 90% E5% 87% BA% E6% 95% B0) also has a perfect description.
"An example of a lot of submissions is Muneo Suzuki of the New Party Daichi, who submitted 1900 questions in the opposition era, also known as the" King of Questions. " Muneo left the Diet after losing his job in 2010 (Heisei 22), but after that he continued to submit the questionnaire to Takahiro Asano, who became the successor to the same New Party Daichi. Furthermore, when Takako Suzuki, the eldest daughter of Muneo, was elected in June 2013, she has been attacking through Takako with a written inquiry. ”
It is said that Muneo Suzuki's presence has greatly changed the role of the questionnaire. It is also mentioned in NHK article. So, I would like to look at the number of Muneo Suzuki's questionnaires.
muneo_daily = convert_to_daily(data[data['name']=="Muneo Suzuki"].reset_index(drop=True))
rcParams['figure.figsize'] = 20,5
plt.plot(data_daily)
plt.plot(muneo_daily)
for i in range(data_diet_open.shape[0]):
plt.axvspan(data_diet_open.start[i],data_diet_open.end[i], color=sns.xkcd_rgb['grey'], alpha=0.4)
The orange color is the number of Muneo Suzuki's questionnaires submitted. Expanding the period when Muneo Suzuki submitted the questionnaire,
start_day = "2005-10-03"; end_day = "2010-08-04"
subdata = data_diet_open[(data_diet_open['end'] >= start_day) & (data_diet_open['end'] <= end_day)].sort_values(by='diet_time').reset_index(drop=True)
plt.plot(data_daily.loc[start_day:end_day])
plt.plot(muneo_daily.loc[start_day:end_day])
plt.title("Number of docments between " + start_day + " and " + end_day)
for i in range(subdata.shape[0]):
plt.axvspan(subdata.start[i],subdata.end[i], color=sns.xkcd_rgb['grey'], alpha=0.5)
From 2006 to 2007, you can see that most of the submitted questionnaires were Muneo Suzuki.
Next time, I would like to find a member of the Diet who submits a lot per day.
subdata = data.groupby(['name','date']).filter(lambda x: len(x) >= 15)
pd.DataFrame(subdata.groupby(['name','date']).size().sort_values(ascending=False))
The top was Mr. Hiroyuki Konishi, who submitted 55 books on September 25, 2015. As expected, there are times when it just has the nickname of "Quiz King of the Diet". By the way, here is the title of Mr. Hiroyuki Konishi's questionnaire submitted on this day.
Similar titles continue, but especially 39 and 40 didn't work out ... Even though the cost of the question statement is being screamed, it seems like looking for a mistake.
Regarding this, this article was also written.
"Looking at the questions asked by Representative Konishi, who submitted a record number of 55 questions this time, there was a case in which only one and a half lines of questions were asked in multiple times with the same theme. If the book has the same theme, you can itemize it and ask multiple questions. Many of the contents also ask the interpretation of the wording, and at the Budget Committee, Prime Minister Abe once said, "Questions like quizzes are productive. It's an impression that the questions lacking in the big picture are lined up, which reminds me of the scene where I was told. "
In the explanation of the column name of the data, I wrote that q_numb
will be described later, but this time, not only the number of question texts but also the number of questions described in the question texts are extracted at the time of scraping. For example, this letter of intent asks three questions.
The Chinese numeral part is an individual question. The idea is that the actual cost can be achieved by looking at the number of questions in the questionnaire, not just the number of questionnaires. Let's start with the total number of questions per day.
def convert_to_daily_qnum_sum(data):
time_index = pd.DataFrame(index=pd.date_range('19881230','20190805'))
doc_num_daily = data.groupby(['date'], as_index=False)['q_numb'].sum()
doc_num_daily.set_index('date', inplace=True)
data_daily = pd.concat([time_index, doc_num_daily], axis=1)
data_daily['q_numb'] = data_daily['q_numb'].fillna(0)
data_daily = data_daily["q_numb"]
return data_daily
convert_to_daily_qnum_sum(data).plot()
for i in range(data_diet_open.shape[0]):
plt.axvspan(data_diet_open.start[i],data_diet_open.end[i], color=sns.xkcd_rgb['grey'], alpha=0.4)
This has also risen to the right since 2000. In particular, since around 2007, there have been 269 (2007-07-03) questions, which has increased significantly, and more than 200 days have been seen since then. We will also look at the average number of questions per day.
def convert_to_daily_qnum_mean(data):
time_index = pd.DataFrame(index=pd.date_range('19881230','20190805'))
doc_num_daily = data.groupby(['date'], as_index=False)['q_numb'].mean()
doc_num_daily.set_index('date', inplace=True)
data_daily = pd.concat([time_index, doc_num_daily], axis=1)
data_daily['q_numb'] = data_daily['q_numb'].fillna(0)
data_daily = data_daily["q_numb"]
return data_daily
convert_to_daily_qnum_mean(data).plot()
for i in range(data_diet_open.shape[0]):
plt.axvspan(data_diet_open.start[i],data_diet_open.end[i], color=sns.xkcd_rgb['grey'], alpha=0.4)
There isn't much noticeable change here ... In addition, Kazunori Yamai's "Questionnaire on National Strategic Special Zones in the Employment Sector" had the largest number of questions in one question statement, with 68 questions! Somehow, it feels like you are taking a test, such as "Please give three specific examples of clarification of dismissal rules" and "Is Article 16 of the Labor Contract Law not applicable in the special zone in the first place?" became….
Doc2Vec
The data search has become long, but it's time to start. First, divide the sentence into words for each body of the question statement (separate writing). For example, if " I am a man "
, then divide it into word levels, such as [I am a man]
, and separate each with a half-width space. The one I used this time in the word-separation is MeCab
. The following function is the one used for word-separation. Only the part of speech required by POS1
can be extracted.
import MeCab
def word_pos(text, POS1=['Adnominal adjective','noun','adverb','verb','Prefix','conjunction',
'Auxiliary verb','Particle','adjective','symbol','Interjection','Filler','Other']):
tagger = MeCab.Tagger('mecab -d /usr/lib/x86_64-linux-gnu/mecab/dic/mecab-ipadic-neologd')
tagger.parse('')
node = tagger.parseToNode(text)
word_class = []
while node:
word = node.surface
wclass = node.feature.split(',')
if wclass[0] != u'BOS/EOS':
if wclass[6] == None:
if wclass[0] in POS1:
word_class.append((word,wclass[0],wclass[1],""))
else:
if wclass[0] in POS1:
word_class.append((word,wclass[0],wclass[1]))
node = node.next
word_class = pd.DataFrame(word_class, columns=['term', 'pos1', 'pos2'])
return word_class
After clearing the blanks in the body of the question statement, we will extract only the nouns and adjectives, and then erase the'suffix',' non-independent', and'number' that do not make much sense. And store it as word_list
in data
data.question = data.question.str.replace("\u3000", " ")
data_200.question = data_200.question.str.replace("\u3000", " ")
data['word_list'] = ""
for i in range(data.shape[0]):
each_data = word_pos(data.question[i], ["noun","adjective"])
each_data1 = each_data[each_data['pos2'] != 'suffix']
each_data1 = each_data1[each_data1['pos2'] != 'Non-independent']
each_data1 = each_data1[each_data1['pos2'] != 'number']
data.loc[i,"word_list"] = " ".join(list(each_data1.term))
data_200['word_list'] = ""
for i in range(data_200.shape[0]):
each_data = word_pos(data_200.question[i], ["noun","adjective"])
each_data1 = each_data[each_data['pos2'] != 'suffix']
each_data1 = each_data1[each_data1['pos2'] != 'Non-independent']
each_data1 = each_data1[each_data1['pos2'] != 'number']
data_200.loc[i,"word_list"] = " ".join(list(each_data1.term))
To get an image, I'll just give you one example. In Kamakura City, Kanagawa Prefecture, if you use word_pos
to process` the tsunami evacuation drill at the beach in the city, it will be as follows.
Doc2Vec Here, we will divide the text into words and put them in the list for each question. The image is [([word 1, word 2, word 3], document id), ...]. In the code below, words refers to the list of words contained in the document (with duplicate words), and tags refers to the identifier of the question statement (specified in the list. Multiple tags can be added to one question statement). ..
docments = data.word_list
tagged_data = [TaggedDocument(words = data.split(),tags = [i]) for i,data in enumerate(docments)]
model = Doc2Vec(documents=tagged_data, size=300, window=10, min_count=5, workers=4, seed=1, dm=1)
For parameters,
size
: vector length
window
: window size
min_count
: Minimum number of words to count
workers
: number of threads
seed
: Fixed random numbers
dm
: If dm = 1, learn with PV-DM, otherwise with DBoW
It will be. In addition to this, there are ʻalpha and
min_alphathat specify the learning rate, but after trying various things and qualitatively evaluating them, the parameters settled down. To be honest,
gensim` doesn't have a function to calculate the loss automatically, so I felt that it would be much easier to model this area by myself. Also, I see a lot of such code in the sense of learning.
for epoch in range(10):
print('iteration {0}'.format(epoch+1))
model.train(tagged_data, total_examples=model.corpus_count, epochs=model.iter)
model.alpha -= 0.0002 # decrease the learning rate
model.min_alpha = model.alpha # fix the learning rate, no decay
However, according to the creator of gensim
, this is the learning code for the older version of Doc2Vec
, and the current version doesn't need to do this unless you're a very expert. was.
What does epochs mean in Doc2Vec and train when I have to manually run the iteration?
"An advanced user who needed to do some mid-training logging or analysis or adjustment might split the training over multiple train() calls, and very consciously manage the effective alpha parameters for each call. An extremely advanced user experimenting with further training on an already-trained model might also try it, aware of all of the murky quality/balance issues that might involve. But essentially, unless you already know specifically why you'd need to do so, and the benefits and risks, it's a bad idea."
There is also an interesting comment from the creator of gensim
about the size of the vector. Is the amount of data this time just barely enough?
what is the minimum dataset size needed for good performance with doc2vec?
In general, word2vec/paragraph-vector techniques benefit from a lot of data and variety of word-contexts. I wouldn't expect good results without at least tens-of-thousands of documents. Documents longer than a few words each work much better. Results may be harder to interpret if wildly-different-in-size or -kind documents are mixed in the same training – such as mixing tweets and books.
model.save("doc2vec.model") #Save model
model = Doc2Vec.load('doc2vec.model') #Model call
You can search for similar question intents with model.docvecs.most_similar ()
.
model.docvecs.most_similar(10531)
[(12348, 0.8008440732955933),
(10543, 0.7899609804153442),
(10534, 0.7879745960235596),
(12278, 0.7819333076477051),
(14764, 0.7807815074920654),
(13340, 0.7798347473144531),
(11314, 0.7743450403213501),
(14881, 0.7730422616004944),
(1828, 0.7719383835792542),
(14701, 0.7534374594688416)]
So, the original one and the top two are displayed
pd.set_option('display.max_colwidth', -1)
data.iloc[10531:10532,:6]
idx = model.docvecs.most_similar(10531)[0][0]
pd.set_option('display.max_colwidth', -1)
data.iloc[idx:idx+1,:6]
idx = model.docvecs.most_similar(10531)[1][0]
pd.set_option('display.max_colwidth', -1)
data.iloc[idx:idx+1,:6]
All of these topics are overtime issues and work-life balance, so I think it's a fair result.
Well, the main subject is finally from here.
"Imagine what the Diet should look like in the future, and when the same questions are asked over and over again, I want you to use artificial intelligence. There are many things you can do, such as the use of artificial intelligence and the future of the Diet. There is "(Shinjiro Koizumi)
We will estimate the degree of similarity using the 200th Diet Questionnaire, which is not in the model. Use model.infer_vector ()
to turn a new document into a vector. For ()
, specify a new document and step
how many times the model should be rotated. The number of steps was set to 20 based on the opinion of the creator of gensim
.
In addition, the vector for this new document is model.docvecs.most_similar (positive = [new_docvec], topn = 1)
to calculate the similarity with the document in the existing model. topn
is how many top documents to extract.
list_most_sim_doc = []
list_most_sim_value = []
for i in range(data_200.shape[0]):
new_doc = data_200.loc[i,'word_list']
new_doc = new_doc.split(" ")
new_docvec = model.infer_vector(new_doc, steps=20)
most_sim_doc = model.docvecs.most_similar(positive=[new_docvec], topn=1)[0][0]
most_sim_value = model.docvecs.most_similar(positive=[new_docvec], topn=1)[0][1]
list_most_sim_doc.append(most_sim_doc)
list_most_sim_value.append(most_sim_value)
The distribution of similarity is as follows. The closer it is to 1, the closer it is.
plt.hist(list_most_sim_value, bins=50)
Now combine the closest document ID and similarity values to the existing dataset.
new_doc_sim = pd.DataFrame({"sim_doc":list_most_sim_doc,"sim_value":list_most_sim_value})
data_200_sim = pd.concat([data_200, new_doc_sim], axis= 1)
Click here for the top 3 similarities. All are over 0.8.
pd.reset_option('^display.', silent=True)
data_200_sim = data_200_sim.sort_values(by='sim_value', ascending=False).reset_index(drop=True)
data_200_sim.head(3)
What kind of content is it? Document 1 of the 200th Diet. The content is why the drug Gardasil is not approved.
idx=data_200_sim.index[1]
pd.set_option('display.max_colwidth', -1)
data_200_sim.iloc[idx:idx+1,:6]
Next, the document 1 of the 200th Diet is the closest document in the model. long…. The content is about the approval of a drug called Iressa. Reasonable result.
idx=data_200_sim.index[0]
pd.set_option('display.max_colwidth', -1)
data_200_sim.iloc[idx:idx+1,:6]
Document 2 of the 200th Diet. long…. The content is economic stimulus measures based on MMT theory.
idx=data_200_sim.index[0]
pd.set_option('display.max_colwidth', -1)
data_200_sim.iloc[idx:idx+1,:6]
Next, the document 2 of the 200th Diet is the closest document in the model. Longer ... Contents of economic stimulus measures for the time being. Not bad result.
idx=data_200_sim.index[0]
pd.set_option('display.max_colwidth', -1)
data_200_sim.iloc[idx:idx+1,:6]
Document 3 of the 200th Diet. The content is about the amnesty of the Reiwa era.
idx=data_200_sim.index[2]
pd.set_option('display.max_colwidth', -1)
data_200_sim.iloc[idx:idx+1,:6]
The closest document in the model is also about the amnesty of the Reiwa era. This is also a good result.
idx=data_200_sim.sim_doc[2]
pd.set_option('display.max_colwidth', -1)
data.iloc[idx:idx+1,:6]
The data search became fun on the way, and I wrote it for a long time in vain, but as a result of trying Doc2Vec lightly, I was able to estimate the similarity reasonably well for the new document called the Questionnaire of the 200th Diet. Isn't it? Having said that, I have the impression that the world in which AI asks questions as Shinjiro Koizumi says is a future event that is not near.
This time I tried data search and similarity judgment using Doc2Vec, but in the future I will do clustering with data of political parties and judge the "quality" of the question statement that I really want to try I would like to try. In particular, regarding the "quality" of the questionnaire, I think that we can change the tone that simply issuing a large number of questionnaires will result in results.
I have a data set that has been crawled and scraped, so if anyone wants to analyze and analyze various things using this data, I'm hoping that we can do something together!
Recommended Posts