It's been a year since the ban on online elections was lifted, and there are some uses that rebel against the Public Offices Election Act. How are you doing today? Now, this time, I will analyze the homepages of each political party, find out what words are used, and extract political parties with similar characteristics.
** Homepage analysis of each political party in 2014 ** http://needtec.sakura.ne.jp/analyze_election/page/analyzehp/2014
Source code https://github.com/mima3/analyze_election/tree/master/script_comp_manifesto
Download the source code and execute the following script.
#Download the homepage and store it in the DB
python create_parties_db.py parties_hp_2014.sqlite party_hp_json_2014.json
#Morphological analysis and totaling the number of words
python create_parties_tokens.py parties_hp_2014.sqlite
# tf-Calculate idf and cosine similarity and record the result in JSON and PNG.
python create_tf_idf_report.py parties_hp_2014.sqlite party_hp_result_2014.json party_hp_result_2014.png "ms ui gothic"
To run it, you need to install the following libraries. ・ Nltk ・ Lxml ・ MeCab ・ Urllib2 ・ Pydot
The value of tf-idf of the word x in the sentence y is as follows.
tf = number of words x appearing in sentence y / number of words in sentence idf = 1.0 + log (total number of sentences / number of sentences in which the word x appears) tf-idf = tf × idf
Words that appear in many documents have a lower importance and a lower score, and words that appear only in a specific document have a higher importance and a higher score.
Sentence 1 has a word (A, B, C) and the TF-IDF value of that word is (0.1, 0.2, 0.3). Sentence 2 has a word (C, D, E) and the TF-IDF value of that word is (0,4,0.5,0.6).
Assuming that the TF-IDF of words that do not exist in the sentence is 0, create TF-IDF for all words.
(A, B, C, D, E) in sentence 1 becomes (0.1,0.2,0.3,0,0). (A, B, C, D, E) in sentence 2 becomes (0,0,0.4,0.5,0.6)
The cosine of the vector angle between sentence 1 and sentence 2 represents the degree of similarity between the two. In the case of exactly the same sentence, the angle between sentence 1 and sentence 2 is 0 degrees.
In the case of Python, it calculates with nltk.cluster.util.cosine_distance.
Recommended Posts