That's why I analyze the homepages of each political party

Purpose

It's been a year since the ban on online elections was lifted, and there are some uses that rebel against the Public Offices Election Act. How are you doing today? Now, this time, I will analyze the homepages of each political party, find out what words are used, and extract political parties with similar characteristics.

procedure

Download the homepage of each political party.
Morphological analysis of the downloaded page with Mecab.
Find the score for each word with tf-idf
Measure the distance between sentences by cosine similarity and find the distance between political parties.
View the association in GraphViz.

result

** Homepage analysis of each political party in 2014 ** http://needtec.sakura.ne.jp/analyze_election/page/analyzehp/2014

Source code https://github.com/mima3/analyze_election/tree/master/script_comp_manifesto

Download the source code and execute the following script.

#Download the homepage and store it in the DB
python create_parties_db.py parties_hp_2014.sqlite party_hp_json_2014.json

#Morphological analysis and totaling the number of words
python create_parties_tokens.py parties_hp_2014.sqlite

# tf-Calculate idf and cosine similarity and record the result in JSON and PNG.
python create_tf_idf_report.py parties_hp_2014.sqlite party_hp_result_2014.json  party_hp_result_2014.png "ms ui gothic"

To run it, you need to install the following libraries. ・ Nltk ・ Lxml ・ MeCab ・ Urllib2 ・ Pydot

Commentary

Analysis of sentences by tf-idf

The value of tf-idf of the word x in the sentence y is as follows.

tf = number of words x appearing in sentence y / number of words in sentence idf = 1.0 + log (total number of sentences / number of sentences in which the word x appears) tf-idf = tf × idf

Words that appear in many documents have a lower importance and a lower score, and words that appear only in a specific document have a higher importance and a higher score.

Measuring sentence distance by cosine similarity

Sentence 1 has a word (A, B, C) and the TF-IDF value of that word is (0.1, 0.2, 0.3). Sentence 2 has a word (C, D, E) and the TF-IDF value of that word is (0,4,0.5,0.6).

Assuming that the TF-IDF of words that do not exist in the sentence is 0, create TF-IDF for all words.

(A, B, C, D, E) in sentence 1 becomes (0.1,0.2,0.3,0,0). (A, B, C, D, E) in sentence 2 becomes (0,0,0.4,0.5,0.6)

The cosine of the vector angle between sentence 1 and sentence 2 represents the degree of similarity between the two. In the case of exactly the same sentence, the angle between sentence 1 and sentence 2 is 0 degrees.

In the case of Python, it calculates with nltk.cluster.util.cosine_distance.