In Posted last time, I followed the statistics understood by manga [Factor analysis] Chapter 4 Principal component analysis with Python.
This time, I will challenge the principal component analysis of text data with Python.
Originally, I wanted to know about principal component analysis from this text analytics book by Mr. Akitetsu Kim. When I wanted to do text data clustering, I found the principal component analysis of this book to be interesting, and this led me to study principal component analysis.
The analysis target is composition data written on three themes (friends, cars, Japanese food). There are 33 data in total for 3 themes x 11 people.
The data can be obtained from the source code download on the support page of here.
The data is not text, but already in Bag of Words (BoW) format. Therefore, processing such as morphological analysis and BoW conversion is not included this time.
The code is diverted from previous article.
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
%matplotlib inline
from matplotlib import rcParams
rcParams['font.family'] = 'sans-serif'
rcParams['font.sans-serif'] = ['Hiragino Maru Gothic Pro', 'Yu Gothic', 'Meirio', 'Takao', 'IPAexGothic', 'IPAPGothic', 'Noto Sans CJK JP']
#Reading composition data * Please modify the file path according to your environment.
df = pd.read_csv('./sakubun3f.csv',encoding='cp932')
data = df.values
# "Words"Column,"OTHERS"Exclude columns
d = data[:,1:-1].astype(np.int64)
#Data standardization * Standard deviation is calculated by unbiased standard deviation
X = (d - d.mean(axis=0)) / d.std(ddof=1,axis=0)
#Find the correlation matrix
XX = np.round(np.dot(X.T,X) / (len(X) - 1), 2)
#Find the eigenvalues and eigenvalue vectors of the correlation matrix
w, V = np.linalg.eig(XX)
print('-------eigenvalue-------')
print(np.round(w,3))
print('')
#Find the first principal component
z1 = np.dot(X,V[:,0])
#Find the second principal component
z2 = np.dot(X,V[:,1])
##############################################################
#Draw the first principal component score and the second principal component score obtained so far in a graph
##############################################################
#Generating objects for graphs
fig = plt.figure(figsize=(12, 8))
ax = fig.add_subplot(111)
#Insert grid lines
ax.grid()
#Boundary of data to draw
lim = [-6.0, 6.0]
ax.set_xlim(lim)
ax.set_ylim(lim)
#Bring the left and bottom axes to the middle
ax.spines['bottom'].set_position(('axes', 0.5))
ax.spines['left'].set_position(('axes', 0.5))
#Erase the right and top axes
ax.spines['right'].set_visible(False)
ax.spines['top'].set_visible(False)
#Adjust the axis scale spacing
ticks = np.arange(-6.0, 6.0, 2.0)
ax.set_xticks(ticks)
ax.set_yticks(ticks)
#Add axis label, adjust position
ax.set_xlabel('Z1', fontsize=16)
ax.set_ylabel('Z2', fontsize=16, rotation=0)
ax.xaxis.set_label_coords(1.02, 0.49)
ax.yaxis.set_label_coords(0.5, 1.02)
#Data plot
for (i,j,k) in zip(z1,z2,data[:,0]):
ax.plot(i,j,'o')
ax.annotate(k, xy=(i, j),fontsize=16)
#drawing
plt.show()
-------eigenvalue-------
[ 5.589e+00 4.433e+00 2.739e+00 2.425e+00 2.194e+00 1.950e+00
1.672e+00 1.411e+00 1.227e+00 1.069e+00 9.590e-01 9.240e-01
7.490e-01 6.860e-01 5.820e-01 5.150e-01 4.330e-01 3.840e-01
2.970e-01 2.200e-01 1.620e-01 1.080e-01 8.800e-02 7.800e-02
4.600e-02 3.500e-02 -7.000e-03 -2.000e-03 4.000e-03 1.700e-02
1.300e-02]
According to the explanation of the book, the text with 9 at the end of the label is "Japanese food", 2 at the end is "friend", and 5 at the end is "car".
The scatter plot is output in the opposite direction to the book, but the three themes are neatly classified, with the upper left direction being "Japanese food", the upper right direction being "friends", and the lower right direction being "cars". .. (The figure opposite to the book may be due to the arbitrariness of a constant multiple of the eigenvalue.)
#Coordinates with the eigenvector corresponding to the largest eigenvalue on the horizontal axis and the eigenvector corresponding to the penultimate eigenvalue on the vertical axis.
V_ = np.array([(V[:,0]),V[:,1]]).T
V_ = np.round(V_,2)
#Data for graph drawing
data_name=df.columns[1:-1]
z1 = V_[:,0]
z2 = V_[:,1]
#Generating objects for graphs
fig = plt.figure(figsize=(12, 8))
ax = fig.add_subplot(111)
#Insert grid lines
ax.grid()
#Boundary of data to draw
lim = [-0.4, 0.4]
ax.set_xlim(lim)
ax.set_ylim(lim)
#Bring the left and bottom axes to the middle
ax.spines['bottom'].set_position(('axes', 0.5))
ax.spines['left'].set_position(('axes', 0.5))
#Erase the right and top axes
ax.spines['right'].set_visible(False)
ax.spines['top'].set_visible(False)
#Adjust the axis scale spacing
ticks = np.arange(-0.4, 0.4, 0.2)
ax.set_xticks(ticks)
ax.set_yticks(ticks)
#Add axis label, adjust position
ax.set_xlabel('Z1', fontsize=16)
ax.set_ylabel('Z2', fontsize=16, rotation=0)
ax.xaxis.set_label_coords(1.02, 0.49)
ax.yaxis.set_label_coords(0.5, 1.02)
#Data plot
for (i,j,k) in zip(z1,z2,data_name):
ax.plot(i,j,'o')
ax.annotate(k, xy=(i, j),fontsize=14)
#drawing
plt.show()
The factor loading is also opposite, but the result is almost the same as the book.
Words that are likely to be related to the theme of "Japanese food" such as "Japanese" and "rice" in the upper left direction, and words that are likely to be related to the theme of "friends" such as "best friend" and "friend" in the upper right direction. However, in the lower right direction, there are words that seem to be related to the theme of "car" such as "traffic" and "accident".
If you compare it with the scatter plot of the principal components, you can see that the directions of the words that are likely to be related to each theme are the same.
When I used the code from the previous article, I was able to analyze the principal components of text data more easily than I had expected.
This time, the data was already cleanly preprocessed, so the results were pretty good. Next time, I would like to check if it can be classified neatly in news articles.
end
Recommended Posts