4 Various algorithms such as sup> are also implemented. p>
Library th>
| Overview th>
| Purpose of use this time th>
| Official URL th>
|
NumPy |
Numerical calculation library td>
| Used for numerical calculation in statistical processing td>
| https://www.numpy.org |
pandas |
Data analysis library td>
| Used for data reading and formatting td>
| https://pandas.pydata.org |
Matplotlib |
Graph drawing library td>
| Used for data visualization td>
| https://matplotlib.org |
scikit-learn |
Machine learning library td>
| Used to implement principal component analysis (PCA) td>
| https://scikit-learn.org |
Reading Chap.2 data
Now that the necessary libraries have been loaded in Chap.1, prepare the data to be handled this time. The data handled this time are three types: finless porpoise ( Neophocaena phocaenoides i>), striped dolphin ( Stenella coeruleoalba i>), and melon-headed whale ( Peponocephala electra i>). In both cases, download measured data and samples from ChemTHEATER's Sample Search and load them in the same way as Part 1 and Part 2. p>
- Select "Sample Search" from the menu bar of Chem THEATER.
- Select "Biotic --Mammals --Marine mammals" in "Sample Type".
- Select "Neophocaena phocaenoides" from "Scientific Name".
- Click the "Search" button to output a list of samples that meet the conditions.
- "Export samples" outputs the sample information, and "Export measured data" outputs the measured values of the target chemical substance as a tab-delimited text file.
- Repeat for the other two types.
After successfully reading each of the three files, combine the three files measureddata and samples so that they can be used later in the process. At this time, it is convenient to use the Pandas concat function. If the parameter ignore_index is set to True, the column numbers of the Pandas DataFrame 5 </ sup> to be joined will be renumbered. </ p>
data_file1 = "measureddata_20191004033314.tsv" #Change the string entered in the variable to the tsv file name of your measured data
data_file2 = "measureddata_20191004033325.tsv" #Change the string entered in the variable to the tsv file name of your measured data
data_file3 = "measureddata_20191004033338.tsv" #Change the string entered in the variable to the tsv file name of your measured data
data1 = pd.read_csv(data_file1, delimiter="\t")
data2 = pd.read_csv(data_file2, delimiter="\t")
data3 = pd.read_csv(data_file3, delimiter="\t")
data = pd.concat([data1, data2, data3], ignore_index=True) #Combine with Pandas concat function
data
|
MeasuredID |
ProjectID |
SampleID |
ScientificName |
ChemicalID |
ChemicalName |
ExperimentID |
MeasuredValue |
AlternativeData |
... |
0 |
80 |
PRA000002 |
SAA000087 |
Neophocaena phocaenoides |
CH0000154 |
TBT |
EXA000001 |
170.0000 |
NaN |
... |
1 |
81 |
PRA000002 |
SAA000087 |
Neophocaena phocaenoides |
CH0000155 |
DBT |
EXA000001 |
220.3591 |
NaN |
... |
2 |
82 |
PRA000002 |
SAA000087 |
Neophocaena phocaenoides |
CH0000156 |
MBT |
EXA000001 |
44.5445 |
NaN |
... |
3 |
83 |
PRA000002 |
SAA000087 |
Neophocaena phocaenoides |
CH0000157 |
ΣBTs |
EXA000001 |
434.9036 |
NaN |
... |
4 |
84 |
PRA000002 |
SAA000087 |
Neophocaena phocaenoides |
CH0000158 |
TPT |
EXA000001 |
12.9220 |
NaN |
... |
... |
... |
... |
... |
... |
... |
... |
... |
... |
... |
... |
9774 |
24272 |
PRA000036 |
SAA002163 |
Peponocephala electra |
CH0000151 |
cis-nonachlor |
EXA000001 |
170.0000 |
NaN |
... |
9775 |
24056 |
PRA000036 |
SAA002163 |
Peponocephala electra |
CH0000152 |
ΣCHLs |
EXA000001 |
1100.0000 |
NaN |
... |
9776 |
24376 |
PRA000036 |
SAA002163 |
Peponocephala electra |
CH0000153 |
HCB |
EXA000001 |
120.0000 |
NaN |
... |
9777 |
24646 |
PRA000036 |
SAA002163 |
Peponocephala electra |
CH0000533 |
TCPMe |
EXA000001 |
6.1000 |
NaN |
... |
9778 |
24700 |
PRA000036 |
SAA002163 |
Peponocephala electra |
CH0000534 |
TCPMOH |
EXA000001 |
16.0000 |
NaN |
... |
9779 rows × 13 columns
sample_file1 = "samples_20191004033311.tsv" #Change the string entered in the variable to the tsv file name of your samples
sample_file2 = "samples_20191004033323.tsv" #Change the string entered in the variable to the tsv file name of your samples
sample_file3 = "samples_20191004033334.tsv" #Change the string entered in the variable to the tsv file name of your samples
sample1 = pd.read_csv(sample_file1, delimiter="\t")
sample2 = pd.read_csv(sample_file2, delimiter="\t")
sample3 = pd.read_csv(sample_file3, delimiter="\t")
sample = pd.concat([sample1, sample2, sample3], ignore_index=True) #Combine with Pandas concat function
sample
|
ProjectID |
SampleID |
SampleType |
TaxonomyID |
UniqCodeType |
UniqCode |
SampleName |
ScientificName |
CommonName |
CollectionYear |
... |
0 |
PRA000002 |
SAA000087 |
ST004 |
34892 |
es-BANK |
EW00884 |
NaN |
Neophocaena phocaenoides |
Finless porpoises |
1996 |
... |
1 |
PRA000002 |
SAA000088 |
ST004 |
34892 |
es-BANK |
EW00812 |
NaN |
Neophocaena phocaenoides |
Finless porpoises |
1999 |
... |
2 |
PRA000002 |
SAA000089 |
ST004 |
34892 |
es-BANK |
EW00873 |
NaN |
Neophocaena phocaenoides |
Finless porpoises |
1995 |
... |
3 |
PRA000002 |
SAA000090 |
ST004 |
34892 |
es-BANK |
EW04787 |
NaN |
Neophocaena phocaenoides |
Finless porpoises |
2000 |
... |
4 |
PRA000002 |
SAA000091 |
ST004 |
34892 |
es-BANK |
EW00867 |
NaN |
Neophocaena phocaenoides |
Finless porpoises |
1998 |
... |
... |
... |
... |
... |
... |
... |
... |
... |
... |
... |
... |
... |
312 |
PRA000036 |
SAA002159 |
ST004 |
103596 |
es-BANK |
EW04779 |
060301-1 |
Peponocephala electra |
Melon-headed whale |
2006 |
... |
313 |
PRA000036 |
SAA002160 |
ST004 |
103596 |
es-BANK |
EW00115 |
M32625 |
Peponocephala electra |
Melon-headed whale |
2001 |
... |
314 |
PRA000036 |
SAA002161 |
ST004 |
103596 |
es-BANK |
EW00122 |
M32633 |
Peponocephala electra |
Melon-headed whale |
2001 |
... |
315 |
PRA000036 |
SAA002162 |
ST004 |
103596 |
es-BANK |
EW00116 |
M32626 |
Peponocephala electra |
Melon-headed whale |
2001 |
... |
316 |
PRA000036 |
SAA002163 |
ST004 |
103596 |
es-BANK |
EW00117 |
M32627 |
Peponocephala electra |
Melon-headed whale |
2001 |
... |
317 rows × 66 columns
Chap.3 Data preparation
After reading the data, the next step is to prepare the data so that it can be analyzed. First, start by extracting the data to be used this time from measured data.
This time, we will use 5 kinds of chemical substances (ΣPCBs, ΣDDTs, ΣPBDEs, ΣCHLs, ΣHCHs) for 3 kinds of finless porpoise, striped dolphin, and melon-headed whale. p>
First, extract the data whose unit is [ng / g lipid]. This is because, as mentioned in Part 1, data with different units cannot be simply compared.
Next, extract the data whose chemical substance name is one of the above five types, and use the reset_index method to renumber the lines. p>
data_lipid = data[(data["Unit"] == "ng/g lipid")]
data_lipid = data_lipid[(data_lipid["ChemicalName"] == "ΣPCBs") | (data_lipid["ChemicalName"] == "ΣDDTs") |
(data_lipid["ChemicalName"] == "ΣPBDEs") | (data_lipid["ChemicalName"] == "ΣCHLs") |
(data_lipid["ChemicalName"] == "ΣHCHs")]
data_lipid = data_lipid.reset_index(drop=True)
data_lipid[data_lipid["SampleID"] == "SAA001941"] #Data list with Sample ID SAA001941
|
MeasuredID |
ProjectID |
SampleID |
ScientificName |
ChemicalID |
ChemicalName |
ExperimentID |
MeasuredValue |
AlternativeData |
... |
115 |
25112 |
PRA000030 |
SAA001941 |
Neophocaena phocaenoides |
CH0000033 |
ΣDDTs |
EXA000001 |
9400.0 |
NaN |
... |
116 |
25098 |
PRA000030 |
SAA001941 |
Neophocaena phocaenoides |
CH0000096 |
ΣPCBs |
EXA000001 |
1100.0 |
NaN |
... |
117 |
25140 |
PRA000030 |
SAA001941 |
Neophocaena phocaenoides |
CH0000146 |
ΣHCHs |
EXA000001 |
41.0 |
NaN |
... |
118 |
25126 |
PRA000030 |
SAA001941 |
Neophocaena phocaenoides |
CH0000152 |
ΣCHLs |
EXA000001 |
290.0 |
NaN |
... |
After extracting the measured data data, the next step is to integrate the samples and the measured data Data Frame.
In Part1 and Part2, sample information (scientific name, collection site, etc.) was added to the measured data of each chemical substance, but this time, the measured value of the chemical substance is given to each sample, so be careful that the processing is different from before. .. p>
#Create a DataFrame with only SampleID and scientific name
df = sample[["SampleID", "ScientificName"]]
for chem in ["ΣCHLs", "ΣDDTs", "ΣHCHs", "ΣPBDEs", "ΣPCBs"]: #Loop processing by chemical substance name
#Create a DataFrame for only the chemicals in the chemth list
data_lipid_chem = data_lipid[data_lipid["ChemicalName"] == chem]
#Column name, leaving only the SampleID and measurements"MeasuredValue"Changed to chemical substance name
data_lipid_chem = data_lipid_chem[["SampleID", "MeasuredValue"]].rename(columns={"MeasuredValue": chem})
data_lipid_chem = data_lipid_chem.drop_duplicates(subset='SampleID')
#Df with SampleID(ID+Scientific name)And data_lipid_chem(ID+measured value)Merge
df = pd.merge(df, data_lipid_chem, on="SampleID")
df = df.dropna(how="any") #Remove rows containing NaN
df = df.drop("SampleID", axis=1)
df
|
ScientificName |
ΣCHLs |
ΣDDTs |
ΣHCHs |
ΣPBDEs |
ΣPCBs |
0 |
Neophocaena phocaenoides |
770.0 |
68000.0 |
1100.0 |
170.0 |
5700.0 |
1 |
Neophocaena phocaenoides |
1200.0 |
140000.0 |
3500.0 |
120.0 |
6500.0 |
2 |
Neophocaena phocaenoides |
1000.0 |
140000.0 |
5500.0 |
86.0 |
5500.0 |
3 |
Neophocaena phocaenoides |
950.0 |
130000.0 |
6600.0 |
100.0 |
5800.0 |
4 |
Neophocaena phocaenoides |
1400.0 |
280000.0 |
6100.0 |
140.0 |
11000.0 |
... |
... |
... |
... |
... |
... |
... |
89 |
Peponocephala electra |
3000.0 |
15000.0 |
170.0 |
260.0 |
12000.0 |
90 |
Peponocephala electra |
5100.0 |
23000.0 |
380.0 |
490.0 |
19000.0 |
91 |
Peponocephala electra |
5700.0 |
33000.0 |
240.0 |
300.0 |
25000.0 |
92 |
Peponocephala electra |
2800.0 |
12000.0 |
220.0 |
230.0 |
9300.0 |
93 |
Peponocephala electra |
5700.0 |
27000.0 |
240.0 |
180.0 |
21000.0 |
94 rows × 6 columns
As shown above
, the scientific name is stored in the DataFrame as it is in the current state. Since the species name will be used for visualization in the future, a variable representing the scientific name will be prepared for convenience. First, create a dictionary that associates each scientific name with a value (finless porpoise: 0, striped dolphin: 1, melon-headed whale: 2). </ p>
species = df["ScientificName"].unique().tolist() #Output a list of scientific names in df
class_dic = {}
for i in range(len(species)):
class_dic[species[i]] = i
class_dic
{'Neophocaena phocaenoides': 0,
'Stenella coeruleoalba': 1,
'Peponocephala electra': 2}
Next, add the Class column to df and enter the value corresponding to the scientific name in each row in the Class column. p>
```python
df["Class"] = 0
for irow in range(len(df)):
df.at[irow, "Class"] = class_dic[df.at[irow, "ScientificName"]]
df = df.loc[:, ["Class", "ScientificName", "ΣPCBs", "ΣDDTs", "ΣPBDEs", "ΣCHLs", "ΣHCHs"]]
df
```
|
Class |
ScientificName |
ΣPCBs |
ΣDDTs |
ΣPBDEs |
ΣCHLs |
ΣHCHs |
0 |
0 |
Neophocaena phocaenoides |
5700.0 |
68000.0 |
170.0 |
770.0 |
1100.0 |
1 |
0 |
Neophocaena phocaenoides |
6500.0 |
140000.0 |
120.0 |
1200.0 |
3500.0 |
2 |
0 |
Neophocaena phocaenoides |
5500.0 |
140000.0 |
86.0 |
1000.0 |
5500.0 |
3 |
0 |
Neophocaena phocaenoides |
5800.0 |
130000.0 |
100.0 |
950.0 |
6600.0 |
4 |
0 |
Neophocaena phocaenoides |
11000.0 |
280000.0 |
140.0 |
1400.0 |
6100.0 |
... |
... |
... |
... |
... |
... |
... |
... |
89 |
2 |
Peponocephala electra |
12000.0 |
15000.0 |
260.0 |
3000.0 |
170.0 |
90 |
2 |
Peponocephala electra |
19000.0 |
23000.0 |
490.0 |
5100.0 |
380.0 |
91 |
2 |
Peponocephala electra |
25000.0 |
33000.0 |
300.0 |
5700.0 |
240.0 |
92 |
2 |
Peponocephala electra |
9300.0 |
12000.0 |
230.0 |
2800.0 |
220.0 |
93 |
2 |
Peponocephala electra |
21000.0 |
27000.0 |
180.0 |
5700.0 |
240.0 |
94 rows × 7 columns
Since it was easier to use the pivot table, modify the data_lipid and below as follows.
df = pd.pivot_table(data_lipid, index=['ScientificName', 'SampleID'], columns='ChemicalName', values=['MeasuredValue'])
df = df.dropna(how="any") #Remove rows containing NaN
df
|
|
MeasuredValue |
|
ChemicalName |
ΣCHLs |
ΣDDTs |
ΣHCHs |
ΣPBDEs |
ΣPCBs |
ScientificName |
SampleID |
|
|
|
|
|
Neophocaena phocaenoides |
SAA001903 |
770.0 |
68000.0 |
1100.0 |
170.0 |
5700.0 |
SAA001904 |
1200.0 |
140000.0 |
3500.0 |
120.0 |
6500.0 |
SAA001905 |
1000.0 |
140000.0 |
5500.0 |
86.0 |
5500.0 |
SAA001906 |
950.0 |
130000.0 |
6600.0 |
100.0 |
5800.0 |
SAA001907 |
1400.0 |
280000.0 |
6100.0 |
140.0 |
11000.0 |
... |
... |
... |
... |
... |
... |
... |
Stenella coeruleoalba |
SAA001984 |
730.0 |
3500.0 |
40.0 |
84.0 |
3200.0 |
SAA001985 |
7300.0 |
44000.0 |
300.0 |
530.0 |
26000.0 |
SAA001986 |
12000.0 |
88000.0 |
490.0 |
850.0 |
42000.0 |
SAA001987 |
9100.0 |
56000.0 |
380.0 |
650.0 |
32000.0 |
SAA001988 |
11000.0 |
67000.0 |
520.0 |
610.0 |
36000.0 |
94 rows × 5 columns
From this point onward, analysis is performed using the data frame of the pivot table.
Chap.4 Visualization of data overview
Now that the data preparation is complete, let's visualize and check the entire data for the time being. The data handled this time is five-dimensional (there are five explanatory variables). Since humans cannot grasp the five-dimensional space, it is impossible to visualize the distribution of all the five-dimensional data as it is. Here, a distribution map of $ {} _ 5 C_2 = 10 $ is generated. p>
It is convenient to use pandas' pandas.plotting.scatter_matrix () function to visualize the scatter plot of multiple variables. In the scatter_matrix () function, specify the range to be visualized with the frame parameter. This time, all rows and 2nd column of df 6 sup> and after. p>
pd.plotting.scatter_matrix(
frame=df["MeasuredValue"],
c=list(df["MeasuredValue"].iloc[:, 0]),
alpha=0.5)
plt.show()
Chap.5 Principal component analysis
In
Chap.4, a distribution map of $ {} _ 5 C_2 = 10 $ was generated as an overview of the entire data. However, with this, only the distribution of data for each combination of two types of chemical substances can be confirmed. In other words, it is not possible to get an overview of the entire 5-dimensional data. Therefore, the five-dimensional data is summarized and visualized by principal component analysis. </ p>
First, since the length of each axis is different as it is, standardize 7 sup> for each variable and set the length of all axes in 5 dimensions. Unify. p>
dfs = df.apply(lambda x: (x-x.mean())/x.std(), axis=0)
dfs
|
|
MeasuredValue |
|
ChemicalName |
ΣCHLs |
ΣDDTs |
ΣHCHs |
ΣPBDEs |
ΣPCBs |
ScientificName |
SampleID |
|
|
|
|
|
Neophocaena phocaenoides |
SAA001903 |
-0.743053 |
0.320011 |
0.179064 |
-0.308268 |
-0.768856 |
SAA001904 |
-0.588580 |
1.574027 |
1.297534 |
-0.528241 |
-0.693375 |
SAA001905 |
-0.660428 |
1.574027 |
2.229592 |
-0.677822 |
-0.787727 |
SAA001906 |
-0.678390 |
1.399859 |
2.742224 |
-0.616230 |
-0.759421 |
SAA001907 |
-0.516732 |
4.012392 |
2.509209 |
-0.440252 |
-0.268792 |
... |
... |
... |
... |
... |
... |
... |
Stenella coeruleoalba |
SAA001984 |
-0.757422 |
-0.803378 |
-0.314927 |
-0.686621 |
-1.004736 |
SAA001985 |
1.602782 |
-0.097994 |
-0.193759 |
1.275535 |
1.146484 |
SAA001986 |
3.291208 |
0.668349 |
-0.105213 |
2.683360 |
2.656112 |
SAA001987 |
2.249413 |
0.111009 |
-0.156477 |
1.803469 |
1.712595 |
SAA001988 |
2.931969 |
0.302594 |
-0.091233 |
1.627491 |
2.090001 |
94 rows × 5 columns
Alternatively, you can use the Standard Scaler for standardization.
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(df)
scaler.transform(df)
dfs = pd.DataFrame(scaler.transform(df), columns=df.columns)
dfs
|
MeasuredValue |
ChemicalName |
ΣCHLs |
ΣDDTs |
ΣHCHs |
ΣPBDEs |
ΣPCBs |
0 |
-0.747037 |
0.321727 |
0.180024 |
-0.309921 |
-0.772979 |
1 |
-0.591736 |
1.582467 |
1.304491 |
-0.531073 |
-0.697093 |
2 |
-0.663969 |
1.582467 |
2.241547 |
-0.681457 |
-0.791950 |
3 |
-0.682027 |
1.407365 |
2.756927 |
-0.619534 |
-0.763493 |
4 |
-0.519503 |
4.033907 |
2.522663 |
-0.442612 |
-0.270233 |
... |
... |
... |
... |
... |
... |
89 |
-0.761484 |
-0.807686 |
-0.316615 |
-0.690303 |
-1.010123 |
90 |
1.611376 |
-0.098520 |
-0.194798 |
1.282374 |
1.152631 |
91 |
3.308856 |
0.671933 |
-0.105778 |
2.697748 |
2.670354 |
92 |
2.261475 |
0.111604 |
-0.157316 |
1.813139 |
1.721777 |
93 |
2.947690 |
0.304217 |
-0.091722 |
1.636218 |
2.101208 |
94 rows × 5 columns
After standardization, perform principal component analysis. Here, the result of Standard Scaler is used.
For principal component analysis, scikit-learn's sklearn.decomposition.PCA () is used.
In principal component analysis, first, a synthetic variable (first principal component) is created so that the axis is drawn in the direction that maximizes the variation in the distribution of the data, and then the second principal component, which complements the data lost by the composite variable. Create the third principal component and synthetic variables. After that, if the data is transferred to the space centered on each principal component, the rest is only interpreted. p>
pca = PCA()
feature = pca.fit(dfs) #Principal component analysis
feature = pca.transform(dfs) #Transfer data to principal component space
pd.DataFrame(feature, columns=["PC{}".format(x + 1) for x in range(len(dfs.columns))])
|
PC1 |
PC2 |
PC3 |
PC4 |
PC5 |
0 |
-0.931614 |
0.574985 |
0.348535 |
-0.188477 |
-0.178934 |
1 |
-0.525203 |
2.177125 |
0.038871 |
-0.242288 |
-0.498098 |
2 |
-0.700197 |
2.892482 |
-0.248495 |
0.251040 |
-0.282242 |
3 |
-0.720222 |
3.155379 |
-0.366880 |
0.634978 |
-0.017083 |
4 |
0.554487 |
4.585518 |
0.039662 |
-0.952484 |
-0.965309 |
... |
... |
... |
... |
... |
... |
89 |
-1.607963 |
-0.471519 |
0.077579 |
0.103413 |
-0.068820 |
90 |
2.190969 |
-0.704970 |
0.009171 |
0.563020 |
0.053168 |
91 |
4.974287 |
-0.699630 |
0.023300 |
0.765578 |
0.013150 |
92 |
3.218248 |
-0.753396 |
-0.005256 |
0.688830 |
0.065009 |
93 |
3.804695 |
-0.750210 |
-0.513720 |
0.727646 |
-0.224109 |
94 rows × 5 columns
Let's also find the eigenvectors. The eigenvector is like the "force to tilt the axis" applied when transferring the original data to the principal component axis space. p>
Calculation of eigenvectors in Python is implemented in components_ included in PCA of scikit-learn. p>
pd.DataFrame(pca.components_, columns=df.columns[2:], index=["PC{}".format(x + 1) for x in range(len(dfs.columns))])
|
MeasuredValue |
ChemicalName |
ΣCHLs |
ΣDDTs |
ΣHCHs |
ΣPBDEs |
ΣPCBs |
PC1 |
0.572482 |
0.307088 |
-0.005288 |
0.488078 |
0.582848 |
PC2 |
-0.212874 |
0.641425 |
0.730929 |
-0.048824 |
-0.081347 |
PC3 |
-0.328297 |
0.180897 |
-0.249982 |
0.782267 |
-0.430192 |
PC4 |
0.427788 |
-0.529630 |
0.566847 |
0.279604 |
-0.370130 |
PC5 |
-0.579799 |
-0.425488 |
0.286195 |
0.263203 |
0.575856 |
Now that we know the eigenvectors, we will find the eigenvalues of each principal component. The eigenvalue is a coefficient associated with the eigenvector and is an index indicating the size of each principal component axis. p>
In Python, the contribution rate can be calculated by using explained_variance_ included in the PCA of scikit-learn. p>
pd.DataFrame(pca.explained_variance_, index=["PC{}".format(x + 1) for x in range(len(dfs.columns))])
|
0 |
PC1 |
2.301828 |
PC2 |
1.526633 |
PC3 |
0.681585 |
PC4 |
0.450884 |
PC5 |
0.092834 |
Next, find the contribution rate. The contribution rate is an index showing how much information each principal component axis explains in the data. In other words, the main component with a contribution rate of 60% can explain 60% of the total data, and 40% is lost. p>
In Python, the contribution rate can be calculated by using explained_variance_ratio_ included in the PCA of scikit-learn. p>
pd.DataFrame(pca.explained_variance_ratio_, index=["PC{}".format(x + 1) for x in range(len(dfs.columns))])
|
0 |
PC1 |
0.455468 |
PC2 |
0.302078 |
PC3 |
0.134867 |
PC4 |
0.089218 |
PC5 |
0.018369 |
Here, the cumulative contribution rate is also calculated. The reason why the cumulative contribution rate is necessary is to see "what percentage of data is lost". In other words, if you look at the main components that have a cumulative contribution rate above a certain level, you can reduce the number of dimensions and get an overview of the entire data. p>
Here, the cumulative total value is calculated using the cumsum function of numpy. p>
plt.plot([0] + list( np.cumsum(pca.explained_variance_ratio_)), "-o")
plt.xlabel("Number of principal components")
plt.ylabel("Cumulative contribution rate")
plt.grid()
plt.show()
Regarding each generated principal component axis, I'm not sure what it means as it is. However, by visualizing the contribution of each variable in the principal component, the meaning of each principal component can be seen. Here, we plot the contribution of each variable in the first and second principal components on a scatter plot. p>
```python
plt.figure()
for x, y, name in zip(pca.components_[0], pca.components_[1], df["MeasuredValue"].columns[:]):
plt.text(x, y, name)
plt.scatter(pca.components_[0], pca.components_[1], alpha=0.8)
plt.grid()
plt.xlabel("PC1")
plt.ylabel("PC2")
plt.show()
```
Finally, visualize the result of principal component analysis. Generally, the eigenvalues of 1 or more or the cumulative contribution rate of 80% or more are adopted, and the distribution is examined by the combination of the adopted principal components. In this case, it is assumed that up to the second main component is adopted and visualized in the graph. As with a general scatter plot, you can use the scatter function of matplotlib and specify the first principal component (0th column of feature) and the second principal component (1st column of feature) for each of the parameters X and Y. p>
plt.figure()
plt.scatter(x=feature[:, 0], y=feature[:, 1], alpha=0.8, c=list(df.iloc[:, 0])) #Describe plot color coding in c
plt.grid()
plt.xlabel("PC1")
plt.ylabel("PC2")
plt.show()
footnote
1 sup> An explanatory variable is a variable that explains the objective variable that is the desired result. In other words, for the objective variable that indicates the result, the variable that causes it is the explanatory variable. Also an independent variable. p>
2 sup> A variable created by combining multiple variables. In principal component analysis, each principal component is a synthetic variable. p>
3 sup> Software or libraries whose source code, which is the blueprint of a program, is widely disclosed to the general public. As long as it is within the range permitted by the license, everyone can use it free of charge. Also, since anyone can develop it, it is easy to update from the user's perspective and fix bugs that are not noticed by a small number of people.
Python, Jupyter Notebook, and Anaconda are all open source software. p>
4 sup> Both are methods used in machine learning. Classification is image classification, etc., and regression is used for future forecasts such as stock prices. Clustering is used to group similar data based on data trends. p>
5 sup> A data type that stores tabular data in Pandas. Manipulate data by specifying cells, rows, and columns as if you read a CSV or TSV format file in Excel. p>
6 sup> In python, numbers start with "0". p>
7 sup> The operation of subtracting the average value from the measured value and dividing by the standard deviation. p>