Continuing from Matplotlib learned from chemoinformatics, "Matplotlib" is one of the representative libraries of Python with the theme of lipidomics (comprehensive analysis of lipids). I will explain about. We will mainly explain practical examples of chemoinformatics, so if you want to check the basics, please read the following article before reading this article.
Pharmaceutical researcher summarized scikit-learn
scikit-learn is a library for machine learning.
Here, consider predicting the retention time (RT) in liquid chromatography (LC) from the physical properties of a compound using partial least squares (PLS) regression. I will.
First, create a dataset for machine learning.
import pandas as pd
params_fatty_acids = ['Heavy atoms', 'Rotatable Bonds', 'van der Waals Molecular Volume', 'logP', 'Molar Refractivity']
lauric = [14, 10, 231.10, 3.99, 59.48]
myristic = [16, 12, 265.70, 4.77, 68.71]
palmitic = [18, 14, 300.30, 5.55, 77.95]
palmitoleic = [18, 13, 297.66, 5.33, 77.85]
stearic = [20, 16, 334.90, 6.33, 87.18]
oleic = [20, 15, 332.26, 6.11, 87.09]
linoleic = [20, 14, 329.62, 5.88, 86.99]
linolenic = [20, 13, 326.98, 5.66, 86.90]
stearidonic = [20, 12, 324.34, 5.44, 86.81]
arachidic = [22, 18, 369.50, 7.11, 96.42]
bishomo_gamma_linolenic = [22, 15, 361.58, 6.44, 96.13]
arachidonic = [22, 14, 358.94, 6.22, 96.04]
eicosapentaenoic = [22, 13, 356.30, 5.99, 95.95]
behenic = [24, 20, 404.10, 7.89, 105.65]
adrenic = [24, 16, 393.54, 7.00, 105.27]
docosapentaenoic = [24, 15, 390.90, 6.77, 105.18]
docosahexaenoic = [24, 14, 388.26, 6.55, 105.09]
df_fatty_acids = pd.DataFrame([lauric, myristic, palmitic, palmitoleic, stearic, oleic, linoleic, linolenic, stearidonic, arachidic, bishomo_gamma_linolenic, arachidonic, eicosapentaenoic, behenic, adrenic, docosapentaenoic, docosahexaenoic], columns=params_fatty_acids)
df_fatty_acids['Experimental Retention Time (min)'] = [4.53, 7.52, 11.02, 10.59, 14.45, 11.86, 9.76, 8.31, 6.71, 17.52, 11.20, 9.96, 8.27, 20.40, 12.75, 11.52, 9.84]
print(df_fatty_acids)
Here, the list of physical property parameter names used as explanatory variables is params_fatty_acids
.
Each physical property value refers to the information stored in the database of LIPID MAPS.
In addition, RT is published on the website PRIMe of RIKEN RT data in reverse phase LC. .riken.jp/Metabolomics_Software/MrmDatabase/Detail%20of%20LCQqQMS%20method%20(ODS-lipids).xlsx) is referenced.
Also, in reality, I think that CSV files etc. are often read with pandas.read_csv
etc.
In addition, data preprocessing such as missing value completion is often required.
Next, we will build a prediction model and calculate the prediction value using the model.
from sklearn.cross_decomposition import PLSRegression
X = df_fatty_acids[params_fatty_acids] #Explanatory variable
y = df_fatty_acids['Experimental Retention Time (min)'] #Objective variable
pls_rt = PLSRegression()
pls_rt.fit(X, y) #Build a PLS prediction model
y_pred = pls_rt.predict(X) #Calculate the predicted value
df_fatty_acids['Predicted Retention Time (min)'] = y_pred
df_fatty_acids['Diff (min)'] = df_fatty_acids['Predicted Retention Time (min)'] - df_fatty_acids['Experimental Retention Time (min)']
df_fatty_acids['Accuracy (%)'] = (df_fatty_acids['Diff (min)'] / df_fatty_acids['Experimental Retention Time (min)']) * 100
print(df_fatty_acids)
The relationship between the measured value and the predicted value is shown below.
%matplotlib inline
import matplotlib.pyplot as plt
plt.scatter(y, y_pred)
plt.xlabel('Experimental Retention Time (min)')
plt.ylabel('Predicted Retention Time (min)')
plt.savefig('rts_fatty_acids.png')
plt.show()
In this data, it seems that the measured value and the predicted value match well.
You can check the degree of fit of the built model with r2_score
.
from sklearn.metrics import r2_score
print(r2_score(y, y_pred))
r2_score
takes a value between 0 and 1, and the closer it is to 1, the better the measured and predicted values are.
In this data, r2_score
is a value exceeding 0.98, which is a fairly good model.
This time, we used 5 types of physical property parameters to predict RT, but let's see which of them contributes significantly to the prediction.
print(pls_rt.coef_)
From this result, it can be seen that in this data, the coefficient (absolute value) for Rotable Bonds is the largest at 3.44, and this physical property value strongly contributes to the prediction of RT.
We have discussed the prediction accuracy of the data used to build the prediction model, but finally, let's see how accurate the data not used to build the model can be predicted.
lignoceric = [26, 22, 438.70, 8.67, 114.88]
x_lignoceric = pd.DataFrame([lignoceric], columns=params_fatty_acids)
y_pred_lignoceric = pls_rt.predict(x_lignoceric)
y_exp_lignoceric = 22.31 #Measured value
print(y_exp_lignoceric)
print(y_pred_lignoceric)
Here, I tried to predict the RT of lignoceric acid (FA 24: 0). The difference between the predicted value and the measured value is about 1.2 minutes. I think there are various views on whether this difference is large or small, but I personally think that the prediction accuracy is rather low. The reason is that lignoceric acid is a molecular species that is more hydrophobic than any fatty acid molecular species included in the dataset used for model construction, and data fitting close to the physical properties of lignoceric acid is performed at the model construction stage. It is thought that it is related to what was not done.
PLS regression can be performed by the above procedure.
Although not mentioned here, the number of latent variables n_components
is also important when performing PLS regression.
This time, I used the default value 2
, but by changing this, the prediction accuracy will change little by little.
I would like to explain it at another time.
Here, we have explained scikit-learn, focusing on practical knowledge that can be used in chemoinformatics. Let's review the main points again.
--By using scikit-learn, you can easily perform machine learning.
--Machine learning is performed in the flow of data preprocessing, model construction, and prediction.
--Proceed while observing the r2_score
of the constructed model and the difference between the predicted value and the measured value.
What is the programming language Python? Can it be used for AI and machine learning?
Recommended Posts