Introduction

Continuing from Matplotlib learned from chemoinformatics, "Matplotlib" is one of the representative libraries of Python with the theme of lipidomics (comprehensive analysis of lipids). I will explain about. We will mainly explain practical examples of chemoinformatics, so if you want to check the basics, please read the following article before reading this article.

Pharmaceutical researcher summarized scikit-learn

Data set preparation

scikit-learn is a library for machine learning.

Here, consider predicting the retention time (RT) in liquid chromatography (LC) from the physical properties of a compound using partial least squares (PLS) regression. I will.

First, create a dataset for machine learning.

import pandas as pd


params_fatty_acids = ['Heavy atoms', 'Rotatable Bonds', 'van der Waals Molecular Volume', 'logP', 'Molar Refractivity']

lauric = [14, 10, 231.10, 3.99, 59.48]
myristic = [16, 12, 265.70, 4.77, 68.71]
palmitic = [18, 14, 300.30, 5.55, 77.95]
palmitoleic = [18, 13, 297.66, 5.33, 77.85]
stearic = [20, 16, 334.90, 6.33, 87.18]
oleic = [20, 15, 332.26, 6.11, 87.09]
linoleic = [20, 14, 329.62, 5.88, 86.99]
linolenic = [20, 13, 326.98, 5.66, 86.90]
stearidonic = [20, 12, 324.34, 5.44, 86.81]
arachidic = [22, 18, 369.50, 7.11, 96.42]
bishomo_gamma_linolenic = [22, 15, 361.58, 6.44, 96.13]
arachidonic = [22, 14, 358.94, 6.22, 96.04]
eicosapentaenoic = [22, 13, 356.30, 5.99, 95.95]
behenic = [24, 20, 404.10, 7.89, 105.65]
adrenic = [24, 16, 393.54, 7.00, 105.27]
docosapentaenoic = [24, 15, 390.90, 6.77, 105.18]
docosahexaenoic = [24, 14, 388.26, 6.55, 105.09]

df_fatty_acids = pd.DataFrame([lauric, myristic, palmitic, palmitoleic, stearic, oleic, linoleic, linolenic, stearidonic, arachidic, bishomo_gamma_linolenic, arachidonic, eicosapentaenoic, behenic, adrenic, docosapentaenoic, docosahexaenoic], columns=params_fatty_acids)
df_fatty_acids['Experimental Retention Time (min)'] = [4.53, 7.52, 11.02, 10.59, 14.45, 11.86, 9.76, 8.31, 6.71, 17.52, 11.20, 9.96, 8.27, 20.40, 12.75, 11.52, 9.84]

print(df_fatty_acids)

Here, the list of physical property parameter names used as explanatory variables is params_fatty_acids. Each physical property value refers to the information stored in the database of LIPID MAPS. In addition, RT is published on the website PRIMe of RIKEN RT data in reverse phase LC. .riken.jp/Metabolomics_Software/MrmDatabase/Detail%20of%20LCQqQMS%20method%20(ODS-lipids).xlsx) is referenced. Also, in reality, I think that CSV files etc. are often read with pandas.read_csv etc. In addition, data preprocessing such as missing value completion is often required.

Model building

Next, we will build a prediction model and calculate the prediction value using the model.

from sklearn.cross_decomposition import PLSRegression


X = df_fatty_acids[params_fatty_acids] #Explanatory variable
y = df_fatty_acids['Experimental Retention Time (min)'] #Objective variable

pls_rt = PLSRegression()
pls_rt.fit(X, y) #Build a PLS prediction model

y_pred = pls_rt.predict(X) #Calculate the predicted value

df_fatty_acids['Predicted Retention Time (min)'] = y_pred
df_fatty_acids['Diff (min)'] = df_fatty_acids['Predicted Retention Time (min)'] - df_fatty_acids['Experimental Retention Time (min)']
df_fatty_acids['Accuracy (%)'] = (df_fatty_acids['Diff (min)'] / df_fatty_acids['Experimental Retention Time (min)']) * 100

print(df_fatty_acids)

The relationship between the measured value and the predicted value is shown below.

%matplotlib inline
import matplotlib.pyplot as plt


plt.scatter(y, y_pred)
plt.xlabel('Experimental Retention Time (min)')
plt.ylabel('Predicted Retention Time (min)')

plt.savefig('rts_fatty_acids.png')
plt.show()

In this data, it seems that the measured value and the predicted value match well. You can check the degree of fit of the built model with r2_score.

from sklearn.metrics import r2_score


print(r2_score(y, y_pred))

r2_score takes a value between 0 and 1, and the closer it is to 1, the better the measured and predicted values are. In this data, r2_score is a value exceeding 0.98, which is a fairly good model.

This time, we used 5 types of physical property parameters to predict RT, but let's see which of them contributes significantly to the prediction.

print(pls_rt.coef_)

From this result, it can be seen that in this data, the coefficient (absolute value) for Rotable Bonds is the largest at 3.44, and this physical property value strongly contributes to the prediction of RT.

Prediction using a model

We have discussed the prediction accuracy of the data used to build the prediction model, but finally, let's see how accurate the data not used to build the model can be predicted.

lignoceric = [26, 22, 438.70, 8.67, 114.88]
x_lignoceric = pd.DataFrame([lignoceric], columns=params_fatty_acids)
y_pred_lignoceric = pls_rt.predict(x_lignoceric)

y_exp_lignoceric = 22.31 #Measured value

print(y_exp_lignoceric)
print(y_pred_lignoceric)

Here, I tried to predict the RT of lignoceric acid (FA 24: 0). The difference between the predicted value and the measured value is about 1.2 minutes. I think there are various views on whether this difference is large or small, but I personally think that the prediction accuracy is rather low. The reason is that lignoceric acid is a molecular species that is more hydrophobic than any fatty acid molecular species included in the dataset used for model construction, and data fitting close to the physical properties of lignoceric acid is performed at the model construction stage. It is thought that it is related to what was not done.

PLS regression can be performed by the above procedure. Although not mentioned here, the number of latent variables n_components is also important when performing PLS regression. This time, I used the default value 2, but by changing this, the prediction accuracy will change little by little. I would like to explain it at another time.

Summary

Here, we have explained scikit-learn, focusing on practical knowledge that can be used in chemoinformatics. Let's review the main points again.

--By using scikit-learn, you can easily perform machine learning. --Machine learning is performed in the flow of data preprocessing, model construction, and prediction. --Proceed while observing the r2_score of the constructed model and the difference between the predicted value and the measured value.

Reference materials / links

What is the programming language Python? Can it be used for AI and machine learning?