It's nice to create a predictive model, but when predicting, do you make it exactly the same as the prerequisites for creating the model? Talk. It seems to be a very important story in the operation of the opportunity learning system.
Especially in the field of chemoinformatics, models are often created by combining various commercial and free software. Pretreatment of the compound is performed with the A tool, then the descriptor is calculated with the B tool, and the prediction model is created with the C tool. .. .. It's okay to make a model like that, but this time I tried to verify what would happen if the user did not do the same preprocessing.
There are various pre-processing, but since it happened to be found, we proceeded with the following scenario this time.
--When dealing with compound data, hydrogen may or may not be explicitly added (obvious ones may be omitted). --In the Morgan fingerprint of RDKit, the value of the explanatory variable that is output differs moderately depending on whether hydrogen is explicitly added. --This time, we will verify how much the predicted value will fluctuate depending on whether the conditions for explicitly adding hydrogen are met or not for the input compounds given at the time of creating the prediction model and at the time of prediction. I tried it.
What is RDKit's Morgan fingerprint in the first place? But it looks like this in the source.
from rdkit.Chem import AllChem
mol = Chem.MolFromSmiles("CCC")
mol = Chem.AddHs(mol)
fp = AllChem.GetMorganFingerprintAsBitVect(mol, radius=3, nBits=2048, useFeatures=False, useChirality=False)
In the first line, "CCC" creates a compound object from the character string SMILES, which represents a compound, in the second line, hydrogen is explicitly added to the compound, and in the third line, the descriptor calculation is performed. The descriptor calculation result is an array of 2048 bits, and each bit is 0 or 1.
Prediction models are created and predictions are made using this, but after that, `Chem.AddHs (mol)`
is not attached at the time of creating the prediction model and at the time of prediction. , I confirmed how the prediction result is different when it is attached only when the prediction model is created.
The correlation between the results predicted by the combination of the following three patterns is summarized in the table for the data of about 100 training data and about 10,000 prediction target data.
--Prediction model creation: without hydrogen, prediction: with hydrogen --Prediction model creation: with hydrogen, prediction: with hydrogen --Prediction model creation: with hydrogen, prediction: without hydrogen
The results are as follows.
For the prediction model created by explicitly adding hydrogen to the training data and calculating the descriptor, the predicted value when the descriptor calculation / prediction is performed by omitting hydrogen in the prediction target data is explicitly hydrogen. There is only a correlation of about 0.48 compared to the predicted value when it is given to and predicted. The plot of the relationship between the two is as follows. It is a considerable error.
This value of 0.48 is lower than the correlation of 0.58 between those who made predictions with and without hydrogen by aligning the conditions at the time of creating the prediction model and at the time of prediction. There is some debate about which is more appropriate as an input for the Morgan fingerprint, with or without hydrogen (in some cases it is not specified), but first of all, it seems important to properly align the input conditions.
Make sure that the preprocessing conditions are the same when creating a prediction model and when making a prediction. It is best to provide it on the system side including preprocessing, but if for some reason it is not possible to do so, write it firmly in the document.