Article AI drug discovery started free of charge using papers and public databases As a bonus, we visualized the scope of application of the prediction model. At that time, the data to be predicted was applied to the dimensional compression model by PCA and UMAP learned only from the training data and visualized. At that time, at first glance, I concluded that ** the data to be predicted seems to be within the range of the training data **, but I was wondering if that was the case, so I will verify it.
In the article, "The figure shows that the data to be predicted does not seem to deviate significantly from the range of the training data, but it may be affected by the fact that the dimensions are reduced only by the data composed of the training data. No. " Therefore, this time, I would like to try dimensional compression using all of the training data and the prediction target data, and compare it with the figure when dimensional compression is performed only from the training data **.
For the previous source, I will post the source modified to compress the dimensions using all of the training data and the prediction target data.
import argparse
import csv
import pandas as pd
import numpy as np
import umap
import matplotlib.pyplot as plt
from sklearn.manifold import TSNE, MDS
from sklearn.decomposition import PCA
from rdkit import Chem
from rdkit.Chem import Descriptors, AllChem
from rdkit import rdBase, Chem, DataStructs
def main():
parser = argparse.ArgumentParser()
parser.add_argument("-train", type=str, required=True)
parser.add_argument("-predict", type=str)
parser.add_argument("-result", type=str)
parser.add_argument("-method", type=str, default="PCA", choices=["PCA", "UMAP"])
args = parser.parse_args()
# all
all_datas = []
# all_train.loading csv,fp calculation
train_datas = []
train_datas_active = []
train_datas_inactive = []
with open(args.train, "r") as f:
reader = csv.DictReader(f)
for row in reader:
smiles = row["canonical_smiles"]
mol = Chem.MolFromSmiles(smiles)
fp = AllChem.GetMorganFingerprintAsBitVect(mol, radius=3, nBits=2048, useFeatures=False, useChirality=False)
if int(row["outcome"]) == 1:
if args.predict and args.result:
result_outcomes = []
result_ads = []
#Prediction result reading
with open(args.result, "r",encoding="utf-8_sig") as f:
reader = csv.DictReader(f)
for i, row in enumerate(reader):
if row["Prediction"] == "Active":
# drugbank.loading csv,fp calculation
predict_datas = []
predict_datas_active = []
predict_datas_inactive = []
predict_ads = []
with open(args.predict, "r") as f:
reader = csv.DictReader(f)
for i, row in enumerate(reader):
smiles = row["smiles"]
mol = Chem.MolFromSmiles(smiles)
fp = AllChem.GetMorganFingerprintAsBitVect(mol, radius=3, nBits=2048, useFeatures=False, useChirality=False)
if result_outcomes[i] == 1:
model = None
if args.method == "PCA":
model = PCA(n_components=2)
if args.method == "UMAP":
model = umap.UMAP()
result_train = model.transform(train_datas)
result_train_active = model.transform(train_datas_active)
result_train_inactive = model.transform(train_datas_inactive)
#plt.scatter(result_train[:, 0], result_train[:, 1], c="blue", alpha=0.1, marker="o")
plt.scatter(result_train_active[:, 0], result_train_active[:, 1], c="blue", alpha=0.5, marker="o")
plt.scatter(result_train_inactive[:, 0], result_train_inactive[:, 1], c="blue", alpha=0.5, marker="x")
if args.predict and args.result:
result_predict = model.transform(predict_datas)
result_predict_active = model.transform(predict_datas_active)
result_predict_inactive = model.transform(predict_datas_inactive)
#plt.scatter(result_predict[:, 0], result_predict[:, 1], c=result_ads, alpha=0.1, cmap='viridis_r')
plt.scatter(result_predict_active[:, 0], result_predict_active[:, 1], c="red", alpha=0.1, marker="o")
plt.scatter(result_predict_inactive[:, 0], result_predict_inactive[:, 1], c="red", alpha=0.1, marker="x")
if __name__ == "__main__":
import argparse
import csv
import pandas as pd
import numpy as np
import umap
import matplotlib.pyplot as plt
from sklearn.manifold import TSNE, MDS
from sklearn.decomposition import PCA
from rdkit import Chem
from rdkit.Chem import Descriptors, AllChem
from rdkit import rdBase, Chem, DataStructs
def main():
parser = argparse.ArgumentParser()
parser.add_argument("-train", type=str, required=True)
parser.add_argument("-predict", type=str)
parser.add_argument("-result", type=str)
parser.add_argument("-method", type=str, default="PCA", choices=["PCA", "UMAP"])
args = parser.parse_args()
# all
all_datas = []
# all_train.loading csv,fp calculation
train_datas = []
train_datas_active = []
train_datas_inactive = []
with open(args.train, "r") as f:
reader = csv.DictReader(f)
for row in reader:
smiles = row["canonical_smiles"]
mol = Chem.MolFromSmiles(smiles)
fp = AllChem.GetMorganFingerprintAsBitVect(mol, radius=3, nBits=2048, useFeatures=False, useChirality=False)
if int(row["outcome"]) == 1:
if args.predict and args.result:
result_outcomes = []
result_ads = []
#Prediction result reading
with open(args.result, "r",encoding="utf-8_sig") as f:
reader = csv.DictReader(f)
for i, row in enumerate(reader):
if row["Prediction"] == "Active":
# drugbank.loading csv,fp calculation
predict_datas = []
predict_datas_active = []
predict_datas_inactive = []
predict_ads = []
with open(args.predict, "r") as f:
reader = csv.DictReader(f)
for i, row in enumerate(reader):
smiles = row["smiles"]
mol = Chem.MolFromSmiles(smiles)
fp = AllChem.GetMorganFingerprintAsBitVect(mol, radius=3, nBits=2048, useFeatures=False, useChirality=False)
if result_outcomes[i] == 1:
model = None
if args.method == "PCA":
model = PCA(n_components=2)
if args.method == "UMAP":
model = umap.UMAP()
result_train = model.transform(train_datas)
result_train_active = model.transform(train_datas_active)
result_train_inactive = model.transform(train_datas_inactive)
#plt.scatter(result_train[:, 0], result_train[:, 1], c="blue", alpha=0.1, marker="o")
plt.scatter(result_train_active[:, 0], result_train_active[:, 1], c="blue", alpha=0.5, marker="o")
plt.scatter(result_train_inactive[:, 0], result_train_inactive[:, 1], c="blue", alpha=0.5, marker="x")
if args.predict and args.result:
result_predict = model.transform(predict_datas)
result_predict_active = model.transform(predict_datas_active)
result_predict_inactive = model.transform(predict_datas_inactive)
#plt.scatter(result_predict[:, 0], result_predict[:, 1], c=result_ads, alpha=0.1, cmap='viridis_r')
plt.scatter(result_predict_active[:, 0], result_predict_active[:, 1], c="red", alpha=0.1, marker="o")
plt.scatter(result_predict_inactive[:, 0], result_predict_inactive[:, 1], c="red", alpha=0.1, marker="x")
if __name__ == "__main__":
For the time being, the modification of the program is that the argument when fitting with model is changed from training data to all data (learning data + prediction target data).
For the time being, if repeated, blue will be the training data and red will be the prediction data. Please forgive the points that overlap and are difficult to see.
--In both PCA / UMAP, when fitting only with training data, it seems that a lot of prediction target data is within the applicable range. ――However, when fitting with all the training data + prediction target data, there are a large number of data that exist largely outside the training data area. ――In other words, it can be said that it is ** extremely dangerous ** to judge whether or not the forecast target data is in the applicable area by looking at the former figure. ――The reason why this happened is that the former figure is a dimensional compression model that considers the tendency of only the training data as a whole, and even if fitting for data that does not follow that tendency, the model works well. This is because it cannot be caught. ――So what should you do to determine the scope of application? That being said, one is to define some formula for measuring the distance to the training data without relying on figures, and rely on numerical values. --The other is to create a dimensional compression model using a larger set of compounds that includes the entire training data and prediction target data. ――In the latter case, you may think that you should prepare all the compounds that exist in the natural world, but in that case, the number of data is too large to make a real model, and the space is too large. Is too large to distinguish between training data and prediction data, so I think it is necessary to prepare an appropriate compound set according to the domain of the prediction model (I think there is a paper). ..
Recommended Posts