I would like to study materials informatics. This time, I will use RDKit, a tool that converts organic compounds into vectors, to create an AI that determines whether the organic compound given as data is alcohol.
Python: 3.6.5 scikit-learn: 0.20.3 rdkit: 2019.03.1.0
#Read the required library
from rdkit import Chem
from rdkit.Chem import AllChem
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.externals import joblib
import numpy as np
#Prepare the data.
#This time, alcohol to determine if it is alcohol(=1)And others(=0)To judge.
#Chemical formulas are expressed in SMILES notation.
smiles = ['CO', 'C(=O)O', 'CCO', 'C=O', 'CCCO', 'CCC', 'C(C)CO', 'C(=O)', 'CC(=O)', 'CC(=O)', 'C', 'CC(=O)C', 'CCCCO', 'C(C)CO']
ans = [1, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1]
#Vector transform the chemical formula.
mols = [Chem.MolFromSmiles(smile) for smile in smiles]
finger_print = [AllChem.GetMorganFingerprintAsBitVect(mol, 2, 1024) for mol in mols]
#Divide the data into training data and test data
X = np.array(finger_print)
y = ans
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
#Modeling in a random forest of machine learning algorithms
forest = RandomForestClassifier()
forest.fit(X_train, y_train)
#Check the accuracy of the model
forest.score(X_train, y_train)
forest.score(X_test, y_test)
The flow is as follows.
The jupyter notebook file is uploaded to here on GitHub.
Recommended Posts