It is a story that I wrote Python a little seriously based on a library that understands a compound called RDKit and returns a lot of numbers. I'm still exploring my common function and the form of classification, and although there are still various restrictions following the other day, I think I could see a little direction.
Create a CSV file from a file with compound information called an SDF file. A library called RDKit creates 200 columns of numbers, so in addition to that, it outputs 210 columns including names and 10 columns. However, since the generalization is partially broken, it is not possible to limit it to a specific file. Well, I plan to upgrade it later. I'm going.
-The compounds in the SDF file should have the following parameters. ['ID', 'NAME', 'SOL', 'SMILES', 'SOL_classification']
・ Weird compounds are NG. (Separation, ions, etc. If you get confused, RDKit will not give a calculation error)
import pandas as pd
from rdkit import Chem
from rdkit.Chem import AllChem, Descriptors
from rdkit.ML.Descriptors import MoleculeDescriptors
def get_basevalues(sampleid, mol):
tmps = list()
tmps.append(('SampleID', sampleid))
tmps.append(('SampleName', mol.GetProp('_Name')))
tmps.append(('Structure', Chem.MolToMolBlock(mol)))
tmps.append(('Atoms', len(mol.GetAtoms())))
tmps.append(('Bonds', len(mol.GetBonds())))
names = [tmp[0] for tmp in tmps]
values = [tmp[1] for tmp in tmps]
return names, values
def get_exvalues(sampleid, mol):
names = ['ID', 'NAME', 'SOL', 'SMILES', 'SOL_classification']
values = list()
for name in names:
values.append(mol.GetProp(name))
return names, values
#Calculate descriptor from SDF file and output CSV
# I :Compound file path
#CSV file path
def ExportCSVFromSDF(sdfpath, csvpath):
#Get compound
mols = Chem.SDMolSupplier(sdfpath)
#Preparing for RDKit descriptor calculation
descLists = [desc_name[0] for desc_name in Descriptors._descList]
desc_calc = MoleculeDescriptors.MolecularDescriptorCalculator(descLists)
#Give ID with serial number
sampleids = list()
#Compound name, etc.
values_base = list()
#External parameters(Current status:Fixed 5 pieces)
values_ex = list()
#Get the value of each compound
for i, mol in enumerate(mols, 1):
sampleids.append(i)
names_base, values = get_basevalues(i, mol)
values_base.append(values)
names_ex, values = get_exvalues(i, mol)
values_ex.append(values)
#Calculate RDKit descriptor
values_rdkit = [desc_calc.CalcDescriptors(mol) for mol in mols]
#Convert to DataFrame
df_base = pd.DataFrame(values_base, columns=names_base, index=sampleids)
df_ex = pd.DataFrame(values_ex, columns=names_ex, index=sampleids)
df_rdkit = pd.DataFrame(values_rdkit, columns=descLists, index=sampleids)
#Combine all
df = pd.concat([df_base, df_ex, df_rdkit], axis=1)
#Print for confirmation()
print(df)
#Output to CSV
df.to_csv(csvpath, index=False)
def main():
sdfpath = 'solubility.test.sdf'
csvpath = 'solubility.test.csv'
ExportCSVFromSDF(sdfpath, csvpath)
if __name__ == '__main__':
main()
SampleID | SampleName | Structure | Atoms | Bonds | ID | NAME | SOL | SMILES | SOL_classification | MaxEStateIndex | MinEStateIndex |
---|---|---|---|---|---|---|---|---|---|---|---|
1 | 3-methylpentane | 6 | 5 | 5 | 3-methylpentane | -3.68 | CCC(C)CC | (A) low | 2.2777777777777777 | 0.9351851851851851 | |
2 | 2,4-dimethylpentane | 7 | 6 | 10 | 2,4-dimethylpentane | -4.26 | CC(C)CC(C)C | (A) low | 2.263888888888889 | 0.8749999999999998 | |
3 | ... | ||||||||||
4 |
Yup. Pandas, maybe I got a little better. So, I will expand it in various ways from now on. maybe. .. ..