What is this article?

It is a story that I wrote Python a little seriously based on a library that understands a compound called RDKit and returns a lot of numbers. I'm still exploring my common function and the form of classification, and although there are still various restrictions following the other day, I think I could see a little direction.

So what do you do?

Create a CSV file from a file with compound information called an SDF file. A library called RDKit creates 200 columns of numbers, so in addition to that, it outputs 210 columns including names and 10 columns. However, since the generalization is partially broken, it is not possible to limit it to a specific file. Well, I plan to upgrade it later. I'm going.

Limitations

-The compounds in the SDF file should have the following parameters. ['ID', 'NAME', 'SOL', 'SMILES', 'SOL_classification']

・ Weird compounds are NG. (Separation, ions, etc. If you get confused, RDKit will not give a calculation error)

code

import pandas as pd
from rdkit import Chem
from rdkit.Chem import AllChem, Descriptors
from rdkit.ML.Descriptors import MoleculeDescriptors


def get_basevalues(sampleid, mol):
    tmps = list()
    tmps.append(('SampleID', sampleid))
    tmps.append(('SampleName', mol.GetProp('_Name')))
    tmps.append(('Structure', Chem.MolToMolBlock(mol)))
    tmps.append(('Atoms', len(mol.GetAtoms())))
    tmps.append(('Bonds', len(mol.GetBonds())))
    names = [tmp[0] for tmp in tmps]
    values = [tmp[1] for tmp in tmps]
    return names, values


def get_exvalues(sampleid, mol):
    names = ['ID', 'NAME', 'SOL', 'SMILES', 'SOL_classification']
    values = list()
    for name in names:
        values.append(mol.GetProp(name))
    return names, values


#Calculate descriptor from SDF file and output CSV
# I :Compound file path
#CSV file path
def ExportCSVFromSDF(sdfpath, csvpath):

    #Get compound
    mols = Chem.SDMolSupplier(sdfpath)

    #Preparing for RDKit descriptor calculation
    descLists = [desc_name[0] for desc_name in Descriptors._descList]
    desc_calc = MoleculeDescriptors.MolecularDescriptorCalculator(descLists)

    #Give ID with serial number
    sampleids = list()
    #Compound name, etc.
    values_base = list()
    #External parameters(Current status:Fixed 5 pieces)
    values_ex = list()

    #Get the value of each compound
    for i, mol in enumerate(mols, 1):
        sampleids.append(i)
        names_base, values = get_basevalues(i, mol)
        values_base.append(values)
        names_ex, values = get_exvalues(i, mol)
        values_ex.append(values)

    #Calculate RDKit descriptor
    values_rdkit = [desc_calc.CalcDescriptors(mol) for mol in mols]

    #Convert to DataFrame
    df_base = pd.DataFrame(values_base, columns=names_base, index=sampleids)
    df_ex = pd.DataFrame(values_ex, columns=names_ex, index=sampleids)
    df_rdkit = pd.DataFrame(values_rdkit, columns=descLists, index=sampleids)

    #Combine all
    df = pd.concat([df_base, df_ex, df_rdkit], axis=1)

    #Print for confirmation()
    print(df)

    #Output to CSV
    df.to_csv(csvpath, index=False)


def main():
    sdfpath = 'solubility.test.sdf'
    csvpath = 'solubility.test.csv'
    ExportCSVFromSDF(sdfpath, csvpath)


if __name__ == '__main__':
    main()

Output example

SampleID	SampleName	Atoms	Bonds	ID	NAME	SOL	SMILES	SOL_classification	MaxEStateIndex	MinEStateIndex
1	3-methylpentane	6	5	5	3-methylpentane	-3.68	CCC(C)CC	(A) low	2.2777777777777777	0.9351851851851851
2	2,4-dimethylpentane	7	6	10	2,4-dimethylpentane	-4.26	CC(C)CC(C)C	(A) low	2.263888888888889	0.8749999999999998
3	...
4

Impressions

Yup. Pandas, maybe I got a little better. So, I will expand it in various ways from now on. maybe. .. ..