Introduction

I wrote a script to convert SDF, which is a compound data format, to CSV quickly.

specification

--Read the properties in SDF and output as CSV items --The properties of each compound do not necessarily have the same properties (the properties that do not have are empty).

Source

`SDF2CSVConvert.py`


import pandas as pd
from rdkit import Chem
import argparse
from collections import defaultdict


def main():

    parser = argparse.ArgumentParser()
    parser.add_argument("-input", type=str, required=True)
    parser.add_argument("-output", type=str, required=True)
    parser.add_argument("-save_name", action='store_true', help="store header line as _Name")
    args = parser.parse_args()

    #Loading SDF(Read all parameter names the first time)
    sdf_sup = Chem.SDMolSupplier(args.input)
    Props = []
    if args.save_name:
        Props.append("_Name")

    for mol in sdf_sup:
        for name in mol.GetPropNames():
            if name not in Props:
                Props.append(name)

    #Dictionary to store data
    param_dict = defaultdict(list)

    #Loading SDF(The second time, the parameters of the compound are acquired. Otherwise an error)
    sdf_sup = Chem.SDMolSupplier(args.input)
    for mol in sdf_sup:
        #Get name
        for name in Props:
            if mol.HasProp(name):
                param_dict[name].append(mol.GetProp(name))
            else:
                param_dict[name].append(None)

    #Convert at once with pandas
    df = pd.DataFrame(data=param_dict)
    df.to_csv(args.output, index=False)


if __name__ == "__main__":
    main()

Commentary

The SDF is loaded first to know the properties of all compounds. Then, the value of the property of each compound is read in the second reading. If the compound does not have properties, None is included. Finally, the dictionary type that stores the properties was thrown into Pandas and output to CSV. In addition, the first line of SDF can be saved with the property "_Name" with -save_name. See source for other arguments.

Output example

The Solubility data of RDKit looks like this.

_Name,ID,NAME,SOL,SMILES,SOL_classification
3-methylpentane,5,3-methylpentane,-3.68,CCC(C)CC,(A) low
"2,4-dimethylpentane",10,"2,4-dimethylpentane",-4.26,CC(C)CC(C)C,(A) low
1-pentene,15,1-pentene,-2.68,CCCC=C,(B) medium
cyclohexene,20,cyclohexene,-2.59,C1CC=CCC1,(B) medium
"1,4-pentadiene",25,"1,4-pentadiene",-2.09,C=CCC=C,(B) medium
cycloheptatriene,30,cycloheptatriene,-2.15,C1=CC=CC=CC1,(B) medium
1-octyne,35,1-octyne,-3.66,CCCCCCC#C,(A) low
ethylbenzene,40,ethylbenzene,-2.77,c1ccccc1CC,(B) medium
"1,3,5-trimethylbenzene",45,"1,3,5-trimethylbenzene",-3.4,c1c(C)cc(C)cc1C,(A) low
indane,50,indane,-3.04,c(c(ccc1)CC2)(c1)C2,(A) low
isobutylbenzene,55,isobutylbenzene,-4.12,c1ccccc1CC(C)C,(A) low
n-hexylbenzene,60,n-hexylbenzene,-5.21,c1ccccc1CCCCCC,(A) low

Convert SDF to CSV quickly