Introduction

I tried to organize the pretreatment of compounds often used in Python.

environment

This time, I used the following library. See Resources for installation instructions. MolVS is a library specializing in compound pretreatment, but it seems that it is also incorporated in RDKit.

rdkit 2020.03.5
molvs 0.1.1

Pretreatment complete

RDKit : SanitizeMol Kekule formation, confirmation of valence, setting of aromaticity, conjugation, hybridization, etc. are performed. reference: http://rdkit.org/docs/source/rdkit.Chem.rdmolops.html

If you create a mol object from Smiles from RDKit, it looks like it is done by default. Feeling to use after editing the mol object by yourself?

MolVS : Normarize reference: https://molvs.readthedocs.io/en/latest/guide/standardize.html

A series of transformations to fix common drawing errors and standardize feature groups. Is it a charge correction?

Let's try it for the time being.

from rdkit import Chem
from molvs.normalize import Normalizer, Normalization

old_smiles = "[Na]OC(=O)c1ccc(C[S+2]([O-])([O-]))cc1"
print("PREV:" + old_smiles)
old_mol = Chem.MolFromSmiles(old_smiles)
normalizer = Normalizer(normalizations=[Normalization('Sulfone to S(=O)(=O)', '[S+2:1]([O-:2])([O-:3])>>[S+0:1](=[O-0:2])(=[O-0:3])')])
new_mol = normalizer.normalize(old_mol)
new_smiles = Chem.MolToSmiles(new_mol)
print("NEW:" + new_smiles)

Above, the normalization process defined in "Sulfone to S (= O) (= O)" is selectively executed. The result is as follows, and the charges of sulfur atom and oxygen atom have changed. If you generate a Normalizer with no arguments, all the normalization processes defined in MolVS in advance will be performed.

PREV:[Na]OC(=O)c1ccc(C[S+2]([O-])([O-]))cc1
NEW: O=C(O[Na])c1ccc(C[S](=O)=O)cc1

MolVS : TautomerCanonicalizer Reference: https://molvs.readthedocs.io/en/latest/guide/tautomer.html It seems that Tautomer is a set of molecules that easily exchange with each other through the movement of hydrogen atoms. Phenol is said to be a typical example. (Example of phenol) https://en.wikipedia.org/wiki/File:Phenol_tautomers.svg

let's try it.

from rdkit import Chem
from molvs.tautomer import TAUTOMER_TRANSFORMS, TAUTOMER_SCORES, MAX_TAUTOMERS, TautomerCanonicalizer, TautomerEnumerator, TautomerTransform

tautomerCanonicalizer = TautomerCanonicalizer((
    TautomerTransform('1,7 aromatic heteroatom H shift r', '[#7,S,O,Se,Te,CX4;!H0]-[#6,#7X2]=[#6]-[#6,#7X2]=[#6,#7X2]-[#6,#7X2]=[NX2,S,O,Se,Te]'),
    ))

mol = Chem.MolFromSmiles("O=C1CC=CC=C1")
print("prev:" + Chem.MolToSmiles(mol))
mol2 = tautomerCanonicalizer.canonicalize(mol)
print("after: "+ Chem.MolToSmiles(mol2))

Above, the Tautoemer process defined by the rule '1,7 aromatic heteroatom H shift r'is selectively executed by the phenol Tautomer. As a result, phenol is produced as follows. If TautomerCanonicalizer is generated without any arguments, all Tautoemer processes defined in MolVS in advance will be performed.

prev:O=C1C=CC=CC1
after: Oc1ccccc1

MolVS : LargestFragmentChooser Reference: https://molvs.readthedocs.io/en/latest/api.html#molvs-fragment Roughly speaking, when multiple molecules are included, the largest molecule is returned.

let's try it

from rdkit import Chem
from molvs.fragment import LargestFragmentChooser

flagmentChooser1 = LargestFragmentChooser()
old_smiles = "O=S(=O)(Cc1[nH]c(-c2ccc(Cl)s2)c[s+]1)c1cccs1.[Br-]"
print("prev:" + old_smiles)
mol = Chem.MolFromSmiles(old_smiles)
mol2 = flagmentChooser1(mol)
print("after:" + Chem.MolToSmiles(mol2))

In the upper part, LargestFragmentChooser is applied to the ionic bond between the bromine ion and another molecule, but the one in which the bromine ion is removed is generated as shown in the lower part.

prev:O=S(=O)(Cc1[nH]c(-c2ccc(Cl)s2)c[s+]1)c1cccs1.[Br-]
after:O=S(=O)(Cc1[nH]c(-c2ccc(Cl)s2)c[s+]1)c1cccs1

MolVS: Uncharger

Reference: https://molvs.readthedocs.io/en/latest/api.html#molvs-charge

It attempts to neutralize the ionized acids and bases on the molecule. let's try it.

from molvs.charge import Reionizer, Uncharger

uncharger = Uncharger()
mol = Chem.MolFromSmiles("c1cccc[nH+]1")
print("prev:" + Chem.MolToSmiles(mol))
mol2 = uncharger(mol)
print("after:" + Chem.MolToSmiles(mol2))

The top is a molecule containing ionized acids and bases, but when Uncharger is applied, it is neutralized as shown below.

prev:c1cc[nH+]cc1
after:c1ccncc1

in conclusion

Some of the things that could not be introduced this time were processing such as "MolVS: reionization" and "MolVS: Disconnect metals", but the explanation is omitted because the target compound could not be imagined. See Resources for details.

References

http://www.rdkit.org/
https://molvs.readthedocs.io/en/latest/index.html