I was wondering how long it would take for a query compound to search for similar compounds in the target database (just SDF) with RDKit, so I wrote a command.
When calculating similarity, it is common to generate a fingerprint and calculate the similarity score using the Tanimoto coefficient. Fingerprints are bits of chemical structure and there are various methods. Here, I tried using major MACCS Keys with a small number of bits.
import argparse
from rdkit import Chem
from rdkit.Chem import Descriptors, AllChem
from rdkit import rdBase, Chem, DataStructs
def main():
parser = argparse.ArgumentParser()
parser.add_argument("-query", type=str, required=True)
parser.add_argument("-target_db", type=str, required=True)
args = parser.parse_args()
#Read query
mol_block = ""
with open(args.query) as f:
for line in f:
mol_block += line
query_mol = Chem.MolFromMolBlock(mol_block)
#Loading SDF
target_sdf_sup = Chem.SDMolSupplier(args.target_db)
#FingerPrint calculation(query)
query_fp = AllChem.GetMACCSKeysFingerprint(query_mol)
#FingerPrint calculation(target)
target_fps = [AllChem.GetMACCSKeysFingerprint(mol) for mol in target_sdf_sup]
for i, target_fp in enumerate(target_fps):
result = DataStructs.TanimotoSimilarity(query_fp, target_fp)
print(i, result)
if __name__ == "__main__":
main()
Like this. Thank you argparse.
usage: StructureSimilaritySearch.py [-h] -query QUERY -target_db TARGET_DB
optional arguments:
-h, --help show this help message and exit
-query QUERY(mol)
-target_db TARGET_DB(sdf)
As usual, search by targeting 1024 train data of Solubility of RDkit. query is appropriate. Then, it will be returned in about 1 second. If it is 10,000 units, it seems that it will be reasonable as it is.
Recommended Posts