Predicting the activity and physical properties of compounds on a computer has been performed for some time, but in recent years, with the increase in accumulated data and the development of deep learning technology, the keyword AI drug discovery has come to be heard. I came.
Therefore, this time, regarding the prediction of therapeutic agents for the new coronavirus (SARS-CoV-2), "Computational Models Identify Several FDA Approved or Experimental Drugs as Putative Agents Against SARS-CoV-2 .nih.gov/pmc/articles/PMC7252448/) "By trying the contents of the paper by myself, I tried to experience AI drug discovery " for free ", so I would like to share it.
I chose this paper for the following three reasons.
First of all, I examined the work policy for trying the contents of the dissertation (or rather, the correct answer was that this happened through trial and error).
Actually, all the collected data and analysis programs in this paper are published on github (https://github.com/alvesvm/sars-cov-mpro/tree/master/mol-inf-2020), and what if you use this? The content of the paper can be reproduced without any trouble.
However, even if you download this as it is and hit the return key repeatedly on Jupyter, it will end up as if you understood "I see", and nothing will remain.
By moving your hands, getting stuck, researching, and thinking repeatedly, you will notice your lack of understanding and gradually become able to understand what you could not understand until now.
Therefore, we decided to proceed with the following policy.
――Collecting data from public databases ** Try to check it yourself **. ――Using the collected data, ** write your own prediction program **. --Compare the final result with the paper ** and consider it if possible. --Refer to github for ** checking details ** of the content of the dissertation, ** getting hints in case of clogging **, ** comparing results **.
Next, the outline of the paper and the public database used this time are described.
In the paper, the main protease involved in viral replication of the new coronavirus is used as the target molecule of the drug, and the SARS coronavirus epidemic in 2003, which has a structure very similar to this, is predicted to have an inhibitory activity on the main protease. I'm building a model.
Furthermore, by applying the constructed predictive model to the market / discontinuation / experiment and investigational drug data obtained from DrugBank, 41 drugs are finally presented as candidate therapeutic agents for the new coronavirus.
Information on the inhibitory activity against SARS coronavirus, which is the learning data, is collected from public databases such as ChEMBL and PDB.
By the way, the main protease, which is the target molecule, is explained in detail in Introduction to PDBj: PDBj's biopolymer learning portal site 242: Coronavirus Proteases. Has been done.
Next, I will briefly summarize the public databases used in this paper. ChEMBL (https://www.ebi.ac.uk/chembl/)
ChEMBL is a database containing manually curated information on the bioactivity of drugs and small molecules that can be drug candidates. It is managed by the European Bioinformatics Institute (EBI) of the European Molecular Biology Laboratory (EMBL).
The number of listed data is as follows (calculated from the number of records in the downloaded V27 table).
item | Number of listings |
---|---|
Number of compounds | 2,444,828 |
Number of biological activity values | 16,066,124 |
Number of assays | 1,221,361 |
PDB (https://www.rcsb.org/)
PDB (Protein Data Bank) is a database containing 3D structural data of proteins and nucleic acids. It is managed by Worldwide Protein Data Bank.
The number of listed data is as follows (as of September 12, 2020).
item | Number of listings |
---|---|
Structures | 49012 |
Structures of Human Sequences | 12216 |
Nucleic Acid Containing Structures | 12216 |
DrugBank (https://www.drugbank.ca/)
DrugBank is a database that contains information on drugs and drug targets. It is managed by the Metabolomics Innovation Center (TMIC). The number of listings in the latest release (version 5.1.7, released on 2020-07-02) is as follows.
item | Number of listings |
---|---|
Drug | 13,715 |
Approved small molecule drug | 2,649 |
Approved biologics (proteins, peptides, vaccines, and allergens) | 1,405 |
Dietary supplement | 131 |
Experiment (discovery stage) drug | 6,417 |
Linked protein sequence | 5,234 |
Now, from now on, we will finally proceed with the work of testing the content of the dissertation.
Here, we try to collect experimental data (ChEMBLE 91 cases, PDB 22 cases) of inhibitors of SARS virus main protease described in the paper.
ChEMBL can be used with a web browser, but you can also download the data, register it in a local database, and use it from SQL. This time, we will install PostgreSQL and load and use the dump for PostgreSQL.
ftp://ftp.ebi.ac.uk/pub/databases/chembl/ChEMBLdb/latest/Click to display a list of files for download, chembl_27_postgresql.tar.Download and unzip gz.
#### ② Install PostgreSQL
Next, install PostgreSQL by referring to [Install PostgreSQL on Ubuntu 18.04](https://qiita.com/eighty8/items/82063beab09ab9e41692).
#### ③ ChEMBL load
Log in to PostgreSQL and create a database.
```sql
pgdb=# create database chembl_27;
Then move to the directory where you unzipped the data and do the following: It will take some time, but the data will be imported into PostgreSQL.
pg_restore --no-owner -U postgres -d chembl_27 chembl_27_postgresql.dmp
Next, SQL is issued from pgAdmin etc. to get the data, but first of all, what kind of data is stored in which table is `` `ftp://ftp.ebi.ac.uk/pub/databases/chembl Take a quick look at /ChEMBLdb/latest/schema_documentation.txt```. Also, there are SQL examples in Schema Questions and SQL Examples. Hit the table that looks important to you. The table below summarizes the results of examining the previous schema_documentation.txt for the table with the hits.
table name | Description |
---|---|
compound_structures | A table that stores various structural representations of compounds (Molfile, InChI, etc.) |
molecule_dictionary | Compounds with related identifiers/Non-redundant list of biotherapeutic agents |
compound_records | Represents each compound extracted from scientific documentation. |
docs | The assay retains all extracted scientific documents (papers or patents) |
activities | A table that stores activity values or endpoints that are the results of assays recorded in scientific documentation |
assay | Save the list of assays reported in each document. The same assay from different documents is displayed as a separate assay |
target_dictionary | A dictionary of all curated targets contained in ChEMBL. Both protein and non-protein targets (eg, organisms, tissues, cell lines) are included. |
In ChEMBL, CHEMBL ID is assigned to all data, so first try searching for CHEMBL ID of SARS coronavirus. Therefore, try throwing the following SQL to the target_dictionary table examined in the previous section.
select * from target_dictionary where upper(pref_name) like '%SARS%'
You can get four as follows.
This time we want to collect inhibition data of the main protease (also known as 3C-like protease), so we know that it is CHEMBL 3927, which is the top of the pref_name.
Once you know the CHEMBL ID of the target molecule, go to Schema Questions and SQL Examples. The query "Retrieve compound activity details for all targets containing a protein of interest" at the bottom will give you a list of molecules that have activity against the target molecule, so you can change the CHEMBL ID part of this SQL.
SELECT m.chembl_id AS compound_chembl_id,
s.canonical_smiles,
r.compound_key,
d.pubmed_id,
d.doi,
a.description,
act.standard_type,
act.standard_relation,
act.standard_value,
act.standard_units,
act.activity_comment
FROM compound_structures s,
molecule_dictionary m,
compound_records r,
docs d,
activities act,
assays a,
target_dictionary t
WHERE s.molregno = m.molregno
AND m.molregno = r.molregno
AND r.record_id = act.record_id
AND r.doc_id = d.doc_id
AND act.assay_id = a.assay_id
AND a.tid = t.tid
AND t.chembl_id = 'CHEMBL3927'
AND (act.standard_type ='IC50' or act.standard_type ='Ki')
order by compound_chembl_id;
As a caveat, I want only inhibition data this time, so I specify IC50 or Ki as standard_type. It seems that the definitions of IC50 and Ki are strictly different, but since both of them seemed to be used as learning data in the paper, we will apply them here.
Export the result from pgAdmin to csv.
By the way, this is a muddy work like a data scientist.
The following data is obtained in csv.
Since canonical_smiles is data showing the structure of the compound, it is necessary to generate explanatory variables. standard_value is the value of standard_type (hereinafter referred to as the inhibition value), and standard_units is the unit. Since standard_units are all the same nM this time, there is no need to convert and align the units.
The following processing was performed to create data for the prediction model of the paper.
--As a response when the standard_relation of csv contains an inequality sign, in such a case it is often judged as a numerical value for safety (for example, if the data is larger than 10, it is assumed to be about 20 times. Etc.), this time the inequality sign was ignored and the standard_value was adopted as the inhibitory value. --When CHEMBL IDs were included in duplicate due to multiple experiments, the inhibition value was the average of these. --As the final objective variable, the threshold value of the inhibition value is set to 10uM in the paper, so values smaller than 10uM are output as 1 (Active) and values larger than 10uM are output as 0 (Inactive).
The script fragments that have undergone these processes are shown below.
prev_chembl.py
rows_by_id = defaultdict(list)
#Read csv and group rows by CHEMBLID
with open(args.input, "r") as f:
reader = csv.DictReader(f)
for row in reader:
rows_by_id[row["compound_chembl_id"]].append(row)
#Value to be considered active
threshold = 10000
#Output while processing the value for each CHEMBLID
with open(args.output, "w") as f:
writer = csv.writer(f, lineterminator="\n")
writer.writerow(["chembl_id", "canonical_smiles", "value", "outcome"])
#Process values by CHEMBLID
for id in rows_by_id:
#Find the total
total = 0.0
for row in rows_by_id[id]:
value = row["standard_value"]
total += float(value)
#Find the average
mean = total/len(rows_by_id[id])
print(f'{id},{mean}')
outcome = 0
if mean < threshold:
outcome = 1
writer.writerow([id, rows_by_id[id][0]["canonical_smiles"], mean, outcome])
As a result of executing this program, although there was a difference of 1 in the number of Active / Inactive cases, 91 learning data, which is the same as the paper, was obtained. Since the acquisition procedure was not described in the paper, ** I was most happy that I was able to almost reproduce the collection of this data **.
Next, get from the PDB. Actually, the acquisition procedure is almost the same as the article I wrote earlier Collecting machine learning data by scraping from a bio-based public database. The only difference is that it also collects an item called macromolecule to determine if it is the main protease. The scrapy-free version of the collection program is reprinted below.
get_pdb.py
import requests
import json
import time
import lxml.html
import argparse
import csv
def get_ligand(ligand_id):
tmp_url = "https://www.rcsb.org" + ligand_id
response = requests.get(tmp_url)
if response.status_code != 200:
return response.status_code, []
html = response.text
root = lxml.html.fromstring(html)
#print(html)
print(tmp_url)
smiles = root.xpath("//tr[@id='chemicalIsomeric']/td[1]/text()")[0]
inchi = root.xpath("//tr[@id='chemicalInChI']/td[1]/text()")[0]
inchi_key = root.xpath("//tr[@id='chemicalInChIKey']/td[1]/text()")[0]
return response.status_code, [smiles, inchi, inchi_key]
def get_structure(structure_id):
structure_url = "https://www.rcsb.org/structure/"
tmp_url = structure_url + structure_id
print(tmp_url)
html = requests.get(tmp_url).text
root = lxml.html.fromstring(html)
macromolecule_trs = root.xpath("//tr[contains(@id,'macromolecule-entityId-')]")
macromolecule = ""
for tr in macromolecule_trs:
print(tr.xpath("@id"))
macromolecule += tr.xpath("td[position()=1]/text()")[0] + ","
print(f"macro={macromolecule}")
binding_trs = root.xpath("//tr[contains(@id,'binding_row')]")
datas = []
ids = []
for tr in binding_trs:
print(tr.xpath("@id"))
d1 = tr.xpath("td[position()=1]/a/@href")
if d1[0] in ids:
continue
ids.append(d1[0])
status_code, values = get_ligand(d1[0])
ligand_id = d1[0][(d1[0].rfind("/") + 1):]
print(ligand_id)
if status_code == 200:
smiles, inchi, inchi_key = values
#item = tr.xpath("a/td[position()=2]/text()")[0]
item = tr.xpath("td[position()=2]/a/text()")[0]
item = item.strip()
value = tr.xpath("td[position()=2]/text()")[0]
value = value.replace(":", "")
value = value.replace(";", "")
value = value.replace(" ", "")
value = value.replace("\n", "")
print(value)
values = value.split(" ", 1)
print(values)
value = values[0].strip()
unit = values[1].strip()
datas.append([ligand_id, smiles, inchi, inchi_key, item, value, unit, macromolecule])
time.sleep(1)
return datas
def main():
parser = argparse.ArgumentParser()
parser.add_argument("-output", type=str, required=True)
args = parser.parse_args()
base_url = "https://www.rcsb.org/search/data"
payloads = {"query":{"type":"group","logical_operator":"and","nodes":[{"type":"group","logical_operator":"and","nodes":[{"type":"group","logical_operator":"or","nodes":[{"type":"group","logical_operator":"and","nodes":[{"type":"terminal","service":"text","parameters":{"attribute":"rcsb_binding_affinity.value","negation":False,"operator":"exists"},"node_id":0},{"type":"terminal","service":"text","parameters":{"attribute":"rcsb_binding_affinity.type","operator":"exact_match","value":"IC50"},"node_id":1}],"label":"nested-attribute"},{"type":"group","logical_operator":"and","nodes":[{"type":"terminal","service":"text","parameters":{"attribute":"rcsb_binding_affinity.value","negation":False,"operator":"exists"},"node_id":2},{"type":"terminal","service":"text","parameters":{"attribute":"rcsb_binding_affinity.type","operator":"exact_match","value":"Ki"},"node_id":3}],"label":"nested-attribute"}]},{"type":"group","logical_operator":"and","nodes":[{"type":"terminal","service":"text","parameters":{"operator":"exact_match","negation":False,"value":"Severe acute respiratory syndrome-related coronavirus","attribute":"rcsb_entity_source_organism.taxonomy_lineage.name"},"node_id":4}]}],"label":"text"}],"label":"query-builder"},"return_type":"entry","request_options":{"pager":{"start":0,"rows":100},"scoring_strategy":"combined","sort":[{"sort_by":"score","direction":"desc"}]},"request_info":{"src":"ui","query_id":"e757fdfd5f9fb0efa272769c5966e3f4"}}
print(json.dumps(payloads))
response = requests.post(
base_url,
json.dumps(payloads),
headers={'Content-Type': 'application/json'})
datas = []
for a in response.json()["result_set"]:
structure_id = a["identifier"]
datas.extend(get_structure(structure_id))
time.sleep(1)
with open(args.output, "w") as f:
writer = csv.writer(f, lineterminator="\n")
writer.writerow(["ligand_id", "canonical_smiles", "inchi", "inchi_key", "item", "value", "unit"])
for data in datas:
writer.writerow(data)
if __name__ == "__main__":
main()
When this is executed, 21 data in the following format will be obtained. Since there is a protein name in the rightmost item, those that do not contain phrases such as "3C like proteinase" and "main proteinase" were excluded, and finally 10 cases were used as training data.
ligand_id,canonical_smiles,inchi,inchi_key,item,value,unit
PMA,OC(=O)c1cc(C(O)=O)c(cc1C(O)=O)C(O)=O,"InChI=1S/C10H6O8/c11-7(12)3-1-4(8(13)14)6(10(17)18)2-5(3)9(15)16/h1-2H,(H,11,12)(H,13,14)(H,15,16)(H,17,18)",CYIDZMCFTVVTJO-UHFFFAOYSA-N,Ki,700,nM,"3C-like proteinase,"
TLD,Cc1ccc(S)c(S)c1,"InChI=1S/C7H8S2/c1-5-2-3-6(8)7(9)4-5/h2-4,8-9H,1H3",NIAAGQAEVGMHPM-UHFFFAOYSA-N,Ki,1400,nM,"Replicase polyprotein 1ab,"
ZU5,CC(C)C[C@H](NC(=O)[C@@H](NC(=O)OCc1ccccc1)[C@@H](C)OC(C)(C)C)C(=O)N[C@H](CCC(=O)C2CC2)C[C@@H]3CCNC3=O,"InChI=1S/C34H52N4O7/c1-21(2)18-27(31(41)36-26(14-15-28(39)24-12-13-24)19-25-16-17-35-30(25)40)37-32(42)29(22(3)45-34(4,5)6)38-33(43)44-20-23-10-8-7-9-11-23/h7-11,21-22,24-27,29H,12-20H2,1-6H3,(H,35,40)(H,36,41)(H,37,42)(H,38,43)/t22-,25+,26-,27+,29+/m1/s1",QIMPWBPEAHOISN-XSLDCGIXSA-N,Ki,99,nM,"3C-like proteinase,"
ZU3,CC(C)C[C@H](NC(=O)[C@H](CNC(=O)C(C)(C)C)NC(=O)OCc1ccccc1)C(=O)N[C@H](CCC(C)=O)C[C@@H]2CCNC2=O,"InChI=1S/C32H49N5O7/c1-20(2)16-25(28(40)35-24(13-12-21(3)38)17-23-14-15-33-27(23)39)36-29(41)26(18-34-30(42)32(4,5)6)37-31(43)44-19-22-10-8-7-9-11-22/h7-11,20,23-26H,12-19H2,1-6H3,(H,33,39)(H,34,42)(H,35,40)(H,36,41)(H,37,43)/t23-,24+,25-,26-/m0/s1",IEQRDAZPCPYZAJ-QYOOZWMWSA-N,Ki,38,nM,"3C-like proteinase,"
S89,OC[C@H](Cc1ccccc1)NC(=O)[C@H](Cc2ccccc2)NC(=O)/C=C/c3ccccc3,"InChI=1S/C27H28N2O3/c30-20-24(18-22-12-6-2-7-13-22)28-27(32)25(19-23-14-8-3-9-15-23)29-26(31)17-16-21-10-4-1-5-11-21/h1-17,24-25,30H,18-20H2,(H,28,32)(H,29,31)/b17-16+/t24-,25-/m0/s1",GEVQDXBVGFGWFA-KQRRRSJSSA-N,Ki,2240,nM,"3C-like proteinase,"
Unfortunately, it didn't work as well as ChEMBL and I couldn't get 22 papers. In addition, some of the above 10 papers were not included in the 22 papers. We believe that this is due to any or a combination of the following:
--There was an error in the method of specifying the search for the PDB. ――We collected only the Lingad data linked to the protein structure, but it is possible that the data was actually present in the linked papers even if it was not linked.
Although the cause is unknown, I did not think that the data obtained by this specification procedure was incorrect, so I decided to adopt these 10 cases for PDB.
The data obtained in 5.1.1 and 5.1.2 were merged, and finally 101 cases were output to csv as training data. The script is omitted.
In this paper, we apply the constructed predictive model to marketing / discontinuation / experiment and investigational drug. Obtain the structural data of marketing / discontinuation / experiment and investigational drug required for this purpose from DrugBank. The github of the paper has the formatted data as it is, but ** I will try it myself persistently **.
First, as XML data, download xml from the link described as "DOWNLOAD (XML)" in https://www.drugbank.ca/releases/latest
.
Then download sdf from https://www.drugbank.ca/releases/latest#structures
.
The account may have had to be created in advance.
Save these as "full database.xml", "open structures.sdf".
Structural data is stored in sdf, and the prediction model can be applied by itself, but since there are types such as commercial / discontinued / experiment in xml, these are associated and saved in csv. The script fragment is shown below.
make_drugbank_data.py
def get_group(groups_node, ns):
ret = ""
if groups_node:
for i, child in enumerate(groups_node.iter(f"{ns}group")):
if i > 0 :
ret += ","
ret += child.text
return ret
def get_id(drug_node, ns):
for i, child in enumerate(drug_node.iter(f"{ns}drugbank-id")):
for attr in child.attrib:
if attr == "primary":
return child.text
return None
def main():
parser = argparse.ArgumentParser()
parser.add_argument("-input_xml", type=str, required=True)
parser.add_argument("-input_sdf", type=str, required=True)
parser.add_argument("-output", type=str, required=True)
args = parser.parse_args()
name_dict = {}
smiles_dict = {}
sdf_sup = Chem.SDMolSupplier(args.input_sdf)
datas = []
for i, mol in enumerate(sdf_sup):
if not mol:
continue
if mol.HasProp("DRUGBANK_ID"):
id = mol.GetProp("DRUGBANK_ID")
if mol.HasProp("COMMON_NAME"):
name = mol.GetProp("COMMON_NAME")
smiles = Chem.MolToSmiles(mol)
name_dict[id] = name
smiles_dict[id] = smiles
print(f"{i} {id} {name} {smiles}")
tree = ET.parse(args.input_xml)
root = tree.getroot()
ns = "{http://www.drugbank.ca}"
ids = []
datas = []
for i, drug in enumerate(root.iter(f"{ns}drug")):
id = get_id(drug, ns)
category = get_group(drug.find(f"{ns}groups"), ns)
if id and id in smiles_dict:
print(f"{i}, {id}, {category}")
ids.append(id)
datas.append([name_dict[id], category, smiles_dict[id]])
df = pd.DataFrame(datas, index=ids, columns=[["name", "status", "smiles"]])
df.to_csv(args.output)
if __name__ == "__main__":
main()
The data obtained by executing this script looks like this. Structural data is stored as SMILES.
id,name,status,smiles
DB00006,Bivalirudin,"approved,investigational",CC[C@H](C)[C@H](NC(=O)[C@H](CCC(=O)O)NC(=O)[C@H](CCC(=O)O)NC(=O)[C@H](Cc1ccccc1)NC(=O)[C@H](CC(=O)O)NC(=O)CNC(=O)[C@H](CC(N)=O)NC(=O)CNC(=O)CNC(=O)CNC(=O)CNC(=O)[C@@H]1CCCN1C(=O)[C@H](CCCNC(=N)N)NC(=O)[C@@H]1CCCN1C(=O)[C@H](N)Cc1ccccc1)C(=O)N1CCC[C@H]1C(=O)N[C@@H](CCC(=O)O)C(=O)N[C@@H](CCC(=O)O)C(=O)N[C@@H](Cc1ccc(O)cc1)C(=O)N[C@@H](CC(C)C)C(=O)O
DB00007,Leuprolide,"approved,investigational",CCNC(=O)[C@@H]1CCCN1C(=O)[C@H](CCCNC(=N)N)NC(=O)[C@H](CC(C)C)NC(=O)[C@@H](CC(C)C)NC(=O)[C@H](Cc1ccc(O)cc1)NC(=O)[C@H](CO)NC(=O)[C@H](Cc1c[nH]c2ccccc12)NC(=O)[C@H](Cc1c[nH]cn1)NC(=O)[C@@H]1CCC(=O)N1
DB00014,Goserelin,approved,CC(C)C[C@H](NC(=O)[C@@H](COC(C)(C)C)NC(=O)[C@H](Cc1ccc(O)cc1)NC(=O)[C@H](CO)NC(=O)[C@H](Cc1c[nH]c2ccccc12)NC(=O)[C@H](Cc1cnc[nH]1)NC(=O)[C@@H]1CCC(=O)N1)C(=O)N[C@@H](CCCN=C(N)N)C(=O)N1CCC[C@H]1C(=O)NNC(N)=O
DB00027,Gramicidin D,approved,CC(C)C[C@@H](NC(=O)CNC(=O)[C@@H](NC=O)C(C)C)C(=O)N[C@@H](C)C(=O)N[C@@H](C(=O)N[C@H](C(=O)N[C@@H](C(=O)N[C@@H](Cc1c[nH]c2ccccc12)C(=O)N[C@H](CC(C)C)C(=O)N[C@@H](Cc1c[nH]c2ccccc12)C(=O)N[C@H](CC(C)C)C(=O)N[C@@H](Cc1c[nH]c2ccccc12)C(=O)N[C@H](CC(C)C)C(=O)N[C@@H](Cc1c[nH]c2ccccc12)C(=O)NCCO)C(C)C)C(C)C)C(C)C
DB00035,Desmopressin,approved,N=C(N)NCCC[C@@H](NC(=O)[C@@H]1CCCN1C(=O)[C@@H]1CSSCCC(=O)N[C@@H](Cc2ccc(O)cc2)C(=O)N[C@@H](Cc2ccccc2)C(=O)N[C@@H](CCC(N)=O)C(=O)N[C@@H](CC(N)=O)C(=O)N1)C(=O)NCC(N)=O
Next, a prediction model is constructed using the training data.
Here, the explanatory variables for machine learning are generated from the csv file generated in 5.1.3. Specifically, read SMILES in the "canonical_smiles" column of the csv file and convert it to a mol object using RDKit. Then use RDKit's AllChem.GetMorganFingerprintAsBitVect to generate a 2048-bit fingerprint. Roughly speaking, a fingerprint is a collection of several bits that, if a compound contains a particular structure, have a corresponding bit of 1 and otherwise 0. This time, a bit string consisting of 2048 0s or 1s is generated.
calc_descriptor.py
from rdkit import Chem
from rdkit.Chem import AllChem
from molvs.normalize import Normalizer, Normalization
from molvs.tautomer import TAUTOMER_TRANSFORMS, TAUTOMER_SCORES, MAX_TAUTOMERS, TautomerCanonicalizer, TautomerEnumerator, TautomerTransform
from molvs.fragment import LargestFragmentChooser
from molvs.charge import Reionizer, Uncharger
import argparse
import csv
import pandas as pd
import numpy as np
def normalize(smiles):
# Generate Mol
mol = Chem.MolFromSmiles(smiles)
# Uncharge
uncharger = Uncharger()
mol = uncharger.uncharge(mol)
# LargestFragmentChooser
flagmentChooser = LargestFragmentChooser()
mol = flagmentChooser(mol)
# Sanitaize
Chem.SanitizeMol(mol)
# Normalize
normalizer = Normalizer()
mol = normalizer.normalize(mol)
tautomerCanonicalizer = TautomerCanonicalizer()
mol = tautomerCanonicalizer.canonicalize(mol)
return Chem.MolToSmiles(mol)
def main():
parser = argparse.ArgumentParser()
parser.add_argument("-input", type=str, required=True)
parser.add_argument("-output", type=str, required=True)
args = parser.parse_args()
#Reading training data
smiles_list = list()
datas = []
with open(args.input, "r") as f:
reader = csv.DictReader(f)
for row in reader:
org_smiles = row["canonical_smiles"]
new_smiles = normalize(org_smiles)
existFlag = False
for tmp_smiles in smiles_list:
if new_smiles == tmp_smiles:
print("exist!")
existFlag = True
break
if not existFlag:
smiles_list.append(new_smiles)
mol = Chem.MolFromSmiles(new_smiles)
fp = AllChem.GetMorganFingerprintAsBitVect(mol, radius=3, nBits=2048, useFeatures=False, useChirality=False)
fp = pd.Series(np.asarray(fp)).values
data = []
data.append(row["id"])
data.append(int(row["outcome"]))
data.extend(fp)
datas.append(data)
columns = list()
columns.append("id")
columns.append("outcome")
columns.extend(["Bit_" + str(i + 1) for i in range(2048)])
df = pd.DataFrame(data=datas, columns=columns)
df.set_index("id", inplace=True, drop=True)
#Save
df.to_csv(args.output)
if __name__ == "__main__":
main()
The compound was pretreated and standardized, and if the same compound was present, it was removed just in case. The process of generating 3D coordinates from the compound may also be performed, but in the case of the fingerprint used this time, the 3D coordinates are unnecessary, so the process is not performed. After execution, the following data will be obtained. The first column is the ID, the second column is the objective variable, and the third and subsequent columns are the explanatory variables by fingerprint.
id,outcome,Bit_1,Bit_2,Bit_3,Bit_4,Bit_5,Bit_6,Bit_7,Bit_8,Bit_9,Bit_10,Bit_11,Bit_12,Bit_13,Bit_14,Bit_15,Bit_16,Bit_17,Bit_18,Bit_19,Bit_20,Bit_21,Bit_22,Bit_23,Bit_24,Bit_25,Bit_26,Bit_27,Bit_28,Bit_29,Bit_30,Bit_31,Bit_32,Bit_33,Bit_34,Bit_35,Bit_36,Bit_37,Bit_38,Bit_39,Bit_40,Bit_41,Bit_42,Bit_43,Bit_44,Bit_45,Bit_46,Bit_47,Bit_48,Bit_49,Bit_50,Bit_51,Bit_52,Bit_53,Bit_54,Bit_55,Bit_56,Bit_57,Bit_58,Bit_59,Bit_60,Bit_61,Bit_62,Bit_63,Bit_64,Bit_65,Bit_66,Bit_67,Bit_68,Bit_69,Bit_70,Bit_71,Bit_72,Bit_73,Bit_74,Bit_75,Bit_76,Bit_77,Bit_78,Bit_79,Bit_80,Bit_81,Bit_82,Bit_83,Bit_84,Bit_85,Bit_86,Bit_87,Bit_88,Bit_89,Bit_90,Bit_91,Bit_92,Bit_93,Bit_94,Bit_95,Bit_96,Bit_97,Bit_98,Bit_99,Bit_100,Bit_101,Bit_102,Bit_103,Bit_104,Bit_105,Bit_106,Bit_107,Bit_108,Bit_109,Bit_110,Bit_111,Bit_112,Bit_113,Bit_114,Bit_115,Bit_116,Bit_117,Bit_118,Bit_119,Bit_120,Bit_121,Bit_122,Bit_123,Bit_124,Bit_125,Bit_126,Bit_127,Bit_128,Bit_129,Bit_130,Bit_131,Bit_132,Bit_133,Bit_134,Bit_135,Bit_136,Bit_137,Bit_138,Bit_139,Bit_140,Bit_141,Bit_142,Bit_143,Bit_144,Bit_145,Bit_146,Bit_147,Bit_148,Bit_149,Bit_150,Bit_151,Bit_152,Bit_153,Bit_154,Bit_155,Bit_156,Bit_157,Bit_158,Bit_159,Bit_160,Bit_161,Bit_162,Bit_163,Bit_164,Bit_165,Bit_166,Bit_167,Bit_168,Bit_169,Bit_170,Bit_171,Bit_172,Bit_173,Bit_174,Bit_175,Bit_176,Bit_177,Bit_178,Bit_179,Bit_180,Bit_181,Bit_182,Bit_183,Bit_184,Bit_185,Bit_186,Bit_187,Bit_188,Bit_189,Bit_190,Bit_191,Bit_192,Bit_193,Bit_194,Bit_195,Bit_196,Bit_197,Bit_198,Bit_199,Bit_200,Bit_201,Bit_202,Bit_203,Bit_204,Bit_205,Bit_206,Bit_207,Bit_208,Bit_209,Bit_210,Bit_211,Bit_212,Bit_213,Bit_214,Bit_215,Bit_216,Bit_217,Bit_218,Bit_219,Bit_220,Bit_221,Bit_222,Bit_223,Bit_224,Bit_225,Bit_226,Bit_227,Bit_228,Bit_229,Bit_230,Bit_231,Bit_232,Bit_233,Bit_234,Bit_235,Bit_236,Bit_237,Bit_238,Bit_239,Bit_240,Bit_241,Bit_242,Bit_243,Bit_244,Bit_245,Bit_246,Bit_247,Bit_248,Bit_249,Bit_250,Bit_251,Bit_252,Bit_253,Bit_254,Bit_255,Bit_256,Bit_257,Bit_258,Bit_259,Bit_260,Bit_261,Bit_262,Bit_263,Bit_264,Bit_265,Bit_266,Bit_267,Bit_268,Bit_269,Bit_270,Bit_271,Bit_272,Bit_273,Bit_274,Bit_275,Bit_276,Bit_277,Bit_278,Bit_279,Bit_280,Bit_281,Bit_282,Bit_283,Bit_284,Bit_285,Bit_286,Bit_287,Bit_288,Bit_289,Bit_290,Bit_291,Bit_292,Bit_293,Bit_294,Bit_295,Bit_296,Bit_297,Bit_298,Bit_299,Bit_300,Bit_301,Bit_302,Bit_303,Bit_304,Bit_305,Bit_306,Bit_307,Bit_308,Bit_309,Bit_310,Bit_311,Bit_312,Bit_313,Bit_314,Bit_315,Bit_316,Bit_317,Bit_318,Bit_319,Bit_320,Bit_321,Bit_322,Bit_323,Bit_324,Bit_325,Bit_326,Bit_327,Bit_328,Bit_329,Bit_330,Bit_331,Bit_332,Bit_333,Bit_334,Bit_335,Bit_336,Bit_337,Bit_338,Bit_339,Bit_340,Bit_341,Bit_342,Bit_343,Bit_344,Bit_345,Bit_346,Bit_347,Bit_348,Bit_349,Bit_350,Bit_351,Bit_352,Bit_353,Bit_354,Bit_355,Bit_356,Bit_357,Bit_358,Bit_359,Bit_360,Bit_361,Bit_362,Bit_363,Bit_364,Bit_365,Bit_366,Bit_367,Bit_368,Bit_369,Bit_370,Bit_371,Bit_372,Bit_373,Bit_374,Bit_375,Bit_376,Bit_377,Bit_378,Bit_379,Bit_380,Bit_381,Bit_382,Bit_383,Bit_384,Bit_385,Bit_386,Bit_387,Bit_388,Bit_389,Bit_390,Bit_391,Bit_392,Bit_393,Bit_394,Bit_395,Bit_396,Bit_397,Bit_398,Bit_399,Bit_400,Bit_401,Bit_402,Bit_403,Bit_404,Bit_405,Bit_406,Bit_407,Bit_408,Bit_409,Bit_410,Bit_411,Bit_412,Bit_413,Bit_414,Bit_415,Bit_416,Bit_417,Bit_418,Bit_419,Bit_420,Bit_421,Bit_422,Bit_423,Bit_424,Bit_425,Bit_426,Bit_427,Bit_428,Bit_429,Bit_430,Bit_431,Bit_432,Bit_433,Bit_434,Bit_435,Bit_436,Bit_437,Bit_438,Bit_439,Bit_440,Bit_441,Bit_442,Bit_443,Bit_444,Bit_445,Bit_446,Bit_447,Bit_448,Bit_449,Bit_450,Bit_451,Bit_452,Bit_453,Bit_454,Bit_455,Bit_456,Bit_457,Bit_458,Bit_459,Bit_460,Bit_461,Bit_462,Bit_463,Bit_464,Bit_465,Bit_466,Bit_467,Bit_468,Bit_469,Bit_470,Bit_471,Bit_472,Bit_473,Bit_474,Bit_475,Bit_476,Bit_477,Bit_478,Bit_479,Bit_480,Bit_481,Bit_482,Bit_483,Bit_484,Bit_485,Bit_486,Bit_487,Bit_488,Bit_489,Bit_490,Bit_491,Bit_492,Bit_493,Bit_494,Bit_495,Bit_496,Bit_497,Bit_498,Bit_499,Bit_500,Bit_501,Bit_502,Bit_503,Bit_504,Bit_505,Bit_506,Bit_507,Bit_508,Bit_509,Bit_510,Bit_511,Bit_512,Bit_513,Bit_514,Bit_515,Bit_516,Bit_517,Bit_518,Bit_519,Bit_520,Bit_521,Bit_522,Bit_523,Bit_524,Bit_525,Bit_526,Bit_527,Bit_528,Bit_529,Bit_530,Bit_531,Bit_532,Bit_533,Bit_534,Bit_535,Bit_536,Bit_537,Bit_538,Bit_539,Bit_540,Bit_541,Bit_542,Bit_543,Bit_544,Bit_545,Bit_546,Bit_547,Bit_548,Bit_549,Bit_550,Bit_551,Bit_552,Bit_553,Bit_554,Bit_555,Bit_556,Bit_557,Bit_558,Bit_559,Bit_560,Bit_561,Bit_562,Bit_563,Bit_564,Bit_565,Bit_566,Bit_567,Bit_568,Bit_569,Bit_570,Bit_571,Bit_572,Bit_573,Bit_574,Bit_575,Bit_576,Bit_577,Bit_578,Bit_579,Bit_580,Bit_581,Bit_582,Bit_583,Bit_584,Bit_585,Bit_586,Bit_587,Bit_588,Bit_589,Bit_590,Bit_591,Bit_592,Bit_593,Bit_594,Bit_595,Bit_596,Bit_597,Bit_598,Bit_599,Bit_600,Bit_601,Bit_602,Bit_603,Bit_604,Bit_605,Bit_606,Bit_607,Bit_608,Bit_609,Bit_610,Bit_611,Bit_612,Bit_613,Bit_614,Bit_615,Bit_616,Bit_617,Bit_618,Bit_619,Bit_620,Bit_621,Bit_622,Bit_623,Bit_624,Bit_625,Bit_626,Bit_627,Bit_628,Bit_629,Bit_630,Bit_631,Bit_632,Bit_633,Bit_634,Bit_635,Bit_636,Bit_637,Bit_638,Bit_639,Bit_640,Bit_641,Bit_642,Bit_643,Bit_644,Bit_645,Bit_646,Bit_647,Bit_648,Bit_649,Bit_650,Bit_651,Bit_652,Bit_653,Bit_654,Bit_655,Bit_656,Bit_657,Bit_658,Bit_659,Bit_660,Bit_661,Bit_662,Bit_663,Bit_664,Bit_665,Bit_666,Bit_667,Bit_668,Bit_669,Bit_670,Bit_671,Bit_672,Bit_673,Bit_674,Bit_675,Bit_676,Bit_677,Bit_678,Bit_679,Bit_680,Bit_681,Bit_682,Bit_683,Bit_684,Bit_685,Bit_686,Bit_687,Bit_688,Bit_689,Bit_690,Bit_691,Bit_692,Bit_693,Bit_694,Bit_695,Bit_696,Bit_697,Bit_698,Bit_699,Bit_700,Bit_701,Bit_702,Bit_703,Bit_704,Bit_705,Bit_706,Bit_707,Bit_708,Bit_709,Bit_710,Bit_711,Bit_712,Bit_713,Bit_714,Bit_715,Bit_716,Bit_717,Bit_718,Bit_719,Bit_720,Bit_721,Bit_722,Bit_723,Bit_724,Bit_725,Bit_726,Bit_727,Bit_728,Bit_729,Bit_730,Bit_731,Bit_732,Bit_733,Bit_734,Bit_735,Bit_736,Bit_737,Bit_738,Bit_739,Bit_740,Bit_741,Bit_742,Bit_743,Bit_744,Bit_745,Bit_746,Bit_747,Bit_748,Bit_749,Bit_750,Bit_751,Bit_752,Bit_753,Bit_754,Bit_755,Bit_756,Bit_757,Bit_758,Bit_759,Bit_760,Bit_761,Bit_762,Bit_763,Bit_764,Bit_765,Bit_766,Bit_767,Bit_768,Bit_769,Bit_770,Bit_771,Bit_772,Bit_773,Bit_774,Bit_775,Bit_776,Bit_777,Bit_778,Bit_779,Bit_780,Bit_781,Bit_782,Bit_783,Bit_784,Bit_785,Bit_786,Bit_787,Bit_788,Bit_789,Bit_790,Bit_791,Bit_792,Bit_793,Bit_794,Bit_795,Bit_796,Bit_797,Bit_798,Bit_799,Bit_800,Bit_801,Bit_802,Bit_803,Bit_804,Bit_805,Bit_806,Bit_807,Bit_808,Bit_809,Bit_810,Bit_811,Bit_812,Bit_813,Bit_814,Bit_815,Bit_816,Bit_817,Bit_818,Bit_819,Bit_820,Bit_821,Bit_822,Bit_823,Bit_824,Bit_825,Bit_826,Bit_827,Bit_828,Bit_829,Bit_830,Bit_831,Bit_832,Bit_833,Bit_834,Bit_835,Bit_836,Bit_837,Bit_838,Bit_839,Bit_840,Bit_841,Bit_842,Bit_843,Bit_844,Bit_845,Bit_846,Bit_847,Bit_848,Bit_849,Bit_850,Bit_851,Bit_852,Bit_853,Bit_854,Bit_855,Bit_856,Bit_857,Bit_858,Bit_859,Bit_860,Bit_861,Bit_862,Bit_863,Bit_864,Bit_865,Bit_866,Bit_867,Bit_868,Bit_869,Bit_870,Bit_871,Bit_872,Bit_873,Bit_874,Bit_875,Bit_876,Bit_877,Bit_878,Bit_879,Bit_880,Bit_881,Bit_882,Bit_883,Bit_884,Bit_885,Bit_886,Bit_887,Bit_888,Bit_889,Bit_890,Bit_891,Bit_892,Bit_893,Bit_894,Bit_895,Bit_896,Bit_897,Bit_898,Bit_899,Bit_900,Bit_901,Bit_902,Bit_903,Bit_904,Bit_905,Bit_906,Bit_907,Bit_908,Bit_909,Bit_910,Bit_911,Bit_912,Bit_913,Bit_914,Bit_915,Bit_916,Bit_917,Bit_918,Bit_919,Bit_920,Bit_921,Bit_922,Bit_923,Bit_924,Bit_925,Bit_926,Bit_927,Bit_928,Bit_929,Bit_930,Bit_931,Bit_932,Bit_933,Bit_934,Bit_935,Bit_936,Bit_937,Bit_938,Bit_939,Bit_940,Bit_941,Bit_942,Bit_943,Bit_944,Bit_945,Bit_946,Bit_947,Bit_948,Bit_949,Bit_950,Bit_951,Bit_952,Bit_953,Bit_954,Bit_955,Bit_956,Bit_957,Bit_958,Bit_959,Bit_960,Bit_961,Bit_962,Bit_963,Bit_964,Bit_965,Bit_966,Bit_967,Bit_968,Bit_969,Bit_970,Bit_971,Bit_972,Bit_973,Bit_974,Bit_975,Bit_976,Bit_977,Bit_978,Bit_979,Bit_980,Bit_981,Bit_982,Bit_983,Bit_984,Bit_985,Bit_986,Bit_987,Bit_988,Bit_989,Bit_990,Bit_991,Bit_992,Bit_993,Bit_994,Bit_995,Bit_996,Bit_997,Bit_998,Bit_999,Bit_1000,Bit_1001,Bit_1002,Bit_1003,Bit_1004,Bit_1005,Bit_1006,Bit_1007,Bit_1008,Bit_1009,Bit_1010,Bit_1011,Bit_1012,Bit_1013,Bit_1014,Bit_1015,Bit_1016,Bit_1017,Bit_1018,Bit_1019,Bit_1020,Bit_1021,Bit_1022,Bit_1023,Bit_1024,Bit_1025,Bit_1026,Bit_1027,Bit_1028,Bit_1029,Bit_1030,Bit_1031,Bit_1032,Bit_1033,Bit_1034,Bit_1035,Bit_1036,Bit_1037,Bit_1038,Bit_1039,Bit_1040,Bit_1041,Bit_1042,Bit_1043,Bit_1044,Bit_1045,Bit_1046,Bit_1047,Bit_1048,Bit_1049,Bit_1050,Bit_1051,Bit_1052,Bit_1053,Bit_1054,Bit_1055,Bit_1056,Bit_1057,Bit_1058,Bit_1059,Bit_1060,Bit_1061,Bit_1062,Bit_1063,Bit_1064,Bit_1065,Bit_1066,Bit_1067,Bit_1068,Bit_1069,Bit_1070,Bit_1071,Bit_1072,Bit_1073,Bit_1074,Bit_1075,Bit_1076,Bit_1077,Bit_1078,Bit_1079,Bit_1080,Bit_1081,Bit_1082,Bit_1083,Bit_1084,Bit_1085,Bit_1086,Bit_1087,Bit_1088,Bit_1089,Bit_1090,Bit_1091,Bit_1092,Bit_1093,Bit_1094,Bit_1095,Bit_1096,Bit_1097,Bit_1098,Bit_1099,Bit_1100,Bit_1101,Bit_1102,Bit_1103,Bit_1104,Bit_1105,Bit_1106,Bit_1107,Bit_1108,Bit_1109,Bit_1110,Bit_1111,Bit_1112,Bit_1113,Bit_1114,Bit_1115,Bit_1116,Bit_1117,Bit_1118,Bit_1119,Bit_1120,Bit_1121,Bit_1122,Bit_1123,Bit_1124,Bit_1125,Bit_1126,Bit_1127,Bit_1128,Bit_1129,Bit_1130,Bit_1131,Bit_1132,Bit_1133,Bit_1134,Bit_1135,Bit_1136,Bit_1137,Bit_1138,Bit_1139,Bit_1140,Bit_1141,Bit_1142,Bit_1143,Bit_1144,Bit_1145,Bit_1146,Bit_1147,Bit_1148,Bit_1149,Bit_1150,Bit_1151,Bit_1152,Bit_1153,Bit_1154,Bit_1155,Bit_1156,Bit_1157,Bit_1158,Bit_1159,Bit_1160,Bit_1161,Bit_1162,Bit_1163,Bit_1164,Bit_1165,Bit_1166,Bit_1167,Bit_1168,Bit_1169,Bit_1170,Bit_1171,Bit_1172,Bit_1173,Bit_1174,Bit_1175,Bit_1176,Bit_1177,Bit_1178,Bit_1179,Bit_1180,Bit_1181,Bit_1182,Bit_1183,Bit_1184,Bit_1185,Bit_1186,Bit_1187,Bit_1188,Bit_1189,Bit_1190,Bit_1191,Bit_1192,Bit_1193,Bit_1194,Bit_1195,Bit_1196,Bit_1197,Bit_1198,Bit_1199,Bit_1200,Bit_1201,Bit_1202,Bit_1203,Bit_1204,Bit_1205,Bit_1206,Bit_1207,Bit_1208,Bit_1209,Bit_1210,Bit_1211,Bit_1212,Bit_1213,Bit_1214,Bit_1215,Bit_1216,Bit_1217,Bit_1218,Bit_1219,Bit_1220,Bit_1221,Bit_1222,Bit_1223,Bit_1224,Bit_1225,Bit_1226,Bit_1227,Bit_1228,Bit_1229,Bit_1230,Bit_1231,Bit_1232,Bit_1233,Bit_1234,Bit_1235,Bit_1236,Bit_1237,Bit_1238,Bit_1239,Bit_1240,Bit_1241,Bit_1242,Bit_1243,Bit_1244,Bit_1245,Bit_1246,Bit_1247,Bit_1248,Bit_1249,Bit_1250,Bit_1251,Bit_1252,Bit_1253,Bit_1254,Bit_1255,Bit_1256,Bit_1257,Bit_1258,Bit_1259,Bit_1260,Bit_1261,Bit_1262,Bit_1263,Bit_1264,Bit_1265,Bit_1266,Bit_1267,Bit_1268,Bit_1269,Bit_1270,Bit_1271,Bit_1272,Bit_1273,Bit_1274,Bit_1275,Bit_1276,Bit_1277,Bit_1278,Bit_1279,Bit_1280,Bit_1281,Bit_1282,Bit_1283,Bit_1284,Bit_1285,Bit_1286,Bit_1287,Bit_1288,Bit_1289,Bit_1290,Bit_1291,Bit_1292,Bit_1293,Bit_1294,Bit_1295,Bit_1296,Bit_1297,Bit_1298,Bit_1299,Bit_1300,Bit_1301,Bit_1302,Bit_1303,Bit_1304,Bit_1305,Bit_1306,Bit_1307,Bit_1308,Bit_1309,Bit_1310,Bit_1311,Bit_1312,Bit_1313,Bit_1314,Bit_1315,Bit_1316,Bit_1317,Bit_1318,Bit_1319,Bit_1320,Bit_1321,Bit_1322,Bit_1323,Bit_1324,Bit_1325,Bit_1326,Bit_1327,Bit_1328,Bit_1329,Bit_1330,Bit_1331,Bit_1332,Bit_1333,Bit_1334,Bit_1335,Bit_1336,Bit_1337,Bit_1338,Bit_1339,Bit_1340,Bit_1341,Bit_1342,Bit_1343,Bit_1344,Bit_1345,Bit_1346,Bit_1347,Bit_1348,Bit_1349,Bit_1350,Bit_1351,Bit_1352,Bit_1353,Bit_1354,Bit_1355,Bit_1356,Bit_1357,Bit_1358,Bit_1359,Bit_1360,Bit_1361,Bit_1362,Bit_1363,Bit_1364,Bit_1365,Bit_1366,Bit_1367,Bit_1368,Bit_1369,Bit_1370,Bit_1371,Bit_1372,Bit_1373,Bit_1374,Bit_1375,Bit_1376,Bit_1377,Bit_1378,Bit_1379,Bit_1380,Bit_1381,Bit_1382,Bit_1383,Bit_1384,Bit_1385,Bit_1386,Bit_1387,Bit_1388,Bit_1389,Bit_1390,Bit_1391,Bit_1392,Bit_1393,Bit_1394,Bit_1395,Bit_1396,Bit_1397,Bit_1398,Bit_1399,Bit_1400,Bit_1401,Bit_1402,Bit_1403,Bit_1404,Bit_1405,Bit_1406,Bit_1407,Bit_1408,Bit_1409,Bit_1410,Bit_1411,Bit_1412,Bit_1413,Bit_1414,Bit_1415,Bit_1416,Bit_1417,Bit_1418,Bit_1419,Bit_1420,Bit_1421,Bit_1422,Bit_1423,Bit_1424,Bit_1425,Bit_1426,Bit_1427,Bit_1428,Bit_1429,Bit_1430,Bit_1431,Bit_1432,Bit_1433,Bit_1434,Bit_1435,Bit_1436,Bit_1437,Bit_1438,Bit_1439,Bit_1440,Bit_1441,Bit_1442,Bit_1443,Bit_1444,Bit_1445,Bit_1446,Bit_1447,Bit_1448,Bit_1449,Bit_1450,Bit_1451,Bit_1452,Bit_1453,Bit_1454,Bit_1455,Bit_1456,Bit_1457,Bit_1458,Bit_1459,Bit_1460,Bit_1461,Bit_1462,Bit_1463,Bit_1464,Bit_1465,Bit_1466,Bit_1467,Bit_1468,Bit_1469,Bit_1470,Bit_1471,Bit_1472,Bit_1473,Bit_1474,Bit_1475,Bit_1476,Bit_1477,Bit_1478,Bit_1479,Bit_1480,Bit_1481,Bit_1482,Bit_1483,Bit_1484,Bit_1485,Bit_1486,Bit_1487,Bit_1488,Bit_1489,Bit_1490,Bit_1491,Bit_1492,Bit_1493,Bit_1494,Bit_1495,Bit_1496,Bit_1497,Bit_1498,Bit_1499,Bit_1500,Bit_1501,Bit_1502,Bit_1503,Bit_1504,Bit_1505,Bit_1506,Bit_1507,Bit_1508,Bit_1509,Bit_1510,Bit_1511,Bit_1512,Bit_1513,Bit_1514,Bit_1515,Bit_1516,Bit_1517,Bit_1518,Bit_1519,Bit_1520,Bit_1521,Bit_1522,Bit_1523,Bit_1524,Bit_1525,Bit_1526,Bit_1527,Bit_1528,Bit_1529,Bit_1530,Bit_1531,Bit_1532,Bit_1533,Bit_1534,Bit_1535,Bit_1536,Bit_1537,Bit_1538,Bit_1539,Bit_1540,Bit_1541,Bit_1542,Bit_1543,Bit_1544,Bit_1545,Bit_1546,Bit_1547,Bit_1548,Bit_1549,Bit_1550,Bit_1551,Bit_1552,Bit_1553,Bit_1554,Bit_1555,Bit_1556,Bit_1557,Bit_1558,Bit_1559,Bit_1560,Bit_1561,Bit_1562,Bit_1563,Bit_1564,Bit_1565,Bit_1566,Bit_1567,Bit_1568,Bit_1569,Bit_1570,Bit_1571,Bit_1572,Bit_1573,Bit_1574,Bit_1575,Bit_1576,Bit_1577,Bit_1578,Bit_1579,Bit_1580,Bit_1581,Bit_1582,Bit_1583,Bit_1584,Bit_1585,Bit_1586,Bit_1587,Bit_1588,Bit_1589,Bit_1590,Bit_1591,Bit_1592,Bit_1593,Bit_1594,Bit_1595,Bit_1596,Bit_1597,Bit_1598,Bit_1599,Bit_1600,Bit_1601,Bit_1602,Bit_1603,Bit_1604,Bit_1605,Bit_1606,Bit_1607,Bit_1608,Bit_1609,Bit_1610,Bit_1611,Bit_1612,Bit_1613,Bit_1614,Bit_1615,Bit_1616,Bit_1617,Bit_1618,Bit_1619,Bit_1620,Bit_1621,Bit_1622,Bit_1623,Bit_1624,Bit_1625,Bit_1626,Bit_1627,Bit_1628,Bit_1629,Bit_1630,Bit_1631,Bit_1632,Bit_1633,Bit_1634,Bit_1635,Bit_1636,Bit_1637,Bit_1638,Bit_1639,Bit_1640,Bit_1641,Bit_1642,Bit_1643,Bit_1644,Bit_1645,Bit_1646,Bit_1647,Bit_1648,Bit_1649,Bit_1650,Bit_1651,Bit_1652,Bit_1653,Bit_1654,Bit_1655,Bit_1656,Bit_1657,Bit_1658,Bit_1659,Bit_1660,Bit_1661,Bit_1662,Bit_1663,Bit_1664,Bit_1665,Bit_1666,Bit_1667,Bit_1668,Bit_1669,Bit_1670,Bit_1671,Bit_1672,Bit_1673,Bit_1674,Bit_1675,Bit_1676,Bit_1677,Bit_1678,Bit_1679,Bit_1680,Bit_1681,Bit_1682,Bit_1683,Bit_1684,Bit_1685,Bit_1686,Bit_1687,Bit_1688,Bit_1689,Bit_1690,Bit_1691,Bit_1692,Bit_1693,Bit_1694,Bit_1695,Bit_1696,Bit_1697,Bit_1698,Bit_1699,Bit_1700,Bit_1701,Bit_1702,Bit_1703,Bit_1704,Bit_1705,Bit_1706,Bit_1707,Bit_1708,Bit_1709,Bit_1710,Bit_1711,Bit_1712,Bit_1713,Bit_1714,Bit_1715,Bit_1716,Bit_1717,Bit_1718,Bit_1719,Bit_1720,Bit_1721,Bit_1722,Bit_1723,Bit_1724,Bit_1725,Bit_1726,Bit_1727,Bit_1728,Bit_1729,Bit_1730,Bit_1731,Bit_1732,Bit_1733,Bit_1734,Bit_1735,Bit_1736,Bit_1737,Bit_1738,Bit_1739,Bit_1740,Bit_1741,Bit_1742,Bit_1743,Bit_1744,Bit_1745,Bit_1746,Bit_1747,Bit_1748,Bit_1749,Bit_1750,Bit_1751,Bit_1752,Bit_1753,Bit_1754,Bit_1755,Bit_1756,Bit_1757,Bit_1758,Bit_1759,Bit_1760,Bit_1761,Bit_1762,Bit_1763,Bit_1764,Bit_1765,Bit_1766,Bit_1767,Bit_1768,Bit_1769,Bit_1770,Bit_1771,Bit_1772,Bit_1773,Bit_1774,Bit_1775,Bit_1776,Bit_1777,Bit_1778,Bit_1779,Bit_1780,Bit_1781,Bit_1782,Bit_1783,Bit_1784,Bit_1785,Bit_1786,Bit_1787,Bit_1788,Bit_1789,Bit_1790,Bit_1791,Bit_1792,Bit_1793,Bit_1794,Bit_1795,Bit_1796,Bit_1797,Bit_1798,Bit_1799,Bit_1800,Bit_1801,Bit_1802,Bit_1803,Bit_1804,Bit_1805,Bit_1806,Bit_1807,Bit_1808,Bit_1809,Bit_1810,Bit_1811,Bit_1812,Bit_1813,Bit_1814,Bit_1815,Bit_1816,Bit_1817,Bit_1818,Bit_1819,Bit_1820,Bit_1821,Bit_1822,Bit_1823,Bit_1824,Bit_1825,Bit_1826,Bit_1827,Bit_1828,Bit_1829,Bit_1830,Bit_1831,Bit_1832,Bit_1833,Bit_1834,Bit_1835,Bit_1836,Bit_1837,Bit_1838,Bit_1839,Bit_1840,Bit_1841,Bit_1842,Bit_1843,Bit_1844,Bit_1845,Bit_1846,Bit_1847,Bit_1848,Bit_1849,Bit_1850,Bit_1851,Bit_1852,Bit_1853,Bit_1854,Bit_1855,Bit_1856,Bit_1857,Bit_1858,Bit_1859,Bit_1860,Bit_1861,Bit_1862,Bit_1863,Bit_1864,Bit_1865,Bit_1866,Bit_1867,Bit_1868,Bit_1869,Bit_1870,Bit_1871,Bit_1872,Bit_1873,Bit_1874,Bit_1875,Bit_1876,Bit_1877,Bit_1878,Bit_1879,Bit_1880,Bit_1881,Bit_1882,Bit_1883,Bit_1884,Bit_1885,Bit_1886,Bit_1887,Bit_1888,Bit_1889,Bit_1890,Bit_1891,Bit_1892,Bit_1893,Bit_1894,Bit_1895,Bit_1896,Bit_1897,Bit_1898,Bit_1899,Bit_1900,Bit_1901,Bit_1902,Bit_1903,Bit_1904,Bit_1905,Bit_1906,Bit_1907,Bit_1908,Bit_1909,Bit_1910,Bit_1911,Bit_1912,Bit_1913,Bit_1914,Bit_1915,Bit_1916,Bit_1917,Bit_1918,Bit_1919,Bit_1920,Bit_1921,Bit_1922,Bit_1923,Bit_1924,Bit_1925,Bit_1926,Bit_1927,Bit_1928,Bit_1929,Bit_1930,Bit_1931,Bit_1932,Bit_1933,Bit_1934,Bit_1935,Bit_1936,Bit_1937,Bit_1938,Bit_1939,Bit_1940,Bit_1941,Bit_1942,Bit_1943,Bit_1944,Bit_1945,Bit_1946,Bit_1947,Bit_1948,Bit_1949,Bit_1950,Bit_1951,Bit_1952,Bit_1953,Bit_1954,Bit_1955,Bit_1956,Bit_1957,Bit_1958,Bit_1959,Bit_1960,Bit_1961,Bit_1962,Bit_1963,Bit_1964,Bit_1965,Bit_1966,Bit_1967,Bit_1968,Bit_1969,Bit_1970,Bit_1971,Bit_1972,Bit_1973,Bit_1974,Bit_1975,Bit_1976,Bit_1977,Bit_1978,Bit_1979,Bit_1980,Bit_1981,Bit_1982,Bit_1983,Bit_1984,Bit_1985,Bit_1986,Bit_1987,Bit_1988,Bit_1989,Bit_1990,Bit_1991,Bit_1992,Bit_1993,Bit_1994,Bit_1995,Bit_1996,Bit_1997,Bit_1998,Bit_1999,Bit_2000,Bit_2001,Bit_2002,Bit_2003,Bit_2004,Bit_2005,Bit_2006,Bit_2007,Bit_2008,Bit_2009,Bit_2010,Bit_2011,Bit_2012,Bit_2013,Bit_2014,Bit_2015,Bit_2016,Bit_2017,Bit_2018,Bit_2019,Bit_2020,Bit_2021,Bit_2022,Bit_2023,Bit_2024,Bit_2025,Bit_2026,Bit_2027,Bit_2028,Bit_2029,Bit_2030,Bit_2031,Bit_2032,Bit_2033,Bit_2034,Bit_2035,Bit_2036,Bit_2037,Bit_2038,Bit_2039,Bit_2040,Bit_2041,Bit_2042,Bit_2043,Bit_2044,Bit_2045,Bit_2046,Bit_2047,Bit_2048
CHEMBL365134,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
CHEMBL187579,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0
Finally, read the generated explanatory variable (and objective variable) data and create a prediction model. Random forest was used according to the paper. I wanted to reproduce the paper as much as possible with hyperparameter search by GridSearch, 5-fold cross-validation, and evaluation index, so I used a part of the source of github. Also, in the github code, Y-randamiztion was performed by the permutation_test_score method of scikit-learn, so this was also diverted.
create_model.py
import argparse
import csv
import pandas as pd
import numpy as np
import gzip
import _pickle as cPickle
from sklearn import metrics
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import VarianceThreshold
from sklearn.model_selection import train_test_split, cross_validate, GridSearchCV
from sklearn.model_selection import permutation_test_score, StratifiedKFold
def calc_metrics_derived_from_confusion_matrix(metrics_name, y_true, y_predict):
tn, fp, fn, tp = metrics.confusion_matrix(y_true, y_predict).ravel()
# PPV, precision
# TP / TP + FP
if metrics_name in ["PPV", "precision"]:
return tp / (tp + fp)
# NPV
# TN / TN + FN
if metrics_name in ["NPV"]:
return tn / (tn + fn)
# sensitivity, recall, TPR
# TP / TP + FN
if metrics_name in ["sensitivity", "recall", "TPR"]:
return tp / (tp + fn)
# specificity
# TN / TN + FP
if metrics_name in ["specificity"]:
return tn / (tn + fp)
def calc_metrics(metrics_name, y_true, y_predict):
if metrics_name == "accuracy":
return metrics.accuracy_score(y_true, y_predict)
if metrics_name == "ba":
return metrics.balanced_accuracy_score(y_true, y_predict)
if metrics_name == "roc_auc":
return metrics.roc_auc_score(y_true, y_predict)
if metrics_name == "kappa":
return metrics.cohen_kappa_score(y_true, y_predict)
if metrics_name == "mcc":
return metrics.matthews_corrcoef(y_true, y_predict)
if metrics_name == "precision":
return metrics.precision_score(y_true, y_predict)
if metrics_name == "recall":
return metrics.recall_score(y_true, y_predict)
def main():
parser = argparse.ArgumentParser()
parser.add_argument("-input", type=str, required=True)
parser.add_argument("-output_model", type=str, required=True)
parser.add_argument("-output_report", type=str, required=True)
args = parser.parse_args()
df = pd.read_csv(args.input, index_col=0)
print(df.shape)
y_train = df['outcome'].to_numpy()
print(y_train)
X_train = df.iloc[:, 1:]
print(y_train.shape)
print(X_train.shape)
# Number of trees in random forest
n_estimators = [100, 250, 500, 750, 1000]
max_features = ['auto', 'sqrt']
criterion = ['gini', 'entropy']
class_weight = [None,'balanced',
{0:.9, 1:.1}, {0:.8, 1:.2}, {0:.7, 1:.3}, {0:.6, 1:.4},
{0:.4, 1:.6}, {0:.3, 1:.7}, {0:.2, 1:.8}, {0:.1, 1:.9}]
random_state = [24]
# Create the random grid
param_grid = {'n_estimators': n_estimators,
'max_features': max_features,
'criterion': criterion,
'random_state': random_state,
'class_weight': class_weight}
# setup model building
rf = GridSearchCV(RandomForestClassifier(), param_grid, n_jobs=-1, cv=5, verbose=1)
rf.fit(X_train, y_train)
print()
print('Best params: %s' % rf.best_params_)
print('Score: %.2f' % rf.best_score_)
rf_best = RandomForestClassifier(**rf.best_params_, n_jobs=-1)
rf_best.fit(X_train, y_train)
#Save the model once created
with gzip.GzipFile(args.output_model, 'w') as f:
cPickle.dump(rf_best, f)
with gzip.GzipFile(args.output_model, 'r') as f:
rf_best = cPickle.load(f)
# Params
pred = []
ad = []
pred_proba = []
index = []
cross_val = StratifiedKFold(n_splits=5)
# Do 5-fold loop
for train_index, test_index in cross_val.split(X_train, y_train):
fold_model = rf_best.fit(X_train.iloc[train_index], y_train[train_index])
fold_pred = rf_best.predict(X_train.iloc[test_index])
fold_ad = rf_best.predict_proba(X_train.iloc[test_index])
pred.append(fold_pred)
ad.append(fold_ad)
pred_proba.append(fold_ad[:, 1])
index.append(test_index)
threshold_ad = 0.70
# Prepare results to export
fold_index = np.concatenate(index)
fold_pred = np.concatenate(pred)
fold_ad = np.concatenate(ad)
fold_pred_proba = np.concatenate(pred_proba)
fold_ad = (np.amax(fold_ad, axis=1) >= threshold_ad).astype(str)
five_fold_morgan = pd.DataFrame({'Prediction': fold_pred, 'AD': fold_ad, 'Proba': fold_pred_proba}, index=list(fold_index))
five_fold_morgan.AD[five_fold_morgan.AD == 'False'] = np.nan
five_fold_morgan.AD[five_fold_morgan.AD == 'True'] = five_fold_morgan.Prediction
five_fold_morgan.sort_index(inplace=True)
five_fold_morgan['y_train'] = pd.DataFrame(y_train)
# morgan stats
all_datas = []
datas = []
datas.append(calc_metrics("accuracy", five_fold_morgan['y_train'], five_fold_morgan['Prediction']))
datas.append(calc_metrics("ba", five_fold_morgan['y_train'], five_fold_morgan['Prediction']))
datas.append(calc_metrics("roc_auc", five_fold_morgan['y_train'], five_fold_morgan['Proba']))
datas.append(calc_metrics("kappa", five_fold_morgan['y_train'], five_fold_morgan['Prediction']))
datas.append(calc_metrics("mcc", five_fold_morgan['y_train'], five_fold_morgan['Prediction']))
datas.append(calc_metrics("precision", five_fold_morgan['y_train'], five_fold_morgan['Prediction']))
datas.append(calc_metrics("recall", five_fold_morgan['y_train'], five_fold_morgan['Prediction']))
datas.append(calc_metrics_derived_from_confusion_matrix("sensitivity", five_fold_morgan['y_train'], five_fold_morgan['Prediction']))
datas.append(calc_metrics_derived_from_confusion_matrix("PPV", five_fold_morgan['y_train'], five_fold_morgan['Prediction']))
datas.append(calc_metrics_derived_from_confusion_matrix("specificity", five_fold_morgan['y_train'], five_fold_morgan['Prediction']))
datas.append(calc_metrics_derived_from_confusion_matrix("NPV", five_fold_morgan['y_train'], five_fold_morgan['Prediction']))
datas.append(1)
all_datas.append(datas)
# morgan AD stats
morgan_ad = five_fold_morgan.dropna(subset=['AD'])
morgan_ad['AD'] = morgan_ad['AD'].astype(int)
coverage_morgan_ad = len(morgan_ad['AD']) / len(five_fold_morgan['y_train'])
datas = []
datas.append(calc_metrics("accuracy", morgan_ad['y_train'], morgan_ad['AD']))
datas.append(calc_metrics("ba", morgan_ad['y_train'], morgan_ad['AD']))
datas.append(calc_metrics("roc_auc", morgan_ad['y_train'], morgan_ad['Proba']))
datas.append(calc_metrics("kappa", morgan_ad['y_train'], morgan_ad['AD']))
datas.append(calc_metrics("mcc", morgan_ad['y_train'], morgan_ad['AD']))
datas.append(calc_metrics("precision", morgan_ad['y_train'], morgan_ad['AD']))
datas.append(calc_metrics("recall", morgan_ad['y_train'], morgan_ad['AD']))
datas.append(calc_metrics_derived_from_confusion_matrix("sensitivity", morgan_ad['y_train'], morgan_ad['AD']))
datas.append(calc_metrics_derived_from_confusion_matrix("PPV", morgan_ad['y_train'], morgan_ad['AD']))
datas.append(calc_metrics_derived_from_confusion_matrix("specificity", morgan_ad['y_train'], morgan_ad['AD']))
datas.append(calc_metrics_derived_from_confusion_matrix("NPV", morgan_ad['y_train'], morgan_ad['AD']))
datas.append(coverage_morgan_ad)
all_datas.append(datas)
# print stats
print('\033[1m' + '5-fold External Cross Validation Statistical Characteristcs of QSAR models developed morgan' + '\n' + '\033[0m')
morgan_5f_stats = pd.DataFrame(all_datas, columns=["accuracy", "ba", "roc_auc", "kappa", "mcc", "precision", "recall",
"sensitivity", "PPV", "specificity", "NPV", "Coverage"])
morgan_5f_stats.to_csv(args.output_report)
print(morgan_5f_stats)
# Y ramdomaization
permutations = 20
score, permutation_scores, pvalue = permutation_test_score(rf_best, X_train, y_train,
cv=5, scoring='balanced_accuracy',
n_permutations=permutations,
n_jobs=-1,
verbose=1,
random_state=24)
print('True score = ', score.round(2),
'\nY-randomization = ', np.mean(permutation_scores).round(2),
'\np-value = ', pvalue.round(4))
if __name__ == "__main__":
main()
When this is executed, the prediction model body is saved in the path specified by output_model
, and the evaluation index of the prediction model constructed in the path specified by output_report
is output. The results are as follows.
accuracy balanced_accuracy roc_auc kappa mcc precision recall sensitivity PPV specificity NPV Coverage
0.801980 0.713126 0.807507 0.490156 0.551662 0.9375 0.441176 0.441176 0.9375 0.985075 0.776471 1.000000
0.852941 0.736842 0.812030 0.564661 0.627215 1.0000 0.473684 0.473684 1.0000 1.000000 0.830508 0.673267
True score = 0.71
Y-randomization = 0.48
p-value = 0.0476
The score in the first row of the first half is the cross-validation score in all training data. Recall and sensitivity are the same, but both are shown to compare whether the scikit-learn version and the function built in by myself match.
The second line of the first half is taken into consideration in the paper, and the predictive reliability (do you know the value obtained by predict_proba of scikit-learn for the predicted class) is above the threshold value (0.7)? It is a cross-validation score limited to the data of, and is considered to be an index based only on more reliable data, and the index is slightly improved.
The second half is the result of Y-randamization, according to the scikit-learn documentation, True score is the true score without swapping targets, Y-randamization is the randomized score of the target, and p-value is the accidental score. Probability, the best is `1 / (n_permutations + 1)`
and the worst seems to be 1.0. This time, 1 / (n_permutations + 1) = 1 / (20 + 1) = 0.0467
, which is the best value, so the balanced_accuracy of 0.71 does not seem to have been obtained by chance. ..
Prediction by specifying the prediction model body created in 5.3, training data (101 cases) extracted from ChEMBL and PDB, and virtual screening data (10750 cases) extracted from DrugBank in the following scripts (only fragments are described). I do.
predict_drugbank.py
parser = argparse.ArgumentParser()
parser.add_argument("-input_train", type=str, required=True)
parser.add_argument("-input_predict", type=str, required=True)
parser.add_argument("-input_model", type=str, required=True)
parser.add_argument("-output_csv", type=str, required=True)
parser.add_argument("-output_report", type=str, required=True)
args = parser.parse_args()
#Read training data
trains = list()
with open(args.input_train, "r") as f:
reader = csv.DictReader(f)
for i, row in enumerate(reader):
smiles = row["canonical_smiles"]
new_smiles = normalize(smiles)
trains.append((row["id"], new_smiles))
#Reading forecast data
smiles_list = list()
datas = []
datas_other = []
with open(args.input_predict, "r") as f:
reader = csv.DictReader(f)
for row in reader:
print(row["id"])
smiles = row["smiles"]
new_smiles = normalize(smiles)
dup_pred = None
dup_train = None
if new_smiles is None or Chem.MolFromSmiles(new_smiles) is None:
print(f"error! {id}")
new_smiles = smiles
existFlag = False
for db_id, tmp_smiles in smiles_list:
if new_smiles == tmp_smiles:
dup_pred = db_id
print(f"{id}: same structure exist in predict data! - {db_id}, {tmp_smiles}")
break
for db_id, tmp_smiles in trains:
if new_smiles == tmp_smiles:
dup_train = db_id
print(f"{id}: same structure exist in train data! - {db_id}, {tmp_smiles}")
break
smiles_list.append((row["id"], new_smiles))
mol = Chem.MolFromSmiles(new_smiles)
fp = AllChem.GetMorganFingerprintAsBitVect(mol, radius=3, nBits=2048, useFeatures=False, useChirality=False)
fp = pd.Series(np.asarray(fp)).values
data = []
data.append(row["id"])
data.extend(fp)
datas.append(data)
datas_other.append((row["id"], row["name"], row["status"], dup_pred, dup_train))
columns = list()
columns.append("id")
columns.extend(["Bit_" + str(i+1) for i in range(2048)])
df = pd.DataFrame(data=datas, columns=columns)
df.set_index("id", inplace=True, drop=True)
df_other = pd.DataFrame(data=datas_other, columns=["id", "name", "status", "dup_predict", "dup_train"])
df_other.set_index("id", inplace=True, drop=True)
#Save once
df.to_csv(args.output_csv)
#df = pd.read_csv(args.input, index_col=0)
X_vs = df.iloc[:, 0:]
with gzip.GzipFile(args.input_model, 'r') as f:
model = cPickle.load(f)
ad_threshold = 0.70
y_pred = model.predict(X_vs)
confidence = model.predict_proba(X_vs)
confidence = np.amax(confidence, axis=1).round(2)
ad = confidence >= ad_threshold
pred = pd.DataFrame({'Prediction': y_pred, 'AD': ad, 'Confidence': confidence}, index=None)
pred.AD[pred.AD == False] = np.nan
pred.AD[pred.AD == True] = pred.Prediction.astype(int)
pred_ad = pred.dropna().astype(int)
coverage_ad = len(pred_ad) * 100 / len(pred)
print('VS pred: %s' % pred.Prediction)
print('VS pred AD: %s' % pred_ad.Prediction)
print('Coverage of AD: %.2f%%' % coverage_ad)
pred.index = X_vs.index
predictions = pd.concat([df_other, pred], axis=1)
for col in ['Prediction', 'AD']:
predictions[col].replace(0, 'Inactive', inplace=True)
predictions[col].replace(1, 'Active', inplace=True)
print(predictions.head())
predictions.to_csv(args.output_report)
output_csv
The calculation result of the explanatory variable of the prediction data is output to. Also,output_report
The drugbank id of each drug in drugbank, common name,Commercially available/Cancel/The type of experiment, prediction result, prediction result for prediction reliability exceeding the threshold value, prediction reliability, etc. are output. Also,outptu_report
In each prediction data, if there is a duplicate structure in the same prediction data and if there is a duplicate structure in the training data, the duplicate id is output respectively. Among the predicted data, there were 321 cases, and 6 cases for the training data, and there were duplicate data.
In the paper, the results of virtual screening for DrugBank are described below. https://github.com/alvesvm/sars-cov-mpro/tree/master/mol-inf-2020/datasets/screened_compounds drugbank_hits_qsar_consensus.xlsx We compared the DrugBank drugs output here and their results with the prediction results output in the previous section.
In the paper, we created a prediction model by descriptor calculation of three patterns of SiRIMS, Dragon, and Morgan of RDKit created this time, and the one that multiple models made Active is the final Active. I made an overall comparison and a comparison for each of SiRIMS and Dragon. In addition, I found a point that I was interested in the result of Morgan on github, so I have not compared it with this.
In the paper, there were 41 cases, but in Excel on github, the final number of Active cases was 51. Of the 51 cases, 19 were made Active by the model created this time.
The comparison between the SiRIMS model and Dragon model and the number of Active / Inactive cases is as follows.
MODEL | ACTIVE | INACITVE | TOTAL |
---|---|---|---|
SiRMS | 309 | 9305 | 9614 |
DRAGON | 864 | 8750 | 9614 |
Creation model | 314 | 9300 | 9614 |
Next, I compared the number of actives common to each of the SiRMIS model and DRragon model.
MODEL | SiRMS | DRAGON | Creation model |
---|---|---|---|
SiRMS | - | 39 | 79 |
DRAGON | 39 | - | 35 |
Creation model | 79 | 35 | - |
Although the number of Actives is the highest in DRAGON, the number of Actives that match the model created this time is 79 for SiRMS, which is double that of Dragon. From this, it is considered that the model created this time is more similar to SiRMS than Dragon.
Although it was not described in the paper, I tried to visualize how the data to be predicted is distributed on the training data. This is because in machine learning, the reliability of prediction becomes a problem when the data is not similar to the learning data. As a method, first, a dimension reduction model was created from the training data, and the two-dimensional coordinates of the prediction target data were generated and plotted using the model. We tried PCA and UMAP as dimensionality reduction methods. The data used the Morgan fingerprint as well as the creation model. Blue is the training data and red is the prediction data. 〇 is Active and X is Inactive. Where 〇 and X overlap in the training data, UMAP seems to be able to separate them better.
The source of the visualized script is as follows. A PCA or UMAP model is created with fit from the training data, and the result of transforming each prediction data is displayed. Looking at the figure, it seems that the data to be predicted does not deviate significantly from the range of the training data, but it may be affected by the fact that the dimensions are reduced only by the data composed of the training data.
view_ad.py
parser = argparse.ArgumentParser()
parser.add_argument("-train", type=str, required=True)
parser.add_argument("-predict", type=str)
parser.add_argument("-result", type=str)
parser.add_argument("-method", type=str, default="PCA", choices=["PCA", "UMAP"])
args = parser.parse_args()
# all_train.loading csv,fp calculation
train_datas = []
train_datas_active = []
train_datas_inactive = []
with open(args.train, "r") as f:
reader = csv.DictReader(f)
for row in reader:
smiles = row["canonical_smiles"]
mol = Chem.MolFromSmiles(smiles)
fp = AllChem.GetMorganFingerprintAsBitVect(mol, radius=3, nBits=2048, useFeatures=False, useChirality=False)
train_datas.append(fp)
if int(row["outcome"]) == 1:
train_datas_active.append(fp)
else:
train_datas_inactive.append(fp)
if args.predict and args.result:
result_outcomes = []
result_ads = []
#Prediction result reading
with open(args.result, "r",encoding="utf-8_sig") as f:
reader = csv.DictReader(f)
for i, row in enumerate(reader):
#print(row)
if row["Prediction"] == "Active":
result_outcomes.append(1)
else:
result_outcomes.append(0)
result_ads.append(row["Confidence"])
# drugbank.loading csv,fp calculation
predict_datas = []
predict_datas_active = []
predict_datas_inactive = []
predict_ads = []
with open(args.predict, "r") as f:
reader = csv.DictReader(f)
for i, row in enumerate(reader):
print(i)
smiles = row["smiles"]
mol = Chem.MolFromSmiles(smiles)
fp = AllChem.GetMorganFingerprintAsBitVect(mol, radius=3, nBits=2048, useFeatures=False, useChirality=False)
predict_datas.append(fp)
if result_outcomes[i] == 1:
predict_datas_active.append(fp)
else:
predict_datas_inactive.append(fp)
#analysis
model = None
if args.method == "PCA":
model = PCA(n_components=2)
model.fit(train_datas)
if args.method == "UMAP":
model = umap.UMAP()
model.fit(train_datas)
result_train = model.transform(train_datas)
result_train_active = model.transform(train_datas_active)
result_train_inactive = model.transform(train_datas_inactive)
plt.title(args.method)
#plt.scatter(result_train[:, 0], result_train[:, 1], c="blue", alpha=0.1, marker="o")
plt.scatter(result_train_active[:, 0], result_train_active[:, 1], c="blue", alpha=0.5, marker="o")
plt.scatter(result_train_inactive[:, 0], result_train_inactive[:, 1], c="blue", alpha=0.5, marker="x")
#Forecast(predict)
if args.predict and args.result:
result_predict = model.transform(predict_datas)
result_predict_active = model.transform(predict_datas_active)
result_predict_inactive = model.transform(predict_datas_inactive)
#plt.scatter(result_predict[:, 0], result_predict[:, 1], c=result_ads, alpha=0.1, cmap='viridis_r')
plt.scatter(result_predict_active[:, 0], result_predict_active[:, 1], c="red", alpha=0.1, marker="o")
plt.scatter(result_predict_inactive[:, 0], result_predict_inactive[:, 1], c="red", alpha=0.1, marker="x")
plt.show()
Through the process of trying the contents of the paper this time, I was able to obtain the following. I think it will be useful in many cases when you analyze yourself or read another paper in the future.
――I was able to experience everything from data collection in the drug discovery field to preprocessing, prediction model creation, and prediction model evaluation. In particular, I was able to experience firsthand how much time it takes to collect and preprocess data before creating a prediction model, and to apply and evaluate it after creating a prediction model. ――I was able to deepen my understanding of the public database, such as the outline and data collection method.
This time, I tried papers using the conventional method, but papers using Deep Learning such as Graph Convolution Networks (GCN) are appearing one after another, and I would like to try them in the near future.
In the process of searching for papers, I will list two other papers on coronavirus-related predictive models found below. All of them have public datasets, so I think it's a good idea to give them a try.
In this paper, a prediction model for a target called papain-like protease (PLpro), which is different from the paper in this article, is created.
This creates the same predictive model for the main protease as in this article. The model has been evaluated according to the OECD guidelines (https://www.oecd.org/chemicalsafety/risk-assessment/37849783.pdf), which may provide insights into the evaluation of predictive models for compounds. Is.
The main libraries used in this article are as follows.
The programs listed in this article are available at https://github.com/kimisyo/sars_trace_1.