AI drug discovery to start for free using papers and public databases

1.First of all

Predicting the activity and physical properties of compounds on a computer has been performed for some time, but in recent years, with the increase in accumulated data and the development of deep learning technology, the keyword AI drug discovery has come to be heard. I came.

Therefore, this time, regarding the prediction of therapeutic agents for the new coronavirus (SARS-CoV-2), "Computational Models Identify Several FDA Approved or Experimental Drugs as Putative Agents Against SARS-CoV-2 .nih.gov/pmc/articles/PMC7252448/) "By trying the contents of the paper by myself, I tried to experience AI drug discovery " for free ", so I would like to share it.

I chose this paper for the following three reasons.

  1. It is easy to reproduce because it uses public data and the analysis procedure and program are published on github.
  2. Conventional analysis methods (fingerprint, random forest) are used, and it is relatively easy to reproduce.
  3. A therapeutic drug for the new coronavirus is being developed all over the world and I was interested in it.

2. This work policy

First of all, I examined the work policy for trying the contents of the dissertation (or rather, the correct answer was that this happened through trial and error).

Actually, all the collected data and analysis programs in this paper are published on github (https://github.com/alvesvm/sars-cov-mpro/tree/master/mol-inf-2020), and what if you use this? The content of the paper can be reproduced without any trouble.

However, even if you download this as it is and hit the return key repeatedly on Jupyter, it will end up as if you understood "I see", and nothing will remain.

By moving your hands, getting stuck, researching, and thinking repeatedly, you will notice your lack of understanding and gradually become able to understand what you could not understand until now.

Therefore, we decided to proceed with the following policy.

――Collecting data from public databases ** Try to check it yourself **. ――Using the collected data, ** write your own prediction program **. --Compare the final result with the paper ** and consider it if possible. --Refer to github for ** checking details ** of the content of the dissertation, ** getting hints in case of clogging **, ** comparing results **.

Next, the outline of the paper and the public database used this time are described.

3. Outline of reference paper

In the paper, the main protease involved in viral replication of the new coronavirus is used as the target molecule of the drug, and the SARS coronavirus epidemic in 2003, which has a structure very similar to this, is predicted to have an inhibitory activity on the main protease. I'm building a model.

Furthermore, by applying the constructed predictive model to the market / discontinuation / experiment and investigational drug data obtained from DrugBank, 41 drugs are finally presented as candidate therapeutic agents for the new coronavirus.

Information on the inhibitory activity against SARS coronavirus, which is the learning data, is collected from public databases such as ChEMBL and PDB.

By the way, the main protease, which is the target molecule, is explained in detail in Introduction to PDBj: PDBj's biopolymer learning portal site 242: Coronavirus Proteases. Has been done.

4. Overview of public database

Next, I will briefly summarize the public databases used in this paper. ChEMBL (https://www.ebi.ac.uk/chembl/) image.png

ChEMBL is a database containing manually curated information on the bioactivity of drugs and small molecules that can be drug candidates. It is managed by the European Bioinformatics Institute (EBI) of the European Molecular Biology Laboratory (EMBL).

The number of listed data is as follows (calculated from the number of records in the downloaded V27 table).

item Number of listings
Number of compounds 2,444,828
Number of biological activity values 16,066,124
Number of assays 1,221,361

PDB (https://www.rcsb.org/) image.png

PDB (Protein Data Bank) is a database containing 3D structural data of proteins and nucleic acids. It is managed by Worldwide Protein Data Bank.

The number of listed data is as follows (as of September 12, 2020).

item Number of listings
Structures 49012
Structures of Human Sequences 12216
Nucleic Acid Containing Structures 12216

DrugBank (https://www.drugbank.ca/)

image.png

DrugBank is a database that contains information on drugs and drug targets. It is managed by the Metabolomics Innovation Center (TMIC). The number of listings in the latest release (version 5.1.7, released on 2020-07-02) is as follows.

item Number of listings
Drug 13,715
Approved small molecule drug 2,649
Approved biologics (proteins, peptides, vaccines, and allergens) 1,405
Dietary supplement 131
Experiment (discovery stage) drug 6,417
Linked protein sequence 5,234

Now, from now on, we will finally proceed with the work of testing the content of the dissertation.

5. Data preparation

5.1 Preparation of training data

Here, we try to collect experimental data (ChEMBLE 91 cases, PDB 22 cases) of inhibitors of SARS virus main protease described in the paper.

5.1.1 Acquisition of ChEMBL data

ChEMBL can be used with a web browser, but you can also download the data, register it in a local database, and use it from SQL. This time, we will install PostgreSQL and load and use the dump for PostgreSQL.

① Download ChEMBL data

ftp://ftp.ebi.ac.uk/pub/databases/chembl/ChEMBLdb/latest/Click to display a list of files for download, chembl_27_postgresql.tar.Download and unzip gz.



#### ② Install PostgreSQL
 Next, install PostgreSQL by referring to [Install PostgreSQL on Ubuntu 18.04](https://qiita.com/eighty8/items/82063beab09ab9e41692).

#### ③ ChEMBL load
 Log in to PostgreSQL and create a database.

```sql
 pgdb=# create database chembl_27;

Then move to the directory where you unzipped the data and do the following: It will take some time, but the data will be imported into PostgreSQL.

pg_restore --no-owner -U postgres -d chembl_27 chembl_27_postgresql.dmp

④ Understanding of ChEMBL table

Next, SQL is issued from pgAdmin etc. to get the data, but first of all, what kind of data is stored in which table is `` `ftp://ftp.ebi.ac.uk/pub/databases/chembl Take a quick look at /ChEMBLdb/latest/schema_documentation.txt```. Also, there are SQL examples in Schema Questions and SQL Examples. Hit the table that looks important to you. The table below summarizes the results of examining the previous schema_documentation.txt for the table with the hits.

table name Description
compound_structures A table that stores various structural representations of compounds (Molfile, InChI, etc.)
molecule_dictionary Compounds with related identifiers/Non-redundant list of biotherapeutic agents
compound_records Represents each compound extracted from scientific documentation.
docs The assay retains all extracted scientific documents (papers or patents)
activities A table that stores activity values or endpoints that are the results of assays recorded in scientific documentation
assay Save the list of assays reported in each document. The same assay from different documents is displayed as a separate assay
target_dictionary A dictionary of all curated targets contained in ChEMBL. Both protein and non-protein targets (eg, organisms, tissues, cell lines) are included.

⑤ Issue SQL to the database and get SARS inhibition data

In ChEMBL, CHEMBL ID is assigned to all data, so first try searching for CHEMBL ID of SARS coronavirus. Therefore, try throwing the following SQL to the target_dictionary table examined in the previous section.

select * from target_dictionary where upper(pref_name) like '%SARS%'

You can get four as follows.

image.png

This time we want to collect inhibition data of the main protease (also known as 3C-like protease), so we know that it is CHEMBL 3927, which is the top of the pref_name.

Once you know the CHEMBL ID of the target molecule, go to Schema Questions and SQL Examples. The query "Retrieve compound activity details for all targets containing a protein of interest" at the bottom will give you a list of molecules that have activity against the target molecule, so you can change the CHEMBL ID part of this SQL.

SELECT m.chembl_id AS compound_chembl_id,   			
       s.canonical_smiles,   			
       r.compound_key,   			
       d.pubmed_id,			
       d.doi,			
       a.description,			
       act.standard_type,   			
       act.standard_relation,   			
       act.standard_value,   			
       act.standard_units,   			
       act.activity_comment
FROM   compound_structures s,   			
       molecule_dictionary m,   			
       compound_records r,   			
       docs d,   			
       activities act,   			
       assays a,   			
       target_dictionary t 			
WHERE s.molregno = m.molregno 
       AND m.molregno       = r.molregno 
       AND r.record_id      = act.record_id 			
       AND r.doc_id         = d.doc_id 			
       AND act.assay_id     = a.assay_id 			
       AND a.tid            = t.tid 			
       AND t.chembl_id      = 'CHEMBL3927'			
       AND (act.standard_type ='IC50' or act.standard_type ='Ki')
order by compound_chembl_id;			

As a caveat, I want only inhibition data this time, so I specify IC50 or Ki as standard_type. It seems that the definitions of IC50 and Ki are strictly different, but since both of them seemed to be used as learning data in the paper, we will apply them here.

Export the result from pgAdmin to csv.

⑥ Data organization

By the way, this is a muddy work like a data scientist.

The following data is obtained in csv.

image.png

Since canonical_smiles is data showing the structure of the compound, it is necessary to generate explanatory variables. standard_value is the value of standard_type (hereinafter referred to as the inhibition value), and standard_units is the unit. Since standard_units are all the same nM this time, there is no need to convert and align the units.

The following processing was performed to create data for the prediction model of the paper.

--As a response when the standard_relation of csv contains an inequality sign, in such a case it is often judged as a numerical value for safety (for example, if the data is larger than 10, it is assumed to be about 20 times. Etc.), this time the inequality sign was ignored and the standard_value was adopted as the inhibitory value. --When CHEMBL IDs were included in duplicate due to multiple experiments, the inhibition value was the average of these. --As the final objective variable, the threshold value of the inhibition value is set to 10uM in the paper, so values smaller than 10uM are output as 1 (Active) and values larger than 10uM are output as 0 (Inactive).

The script fragments that have undergone these processes are shown below.

prev_chembl.py


   rows_by_id = defaultdict(list)

    #Read csv and group rows by CHEMBLID
    with open(args.input, "r") as f:
        reader = csv.DictReader(f)
        for row in reader:
            rows_by_id[row["compound_chembl_id"]].append(row)

    #Value to be considered active
    threshold = 10000

    #Output while processing the value for each CHEMBLID
    with open(args.output, "w") as f:
        writer = csv.writer(f, lineterminator="\n")
        writer.writerow(["chembl_id", "canonical_smiles", "value", "outcome"])

        #Process values by CHEMBLID
        for id in rows_by_id:
            #Find the total
            total = 0.0
            for row in rows_by_id[id]:
                value = row["standard_value"]
                total += float(value)

            #Find the average
            mean = total/len(rows_by_id[id])
            print(f'{id},{mean}')
            outcome = 0
            if mean < threshold:
                outcome = 1
            writer.writerow([id, rows_by_id[id][0]["canonical_smiles"], mean, outcome])

As a result of executing this program, although there was a difference of 1 in the number of Active / Inactive cases, 91 learning data, which is the same as the paper, was obtained. Since the acquisition procedure was not described in the paper, ** I was most happy that I was able to almost reproduce the collection of this data **.

5.1.2 Data acquisition from PDB

Next, get from the PDB. Actually, the acquisition procedure is almost the same as the article I wrote earlier Collecting machine learning data by scraping from a bio-based public database. The only difference is that it also collects an item called macromolecule to determine if it is the main protease. The scrapy-free version of the collection program is reprinted below.

get_pdb.py


import requests
import json
import time
import lxml.html
import argparse
import csv


def get_ligand(ligand_id):
    tmp_url = "https://www.rcsb.org" + ligand_id

    response = requests.get(tmp_url)
    if response.status_code != 200:
        return response.status_code, []

    html = response.text
    root = lxml.html.fromstring(html)
    #print(html)
    print(tmp_url)

    smiles = root.xpath("//tr[@id='chemicalIsomeric']/td[1]/text()")[0]
    inchi = root.xpath("//tr[@id='chemicalInChI']/td[1]/text()")[0]
    inchi_key = root.xpath("//tr[@id='chemicalInChIKey']/td[1]/text()")[0]

    return response.status_code, [smiles, inchi, inchi_key]


def get_structure(structure_id):

    structure_url = "https://www.rcsb.org/structure/"
    tmp_url = structure_url + structure_id
    print(tmp_url)
    html = requests.get(tmp_url).text
    root = lxml.html.fromstring(html)

    macromolecule_trs = root.xpath("//tr[contains(@id,'macromolecule-entityId-')]")
    macromolecule = ""
    for tr in macromolecule_trs:
        print(tr.xpath("@id"))
        macromolecule += tr.xpath("td[position()=1]/text()")[0] + ","
        print(f"macro={macromolecule}")

    binding_trs = root.xpath("//tr[contains(@id,'binding_row')]")

    datas = []
    ids = []

    for tr in binding_trs:

        print(tr.xpath("@id"))
        d1 = tr.xpath("td[position()=1]/a/@href")
        if d1[0] in ids:
            continue

        ids.append(d1[0])
        status_code, values = get_ligand(d1[0])
        ligand_id = d1[0][(d1[0].rfind("/") + 1):]
        print(ligand_id)
        if status_code == 200:
            smiles, inchi, inchi_key = values
            #item = tr.xpath("a/td[position()=2]/text()")[0]
            item = tr.xpath("td[position()=2]/a/text()")[0]
            item = item.strip()
            value = tr.xpath("td[position()=2]/text()")[0]
            value = value.replace(":", "")
            value = value.replace(";", "")
            value = value.replace("&nbsp", "")
            value = value.replace("\n", "")
            print(value)
            values = value.split(" ", 1)
            print(values)
            value = values[0].strip()
            unit = values[1].strip()

            datas.append([ligand_id, smiles, inchi, inchi_key, item, value, unit, macromolecule])

        time.sleep(1)

    return datas


def main():

    parser = argparse.ArgumentParser()
    parser.add_argument("-output", type=str, required=True)
    args = parser.parse_args()

    base_url = "https://www.rcsb.org/search/data"
    payloads = {"query":{"type":"group","logical_operator":"and","nodes":[{"type":"group","logical_operator":"and","nodes":[{"type":"group","logical_operator":"or","nodes":[{"type":"group","logical_operator":"and","nodes":[{"type":"terminal","service":"text","parameters":{"attribute":"rcsb_binding_affinity.value","negation":False,"operator":"exists"},"node_id":0},{"type":"terminal","service":"text","parameters":{"attribute":"rcsb_binding_affinity.type","operator":"exact_match","value":"IC50"},"node_id":1}],"label":"nested-attribute"},{"type":"group","logical_operator":"and","nodes":[{"type":"terminal","service":"text","parameters":{"attribute":"rcsb_binding_affinity.value","negation":False,"operator":"exists"},"node_id":2},{"type":"terminal","service":"text","parameters":{"attribute":"rcsb_binding_affinity.type","operator":"exact_match","value":"Ki"},"node_id":3}],"label":"nested-attribute"}]},{"type":"group","logical_operator":"and","nodes":[{"type":"terminal","service":"text","parameters":{"operator":"exact_match","negation":False,"value":"Severe acute respiratory syndrome-related coronavirus","attribute":"rcsb_entity_source_organism.taxonomy_lineage.name"},"node_id":4}]}],"label":"text"}],"label":"query-builder"},"return_type":"entry","request_options":{"pager":{"start":0,"rows":100},"scoring_strategy":"combined","sort":[{"sort_by":"score","direction":"desc"}]},"request_info":{"src":"ui","query_id":"e757fdfd5f9fb0efa272769c5966e3f4"}}
    print(json.dumps(payloads))

    response = requests.post(
        base_url,
        json.dumps(payloads),
        headers={'Content-Type': 'application/json'})

    datas = []
    for a in response.json()["result_set"]:
        structure_id = a["identifier"]
        datas.extend(get_structure(structure_id))
        time.sleep(1)

    with open(args.output, "w") as f:
        writer = csv.writer(f, lineterminator="\n")
        writer.writerow(["ligand_id", "canonical_smiles", "inchi", "inchi_key", "item", "value", "unit"])

        for data in datas:
            writer.writerow(data)


if __name__ == "__main__":
    main()

When this is executed, 21 data in the following format will be obtained. Since there is a protein name in the rightmost item, those that do not contain phrases such as "3C like proteinase" and "main proteinase" were excluded, and finally 10 cases were used as training data.

ligand_id,canonical_smiles,inchi,inchi_key,item,value,unit
PMA,OC(=O)c1cc(C(O)=O)c(cc1C(O)=O)C(O)=O,"InChI=1S/C10H6O8/c11-7(12)3-1-4(8(13)14)6(10(17)18)2-5(3)9(15)16/h1-2H,(H,11,12)(H,13,14)(H,15,16)(H,17,18)",CYIDZMCFTVVTJO-UHFFFAOYSA-N,Ki,700,nM,"3C-like proteinase,"
TLD,Cc1ccc(S)c(S)c1,"InChI=1S/C7H8S2/c1-5-2-3-6(8)7(9)4-5/h2-4,8-9H,1H3",NIAAGQAEVGMHPM-UHFFFAOYSA-N,Ki,1400,nM,"Replicase polyprotein 1ab,"
ZU5,CC(C)C[C@H](NC(=O)[C@@H](NC(=O)OCc1ccccc1)[C@@H](C)OC(C)(C)C)C(=O)N[C@H](CCC(=O)C2CC2)C[C@@H]3CCNC3=O,"InChI=1S/C34H52N4O7/c1-21(2)18-27(31(41)36-26(14-15-28(39)24-12-13-24)19-25-16-17-35-30(25)40)37-32(42)29(22(3)45-34(4,5)6)38-33(43)44-20-23-10-8-7-9-11-23/h7-11,21-22,24-27,29H,12-20H2,1-6H3,(H,35,40)(H,36,41)(H,37,42)(H,38,43)/t22-,25+,26-,27+,29+/m1/s1",QIMPWBPEAHOISN-XSLDCGIXSA-N,Ki,99,nM,"3C-like proteinase,"
ZU3,CC(C)C[C@H](NC(=O)[C@H](CNC(=O)C(C)(C)C)NC(=O)OCc1ccccc1)C(=O)N[C@H](CCC(C)=O)C[C@@H]2CCNC2=O,"InChI=1S/C32H49N5O7/c1-20(2)16-25(28(40)35-24(13-12-21(3)38)17-23-14-15-33-27(23)39)36-29(41)26(18-34-30(42)32(4,5)6)37-31(43)44-19-22-10-8-7-9-11-22/h7-11,20,23-26H,12-19H2,1-6H3,(H,33,39)(H,34,42)(H,35,40)(H,36,41)(H,37,43)/t23-,24+,25-,26-/m0/s1",IEQRDAZPCPYZAJ-QYOOZWMWSA-N,Ki,38,nM,"3C-like proteinase,"
S89,OC[C@H](Cc1ccccc1)NC(=O)[C@H](Cc2ccccc2)NC(=O)/C=C/c3ccccc3,"InChI=1S/C27H28N2O3/c30-20-24(18-22-12-6-2-7-13-22)28-27(32)25(19-23-14-8-3-9-15-23)29-26(31)17-16-21-10-4-1-5-11-21/h1-17,24-25,30H,18-20H2,(H,28,32)(H,29,31)/b17-16+/t24-,25-/m0/s1",GEVQDXBVGFGWFA-KQRRRSJSSA-N,Ki,2240,nM,"3C-like proteinase,"

Unfortunately, it didn't work as well as ChEMBL and I couldn't get 22 papers. In addition, some of the above 10 papers were not included in the 22 papers. We believe that this is due to any or a combination of the following:

--There was an error in the method of specifying the search for the PDB. ――We collected only the Lingad data linked to the protein structure, but it is possible that the data was actually present in the linked papers even if it was not linked.

Although the cause is unknown, I did not think that the data obtained by this specification procedure was incorrect, so I decided to adopt these 10 cases for PDB.

5.1.3 Merging ChEMBL and PDB data

The data obtained in 5.1.1 and 5.1.2 were merged, and finally 101 cases were output to csv as training data. The script is omitted.

5.2 Preparation of data for virtual screening

In this paper, we apply the constructed predictive model to marketing / discontinuation / experiment and investigational drug. Obtain the structural data of marketing / discontinuation / experiment and investigational drug required for this purpose from DrugBank. The github of the paper has the formatted data as it is, but ** I will try it myself persistently **.

5.2.1 Data acquisition from DrugBank

First, as XML data, download xml from the link described as "DOWNLOAD (XML)" in https://www.drugbank.ca/releases/latest.

Then download sdf from https://www.drugbank.ca/releases/latest#structures.

The account may have had to be created in advance.

Save these as "full database.xml", "open structures.sdf".

5.2.2 DrugBank Data Formatting

Structural data is stored in sdf, and the prediction model can be applied by itself, but since there are types such as commercial / discontinued / experiment in xml, these are associated and saved in csv. The script fragment is shown below.

make_drugbank_data.py


def get_group(groups_node, ns):
    ret = ""
    if groups_node:
        for i, child in enumerate(groups_node.iter(f"{ns}group")):
            if i > 0 :
                ret += ","
            ret += child.text

    return ret


def get_id(drug_node, ns):

    for i, child in enumerate(drug_node.iter(f"{ns}drugbank-id")):
        for attr in child.attrib:
            if attr == "primary":
                return child.text

    return None


def main():
    parser = argparse.ArgumentParser()
    parser.add_argument("-input_xml", type=str, required=True)
    parser.add_argument("-input_sdf", type=str, required=True)
    parser.add_argument("-output", type=str, required=True)
    args = parser.parse_args()

    name_dict = {}
    smiles_dict = {}

    sdf_sup = Chem.SDMolSupplier(args.input_sdf)
    datas = []
    for i, mol in enumerate(sdf_sup):
        if not mol:
            continue
        if mol.HasProp("DRUGBANK_ID"):
            id = mol.GetProp("DRUGBANK_ID")
            if mol.HasProp("COMMON_NAME"):
                name = mol.GetProp("COMMON_NAME")
            smiles = Chem.MolToSmiles(mol)
            name_dict[id] = name
            smiles_dict[id] = smiles
            print(f"{i} {id} {name} {smiles}")

    tree = ET.parse(args.input_xml)
    root = tree.getroot()

    ns = "{http://www.drugbank.ca}"
    ids = []
    datas = []
    for i, drug in enumerate(root.iter(f"{ns}drug")):
        id = get_id(drug, ns)
        category = get_group(drug.find(f"{ns}groups"), ns)
        if id and id in smiles_dict:
            print(f"{i}, {id}, {category}")
            ids.append(id)
            datas.append([name_dict[id], category, smiles_dict[id]])

    df = pd.DataFrame(datas, index=ids, columns=[["name", "status", "smiles"]])
    df.to_csv(args.output)


if __name__ == "__main__":
    main()

The data obtained by executing this script looks like this. Structural data is stored as SMILES.

id,name,status,smiles
DB00006,Bivalirudin,"approved,investigational",CC[C@H](C)[C@H](NC(=O)[C@H](CCC(=O)O)NC(=O)[C@H](CCC(=O)O)NC(=O)[C@H](Cc1ccccc1)NC(=O)[C@H](CC(=O)O)NC(=O)CNC(=O)[C@H](CC(N)=O)NC(=O)CNC(=O)CNC(=O)CNC(=O)CNC(=O)[C@@H]1CCCN1C(=O)[C@H](CCCNC(=N)N)NC(=O)[C@@H]1CCCN1C(=O)[C@H](N)Cc1ccccc1)C(=O)N1CCC[C@H]1C(=O)N[C@@H](CCC(=O)O)C(=O)N[C@@H](CCC(=O)O)C(=O)N[C@@H](Cc1ccc(O)cc1)C(=O)N[C@@H](CC(C)C)C(=O)O
DB00007,Leuprolide,"approved,investigational",CCNC(=O)[C@@H]1CCCN1C(=O)[C@H](CCCNC(=N)N)NC(=O)[C@H](CC(C)C)NC(=O)[C@@H](CC(C)C)NC(=O)[C@H](Cc1ccc(O)cc1)NC(=O)[C@H](CO)NC(=O)[C@H](Cc1c[nH]c2ccccc12)NC(=O)[C@H](Cc1c[nH]cn1)NC(=O)[C@@H]1CCC(=O)N1
DB00014,Goserelin,approved,CC(C)C[C@H](NC(=O)[C@@H](COC(C)(C)C)NC(=O)[C@H](Cc1ccc(O)cc1)NC(=O)[C@H](CO)NC(=O)[C@H](Cc1c[nH]c2ccccc12)NC(=O)[C@H](Cc1cnc[nH]1)NC(=O)[C@@H]1CCC(=O)N1)C(=O)N[C@@H](CCCN=C(N)N)C(=O)N1CCC[C@H]1C(=O)NNC(N)=O
DB00027,Gramicidin D,approved,CC(C)C[C@@H](NC(=O)CNC(=O)[C@@H](NC=O)C(C)C)C(=O)N[C@@H](C)C(=O)N[C@@H](C(=O)N[C@H](C(=O)N[C@@H](C(=O)N[C@@H](Cc1c[nH]c2ccccc12)C(=O)N[C@H](CC(C)C)C(=O)N[C@@H](Cc1c[nH]c2ccccc12)C(=O)N[C@H](CC(C)C)C(=O)N[C@@H](Cc1c[nH]c2ccccc12)C(=O)N[C@H](CC(C)C)C(=O)N[C@@H](Cc1c[nH]c2ccccc12)C(=O)NCCO)C(C)C)C(C)C)C(C)C
DB00035,Desmopressin,approved,N=C(N)NCCC[C@@H](NC(=O)[C@@H]1CCCN1C(=O)[C@@H]1CSSCCC(=O)N[C@@H](Cc2ccc(O)cc2)C(=O)N[C@@H](Cc2ccccc2)C(=O)N[C@@H](CCC(N)=O)C(=O)N[C@@H](CC(N)=O)C(=O)N1)C(=O)NCC(N)=O

5.3 Building a forecast model

Next, a prediction model is constructed using the training data.

5.3.1 Fingerprint generation from structural data

Here, the explanatory variables for machine learning are generated from the csv file generated in 5.1.3. Specifically, read SMILES in the "canonical_smiles" column of the csv file and convert it to a mol object using RDKit. Then use RDKit's AllChem.GetMorganFingerprintAsBitVect to generate a 2048-bit fingerprint. Roughly speaking, a fingerprint is a collection of several bits that, if a compound contains a particular structure, have a corresponding bit of 1 and otherwise 0. This time, a bit string consisting of 2048 0s or 1s is generated.

calc_descriptor.py


from rdkit import Chem
from rdkit.Chem import AllChem
from molvs.normalize import Normalizer, Normalization
from molvs.tautomer import TAUTOMER_TRANSFORMS, TAUTOMER_SCORES, MAX_TAUTOMERS, TautomerCanonicalizer, TautomerEnumerator, TautomerTransform
from molvs.fragment import LargestFragmentChooser
from molvs.charge import Reionizer, Uncharger
import argparse
import csv
import pandas as pd
import numpy as np


def normalize(smiles):
    # Generate Mol
    mol = Chem.MolFromSmiles(smiles)

    # Uncharge
    uncharger = Uncharger()
    mol = uncharger.uncharge(mol)

    # LargestFragmentChooser
    flagmentChooser = LargestFragmentChooser()
    mol = flagmentChooser(mol)

    # Sanitaize
    Chem.SanitizeMol(mol)

    # Normalize
    normalizer = Normalizer()
    mol = normalizer.normalize(mol)

    tautomerCanonicalizer = TautomerCanonicalizer()
    mol = tautomerCanonicalizer.canonicalize(mol)

    return Chem.MolToSmiles(mol)


def main():
    parser = argparse.ArgumentParser()
    parser.add_argument("-input", type=str, required=True)
    parser.add_argument("-output", type=str, required=True)

    args = parser.parse_args()

    #Reading training data
    smiles_list = list()
    datas = []

    with open(args.input, "r") as f:
        reader = csv.DictReader(f)
        for row in reader:
            org_smiles = row["canonical_smiles"]
            new_smiles = normalize(org_smiles)

            existFlag = False
            for tmp_smiles in smiles_list:
                if new_smiles == tmp_smiles:
                    print("exist!")
                    existFlag = True
                    break

            if not existFlag:
                smiles_list.append(new_smiles)
                mol = Chem.MolFromSmiles(new_smiles)
                fp = AllChem.GetMorganFingerprintAsBitVect(mol, radius=3, nBits=2048, useFeatures=False, useChirality=False)
                fp = pd.Series(np.asarray(fp)).values

                data = []
                data.append(row["id"])
                data.append(int(row["outcome"]))
                data.extend(fp)
                datas.append(data)

        columns = list()
        columns.append("id")
        columns.append("outcome")
        columns.extend(["Bit_" + str(i + 1) for i in range(2048)])
        df = pd.DataFrame(data=datas, columns=columns)
        df.set_index("id", inplace=True, drop=True)

        #Save
        df.to_csv(args.output)

if __name__ == "__main__":
    main()

The compound was pretreated and standardized, and if the same compound was present, it was removed just in case. The process of generating 3D coordinates from the compound may also be performed, but in the case of the fingerprint used this time, the 3D coordinates are unnecessary, so the process is not performed. After execution, the following data will be obtained. The first column is the ID, the second column is the objective variable, and the third and subsequent columns are the explanatory variables by fingerprint.

id,outcome,Bit_1,Bit_2,Bit_3,Bit_4,Bit_5,Bit_6,Bit_7,Bit_8,Bit_9,Bit_10,Bit_11,Bit_12,Bit_13,Bit_14,Bit_15,Bit_16,Bit_17,Bit_18,Bit_19,Bit_20,Bit_21,Bit_22,Bit_23,Bit_24,Bit_25,Bit_26,Bit_27,Bit_28,Bit_29,Bit_30,Bit_31,Bit_32,Bit_33,Bit_34,Bit_35,Bit_36,Bit_37,Bit_38,Bit_39,Bit_40,Bit_41,Bit_42,Bit_43,Bit_44,Bit_45,Bit_46,Bit_47,Bit_48,Bit_49,Bit_50,Bit_51,Bit_52,Bit_53,Bit_54,Bit_55,Bit_56,Bit_57,Bit_58,Bit_59,Bit_60,Bit_61,Bit_62,Bit_63,Bit_64,Bit_65,Bit_66,Bit_67,Bit_68,Bit_69,Bit_70,Bit_71,Bit_72,Bit_73,Bit_74,Bit_75,Bit_76,Bit_77,Bit_78,Bit_79,Bit_80,Bit_81,Bit_82,Bit_83,Bit_84,Bit_85,Bit_86,Bit_87,Bit_88,Bit_89,Bit_90,Bit_91,Bit_92,Bit_93,Bit_94,Bit_95,Bit_96,Bit_97,Bit_98,Bit_99,Bit_100,Bit_101,Bit_102,Bit_103,Bit_104,Bit_105,Bit_106,Bit_107,Bit_108,Bit_109,Bit_110,Bit_111,Bit_112,Bit_113,Bit_114,Bit_115,Bit_116,Bit_117,Bit_118,Bit_119,Bit_120,Bit_121,Bit_122,Bit_123,Bit_124,Bit_125,Bit_126,Bit_127,Bit_128,Bit_129,Bit_130,Bit_131,Bit_132,Bit_133,Bit_134,Bit_135,Bit_136,Bit_137,Bit_138,Bit_139,Bit_140,Bit_141,Bit_142,Bit_143,Bit_144,Bit_145,Bit_146,Bit_147,Bit_148,Bit_149,Bit_150,Bit_151,Bit_152,Bit_153,Bit_154,Bit_155,Bit_156,Bit_157,Bit_158,Bit_159,Bit_160,Bit_161,Bit_162,Bit_163,Bit_164,Bit_165,Bit_166,Bit_167,Bit_168,Bit_169,Bit_170,Bit_171,Bit_172,Bit_173,Bit_174,Bit_175,Bit_176,Bit_177,Bit_178,Bit_179,Bit_180,Bit_181,Bit_182,Bit_183,Bit_184,Bit_185,Bit_186,Bit_187,Bit_188,Bit_189,Bit_190,Bit_191,Bit_192,Bit_193,Bit_194,Bit_195,Bit_196,Bit_197,Bit_198,Bit_199,Bit_200,Bit_201,Bit_202,Bit_203,Bit_204,Bit_205,Bit_206,Bit_207,Bit_208,Bit_209,Bit_210,Bit_211,Bit_212,Bit_213,Bit_214,Bit_215,Bit_216,Bit_217,Bit_218,Bit_219,Bit_220,Bit_221,Bit_222,Bit_223,Bit_224,Bit_225,Bit_226,Bit_227,Bit_228,Bit_229,Bit_230,Bit_231,Bit_232,Bit_233,Bit_234,Bit_235,Bit_236,Bit_237,Bit_238,Bit_239,Bit_240,Bit_241,Bit_242,Bit_243,Bit_244,Bit_245,Bit_246,Bit_247,Bit_248,Bit_249,Bit_250,Bit_251,Bit_252,Bit_253,Bit_254,Bit_255,Bit_256,Bit_257,Bit_258,Bit_259,Bit_260,Bit_261,Bit_262,Bit_263,Bit_264,Bit_265,Bit_266,Bit_267,Bit_268,Bit_269,Bit_270,Bit_271,Bit_272,Bit_273,Bit_274,Bit_275,Bit_276,Bit_277,Bit_278,Bit_279,Bit_280,Bit_281,Bit_282,Bit_283,Bit_284,Bit_285,Bit_286,Bit_287,Bit_288,Bit_289,Bit_290,Bit_291,Bit_292,Bit_293,Bit_294,Bit_295,Bit_296,Bit_297,Bit_298,Bit_299,Bit_300,Bit_301,Bit_302,Bit_303,Bit_304,Bit_305,Bit_306,Bit_307,Bit_308,Bit_309,Bit_310,Bit_311,Bit_312,Bit_313,Bit_314,Bit_315,Bit_316,Bit_317,Bit_318,Bit_319,Bit_320,Bit_321,Bit_322,Bit_323,Bit_324,Bit_325,Bit_326,Bit_327,Bit_328,Bit_329,Bit_330,Bit_331,Bit_332,Bit_333,Bit_334,Bit_335,Bit_336,Bit_337,Bit_338,Bit_339,Bit_340,Bit_341,Bit_342,Bit_343,Bit_344,Bit_345,Bit_346,Bit_347,Bit_348,Bit_349,Bit_350,Bit_351,Bit_352,Bit_353,Bit_354,Bit_355,Bit_356,Bit_357,Bit_358,Bit_359,Bit_360,Bit_361,Bit_362,Bit_363,Bit_364,Bit_365,Bit_366,Bit_367,Bit_368,Bit_369,Bit_370,Bit_371,Bit_372,Bit_373,Bit_374,Bit_375,Bit_376,Bit_377,Bit_378,Bit_379,Bit_380,Bit_381,Bit_382,Bit_383,Bit_384,Bit_385,Bit_386,Bit_387,Bit_388,Bit_389,Bit_390,Bit_391,Bit_392,Bit_393,Bit_394,Bit_395,Bit_396,Bit_397,Bit_398,Bit_399,Bit_400,Bit_401,Bit_402,Bit_403,Bit_404,Bit_405,Bit_406,Bit_407,Bit_408,Bit_409,Bit_410,Bit_411,Bit_412,Bit_413,Bit_414,Bit_415,Bit_416,Bit_417,Bit_418,Bit_419,Bit_420,Bit_421,Bit_422,Bit_423,Bit_424,Bit_425,Bit_426,Bit_427,Bit_428,Bit_429,Bit_430,Bit_431,Bit_432,Bit_433,Bit_434,Bit_435,Bit_436,Bit_437,Bit_438,Bit_439,Bit_440,Bit_441,Bit_442,Bit_443,Bit_444,Bit_445,Bit_446,Bit_447,Bit_448,Bit_449,Bit_450,Bit_451,Bit_452,Bit_453,Bit_454,Bit_455,Bit_456,Bit_457,Bit_458,Bit_459,Bit_460,Bit_461,Bit_462,Bit_463,Bit_464,Bit_465,Bit_466,Bit_467,Bit_468,Bit_469,Bit_470,Bit_471,Bit_472,Bit_473,Bit_474,Bit_475,Bit_476,Bit_477,Bit_478,Bit_479,Bit_480,Bit_481,Bit_482,Bit_483,Bit_484,Bit_485,Bit_486,Bit_487,Bit_488,Bit_489,Bit_490,Bit_491,Bit_492,Bit_493,Bit_494,Bit_495,Bit_496,Bit_497,Bit_498,Bit_499,Bit_500,Bit_501,Bit_502,Bit_503,Bit_504,Bit_505,Bit_506,Bit_507,Bit_508,Bit_509,Bit_510,Bit_511,Bit_512,Bit_513,Bit_514,Bit_515,Bit_516,Bit_517,Bit_518,Bit_519,Bit_520,Bit_521,Bit_522,Bit_523,Bit_524,Bit_525,Bit_526,Bit_527,Bit_528,Bit_529,Bit_530,Bit_531,Bit_532,Bit_533,Bit_534,Bit_535,Bit_536,Bit_537,Bit_538,Bit_539,Bit_540,Bit_541,Bit_542,Bit_543,Bit_544,Bit_545,Bit_546,Bit_547,Bit_548,Bit_549,Bit_550,Bit_551,Bit_552,Bit_553,Bit_554,Bit_555,Bit_556,Bit_557,Bit_558,Bit_559,Bit_560,Bit_561,Bit_562,Bit_563,Bit_564,Bit_565,Bit_566,Bit_567,Bit_568,Bit_569,Bit_570,Bit_571,Bit_572,Bit_573,Bit_574,Bit_575,Bit_576,Bit_577,Bit_578,Bit_579,Bit_580,Bit_581,Bit_582,Bit_583,Bit_584,Bit_585,Bit_586,Bit_587,Bit_588,Bit_589,Bit_590,Bit_591,Bit_592,Bit_593,Bit_594,Bit_595,Bit_596,Bit_597,Bit_598,Bit_599,Bit_600,Bit_601,Bit_602,Bit_603,Bit_604,Bit_605,Bit_606,Bit_607,Bit_608,Bit_609,Bit_610,Bit_611,Bit_612,Bit_613,Bit_614,Bit_615,Bit_616,Bit_617,Bit_618,Bit_619,Bit_620,Bit_621,Bit_622,Bit_623,Bit_624,Bit_625,Bit_626,Bit_627,Bit_628,Bit_629,Bit_630,Bit_631,Bit_632,Bit_633,Bit_634,Bit_635,Bit_636,Bit_637,Bit_638,Bit_639,Bit_640,Bit_641,Bit_642,Bit_643,Bit_644,Bit_645,Bit_646,Bit_647,Bit_648,Bit_649,Bit_650,Bit_651,Bit_652,Bit_653,Bit_654,Bit_655,Bit_656,Bit_657,Bit_658,Bit_659,Bit_660,Bit_661,Bit_662,Bit_663,Bit_664,Bit_665,Bit_666,Bit_667,Bit_668,Bit_669,Bit_670,Bit_671,Bit_672,Bit_673,Bit_674,Bit_675,Bit_676,Bit_677,Bit_678,Bit_679,Bit_680,Bit_681,Bit_682,Bit_683,Bit_684,Bit_685,Bit_686,Bit_687,Bit_688,Bit_689,Bit_690,Bit_691,Bit_692,Bit_693,Bit_694,Bit_695,Bit_696,Bit_697,Bit_698,Bit_699,Bit_700,Bit_701,Bit_702,Bit_703,Bit_704,Bit_705,Bit_706,Bit_707,Bit_708,Bit_709,Bit_710,Bit_711,Bit_712,Bit_713,Bit_714,Bit_715,Bit_716,Bit_717,Bit_718,Bit_719,Bit_720,Bit_721,Bit_722,Bit_723,Bit_724,Bit_725,Bit_726,Bit_727,Bit_728,Bit_729,Bit_730,Bit_731,Bit_732,Bit_733,Bit_734,Bit_735,Bit_736,Bit_737,Bit_738,Bit_739,Bit_740,Bit_741,Bit_742,Bit_743,Bit_744,Bit_745,Bit_746,Bit_747,Bit_748,Bit_749,Bit_750,Bit_751,Bit_752,Bit_753,Bit_754,Bit_755,Bit_756,Bit_757,Bit_758,Bit_759,Bit_760,Bit_761,Bit_762,Bit_763,Bit_764,Bit_765,Bit_766,Bit_767,Bit_768,Bit_769,Bit_770,Bit_771,Bit_772,Bit_773,Bit_774,Bit_775,Bit_776,Bit_777,Bit_778,Bit_779,Bit_780,Bit_781,Bit_782,Bit_783,Bit_784,Bit_785,Bit_786,Bit_787,Bit_788,Bit_789,Bit_790,Bit_791,Bit_792,Bit_793,Bit_794,Bit_795,Bit_796,Bit_797,Bit_798,Bit_799,Bit_800,Bit_801,Bit_802,Bit_803,Bit_804,Bit_805,Bit_806,Bit_807,Bit_808,Bit_809,Bit_810,Bit_811,Bit_812,Bit_813,Bit_814,Bit_815,Bit_816,Bit_817,Bit_818,Bit_819,Bit_820,Bit_821,Bit_822,Bit_823,Bit_824,Bit_825,Bit_826,Bit_827,Bit_828,Bit_829,Bit_830,Bit_831,Bit_832,Bit_833,Bit_834,Bit_835,Bit_836,Bit_837,Bit_838,Bit_839,Bit_840,Bit_841,Bit_842,Bit_843,Bit_844,Bit_845,Bit_846,Bit_847,Bit_848,Bit_849,Bit_850,Bit_851,Bit_852,Bit_853,Bit_854,Bit_855,Bit_856,Bit_857,Bit_858,Bit_859,Bit_860,Bit_861,Bit_862,Bit_863,Bit_864,Bit_865,Bit_866,Bit_867,Bit_868,Bit_869,Bit_870,Bit_871,Bit_872,Bit_873,Bit_874,Bit_875,Bit_876,Bit_877,Bit_878,Bit_879,Bit_880,Bit_881,Bit_882,Bit_883,Bit_884,Bit_885,Bit_886,Bit_887,Bit_888,Bit_889,Bit_890,Bit_891,Bit_892,Bit_893,Bit_894,Bit_895,Bit_896,Bit_897,Bit_898,Bit_899,Bit_900,Bit_901,Bit_902,Bit_903,Bit_904,Bit_905,Bit_906,Bit_907,Bit_908,Bit_909,Bit_910,Bit_911,Bit_912,Bit_913,Bit_914,Bit_915,Bit_916,Bit_917,Bit_918,Bit_919,Bit_920,Bit_921,Bit_922,Bit_923,Bit_924,Bit_925,Bit_926,Bit_927,Bit_928,Bit_929,Bit_930,Bit_931,Bit_932,Bit_933,Bit_934,Bit_935,Bit_936,Bit_937,Bit_938,Bit_939,Bit_940,Bit_941,Bit_942,Bit_943,Bit_944,Bit_945,Bit_946,Bit_947,Bit_948,Bit_949,Bit_950,Bit_951,Bit_952,Bit_953,Bit_954,Bit_955,Bit_956,Bit_957,Bit_958,Bit_959,Bit_960,Bit_961,Bit_962,Bit_963,Bit_964,Bit_965,Bit_966,Bit_967,Bit_968,Bit_969,Bit_970,Bit_971,Bit_972,Bit_973,Bit_974,Bit_975,Bit_976,Bit_977,Bit_978,Bit_979,Bit_980,Bit_981,Bit_982,Bit_983,Bit_984,Bit_985,Bit_986,Bit_987,Bit_988,Bit_989,Bit_990,Bit_991,Bit_992,Bit_993,Bit_994,Bit_995,Bit_996,Bit_997,Bit_998,Bit_999,Bit_1000,Bit_1001,Bit_1002,Bit_1003,Bit_1004,Bit_1005,Bit_1006,Bit_1007,Bit_1008,Bit_1009,Bit_1010,Bit_1011,Bit_1012,Bit_1013,Bit_1014,Bit_1015,Bit_1016,Bit_1017,Bit_1018,Bit_1019,Bit_1020,Bit_1021,Bit_1022,Bit_1023,Bit_1024,Bit_1025,Bit_1026,Bit_1027,Bit_1028,Bit_1029,Bit_1030,Bit_1031,Bit_1032,Bit_1033,Bit_1034,Bit_1035,Bit_1036,Bit_1037,Bit_1038,Bit_1039,Bit_1040,Bit_1041,Bit_1042,Bit_1043,Bit_1044,Bit_1045,Bit_1046,Bit_1047,Bit_1048,Bit_1049,Bit_1050,Bit_1051,Bit_1052,Bit_1053,Bit_1054,Bit_1055,Bit_1056,Bit_1057,Bit_1058,Bit_1059,Bit_1060,Bit_1061,Bit_1062,Bit_1063,Bit_1064,Bit_1065,Bit_1066,Bit_1067,Bit_1068,Bit_1069,Bit_1070,Bit_1071,Bit_1072,Bit_1073,Bit_1074,Bit_1075,Bit_1076,Bit_1077,Bit_1078,Bit_1079,Bit_1080,Bit_1081,Bit_1082,Bit_1083,Bit_1084,Bit_1085,Bit_1086,Bit_1087,Bit_1088,Bit_1089,Bit_1090,Bit_1091,Bit_1092,Bit_1093,Bit_1094,Bit_1095,Bit_1096,Bit_1097,Bit_1098,Bit_1099,Bit_1100,Bit_1101,Bit_1102,Bit_1103,Bit_1104,Bit_1105,Bit_1106,Bit_1107,Bit_1108,Bit_1109,Bit_1110,Bit_1111,Bit_1112,Bit_1113,Bit_1114,Bit_1115,Bit_1116,Bit_1117,Bit_1118,Bit_1119,Bit_1120,Bit_1121,Bit_1122,Bit_1123,Bit_1124,Bit_1125,Bit_1126,Bit_1127,Bit_1128,Bit_1129,Bit_1130,Bit_1131,Bit_1132,Bit_1133,Bit_1134,Bit_1135,Bit_1136,Bit_1137,Bit_1138,Bit_1139,Bit_1140,Bit_1141,Bit_1142,Bit_1143,Bit_1144,Bit_1145,Bit_1146,Bit_1147,Bit_1148,Bit_1149,Bit_1150,Bit_1151,Bit_1152,Bit_1153,Bit_1154,Bit_1155,Bit_1156,Bit_1157,Bit_1158,Bit_1159,Bit_1160,Bit_1161,Bit_1162,Bit_1163,Bit_1164,Bit_1165,Bit_1166,Bit_1167,Bit_1168,Bit_1169,Bit_1170,Bit_1171,Bit_1172,Bit_1173,Bit_1174,Bit_1175,Bit_1176,Bit_1177,Bit_1178,Bit_1179,Bit_1180,Bit_1181,Bit_1182,Bit_1183,Bit_1184,Bit_1185,Bit_1186,Bit_1187,Bit_1188,Bit_1189,Bit_1190,Bit_1191,Bit_1192,Bit_1193,Bit_1194,Bit_1195,Bit_1196,Bit_1197,Bit_1198,Bit_1199,Bit_1200,Bit_1201,Bit_1202,Bit_1203,Bit_1204,Bit_1205,Bit_1206,Bit_1207,Bit_1208,Bit_1209,Bit_1210,Bit_1211,Bit_1212,Bit_1213,Bit_1214,Bit_1215,Bit_1216,Bit_1217,Bit_1218,Bit_1219,Bit_1220,Bit_1221,Bit_1222,Bit_1223,Bit_1224,Bit_1225,Bit_1226,Bit_1227,Bit_1228,Bit_1229,Bit_1230,Bit_1231,Bit_1232,Bit_1233,Bit_1234,Bit_1235,Bit_1236,Bit_1237,Bit_1238,Bit_1239,Bit_1240,Bit_1241,Bit_1242,Bit_1243,Bit_1244,Bit_1245,Bit_1246,Bit_1247,Bit_1248,Bit_1249,Bit_1250,Bit_1251,Bit_1252,Bit_1253,Bit_1254,Bit_1255,Bit_1256,Bit_1257,Bit_1258,Bit_1259,Bit_1260,Bit_1261,Bit_1262,Bit_1263,Bit_1264,Bit_1265,Bit_1266,Bit_1267,Bit_1268,Bit_1269,Bit_1270,Bit_1271,Bit_1272,Bit_1273,Bit_1274,Bit_1275,Bit_1276,Bit_1277,Bit_1278,Bit_1279,Bit_1280,Bit_1281,Bit_1282,Bit_1283,Bit_1284,Bit_1285,Bit_1286,Bit_1287,Bit_1288,Bit_1289,Bit_1290,Bit_1291,Bit_1292,Bit_1293,Bit_1294,Bit_1295,Bit_1296,Bit_1297,Bit_1298,Bit_1299,Bit_1300,Bit_1301,Bit_1302,Bit_1303,Bit_1304,Bit_1305,Bit_1306,Bit_1307,Bit_1308,Bit_1309,Bit_1310,Bit_1311,Bit_1312,Bit_1313,Bit_1314,Bit_1315,Bit_1316,Bit_1317,Bit_1318,Bit_1319,Bit_1320,Bit_1321,Bit_1322,Bit_1323,Bit_1324,Bit_1325,Bit_1326,Bit_1327,Bit_1328,Bit_1329,Bit_1330,Bit_1331,Bit_1332,Bit_1333,Bit_1334,Bit_1335,Bit_1336,Bit_1337,Bit_1338,Bit_1339,Bit_1340,Bit_1341,Bit_1342,Bit_1343,Bit_1344,Bit_1345,Bit_1346,Bit_1347,Bit_1348,Bit_1349,Bit_1350,Bit_1351,Bit_1352,Bit_1353,Bit_1354,Bit_1355,Bit_1356,Bit_1357,Bit_1358,Bit_1359,Bit_1360,Bit_1361,Bit_1362,Bit_1363,Bit_1364,Bit_1365,Bit_1366,Bit_1367,Bit_1368,Bit_1369,Bit_1370,Bit_1371,Bit_1372,Bit_1373,Bit_1374,Bit_1375,Bit_1376,Bit_1377,Bit_1378,Bit_1379,Bit_1380,Bit_1381,Bit_1382,Bit_1383,Bit_1384,Bit_1385,Bit_1386,Bit_1387,Bit_1388,Bit_1389,Bit_1390,Bit_1391,Bit_1392,Bit_1393,Bit_1394,Bit_1395,Bit_1396,Bit_1397,Bit_1398,Bit_1399,Bit_1400,Bit_1401,Bit_1402,Bit_1403,Bit_1404,Bit_1405,Bit_1406,Bit_1407,Bit_1408,Bit_1409,Bit_1410,Bit_1411,Bit_1412,Bit_1413,Bit_1414,Bit_1415,Bit_1416,Bit_1417,Bit_1418,Bit_1419,Bit_1420,Bit_1421,Bit_1422,Bit_1423,Bit_1424,Bit_1425,Bit_1426,Bit_1427,Bit_1428,Bit_1429,Bit_1430,Bit_1431,Bit_1432,Bit_1433,Bit_1434,Bit_1435,Bit_1436,Bit_1437,Bit_1438,Bit_1439,Bit_1440,Bit_1441,Bit_1442,Bit_1443,Bit_1444,Bit_1445,Bit_1446,Bit_1447,Bit_1448,Bit_1449,Bit_1450,Bit_1451,Bit_1452,Bit_1453,Bit_1454,Bit_1455,Bit_1456,Bit_1457,Bit_1458,Bit_1459,Bit_1460,Bit_1461,Bit_1462,Bit_1463,Bit_1464,Bit_1465,Bit_1466,Bit_1467,Bit_1468,Bit_1469,Bit_1470,Bit_1471,Bit_1472,Bit_1473,Bit_1474,Bit_1475,Bit_1476,Bit_1477,Bit_1478,Bit_1479,Bit_1480,Bit_1481,Bit_1482,Bit_1483,Bit_1484,Bit_1485,Bit_1486,Bit_1487,Bit_1488,Bit_1489,Bit_1490,Bit_1491,Bit_1492,Bit_1493,Bit_1494,Bit_1495,Bit_1496,Bit_1497,Bit_1498,Bit_1499,Bit_1500,Bit_1501,Bit_1502,Bit_1503,Bit_1504,Bit_1505,Bit_1506,Bit_1507,Bit_1508,Bit_1509,Bit_1510,Bit_1511,Bit_1512,Bit_1513,Bit_1514,Bit_1515,Bit_1516,Bit_1517,Bit_1518,Bit_1519,Bit_1520,Bit_1521,Bit_1522,Bit_1523,Bit_1524,Bit_1525,Bit_1526,Bit_1527,Bit_1528,Bit_1529,Bit_1530,Bit_1531,Bit_1532,Bit_1533,Bit_1534,Bit_1535,Bit_1536,Bit_1537,Bit_1538,Bit_1539,Bit_1540,Bit_1541,Bit_1542,Bit_1543,Bit_1544,Bit_1545,Bit_1546,Bit_1547,Bit_1548,Bit_1549,Bit_1550,Bit_1551,Bit_1552,Bit_1553,Bit_1554,Bit_1555,Bit_1556,Bit_1557,Bit_1558,Bit_1559,Bit_1560,Bit_1561,Bit_1562,Bit_1563,Bit_1564,Bit_1565,Bit_1566,Bit_1567,Bit_1568,Bit_1569,Bit_1570,Bit_1571,Bit_1572,Bit_1573,Bit_1574,Bit_1575,Bit_1576,Bit_1577,Bit_1578,Bit_1579,Bit_1580,Bit_1581,Bit_1582,Bit_1583,Bit_1584,Bit_1585,Bit_1586,Bit_1587,Bit_1588,Bit_1589,Bit_1590,Bit_1591,Bit_1592,Bit_1593,Bit_1594,Bit_1595,Bit_1596,Bit_1597,Bit_1598,Bit_1599,Bit_1600,Bit_1601,Bit_1602,Bit_1603,Bit_1604,Bit_1605,Bit_1606,Bit_1607,Bit_1608,Bit_1609,Bit_1610,Bit_1611,Bit_1612,Bit_1613,Bit_1614,Bit_1615,Bit_1616,Bit_1617,Bit_1618,Bit_1619,Bit_1620,Bit_1621,Bit_1622,Bit_1623,Bit_1624,Bit_1625,Bit_1626,Bit_1627,Bit_1628,Bit_1629,Bit_1630,Bit_1631,Bit_1632,Bit_1633,Bit_1634,Bit_1635,Bit_1636,Bit_1637,Bit_1638,Bit_1639,Bit_1640,Bit_1641,Bit_1642,Bit_1643,Bit_1644,Bit_1645,Bit_1646,Bit_1647,Bit_1648,Bit_1649,Bit_1650,Bit_1651,Bit_1652,Bit_1653,Bit_1654,Bit_1655,Bit_1656,Bit_1657,Bit_1658,Bit_1659,Bit_1660,Bit_1661,Bit_1662,Bit_1663,Bit_1664,Bit_1665,Bit_1666,Bit_1667,Bit_1668,Bit_1669,Bit_1670,Bit_1671,Bit_1672,Bit_1673,Bit_1674,Bit_1675,Bit_1676,Bit_1677,Bit_1678,Bit_1679,Bit_1680,Bit_1681,Bit_1682,Bit_1683,Bit_1684,Bit_1685,Bit_1686,Bit_1687,Bit_1688,Bit_1689,Bit_1690,Bit_1691,Bit_1692,Bit_1693,Bit_1694,Bit_1695,Bit_1696,Bit_1697,Bit_1698,Bit_1699,Bit_1700,Bit_1701,Bit_1702,Bit_1703,Bit_1704,Bit_1705,Bit_1706,Bit_1707,Bit_1708,Bit_1709,Bit_1710,Bit_1711,Bit_1712,Bit_1713,Bit_1714,Bit_1715,Bit_1716,Bit_1717,Bit_1718,Bit_1719,Bit_1720,Bit_1721,Bit_1722,Bit_1723,Bit_1724,Bit_1725,Bit_1726,Bit_1727,Bit_1728,Bit_1729,Bit_1730,Bit_1731,Bit_1732,Bit_1733,Bit_1734,Bit_1735,Bit_1736,Bit_1737,Bit_1738,Bit_1739,Bit_1740,Bit_1741,Bit_1742,Bit_1743,Bit_1744,Bit_1745,Bit_1746,Bit_1747,Bit_1748,Bit_1749,Bit_1750,Bit_1751,Bit_1752,Bit_1753,Bit_1754,Bit_1755,Bit_1756,Bit_1757,Bit_1758,Bit_1759,Bit_1760,Bit_1761,Bit_1762,Bit_1763,Bit_1764,Bit_1765,Bit_1766,Bit_1767,Bit_1768,Bit_1769,Bit_1770,Bit_1771,Bit_1772,Bit_1773,Bit_1774,Bit_1775,Bit_1776,Bit_1777,Bit_1778,Bit_1779,Bit_1780,Bit_1781,Bit_1782,Bit_1783,Bit_1784,Bit_1785,Bit_1786,Bit_1787,Bit_1788,Bit_1789,Bit_1790,Bit_1791,Bit_1792,Bit_1793,Bit_1794,Bit_1795,Bit_1796,Bit_1797,Bit_1798,Bit_1799,Bit_1800,Bit_1801,Bit_1802,Bit_1803,Bit_1804,Bit_1805,Bit_1806,Bit_1807,Bit_1808,Bit_1809,Bit_1810,Bit_1811,Bit_1812,Bit_1813,Bit_1814,Bit_1815,Bit_1816,Bit_1817,Bit_1818,Bit_1819,Bit_1820,Bit_1821,Bit_1822,Bit_1823,Bit_1824,Bit_1825,Bit_1826,Bit_1827,Bit_1828,Bit_1829,Bit_1830,Bit_1831,Bit_1832,Bit_1833,Bit_1834,Bit_1835,Bit_1836,Bit_1837,Bit_1838,Bit_1839,Bit_1840,Bit_1841,Bit_1842,Bit_1843,Bit_1844,Bit_1845,Bit_1846,Bit_1847,Bit_1848,Bit_1849,Bit_1850,Bit_1851,Bit_1852,Bit_1853,Bit_1854,Bit_1855,Bit_1856,Bit_1857,Bit_1858,Bit_1859,Bit_1860,Bit_1861,Bit_1862,Bit_1863,Bit_1864,Bit_1865,Bit_1866,Bit_1867,Bit_1868,Bit_1869,Bit_1870,Bit_1871,Bit_1872,Bit_1873,Bit_1874,Bit_1875,Bit_1876,Bit_1877,Bit_1878,Bit_1879,Bit_1880,Bit_1881,Bit_1882,Bit_1883,Bit_1884,Bit_1885,Bit_1886,Bit_1887,Bit_1888,Bit_1889,Bit_1890,Bit_1891,Bit_1892,Bit_1893,Bit_1894,Bit_1895,Bit_1896,Bit_1897,Bit_1898,Bit_1899,Bit_1900,Bit_1901,Bit_1902,Bit_1903,Bit_1904,Bit_1905,Bit_1906,Bit_1907,Bit_1908,Bit_1909,Bit_1910,Bit_1911,Bit_1912,Bit_1913,Bit_1914,Bit_1915,Bit_1916,Bit_1917,Bit_1918,Bit_1919,Bit_1920,Bit_1921,Bit_1922,Bit_1923,Bit_1924,Bit_1925,Bit_1926,Bit_1927,Bit_1928,Bit_1929,Bit_1930,Bit_1931,Bit_1932,Bit_1933,Bit_1934,Bit_1935,Bit_1936,Bit_1937,Bit_1938,Bit_1939,Bit_1940,Bit_1941,Bit_1942,Bit_1943,Bit_1944,Bit_1945,Bit_1946,Bit_1947,Bit_1948,Bit_1949,Bit_1950,Bit_1951,Bit_1952,Bit_1953,Bit_1954,Bit_1955,Bit_1956,Bit_1957,Bit_1958,Bit_1959,Bit_1960,Bit_1961,Bit_1962,Bit_1963,Bit_1964,Bit_1965,Bit_1966,Bit_1967,Bit_1968,Bit_1969,Bit_1970,Bit_1971,Bit_1972,Bit_1973,Bit_1974,Bit_1975,Bit_1976,Bit_1977,Bit_1978,Bit_1979,Bit_1980,Bit_1981,Bit_1982,Bit_1983,Bit_1984,Bit_1985,Bit_1986,Bit_1987,Bit_1988,Bit_1989,Bit_1990,Bit_1991,Bit_1992,Bit_1993,Bit_1994,Bit_1995,Bit_1996,Bit_1997,Bit_1998,Bit_1999,Bit_2000,Bit_2001,Bit_2002,Bit_2003,Bit_2004,Bit_2005,Bit_2006,Bit_2007,Bit_2008,Bit_2009,Bit_2010,Bit_2011,Bit_2012,Bit_2013,Bit_2014,Bit_2015,Bit_2016,Bit_2017,Bit_2018,Bit_2019,Bit_2020,Bit_2021,Bit_2022,Bit_2023,Bit_2024,Bit_2025,Bit_2026,Bit_2027,Bit_2028,Bit_2029,Bit_2030,Bit_2031,Bit_2032,Bit_2033,Bit_2034,Bit_2035,Bit_2036,Bit_2037,Bit_2038,Bit_2039,Bit_2040,Bit_2041,Bit_2042,Bit_2043,Bit_2044,Bit_2045,Bit_2046,Bit_2047,Bit_2048
CHEMBL365134,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
CHEMBL187579,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0

5.3.2 Building a predictive model

Finally, read the generated explanatory variable (and objective variable) data and create a prediction model. Random forest was used according to the paper. I wanted to reproduce the paper as much as possible with hyperparameter search by GridSearch, 5-fold cross-validation, and evaluation index, so I used a part of the source of github. Also, in the github code, Y-randamiztion was performed by the permutation_test_score method of scikit-learn, so this was also diverted.

create_model.py


import argparse
import csv
import pandas as pd
import numpy as np
import gzip
import _pickle as cPickle

from sklearn import metrics
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import VarianceThreshold
from sklearn.model_selection import train_test_split, cross_validate, GridSearchCV
from sklearn.model_selection import permutation_test_score, StratifiedKFold


def calc_metrics_derived_from_confusion_matrix(metrics_name, y_true, y_predict):
    tn, fp, fn, tp = metrics.confusion_matrix(y_true, y_predict).ravel()
    #  PPV, precision
    #  TP / TP + FP
    if metrics_name in ["PPV", "precision"]:
        return tp / (tp + fp)

    #  NPV
    #  TN / TN + FN
    if metrics_name in ["NPV"]:
        return tn / (tn + fn)

    # sensitivity, recall, TPR
    #  TP / TP + FN
    if metrics_name in ["sensitivity", "recall", "TPR"]:
        return tp / (tp + fn)

    # specificity
    #  TN / TN + FP
    if metrics_name in ["specificity"]:
        return tn / (tn + fp)


def calc_metrics(metrics_name, y_true, y_predict):

    if metrics_name == "accuracy":
        return metrics.accuracy_score(y_true, y_predict)

    if metrics_name == "ba":
        return metrics.balanced_accuracy_score(y_true, y_predict)

    if metrics_name == "roc_auc":
        return metrics.roc_auc_score(y_true, y_predict)

    if metrics_name == "kappa":
        return metrics.cohen_kappa_score(y_true, y_predict)

    if metrics_name == "mcc":
        return metrics.matthews_corrcoef(y_true, y_predict)

    if metrics_name == "precision":
        return metrics.precision_score(y_true, y_predict)

    if metrics_name == "recall":
        return metrics.recall_score(y_true, y_predict)


def main():

    parser = argparse.ArgumentParser()
    parser.add_argument("-input", type=str, required=True)
    parser.add_argument("-output_model", type=str, required=True)
    parser.add_argument("-output_report", type=str, required=True)
    args = parser.parse_args()

    df = pd.read_csv(args.input, index_col=0)
    print(df.shape)

    y_train = df['outcome'].to_numpy()
    print(y_train)
    X_train = df.iloc[:, 1:]
    print(y_train.shape)
    print(X_train.shape)

    # Number of trees in random forest
    n_estimators = [100, 250, 500, 750, 1000]
    max_features = ['auto', 'sqrt']
    criterion = ['gini', 'entropy']
    class_weight = [None,'balanced',
                        {0:.9, 1:.1}, {0:.8, 1:.2}, {0:.7, 1:.3}, {0:.6, 1:.4},
                        {0:.4, 1:.6}, {0:.3, 1:.7}, {0:.2, 1:.8}, {0:.1, 1:.9}]
    random_state = [24]

    # Create the random grid
    param_grid = {'n_estimators': n_estimators,
                  'max_features': max_features,
                  'criterion': criterion,
                  'random_state': random_state,
                  'class_weight': class_weight}

    # setup model building
    rf = GridSearchCV(RandomForestClassifier(), param_grid, n_jobs=-1, cv=5, verbose=1)
    rf.fit(X_train, y_train)
    print()
    print('Best params: %s' % rf.best_params_)
    print('Score: %.2f' % rf.best_score_)

    rf_best = RandomForestClassifier(**rf.best_params_, n_jobs=-1)
    rf_best.fit(X_train, y_train)

    #Save the model once created
    with gzip.GzipFile(args.output_model, 'w') as f:
        cPickle.dump(rf_best, f)

    with gzip.GzipFile(args.output_model, 'r') as f:
        rf_best = cPickle.load(f)

    # Params
    pred = []
    ad = []
    pred_proba = []
    index = []
    cross_val = StratifiedKFold(n_splits=5)

    # Do 5-fold loop
    for train_index, test_index in cross_val.split(X_train, y_train):
        fold_model = rf_best.fit(X_train.iloc[train_index], y_train[train_index])
        fold_pred = rf_best.predict(X_train.iloc[test_index])
        fold_ad = rf_best.predict_proba(X_train.iloc[test_index])

        pred.append(fold_pred)
        ad.append(fold_ad)
        pred_proba.append(fold_ad[:, 1])
        index.append(test_index)

    threshold_ad = 0.70

    # Prepare results to export
    fold_index = np.concatenate(index)
    fold_pred = np.concatenate(pred)
    fold_ad = np.concatenate(ad)
    fold_pred_proba = np.concatenate(pred_proba)

    fold_ad = (np.amax(fold_ad, axis=1) >= threshold_ad).astype(str)
    five_fold_morgan = pd.DataFrame({'Prediction': fold_pred, 'AD': fold_ad, 'Proba': fold_pred_proba}, index=list(fold_index))
    five_fold_morgan.AD[five_fold_morgan.AD == 'False'] = np.nan
    five_fold_morgan.AD[five_fold_morgan.AD == 'True'] = five_fold_morgan.Prediction
    five_fold_morgan.sort_index(inplace=True)
    five_fold_morgan['y_train'] = pd.DataFrame(y_train)

    # morgan stats
    all_datas = []
    datas = []
    datas.append(calc_metrics("accuracy", five_fold_morgan['y_train'], five_fold_morgan['Prediction']))
    datas.append(calc_metrics("ba", five_fold_morgan['y_train'], five_fold_morgan['Prediction']))
    datas.append(calc_metrics("roc_auc", five_fold_morgan['y_train'], five_fold_morgan['Proba']))
    datas.append(calc_metrics("kappa", five_fold_morgan['y_train'], five_fold_morgan['Prediction']))
    datas.append(calc_metrics("mcc", five_fold_morgan['y_train'], five_fold_morgan['Prediction']))
    datas.append(calc_metrics("precision", five_fold_morgan['y_train'], five_fold_morgan['Prediction']))
    datas.append(calc_metrics("recall", five_fold_morgan['y_train'], five_fold_morgan['Prediction']))
    datas.append(calc_metrics_derived_from_confusion_matrix("sensitivity", five_fold_morgan['y_train'], five_fold_morgan['Prediction']))
    datas.append(calc_metrics_derived_from_confusion_matrix("PPV", five_fold_morgan['y_train'], five_fold_morgan['Prediction']))
    datas.append(calc_metrics_derived_from_confusion_matrix("specificity", five_fold_morgan['y_train'], five_fold_morgan['Prediction']))
    datas.append(calc_metrics_derived_from_confusion_matrix("NPV", five_fold_morgan['y_train'], five_fold_morgan['Prediction']))

    datas.append(1)
    all_datas.append(datas)

    # morgan AD stats
    morgan_ad = five_fold_morgan.dropna(subset=['AD'])
    morgan_ad['AD'] = morgan_ad['AD'].astype(int)
    coverage_morgan_ad = len(morgan_ad['AD']) / len(five_fold_morgan['y_train'])

    datas = []
    datas.append(calc_metrics("accuracy", morgan_ad['y_train'], morgan_ad['AD']))
    datas.append(calc_metrics("ba", morgan_ad['y_train'], morgan_ad['AD']))
    datas.append(calc_metrics("roc_auc", morgan_ad['y_train'], morgan_ad['Proba']))
    datas.append(calc_metrics("kappa", morgan_ad['y_train'], morgan_ad['AD']))
    datas.append(calc_metrics("mcc", morgan_ad['y_train'], morgan_ad['AD']))
    datas.append(calc_metrics("precision", morgan_ad['y_train'], morgan_ad['AD']))
    datas.append(calc_metrics("recall", morgan_ad['y_train'], morgan_ad['AD']))
    datas.append(calc_metrics_derived_from_confusion_matrix("sensitivity", morgan_ad['y_train'], morgan_ad['AD']))
    datas.append(calc_metrics_derived_from_confusion_matrix("PPV", morgan_ad['y_train'], morgan_ad['AD']))
    datas.append(calc_metrics_derived_from_confusion_matrix("specificity", morgan_ad['y_train'], morgan_ad['AD']))
    datas.append(calc_metrics_derived_from_confusion_matrix("NPV", morgan_ad['y_train'], morgan_ad['AD']))

    datas.append(coverage_morgan_ad)
    all_datas.append(datas)

    # print stats
    print('\033[1m' + '5-fold External Cross Validation Statistical Characteristcs of QSAR models developed morgan' + '\n' + '\033[0m')
    morgan_5f_stats = pd.DataFrame(all_datas, columns=["accuracy", "ba", "roc_auc", "kappa", "mcc", "precision", "recall",
                                                       "sensitivity", "PPV", "specificity", "NPV", "Coverage"])
    morgan_5f_stats.to_csv(args.output_report)
    print(morgan_5f_stats)

    # Y ramdomaization
    permutations = 20
    score, permutation_scores, pvalue = permutation_test_score(rf_best, X_train, y_train,
                                                           cv=5, scoring='balanced_accuracy',
                                                           n_permutations=permutations,
                                                           n_jobs=-1,
                                                           verbose=1,
                                                           random_state=24)
    print('True score = ', score.round(2),
          '\nY-randomization = ', np.mean(permutation_scores).round(2),
          '\np-value = ', pvalue.round(4))


if __name__ == "__main__":
    main()

When this is executed, the prediction model body is saved in the path specified by output_model, and the evaluation index of the prediction model constructed in the path specified by output_report is output. The results are as follows.

accuracy balanced_accuracy  roc_auc     kappa       mcc  precision    recall  sensitivity     PPV  specificity       NPV  Coverage
0.801980       0.713126    0.807507  0.490156  0.551662     0.9375  0.441176     0.441176  0.9375     0.985075  0.776471  1.000000
0.852941       0.736842    0.812030  0.564661  0.627215     1.0000  0.473684     0.473684  1.0000     1.000000  0.830508  0.673267

True score =  0.71
Y-randomization =  0.48
p-value =  0.0476

The score in the first row of the first half is the cross-validation score in all training data. Recall and sensitivity are the same, but both are shown to compare whether the scikit-learn version and the function built in by myself match.

The second line of the first half is taken into consideration in the paper, and the predictive reliability (do you know the value obtained by predict_proba of scikit-learn for the predicted class) is above the threshold value (0.7)? It is a cross-validation score limited to the data of, and is considered to be an index based only on more reliable data, and the index is slightly improved.

The second half is the result of Y-randamization, according to the scikit-learn documentation, True score is the true score without swapping targets, Y-randamization is the randomized score of the target, and p-value is the accidental score. Probability, the best is `1 / (n_permutations + 1)` and the worst seems to be 1.0. This time, 1 / (n_permutations + 1) = 1 / (20 + 1) = 0.0467, which is the best value, so the balanced_accuracy of 0.71 does not seem to have been obtained by chance. ..

5.4 Screening of candidate drugs using a predictive model

5.4.1 Performing Forecasts on DrugBank Data

Prediction by specifying the prediction model body created in 5.3, training data (101 cases) extracted from ChEMBL and PDB, and virtual screening data (10750 cases) extracted from DrugBank in the following scripts (only fragments are described). I do.

predict_drugbank.py



    parser = argparse.ArgumentParser()
    parser.add_argument("-input_train", type=str, required=True)
    parser.add_argument("-input_predict", type=str, required=True)
    parser.add_argument("-input_model", type=str, required=True)
    parser.add_argument("-output_csv", type=str, required=True)
    parser.add_argument("-output_report", type=str, required=True)
    args = parser.parse_args()

    #Read training data
    trains = list()
    with open(args.input_train, "r") as f:
        reader = csv.DictReader(f)
        for i, row in enumerate(reader):
            smiles = row["canonical_smiles"]
            new_smiles = normalize(smiles)
            trains.append((row["id"], new_smiles))


    #Reading forecast data
    smiles_list = list()
    datas = []
    datas_other = []

    with open(args.input_predict, "r") as f:
        reader = csv.DictReader(f)
        for row in reader:
            print(row["id"])
            smiles = row["smiles"]
            new_smiles = normalize(smiles)
            dup_pred = None
            dup_train = None

            if new_smiles is None or Chem.MolFromSmiles(new_smiles) is None:
                print(f"error! {id}")
                new_smiles = smiles

            existFlag = False
            for db_id, tmp_smiles in smiles_list:
                if new_smiles == tmp_smiles:
                    dup_pred = db_id
                    print(f"{id}: same structure exist in predict data! - {db_id}, {tmp_smiles}")
                    break

            for db_id, tmp_smiles in trains:
                if new_smiles == tmp_smiles:
                    dup_train = db_id
                    print(f"{id}: same structure exist in train data! - {db_id}, {tmp_smiles}")
                    break

            smiles_list.append((row["id"], new_smiles))
            mol = Chem.MolFromSmiles(new_smiles)
            fp = AllChem.GetMorganFingerprintAsBitVect(mol, radius=3, nBits=2048, useFeatures=False, useChirality=False)
            fp = pd.Series(np.asarray(fp)).values
            data = []
            data.append(row["id"])
            data.extend(fp)
            datas.append(data)
            datas_other.append((row["id"], row["name"], row["status"], dup_pred, dup_train))

        columns = list()
        columns.append("id")
        columns.extend(["Bit_" + str(i+1) for i in range(2048)])
        df = pd.DataFrame(data=datas, columns=columns)
        df.set_index("id", inplace=True, drop=True)

        df_other = pd.DataFrame(data=datas_other, columns=["id", "name", "status", "dup_predict", "dup_train"])
        df_other.set_index("id", inplace=True, drop=True)

        #Save once
        df.to_csv(args.output_csv)

    #df = pd.read_csv(args.input, index_col=0)
    X_vs = df.iloc[:, 0:]

    with gzip.GzipFile(args.input_model, 'r') as f:
        model = cPickle.load(f)

    ad_threshold = 0.70

    y_pred = model.predict(X_vs)
    confidence = model.predict_proba(X_vs)
    confidence = np.amax(confidence, axis=1).round(2)
    ad = confidence >= ad_threshold

    pred = pd.DataFrame({'Prediction': y_pred, 'AD': ad, 'Confidence': confidence}, index=None)
    pred.AD[pred.AD == False] = np.nan
    pred.AD[pred.AD == True] = pred.Prediction.astype(int)

    pred_ad = pred.dropna().astype(int)
    coverage_ad = len(pred_ad) * 100 / len(pred)

    print('VS pred: %s' % pred.Prediction)
    print('VS pred AD: %s' % pred_ad.Prediction)
    print('Coverage of AD: %.2f%%' % coverage_ad)

    pred.index = X_vs.index
    predictions = pd.concat([df_other, pred], axis=1)

    for col in ['Prediction', 'AD']:
        predictions[col].replace(0, 'Inactive', inplace=True)
        predictions[col].replace(1, 'Active', inplace=True)
    print(predictions.head())

    predictions.to_csv(args.output_report)

output_csvThe calculation result of the explanatory variable of the prediction data is output to. Also,output_reportThe drugbank id of each drug in drugbank, common name,Commercially available/Cancel/The type of experiment, prediction result, prediction result for prediction reliability exceeding the threshold value, prediction reliability, etc. are output. Also,outptu_reportIn each prediction data, if there is a duplicate structure in the same prediction data and if there is a duplicate structure in the training data, the duplicate id is output respectively. Among the predicted data, there were 321 cases, and 6 cases for the training data, and there were duplicate data.

5.4.2 Comparison with papers

In the paper, the results of virtual screening for DrugBank are described below. https://github.com/alvesvm/sars-cov-mpro/tree/master/mol-inf-2020/datasets/screened_compounds drugbank_hits_qsar_consensus.xlsx We compared the DrugBank drugs output here and their results with the prediction results output in the previous section.

In the paper, we created a prediction model by descriptor calculation of three patterns of SiRIMS, Dragon, and Morgan of RDKit created this time, and the one that multiple models made Active is the final Active. I made an overall comparison and a comparison for each of SiRIMS and Dragon. In addition, I found a point that I was interested in the result of Morgan on github, so I have not compared it with this.

Comparison with the overall result

In the paper, there were 41 cases, but in Excel on github, the final number of Active cases was 51. Of the 51 cases, 19 were made Active by the model created this time.

Comparison with SiRIMS model and Dragon model

The comparison between the SiRIMS model and Dragon model and the number of Active / Inactive cases is as follows.

MODEL ACTIVE INACITVE TOTAL
SiRMS 309 9305 9614
DRAGON 864 8750 9614
Creation model 314 9300 9614

Next, I compared the number of actives common to each of the SiRMIS model and DRragon model.

MODEL SiRMS DRAGON Creation model
SiRMS - 39 79
DRAGON 39 - 35
Creation model 79 35 -

Although the number of Actives is the highest in DRAGON, the number of Actives that match the model created this time is 79 for SiRMS, which is double that of Dragon. From this, it is considered that the model created this time is more similar to SiRMS than Dragon.

5.4.3 Bonus: Visualization of the scope of the forecast model

Although it was not described in the paper, I tried to visualize how the data to be predicted is distributed on the training data. This is because in machine learning, the reliability of prediction becomes a problem when the data is not similar to the learning data. As a method, first, a dimension reduction model was created from the training data, and the two-dimensional coordinates of the prediction target data were generated and plotted using the model. We tried PCA and UMAP as dimensionality reduction methods. The data used the Morgan fingerprint as well as the creation model. Blue is the training data and red is the prediction data. 〇 is Active and X is Inactive. Where 〇 and X overlap in the training data, UMAP seems to be able to separate them better.

image.png

image.png

The source of the visualized script is as follows. A PCA or UMAP model is created with fit from the training data, and the result of transforming each prediction data is displayed. Looking at the figure, it seems that the data to be predicted does not deviate significantly from the range of the training data, but it may be affected by the fact that the dimensions are reduced only by the data composed of the training data.

view_ad.py


    parser = argparse.ArgumentParser()
    parser.add_argument("-train", type=str, required=True)
    parser.add_argument("-predict", type=str)
    parser.add_argument("-result", type=str)
    parser.add_argument("-method", type=str, default="PCA", choices=["PCA", "UMAP"])

    args = parser.parse_args()

    # all_train.loading csv,fp calculation
    train_datas = []
    train_datas_active = []
    train_datas_inactive = []

    with open(args.train, "r") as f:
        reader = csv.DictReader(f)
        for row in reader:
            smiles = row["canonical_smiles"]

            mol = Chem.MolFromSmiles(smiles)
            fp = AllChem.GetMorganFingerprintAsBitVect(mol, radius=3, nBits=2048, useFeatures=False, useChirality=False)
            train_datas.append(fp)

            if int(row["outcome"]) == 1:
                 train_datas_active.append(fp)
            else:
                 train_datas_inactive.append(fp)

    if args.predict and args.result:
        result_outcomes = []
        result_ads = []

        #Prediction result reading
        with open(args.result, "r",encoding="utf-8_sig") as f:
            reader = csv.DictReader(f)
            for i, row in enumerate(reader):
                #print(row)
                if row["Prediction"] == "Active":
                    result_outcomes.append(1)
                else:
                    result_outcomes.append(0)

                result_ads.append(row["Confidence"])

        # drugbank.loading csv,fp calculation
        predict_datas = []
        predict_datas_active = []
        predict_datas_inactive = []
        predict_ads = []
        with open(args.predict, "r") as f:
            reader = csv.DictReader(f)
            for i, row in enumerate(reader):
                print(i)
                smiles = row["smiles"]
                mol = Chem.MolFromSmiles(smiles)
                fp = AllChem.GetMorganFingerprintAsBitVect(mol, radius=3, nBits=2048, useFeatures=False, useChirality=False)
                predict_datas.append(fp)

                if result_outcomes[i] == 1:
                    predict_datas_active.append(fp)
                else:
                    predict_datas_inactive.append(fp)


    #analysis
    model = None
    if args.method == "PCA":
        model = PCA(n_components=2)
        model.fit(train_datas)

    if args.method == "UMAP":
        model = umap.UMAP()
        model.fit(train_datas)

    result_train = model.transform(train_datas)
    result_train_active = model.transform(train_datas_active)
    result_train_inactive = model.transform(train_datas_inactive)

    plt.title(args.method)
    #plt.scatter(result_train[:, 0], result_train[:, 1], c="blue", alpha=0.1, marker="o")
    plt.scatter(result_train_active[:, 0], result_train_active[:, 1], c="blue", alpha=0.5, marker="o")
    plt.scatter(result_train_inactive[:, 0], result_train_inactive[:, 1], c="blue", alpha=0.5, marker="x")

    #Forecast(predict)
    if args.predict and args.result:

        result_predict = model.transform(predict_datas)
        result_predict_active = model.transform(predict_datas_active)
        result_predict_inactive = model.transform(predict_datas_inactive)

        #plt.scatter(result_predict[:, 0], result_predict[:, 1], c=result_ads, alpha=0.1, cmap='viridis_r')
        plt.scatter(result_predict_active[:, 0], result_predict_active[:, 1], c="red", alpha=0.1, marker="o")
        plt.scatter(result_predict_inactive[:, 0], result_predict_inactive[:, 1], c="red", alpha=0.1, marker="x")

    plt.show()

6. Conclusion

6.1 Impressions

Through the process of trying the contents of the paper this time, I was able to obtain the following. I think it will be useful in many cases when you analyze yourself or read another paper in the future.

――I was able to experience everything from data collection in the drug discovery field to preprocessing, prediction model creation, and prediction model evaluation. In particular, I was able to experience firsthand how much time it takes to collect and preprocess data before creating a prediction model, and to apply and evaluate it after creating a prediction model. ――I was able to deepen my understanding of the public database, such as the outline and data collection method.

This time, I tried papers using the conventional method, but papers using Deep Learning such as Graph Convolution Networks (GCN) are appearing one after another, and I would like to try them in the near future.

6.2 Other recommended papers related to this paper

In the process of searching for papers, I will list two other papers on coronavirus-related predictive models found below. All of them have public datasets, so I think it's a good idea to give them a try.

In this paper, a prediction model for a target called papain-like protease (PLpro), which is different from the paper in this article, is created.

This creates the same predictive model for the main protease as in this article. The model has been evaluated according to the OECD guidelines (https://www.oecd.org/chemicalsafety/risk-assessment/37849783.pdf), which may provide insights into the evaluation of predictive models for compounds. Is.

6.3 Library used in this article

The main libraries used in this article are as follows.

6.5 About the program in this article

The programs listed in this article are available at https://github.com/kimisyo/sars_trace_1.

Recommended Posts

AI drug discovery to start for free using papers and public databases
Where to fix code when using plotly for free
Start to Selenium using python
How to block ads for free on iPhone and iPad apps