Introduction

DeepChem contains data from MolculeNet (http://moleculenet.ai/), which allows you to try machine learning with compounds. Among the various data of Molecule Net, the data called "PDB Bind" shines exceptionally brightly. The reason why it has changed is that while other data is "compound data" stored in sdf and smiles as training data, this is the binding state of protein and ligand (compound that specifically binds to protein). This is because it is training data. Of course, Featurizer is also different from the others, and it uses something different from the one for normal data sets such as "RDKitGridFeaturizer".

This time, I downloaded the PDBBind dataset locally and tried to Featurize it with RDKitGridFeaturier.

environment

deepchem 2.2.1.dev54
pdbfixer

In addition to Deep Chem, you need a PDB Fixer to run it. The installation method of PDB Fixer is as follows.

$ conda install -c omnia pdbfixer

Advance preparation

Click the "PDBBind" link at http://moleculenet.ai/datasets-1 to download the dataset, so unzip it to a suitable location. The capacity is quite large. By the way, the format of each data is sdf (and mol2) for the ligand and pdb format for the protein.

Source

The file called load_pdbdataset.py of deepchem is diverted, and unnecessary parts are deleted. Since the original source is used almost as it is, not only RdkitGridFeaturizer but also ComplexNeighborListFragmentAtomicCoordinates, AtomicConvFeaturizer, etc. can be specified. RDKitGridFeaturizer can specify'ecfp','splif','hbond','salt_bridge','pi_stack','cation_pi', and'charge' as feature_type.

Please read the path etc. in your own environment.

`deepchem_test.py`


import os
import time
import numpy as np
from deepchem.feat import rdkit_grid_featurizer as rgf
from deepchem.feat.atomic_coordinates import ComplexNeighborListFragmentAtomicCoordinates
from deepchem.feat.graph_features import AtomicConvFeaturizer

def main():

    split = "random",
    featurizer = "grid"
    subset = "core"
    load_binding_pocket = True
    pdbbind_tasks = ["-logKd/Ki"]
    data_folder = "../data/v2015"

    if subset == "core":
        index_labels_file = os.path.join(data_folder, "INDEX_core_data.2013_small")
    elif subset == "refined":
        index_labels_file = os.path.join(data_folder, "INDEX_refined_data.2015")
    else:
        raise ValueError("Other subsets not supported")

    # Extract locations of data
    with open(index_labels_file, "r") as g:
        pdbs = [line[:4] for line in g.readlines() if line[0] != "#"]
        if load_binding_pocket:
            protein_files = [
            os.path.join(data_folder, pdb, "%s_pocket.pdb" % pdb) for pdb in pdbs
        ]
        else:
            protein_files = [
            os.path.join(data_folder, pdb, "%s_protein.pdb" % pdb) for pdb in pdbs
            ]
        ligand_files = [
        os.path.join(data_folder, pdb, "%s_ligand.sdf" % pdb) for pdb in pdbs
        ]

    # Extract labels
    with open(index_labels_file, "r") as g:
        labels = np.array([
            # Lines have format
            # PDB code, resolution, release year, -logKd/Ki, Kd/Ki, reference, ligand name
            # The base-10 logarithm, -log kd/pk
            float(line.split()[3]) for line in g.readlines() if line[0] != "#"
        ])

    # Featurize Data
    if featurizer == "grid":
        featurizer = rgf.RdkitGridFeaturizer(
        voxel_width=2.0,
         feature_types=[
           'ecfp', 'splif', 'hbond', 'salt_bridge', 'pi_stack', 'cation_pi', 'charge'
        ],
        flatten=False)
    elif featurizer == "atomic" or featurizer == "atomic_conv":
    # Pulled from PDB files. For larger datasets with more PDBs, would use
    # max num atoms instead of exact.

        frag1_num_atoms = 70  # for ligand atoms
        if load_binding_pocket:
            frag2_num_atoms = 1000
            complex_num_atoms = 1070
        else:
            frag2_num_atoms = 24000  # for protein atoms
            complex_num_atoms = 24070  # in total

        max_num_neighbors = 4
        # Cutoff in angstroms
        neighbor_cutoff = 4
        if featurizer == "atomic":
            featurizer = ComplexNeighborListFragmentAtomicCoordinates(
            frag1_num_atoms=frag1_num_atoms,
            frag2_num_atoms=frag2_num_atoms,
            complex_num_atoms=complex_num_atoms,
            max_num_neighbors=max_num_neighbors,
            neighbor_cutoff=neighbor_cutoff)
        if featurizer == "atomic_conv":
            featurizer = AtomicConvFeaturizer(
            labels=labels,
            frag1_num_atoms=frag1_num_atoms,
            frag2_num_atoms=frag2_num_atoms,
            complex_num_atoms=complex_num_atoms,
            neighbor_cutoff=neighbor_cutoff,
            max_num_neighbors=max_num_neighbors,
            batch_size=64)
    else:
        raise ValueError("Featurizer not supported")

    print("\nFeaturizing Complexes for \"%s\" ...\n" % data_folder)
    feat_t1 = time.time()
    features, failures = featurizer.featurize_complexes(ligand_files, protein_files)
    print(features.shape)
    print(features)
    feat_t2 = time.time()
    print("\nFeaturization finished, took %0.3f s." % (feat_t2 - feat_t1))


if __name__ == '__main__':
    main()

Run

When executed, it will be displayed in features.shape, features like this. This shows that there are five training data, and one training data has the characteristics of 18944 dimensions.

(5, 18944)
[[0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]]

When the number of dimensions for each feature_type was examined by changing the specification of feature_types, it was as follows.

'ecfp' 4096
'splif' 12288
'hbond 1536,
'salt_bridge 512
'pi_stack 0
'cation_pi 0
'charge' 512

Actually, it is flattened to one dimension, but you can see the original dimension by setting the flatten argument of rgf.RdkitGridFeaturizer to False. Salt_bridge etc. may be related to 3D coordinates because 512 dimensions are originally 8 x 8 x 8.

Up to here for this time.

in conclusion

This time, I went to the point of preparing the data and seeing the contents of the features, but unfortunately I could not go to the point of "playing" with my own data.

The biggest question seems to be that the binding state of the protein and the ligand depends on the atoms and bonds of the protein / ligand around the binding, but these are Featurize of all the training data, although the types and numbers differ depending on the training data. Why are the results in the same dimension? The only answer would be to look at the RDKitGridFeaturizer source.

If it is one-dimensional learning data, it can be predicted by other than Deep Learning, so it may be interesting to try to predict it by the conventional machine learning method against MoleculeNet. It may also be interesting to create a Featurizer that reflects your ideas by referring to RDKitGridFeaturizer.

Play with MoleculeNet's PDBBind and DeepChem's RDKitGridFeaturizer