DeepChem contains data from MolculeNet (http://moleculenet.ai/), which allows you to try machine learning with compounds. Among the various data of Molecule Net, the data called "PDB Bind" shines exceptionally brightly. The reason why it has changed is that while other data is "compound data" stored in sdf and smiles as training data, this is the binding state of protein and ligand (compound that specifically binds to protein). This is because it is training data. Of course, Featurizer is also different from the others, and it uses something different from the one for normal data sets such as "RDKitGridFeaturizer".
This time, I downloaded the PDBBind dataset locally and tried to Featurize it with RDKitGridFeaturier.
In addition to Deep Chem, you need a PDB Fixer to run it. The installation method of PDB Fixer is as follows.
$ conda install -c omnia pdbfixer
Click the "PDBBind" link at http://moleculenet.ai/datasets-1 to download the dataset, so unzip it to a suitable location. The capacity is quite large. By the way, the format of each data is sdf (and mol2) for the ligand and pdb format for the protein.
The file called load_pdbdataset.py of deepchem is diverted, and unnecessary parts are deleted. Since the original source is used almost as it is, not only RdkitGridFeaturizer but also ComplexNeighborListFragmentAtomicCoordinates, AtomicConvFeaturizer, etc. can be specified. RDKitGridFeaturizer can specify'ecfp','splif','hbond','salt_bridge','pi_stack','cation_pi', and'charge' as feature_type.
Please read the path etc. in your own environment.
deepchem_test.py
import os
import time
import numpy as np
from deepchem.feat import rdkit_grid_featurizer as rgf
from deepchem.feat.atomic_coordinates import ComplexNeighborListFragmentAtomicCoordinates
from deepchem.feat.graph_features import AtomicConvFeaturizer
def main():
split = "random",
featurizer = "grid"
subset = "core"
load_binding_pocket = True
pdbbind_tasks = ["-logKd/Ki"]
data_folder = "../data/v2015"
if subset == "core":
index_labels_file = os.path.join(data_folder, "INDEX_core_data.2013_small")
elif subset == "refined":
index_labels_file = os.path.join(data_folder, "INDEX_refined_data.2015")
else:
raise ValueError("Other subsets not supported")
# Extract locations of data
with open(index_labels_file, "r") as g:
pdbs = [line[:4] for line in g.readlines() if line[0] != "#"]
if load_binding_pocket:
protein_files = [
os.path.join(data_folder, pdb, "%s_pocket.pdb" % pdb) for pdb in pdbs
]
else:
protein_files = [
os.path.join(data_folder, pdb, "%s_protein.pdb" % pdb) for pdb in pdbs
]
ligand_files = [
os.path.join(data_folder, pdb, "%s_ligand.sdf" % pdb) for pdb in pdbs
]
# Extract labels
with open(index_labels_file, "r") as g:
labels = np.array([
# Lines have format
# PDB code, resolution, release year, -logKd/Ki, Kd/Ki, reference, ligand name
# The base-10 logarithm, -log kd/pk
float(line.split()[3]) for line in g.readlines() if line[0] != "#"
])
# Featurize Data
if featurizer == "grid":
featurizer = rgf.RdkitGridFeaturizer(
voxel_width=2.0,
feature_types=[
'ecfp', 'splif', 'hbond', 'salt_bridge', 'pi_stack', 'cation_pi', 'charge'
],
flatten=False)
elif featurizer == "atomic" or featurizer == "atomic_conv":
# Pulled from PDB files. For larger datasets with more PDBs, would use
# max num atoms instead of exact.
frag1_num_atoms = 70 # for ligand atoms
if load_binding_pocket:
frag2_num_atoms = 1000
complex_num_atoms = 1070
else:
frag2_num_atoms = 24000 # for protein atoms
complex_num_atoms = 24070 # in total
max_num_neighbors = 4
# Cutoff in angstroms
neighbor_cutoff = 4
if featurizer == "atomic":
featurizer = ComplexNeighborListFragmentAtomicCoordinates(
frag1_num_atoms=frag1_num_atoms,
frag2_num_atoms=frag2_num_atoms,
complex_num_atoms=complex_num_atoms,
max_num_neighbors=max_num_neighbors,
neighbor_cutoff=neighbor_cutoff)
if featurizer == "atomic_conv":
featurizer = AtomicConvFeaturizer(
labels=labels,
frag1_num_atoms=frag1_num_atoms,
frag2_num_atoms=frag2_num_atoms,
complex_num_atoms=complex_num_atoms,
neighbor_cutoff=neighbor_cutoff,
max_num_neighbors=max_num_neighbors,
batch_size=64)
else:
raise ValueError("Featurizer not supported")
print("\nFeaturizing Complexes for \"%s\" ...\n" % data_folder)
feat_t1 = time.time()
features, failures = featurizer.featurize_complexes(ligand_files, protein_files)
print(features.shape)
print(features)
feat_t2 = time.time()
print("\nFeaturization finished, took %0.3f s." % (feat_t2 - feat_t1))
if __name__ == '__main__':
main()
When executed, it will be displayed in features.shape, features like this. This shows that there are five training data, and one training data has the characteristics of 18944 dimensions.
(5, 18944)
[[0. 0. 0. ... 0. 0. 0.]
[0. 0. 0. ... 0. 0. 0.]
[0. 0. 0. ... 0. 0. 0.]
[0. 0. 0. ... 0. 0. 0.]
[0. 0. 0. ... 0. 0. 0.]]
When the number of dimensions for each feature_type was examined by changing the specification of feature_types, it was as follows.
Actually, it is flattened to one dimension, but you can see the original dimension by setting the flatten argument of rgf.RdkitGridFeaturizer to False. Salt_bridge etc. may be related to 3D coordinates because 512 dimensions are originally 8 x 8 x 8.
Up to here for this time.
This time, I went to the point of preparing the data and seeing the contents of the features, but unfortunately I could not go to the point of "playing" with my own data.
The biggest question seems to be that the binding state of the protein and the ligand depends on the atoms and bonds of the protein / ligand around the binding, but these are Featurize of all the training data, although the types and numbers differ depending on the training data. Why are the results in the same dimension? The only answer would be to look at the RDKitGridFeaturizer source.
If it is one-dimensional learning data, it can be predicted by other than Deep Learning, so it may be interesting to try to predict it by the conventional machine learning method against MoleculeNet. It may also be interesting to create a Featurizer that reflects your ideas by referring to RDKitGridFeaturizer.
Recommended Posts