Introduction

I am reprinting what I wrote in the past. It may be close to miscellaneous notes. I don't know.

New molecule generation?

Create a new molecule. In particular, what have you done so far regarding the design of "useful new molecules with the desired physical properties"? For example, in the field of drug discovery, I think that it is created using not only basic theory of chemistry, but also empirical rules, Tanimoto coefficient, quantum chemistry calculation etc .... (I think there are others). I will (check myself).

There seems to be a big move lately to get machine learning to do the above.

Among them, the one that we paid attention to this time was the Aspuru-Guzik group of Harvard University, which had a large number of citations of 213. 「Automatic Chemical Design Using a Data-Driven Continuous Representation of Molecules」(1) And a program created based on this It is "Chemical VAE" (2).

What is Chemical VAE?

This is a technology that uses word2vec (Seq2Seq) called SMILES2vec. I would appreciate it if you could see the previous My article about this.

The following is the flow of new molecule generation that I thought about after reading the paper.

First, the character string represented by the compound is vectorized by the encoder to generate a latent space (vector space). Each position in this vector space is a character string of SMILES, and it seems that the closer the position (later expressed as z), the closer the structure exists. Revert it to a similar string as much as possible in the decoder. We will also train the encoder and decorator so that encoding and decoding will work.

After that (it may be said "at the same time"), f (z) is generated as shown in the figure below by learning the physical property values of the molecules corresponding to the vector space with a neural network.

スクリーンショット 2018-12-31 17.15.12.png

(Note) From here, it was particularly difficult to read, so it will be close to what you expected.

It's meaningless to say that it's a new molecule if it doesn't have the physical properties you want, right? Although the technology introduced this time is new molecule generation, A pharmaceutical rule of thumb that "molecules with structures similar to known molecules with good physical properties may have good physical properties as well (many?)"? There seems to be an idea like this. In other words, it seems that it takes the process of encoding a known molecule with good physical properties in this learned latent space, searching around the position of that molecule in the latent space, and searching for a new molecule. After that, the trained decoder generates SMILES of the molecule. When it is finally generated, RDkit is used to determine whether it holds as a molecule.

The figure below seems to be the result. The central molecule is surrounded by a square. From there, the positional relationship of the latent space is expressed.

The flow up to this point is within my understanding.

I actually tried an example on GitHub.

When doing an example, enter chemical_vae with conda or pip.

First, the import part

`intro_to_chemvae.ipynb`


# tensorflow backend
from os import environ
environ['KERAS_BACKEND'] = 'tensorflow'
# vae stuff
from chemvae.vae_utils import VAEUtils
from chemvae import mol_utils as mu
# import scientific py
import numpy as np
import pandas as pd
# rdkit stuff
from rdkit.Chem import AllChem as Chem
from rdkit.Chem import PandasTools
# plotting stuff
import matplotlib.pyplot as plt
import matplotlib as mpl
from IPython.display import SVG, display
%config InlineBackend.figure_format = 'retina'
%matplotlib inline

The dataset uses the zinc dataset. This dataset contains SMILES and physical properties (QED (drug-likeness evaluation), SAS (synthetic accessibility score), logP (octanol coefficient)).

Also, ・ Smiles_1 specifies the central molecule. ・ Noise is the distance (z) from the central molecule in the latent space. -Random sampling is performed within the z range, and it may not be possible to find one for which SMILES holds with only one trial, so I tried using the for statement 500 times. -Reconstruction is vectorized by an encoder that has learned the central molecule and output by a decoder that has learned it. (The results below do not seem to work, but you should change the learning method and parameters.)

vae = VAEUtils(directory='../models/zinc_properties')
smiles_1 = mu.canon_smiles('CSCC(=O)NNC(=O)c1c(C)oc(C)c1C')

for i in range(500):
   X_1 = vae.smiles_to_hot(smiles_1,canonize_smiles=True)
   z_1 = vae.encode(X_1)
   X_r= vae.decode(z_1)

   print('{:20s} : {}'.format('Input',smiles_1))
   print('{:20s} : {}'.format('Reconstruction',vae.hot_to_smiles(X_r,strip=True)[0]))

   print('{:20s} : {} with norm {:.3f}'.format('Z representation',z_1.shape, np.linalg.norm(z_1)))


   print('Properties (qed,SAS,logP):')
   y_1 = vae.predict_prop_Z(z_1)[0]
   print(y_1)
　　noise=3.0
   print('Searching molecules randomly sampled from {:.2f} std (z-distance) from the point'.format(noise))

・ Output result

Using TensorFlow backend.
Standarization: estimating mu and std values ...done!
Input                : CSCC(=O)NNC(=O)c1c(C)oc(C)c1C
Reconstruction       : CH1nCNc1Cs)Nccccc(CCc1)c3
Z representation     : (1, 196) with norm 9.901
Properties (qed,SAS,logP):
[0.72396696 2.1183593  2.1463375 ]
Searching molecules randomly sampled from 3.00 std (z-distance) from the point

At the end, we will get what we have found, which is unique and determined by RDkit.

   
   df = vae.z_to_smiles( z_1,decode_attempts=100,noise_norm=noise)
   print('Found {:d} unique mols, out of {:d}'.format(len(set(df['smiles'])),sum(df['count'])))
   print('SMILES\n',df.smiles)
   if sum(df['count']) !=0:
      df1=pd.DataFrame(df.smiles)
   df1.to_csv("result1.csv",mode='a',index=False,header=False)

Output result below

`result1.csv`


ON cCO=COCC(O)ccN2cs2c
CCCCCCNc-1cO-SCOCCcccc1
CC1CCcC(-nOcc1ccccCCC1)c1 O
OC (C)C(=Occc3cccccccc)CB
CCC1oNCc2cCcccccc2cccc1 1 1
CO CC(1c(O=O1O(1cO)nC))1
C=C1nn(=O)SnNccccccocc1C
CC C@Cs(=CN=11cccc2cc1Cc)2c1
C OcCCc(CO)c1nccc=Occccc1C
O1Cc(c1CCO)CNCC=BBOCCCN
CC ON(FCNN(C)ccc(Ocn1)1)l
C1ccnccnccccccccncccscc1
CC(CScc(c1cOn1nc1CCl)C)1
CCCCc(-ncccc21nc1c1c2)1CCC
C cnc(Cnncncc(C())Cl)Cl1 1

There are some that probably do not become molecules even though they have passed through RDkit. .. .. However, it seems to avoid many Syntax errors. RDkit competent. ..

that's all

reference

１）Automatic Chemical Design Using a Data-Driven Continuous Representation of Molecules https://pubs.acs.org/doi/abs/10.1021/acscentsci.7b00572 ２）chemical_vae https://github.com/aspuru-guzik-group/chemical_vae 3) Compound formation by deep learning (drugs, organic luminescent molecules) https://ritsuan.com/blog/8480/

Deep learning for compound formation?