Deep learning for molecules, introduction to chainer chemistry

Kenta Oono (oono@preferred.jp, github: delta2323)
Kosuke Nakago (nakago@preferred.jp, github: corochann)
Deep learning for molecules
Introduction to Chainer Chemistry

Table of contents
1. What is machine learning?
a. Data driven approach
b. Primer of deep learning (MLP/ CNN)
2. Prediction of chemical characteristics
a. Rule-based approach vs. Learning-based approach
b. Neural Message passing (NFP / GGNN etc.)
3. Chainer Chemistry
a. Primer of Chainer
b. Coding examples
4. Other topics
a. Generation of chemical compounds
b. Automatic chemical synthesis

Why machine learning?
Example: Prediction of age from pictures
Challenges
● What criteria can we use?
○ height, hair, cloths, physique etc. ?
○ Not all criteria are perfect.
● Even if we have good criteria, how could we extract them?
○ People in pictures can have different positions, scale, postures.
○ How can we detect each part (face, hair etc.) within a body?
=> It is very difficult to list up rules manually. Picture: irastoya
(https://www.irasutoya.com)

Approach by machine learning
Provide machines with vast amount of images with age information and have
them discover treads characteristic to each generation.
Human does not direct machines where in images to look at explicitly.
Photo : flicker

Application of machine learning
Task Input Output
Chemical prediction Molecule Chemical characteristics (HOMO etc.)
Mail classification E-mail
(sentences, header)
Spam or Normal or Important
Data center electlicity
optimization
Packets of each
server
Estimated electricity demand
Web marketing Access history,
ad contents
Click or not
Surveillance camera Movie suspicious behavior or not

Categorization of machine learning algorithms
● By dataset types
● Supervised learning (with ground truth labels)
● Unsupervised learning (without ground truth labels)
● Semi-supervised learning (A part of samples has ground truth labels)
● Reinforcement learning (Reward instead of labels)
● By methods
● Classification, Regression, Clustering, Nearest Neighbourhood
● Others
● discriminative model vs. generative model / bayesian vs. fequensionist etc.

Deep Learning
A general term of the subcategory of machine learning that uses models
consisting of (typically many) simple and differentiable transformations.
http://www.wsdm-conference.org/2016/slides/WSDM2016-Jeff-Dean.pdf

Multi Layer Perceptron (MLP)
x1
xN
・・・・・・・
h1
hH
kM
k1
yM
y1
f1 f2 f3
W2/b2
W1/b1
tM
t1
Ground truthInput
Forward
Backward
Output
・・・
・・
・・
Learnable parameters
• W1
, W2
: parameter matrices
• b1
, b2
: bias vectors
Forward propagation
• h = f1
(x) = Sigmoid(W1
x + b1
)
• k = f2
(h) = Sigmoid(W2
h + b2
)
• y = f3
(k) = SoftMax(k)
(equivalently, yi
= exp(ki
)/Σj
exp(kj
))
Training dataset
• Feature vectors: x1
, x2
, …, xN
• Ground truth labels: t1
, t2
, …, tN
Each transform consists of a fully-connected layer and an activation function
Evaluate
difference
・・・・・・・

● Learnable parameters:
● W (weight matrix of size N x M)
● b (bias vector of size M)
● Input : vector x of size N
● Output vector y = Wx + b (affine transformation)
W/b
Fully connected layer
yx
y1
yM
・・・・
x1
xN
・・・・・・
y = Wx + b

Activation function
● Function (usually) without learnable
parameter for introducing non-linearlity
● Input: vector (or tensor) x = (x1
, …, xn
)
● Output: vector (or tensor) y = (y1
, …, yn
)
y1
yN
x1
xN
yx
・・・・・・
Examples of σ
● Sigmoid(x) = 1 / 1 + exp(-x)
● tanh(x)
● ReLU(x) = max(0, x)
● LeakyReLU(x) = x (x > 0), ax (x < 0)
○ a < 0 is a fixed constant
・・・・・・
yi
= σ(xi
) (i = 1, …, n)

Convolutional Neural Network (CNN)[LeCun+98]
• A neural network consisting of convolutional layers and pooling layers
• Many variants: AlexNet, VGG, Inception, GoogleNet, ResNet etc.
• Widely used in image recognition and recently applied to biology and chemistry
LeNet-5[LeCunn+98]
LeCun, Yann, et al. "Gradient-based learning applied to
document recognition." Proceedings of the IEEE 86.11
(1998): 2278-2324.

Convolution operation (stride = 1 case)
1 0 1
0 1 0
1 0 1
1 1 1 0 0 0
0 1 1 1 0 0
0 0 1 1 1 0
0 0 1 1 0 0
0 1 1 0 0 0
0 0 0 0 0 0
input filter
* =
output
4 3 4 1
2 4 3 3
2 3 4 1
2 2 1 1

Convolution operation (stride = 3 case)
1 0 1
0 1 0
1 0 1
1 1 1 0 0 0
0 1 1 1 0 0
0 0 1 1 1 0
0 0 1 1 0 0
0 1 1 0 0 0
0 0 0 0 0 0
input filter
* =
output
4 1
2 1

Convolutional layer
Stack several filters whose parameters are learnable

Stacking convolutional layers
Convolution layer with stride k generates
the output whose height & width are
approximately k times smaller.

Pooling layers
http://cs231n.github.io/convolutional-networks/

How can we generalize convolution operations to arbitrary
graphs?
Images : grid graph Molecules : arbitrary graph

Table of contents
b. Primer of deep learning (MLP / CNN)
b. Coding examples
4. Other topics

Chemical prediction - Two approaches
Quantum simulation
　Theory-based approach.
　DFT (Density Functional Theory)
　→ Pros: Precision is guaranteed
　　 Cons: High calculation cost
Machine learning
　Data-based approach.
　Learn known compound’s property,
　predict new compound’s property.
　→ Pros: Low cost, high speed calculation
　　 Cons: No precision guaranteed
“Neural message passing for quantum chemistry” Justin et al

Extended Connectivity Fingerprint (ECFP)
Pros
- Calculation is fast
- Show presence of
particular substructures
Cons
- Bit collision
two (or more) different substructural features could be
represented by the same bit position
https://chembioinfo.com/2011/10/30/revisiting-molecular-hashed-fingerprints/ https://docs.chemaxon.com/display/docs/Extended+Connectivity+Fingerprint+ECFP
Convert molecule into
fixed length bit representation

Problems of conventional methods
1. Input representation is not unique,
result depends on representation of input
e.g. SMILES representation
　　CC#C and C#CC are same molecule.
2. Order invariance is not guaranteed
– representation is not guaranteed to be invariant to relabeling (i.e.
permutation of indexes) of molecules.

How graph convolution works
CNN on image
Image
class label
Chemical
property
Graph convolution

Atom feature embedding: 1 Man-made features
C
N
O
1.0 0.0 0.0 6.0 1.0
atom type
0.0 1.0 0.0 7.0 1.0
0.0 0.0 1.0 8.0 1.0
charge
chirality
Man-made features
Molecular Graph Convolutions: Moving Beyond Fingerprints
Steven Kearnes, Kevin McCloskey, Marc Berndl, Vijay Pande, Patrick Riley arXiv:1603.00856

Atom feature embedding: 2 Embed in vector space
C
N
O
0.5 1.2 1.0 1.0 1.8
Embed in vector space
0.8 1.0 1.3 0.1 1.5
0.5 1.0 0.5 2.0 0.0
Each atom is randomly assigned
to some position in vector space
W
Learnable parameter

Graph Convolution: update each node’s (atom)
feature
Feature of each node is updated (several times) by
Graph Convolution operation.
Han Altae-Tran, Bharath Ramsundar, Aneesh S. Pappu, & Vijay Pande (2017). Low Data Drug
Discovery with One-Shot Learning. ACS Cent. Sci., 3 (4)

Graph Gather: Extract whole graph (molecule) feature
Updated feature of each node is finally combined to form
graph’s (molecule’s) feature by Graph Gather operation.
Han Altae-Tran, Bharath Ramsundar, Aneesh S. Pappu, & Vijay Pande (2017). Low Data Drug
Discovery with One-Shot Learning. ACS Cent. Sci., 3 (4)

Unified view of graph convolution
Many message-passing algorithms (NFP, GGNN, Weave etc.) are formulated as the
iterative application of Update and Readout functions [Gilmer et al. 17].
Update Readout
Aggregates neighborhood information and updates
node representations.
Aggregates all node representations and updates the
final output.
Gilmer, J., Schoenholz, S. S., Riley, P. F., Vinyals, O., & Dahl, G. E. (2017). Neural message
passing for quantum chemistry. arXiv preprint arXiv:1704.01212.

Graph convolution neural network variants
- NFP: Neural Fingerprint
- GGNN: Gated-Graph Neural Network
- WeaveNet: Molecular Graph Convolutions
- SchNet: A continuous-filter convolutional NN
“Convolutional Networks on Graph for
Learning Molecular Fingerprints”
https://arxiv.org/abs/1509.09292

NFP: Neural Fingerprint
Message passing
- update feature r
Readout
- extract output f from r
Convolution
David K Duvenaud, Dougal Maclaurin, Jorge Iparraguirre, Rafael Bombarell, Timothy Hirzel, Alan Aspuru-Guzik,
and Ryan P Adams. Convolutional networks on graphs for learning molecular fingerprints.

C
C C
N C
C
C
O
OH
C
C C
N C
C
C
O
O
h1
h2
h3
h4
h5
h6
h7
h8
h9
h10
W3
h9
W3
h8
W3
h6
W3
h7 h’7
= σ ( W3
(h7
+h6
+h8
+h9
) )
h’3
= σ ( W2
(h3
+h2
+h4
) )
W2
h2
W2
h4
W2
h3
Graph convolution operation depends on degree of each atom
→ Bonding type information is not utilized
Update:

C
C C
N C
C
C
O
OH
h1
h2
h3
h4
h5
h6
h7
h8
h9
h10
Readout operation is basically simply sum over the atoms
→ No selective operation/attention mechanism is adopted.
Readout:
R = ∑ i
softmax (Whi
)

GGNN: Gated Graph Neural Network
C
C C
N C
C
C
O
OH
C
C C
N C
C
C
O
O
h1
h2
h3
h4
h5
h6
h7
h8
h9
h10
W1
h9
W2
h8
W1
h6
h7 h’7
= GRU (h7
, W1
h6
+W2
h8
+W1
h9
)
h’3
= GRU (h3
, W1
h2
+W2
h4
)
W1
h2
W2
h4
h3
Graph convolution operation depends on bonding type of each atom pair
Update:
GRU: Gated Recurrent Unit
Yujia Li, Daniel Tarlow, Marc Brockschmidt, and Richard Zemel. Gated graph sequence neural networks.
arXiv preprint arXiv:1511.05493, 2015.

GGNN: Gated Graph Neural Network
C
C C
N C
C
C
O
OH
h1
h2
h3
h4
h5
h6
h7
h8
h9
h10
Readout operation contains selective operation (gating)
Readout:
R = ∑ v
σ (Wi
hv
) ⦿ Wj
hv
R = ∑ v
σ (i(hv
, hv0
)) ⦿ j(hv
)
Simplified version
Here, i and j represents some function (neural network)
σ is sigmoid non-linear function

Weave: Molecular Graph Convolutions
● Weave module convolutes an atom feature for by
features of the pair of each atoms.
A: atom feature, P: feature of atom pair
● P → A operation:
g() is a function for order invariance.
sum() is used in the paper.
Molecular Graph Convolutions: Moving Beyond Fingerprints
Steven Kearnes, Kevin McCloskey, Marc Berndl, Vijay Pande, Patrick Riley arXiv:1603.00856

SchNet: A continuous-filter convolutional neural network
Kristof Schütt, Pieter-Jan Kindermans, Huziel Enoc Sauceda Felix, Stefan Chmiela, Alexandre Tkatchenko, and Klaus-Rober Müller
Schnet: A continuous-filter convolutional neural network for modeling quantum interactions.
1. All atom pair distance ||ri
- rj
|| is used as input
2. Energy conserving condition can be addtionally used to constraint the model
for energy prediction task

Comparison between graph convolution networks
NFP GGNN Weave SchNet
Atom feature
extraction
Man-made or
Embed
Man-made or
Embed
Man-made or
Embed
Man-made or
Embed
Graph convolution
strategy
Adjacent
atoms only
Adjacent atoms
only
All atom-atom pairs All atom-atom pairs
How to represent
connection
information
Degree Binding type Man-made
pair features
(bondtype,
distance etc.)
Distance

Example: IT Drug Discovery Contest
Task
• Find new seed compounds for a target protein (Sirtuin 1) from 2.5 million
compounds by IT technologies
Rule
• Each team needs to prepare data by itself such as training datasets.
• Each team can submit up to 400 candidate compounds
• Judge checks all submitted compounds
by a 2-stage biological experiment.
– Thermal Shift Assay
– Inhibitory assay → IC50 measurement Sirtuin 1
Contest website (Japanese)
http://www.ipab.org/eventschedule/contest/contest4

Our result
Ours Average
(18 teams in total)
1st screening (TSA) 23 / 200 (11.5%) 69 / 3559 (1.9 %)
2nd screening (IC50) 1 5
We found one hit compound and won
one of Grand prize (IPAB prize)

Extension to semi-supervised learning
Compute representations of subgraphs inductively
with neural message passing (→)
Optimize the representation in unsupervised
manner in the same way as Paragraph vector (↓)
Nguyen, H., Maeda, S. I., & Oono, K.
(2017). Semi-supervised learning of
hierarchical representations of molecules
using neural message passing. arXiv
preprint arXiv:1711.10168.

Table of contents
b. Primer of deep learning (MLP/ CNN / Graph convolution network)
b. Coding examples
4. Other topics

How can we incorporate ML to Chemistry and
Biology?
Problems
• Optimized graph convolution algorithms are hard to implement
from scratch.
• ML and Chemistry/Biology researchers sometimes use different
“languages”.
Solution: Create tools so that …
• Chemistry/Biology researchers do not bother details of DL
algorithms and concentrate on their research.
• ML and Chemistry researchers can work in collaboration.
ー＞ We are developing Chainer Chemistry
Picture: irastoya
(https://www.irasutoya.com)

A Python framework that lets researchers quickly implement, train,
and evaluate deep learning models.
Designing a network Training, evaluation
Data
set

Speed up research and development of deep learning and its applications.
(https://chainer.org)
Features
• Build DL models as a Python program
→　Can write complex network (loop, branch etc.) easily
• Define-by-Run: dynamic model construction
→　Can make full use of Python stacktrace in debugging
→　Can support data-dependent neural networks natively
• CuPy: NumPy-like GPU array library
→　Can write CPU/GPU agnostic code
Basic information
• First release: June 2015
• Version
– v3.3.0 (stable)
– v4.0.0b3 (develop)
• License: MIT
• Language: Python

Example: Build and train convolutional Network
import chainer
import chainer.links as L
import chainer.functions as F
class LeNet5(chainer.Chain):
def __init__(self):
super(LeNet5, self).__init__()
with self.init_scope():
self.conv1 = L.Convolution2D(1, 6, 5, 1)
self.fc4 = L.Linear(None, 84)
self.fc5 = L.Linear(84, 10)
def __call__(self, x):
h = F.sigmoid(self.conv1(x))
h = F.max_pooling_2d(h, 2, 2)
h = F.sigmoid(self.conv2(h))
h = F.max_pooling_2d(h, 2, 2)
h = F.sigmoid(self.conv3(h))
h = F.sigmoid(self.fc4(h))
return self.fc5(h)

Example: Build and train convolutional Network
model = LeNet5()
model = L.Classifier(model)
# Dataset is a list! ([] to access, having __len__)
dataset = [(x1, t1), (x2, t2), ...]
# iterator to return a mini-batch retrieved from dataset
it = iterators.SerialIterator(dataset, batchsize=32)
# Optimization methods (you can easily try various methods by changing SGD to
# MomentumSGD, Adam, RMSprop, AdaGrad, etc.)
opt = optimizers.SGD(lr=0.01)
opt.setup(model)
updater = training.StandardUpdater(it, opt, device=0) # device=-1 if you use CPU
trainer = training.Trainer(updater, stop_trigger=(100, 'epoch'))
trainer.run()

Chainer Chemistry
Chainer extension library for Biology and Chemistry
(http://chainer-chemistry.readthedocs.io/)

Technological Stack
File Parser
(SDF file, CSV file) QM 9, Tox21 dataset
Graph convolution NN
GraphLinear
Preprocessing
(NFP, GGNN, SchNet)
Example
Train and prediction
with QM9/tox21
dataset
Model
Layer/Function
Dataset
Pretrained
Model
(TBD)
Preprocessor (Feature Extractor)

Chainer Chemistry
Chainer extension library for Biology and Chemistry
Basic information
release:12/14/2017, version: v0.1.0, license: MIT, language: Python
Features
• State-of-the-art deep learning neural network models (especially graph
convolutions) for chemical molecules (NFP, GGNN, Weave, SchNet etc.)
• Preprocessors of molecules tailored for these models
• Parsers for several standard file formats (CSV, SDF etc.)
• Loaders for several well-known datasets (QM9, Tox21 etc.)
(http://chainer-chemistry.readthedocs.io/)

Dataset introduction - tox21
# of Dataset: Train 11757, Validation 295, Test 645
Label - Following 12 types of toxity is included:
'NR-AR', 'NR-AR-LBD', 'NR-AhR', 'NR-Aromatase', 'NR-ER', 'NR-ER-LBD',
'NR-PPAR-gamma', 'SR-ARE', 'SR-ATAD5', 'SR-HSE', 'SR-MMP', 'SR-p53'
Example:
SMILES:
C(=O)C1(O)Cc2c(O)c3c(c(O)c2C(OC2CC
(N)C(O)C(C)O2)C1)C(=O)c1c(O)cccc1C3
=O
LABEL: [ 0 1 -1 1 -1 1 -1 -1 1 -1 1 1]
SMILES:
CCCOc1ccc(C(=O)CCN2CCCCC2)cc1.Cl
LABEL: [ 0 0 0 -1 1 0 0 -1 -1 -1 0 0]
SMILES:
CCOP(=S)(OCC)SC(CCl)N1C(=O)c2cccc
c2C1=O
LABEL: [ 0 0 1 0 1 1 0 1 0 0 -1 -1]
SMILES:
O=c1c(O)c(-c2ccc(O)cc2)oc2cc(O)cc(O)c
12
LABEL: [ 0 0 1 -1 1 1 -1 0 0 0 1 0]
2948 3895 6558 7381

Dataset introduction - QM9
# of Dataset: 133,885
Label - Following property is included:
'A', 'B', 'C', 'mu', 'alpha', 'homo', 'lumo', 'gap', 'r2', 'zpve', 'U0', 'U', 'H', 'G', 'Cv'
Example:
SMILES: NC1=NCCC(=O)N1
LABEL: [ 3.51 1.93 1.29 2.54
64.1 -0.236 -2.79e-03 2.34e-01
900.7 0.12 -396.0 -396.0
-396.0 -396.0 26.9]
SMILES: CN1CCC(=O)C1=N
LABEL: [3.285 2.062 1.3 4.218
68.69 -0.224 -0.056 0.168
914.65 0.131 -379.959 -379.951
-379.95 -379.992 27.934]
SMILES: N=C1OC2CC1C(=O)O2
LABEL: [2.729 1.853 1.474 4.274
61.94 -0.282 -0.026 0.256
887.402 0.104 -473.876 -473.87
-473.869 -473.907 24.823]
SMILES: C1N2C3C4C5OC13C2C5
LABEL: [ 3.64 2.218 1.938 0.863
69.48 -0.232 0.074 0.306
756.356 0.128 -400.633 -400.628
-400.627 -400.662 23.434]

Example: HOMO Prediction by NFP with QM9 dataset
Dataset preprocessing (for NFP Network)
preprocessor = preprocess_method_dict['nfp']()
dataset = D.get_qm9(preprocessor, labels='homo')
# Cache dataset for second use
NumpyTupleDataset.save('input/nfp_homo/data.npz', dataset)
train_data_size = int(len(dataset) * train_data_ratio)
train, val = split_dataset_random(dataset, train_data_size)

Example: HOMO Prediction by NFP with QM9 dataset
Model definition
class GraphConvPredictor(chainer.Chain):
def __init__(self, graph_conv, mlp):
super(GraphConvPredictor, self).__init__()
with self.init_scope():
self.graph_conv = graph_conv
self.mlp = mlp
def __call__(self, atoms, adjs):
x = self.graph_conv(atoms, adjs)
x = self.mlp(x)
return x
model = GraphConvPredictor(NFP(16, 16, 4), MLP(16, 1))
Once a graph neural network is built, training is same as ordinary Chainer models.

Future work
• Primitive operations
– GraphConv, GraphPool, GraphGather
• Graph Convolution models
– Follow state of the art Graph Convolutional Neural Networks
• Pretrained Models
– We do not think to guarantee reproducibility of papers, though.
• Off-the-shelf models
– Neural message passing, 3D convolution, Generative models etc.
• Dataset
– MUTAG, MoleculeNet etc.

Table of contents
b. Primer of deep learning (MLP/ CNN / Graph convolution network)
b. Coding examples
4. Other topics (5 min.)

From prediction to generation of molecules
Prediction Generation
Find molecules with desired properties
from given compound libraries.
Produce molecules not in the
libraries that has desired properties

Molecule generation with VAE [Gómez-Bombarelli+16]
● Encode and decode molecules
represented as SMILE with VAE in
seq2seq manner.
● Latent representation can be used for
semi-supervised learning.
● We can use learned model to find
molecule with desired property by
optimizing representation in latent
space and decode it.
Generated molecules are not guaranteed
to be valid syntactically :(
Gómez-Bombarelli, R., Wei, J. N., Duvenaud, D., Herná ndez-Lobato, J. M.,
Sánchez-Lengeling, B., Sheberla, D., ... & Aspuru-Guzik, A. (2016). Automatic chemical
design using a data-driven continuous representation of molecules. ACS Central Science.

Grammar VAE [Kusner+17]
Encode
Convert a molecule to a
parse tree to get a
sequence of production
rules and feed the
sequence to RNN-VAE.
Generated molecules are guaranteed to be valid syntactically !
Kusner, M. J., Paige, B., & Hernández-Lobato, J. M.
(2017). Grammar Variational Autoencoder. arXiv
preprint arXiv:1703.01925.
Decode
Generate sequence of
production rules of syntax
of SMILES represented by
CFG

Conclusion
• Data-based approach for chemical property prediction is
getting more attention.
• New material/drug discovery research may be
accelerated by deep learning technology.

Deep learning for molecules, introduction to chainer chemistry

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Deep learning for molecules, introduction to chainer chemistry

Similar to Deep learning for molecules, introduction to chainer chemistry (20)

More from Kenta Oono

More from Kenta Oono (20)

Recently uploaded

Recently uploaded (20)

Deep learning for molecules, introduction to chainer chemistry