Abstract

Motivation

Expanding our knowledge of small molecules beyond what is known in nature or designed in wet laboratories promises to significantly advance cheminformatics, drug discovery, biotechnology and material science. In silico molecular design remains challenging, primarily due to the complexity of the chemical space and the non-trivial relationship between chemical structures and biological properties. Deep generative models that learn directly from data are intriguing, but they have yet to demonstrate interpretability in the learned representation, so we can learn more about the relationship between the chemical and biological space. In this article, we advance research on disentangled representation learning for small molecule generation. We build on recent work by us and others on deep graph generative frameworks, which capture atomic interactions via a graph-based representation of a small molecule. The methodological novelty is how we leverage the concept of disentanglement in the graph variational autoencoder framework both to generate biologically relevant small molecules and to enhance model interpretability.

Results

Extensive qualitative and quantitative experimental evaluation in comparison with state-of-the-art models demonstrate the superiority of our disentanglement framework. We believe this work is an important step to address key challenges in small molecule generation with deep generative frameworks.

Availability and implementation

Training and generated data are made available at https://ieee-dataport.org/documents/dataset-disentangled-representation-learning-interpretable-molecule-generation. All code is made available at https://anonymous.4open.science/r/D-MolVAE-2799/.

Supplementary information

Supplementary data are available at Bioinformatics online.

1 Introduction

Expanding our knowledge of small molecules beyond what is known in nature or designed in wet laboratories promises to significantly advance drug discovery, biotechnology and material science (Whitesides, 2015). In-silico molecule design is central to cheminformatics research but remains challenging (Schneider and Schneider, 2016). Studies estimate that 1060 drug-like molecules are synthetically accessible (Reymond et al., 2012). This size of chemical space is beyond the scope of even high-throughput wet-laboratory technologies.

A multi-decade journey in cheminformatics research informs us of several challenges for small molecule generation. The first concerns the poorly understood and complex relationship between chemical and biological space. Not all molecules in the vast chemical space meet desired biological/functional properties of interest, such as water soluble, drug-likeness and more (Ramakrishnan et al., 2014). Moreover, changes to the chemical structure to optimize along a biological criterion may worsen other criteria; the search space that links chemical and biological space may be rich in with barriers separating neighboring local optima.

Until a decade ago, molecule generation, widely referred to as computational screening, was dominated by similarity search methods (Stumpfe and Bajorath, 2011). While conceptually straightforward, these methods were limited in their ability to generate novel small molecules. Advances in machine learning expedited progress. Shallow models were not very effective (Ellman, 1996; Renz et al., 2020; Xue et al., 2019; Yoshikawa et al., 2018), as they relied heavily on domain insight to formulate and construct meaningful representations of small molecules. Due to their inherent ability to learn directly from data, deep generative models then made a debut. Initial efforts utilized a linear representation of molecules, known as SMILES (Weininger, 1988), which stands for ‘molecular-input line-entry system’. SMILES is a formal grammar that describes molecules with an alphabet of characters; aromatic and aliphatic carbon atoms are denoted by ‘c’ and ‘C’, oxygen atoms by ‘O’, single bonds by ‘–’, double bonds by ‘=’, etc. The SMILES representation allows addressing molecule generation as a string generation problem. Deep learning methods based on the recurrent neural network (RNN) framework suddenly became useful (Gómez-Bombarelli et al., 2018; Kusner et al., 2017; Segler et al., 2018). However, SMILES-based deep models could generate few valid molecules. In response, later works (Dai et al., 2018; Kusner et al., 2017) added syntactic and semantic constraints. In other works, models were guided to generate valid SMILES through active learning, reinforcement learning and additional training signals (Guimaraes et al., 2017; Janz et al., 2017). While some improvements were observed, generating valid molecules remained challenging.

Graph-generative deep models leverage a more expressive representation of a molecule via the concept of a molecular graph. The atoms are represented as vertices and the bonds as edges connecting the vertices. In deep learning literature, graph-generative models are based on the variational autoencoder (VAE) (Blaschke et al., 2018; Dai et al., 2018; De Samanta et al., 2018; Jin et al., 2018; Simonovsky and Komodakis, 2018) or generative adversarial networks (Bojchevski et al., 2018; Guo et al., 2018). For instance, GraphRNN (You et al., 2018) builds an autoregressive generative model based on a generative RNN that generates the graph one vertex at a time. In contrast, GraphVAE (Simonovsky and Komodakis, 2018) represents each graph in terms of its adjacent matrix and feature vectors of vertices. A VAE model is then utilized to learn the distribution of the graphs conditioned on a latent representation at the graph level. Other works (Grover et al., 2019; Kipf and Welling, 2016) encode the vertices into vertex-level embeddings and predict the edges between each pair of vertices to generate a graph.

The adoption of graph-generative models for small molecule generation has been rapid. Current graph-generative models for molecule generation leverage the VAE framework to address two subtasks: (i) encoding: learning a low-dimensional, latent code/representation of a molecular graph; (ii) decoding: learning to map the latent representation back into a (reconstructed) molecular graph. For instance, work in Simonovsky and Komodakis (2018) generates molecular graphs by predicting their adjacency matrices. Work in Liu et al. (2018) generates molecules through a constrained graph-generative model that enforces validity by generating a molecule one atom at a time. These works generate more valid molecules than SMILES-based models and additionally subject generated molecules to the sanitization checks in RDKit (RDKit: Open-source cheminformatics; http://www.rdkit.org).

Graph-generative VAEs represent a promising platform that we leverage in this article, but current graph-generative VAEs for small molecule generation fall short. The learned latent representation has all the latent factors entangled which limits the model transparency and interpretability. Specifically, these models do not facilitate linking the chemical space to the biological space and so do not advance our understanding of complex relationship between chemical and biological space for small molecules. Facilitating this linking is central not only for molecule generation but also for molecule optimization Alemi et al. (2017), an important and related task that beyond the scope of this article.

In this article, we advance research on small molecule representation learning for molecule generation by disentanglement enhancement. Disentangled representation learning is an active research area, particularly in image representation learning (Alemi et al., 2017; Chen et al., 2018; Guo et al., 2021; Higgins et al., 2017; Kim and Mnih, 2018) and has been shown key to improving model generalizability and robustness against adversarial attacks, and even facilitate debugging and auditing (Alemi et al., 2017; Doshi-Velez and Kim, 2017). While a comprehensive review is beyond the scope of this article, we point to recent approaches that modify the VAE objective by adding, removing or altering the weight of individual terms in the loss function to improve disentanglement (Alemi et al., 2017; Chen et al., 2018; Du et al., 2021a; Esmaeili et al., 2019; Guo et al., 2020; Kim and Mnih, 2018; Kumar et al., 2018; Lopez et al., 2018; Zhao et al., 2019). Currently, however, we do not know the best approach to learn disentangled representations of graph data. This includes the small molecule generation domain. In a recent workshop paper (Du et al., 2020), we demonstrated that learning disentangled representations results in better molecule generation over methods that do not leverage disentanglement. However, as our goal was a proof-of-concept demonstration that VAEs for disentangled representation learning achieve good assessment for small molecule generation, the study was limited to classic disentanglement and focused on few datasets of known small molecules of the same size.

Here we propose a graph-generative VAE framework that learns a disentangled code/representation, so that we may additionally elucidate how the factors that encode chemical structure control biological properties. Specifically, we design and evaluate the D-MolVAE framework, which stands for Disentangled Molecule VAE. The framework permits various mechanisms for disentanglement, resulting in several novel deep graph-generative models, which we compare to one another and many other state-of-the-art methods on benchmark datasets across several metrics.

Our experiments show that the D-MolVAE framework is effective and superior at generating valid, novel and unique small molecules over other methods. The framework also accommodates variable-size molecules which improves its scope and applicability. Our experiments additionally show that disentanglement representation learning is valuable for better interpretation and understanding of the relationship between the chemical space and the biological space; the proposed D-MolVAE models are better able to capture the underlying graph statistics and distributions of various biological properties.

The D-MolVAE models effectively implement a trade-off between the disentanglement enhancement and the reconstruction. Our experiments show that explicit disentanglement enforcement does not hurt performance. In fact, the models are superior over many methods. Taken altogether, our findings suggest that the disentangled factors provide an advantage with respect to the quality of generated molecules, as well as the linking of the chemical and biological space. Our experiments suggest several models as promising platforms for further exploring disentangled representations for improving small molecule generation.

2 Materials and methods

We first define and formalize the problem. Then we describe the graph-generative models based on VAE framework, namely D-MolVAE, focusing the description on the variants of disentanglement terms proposed to obtain different disentangled graph-generative VAE models.

2.1 Problem formulation

Let us represent a molecule as a graph G=(V,E,E,F). The N atoms of the molecule constitute the N vertices V of graph G. The M bonds connecting pairs of atoms in the molecule constitute the edges EV×V, where ei,jE is an edge connecting vertices viV and vjV. G=(V,E,E,F) also contains E and F. ERN×N×K is the edge-type tensor that records the K bond types. Specifically, Ei,jR1×K is an one-hot vector encoding the type of edge ei,j. FRN×K is the vertex-type feature matrix that records the K atom types. Specifically, FiR1×K is the one-hot encoding vector denoting the type of atom vi.

The objective in graph-generative disentangled representation learning is to learn the joint distribution of G and a set of generative disentangled latent factors/variables ZRN×L, such that the observed graph G can be generated as p(G|Z). Note that L is the dimensionality of the latent factors. Disentanglement denotes the additional constraint that the individual variables in Z be independent from one another.

2.2 D-MolVAE framework

Two challenges present themselves with the above formulation: (i) how to integrate the disentanglement constraint and the reconstruction quality constraint in the loss function that guides learning; (ii) how to efficiently encode and decode molecules/graphs of different sizes. We first show how the first challenge is addressed in the D-MolVAE framework via a generative objective function. We show in this context that different approaches here can result in different models. Then we show how the second challenge is addressed via variable-size edge-to-edge and edge-to-vertex convolution operators in D-MolVAE.

We are inspired by disentanglement representation learning in the image domain (Higgins et al., 2017), where a suitable objective in learning p(G|Z) is to maximize the marginal (log-)likelihood of the observed graph G in expectation over the whole distribution of latent variables set ZRN×L as maxθEpθ(Z)[pθ(G|Z)], where θ allows explicitly denoting the parameters characterizing this distribution.

Learning pθ(G|Z) requires the inference of its posterior pθ(Z|G), which is intractable. So, one defines instead an approximated posterior qϕ(Z|G) that is computationally tractable. In disentangled representation learning, one needs to additionally ensure that the inferred latent variables Z from qϕ(Z|G) capture all the generative factors in a disentangled manner. This is achieved by introducing a constraint to match qϕ(Z|G) to a well-disentangled prior p(Z) that controls the capacity of the latent information bottleneck and embodies the statistical independence mentioned above. An isotropic unit Gaussian suffices; that is, p(Z)=N(0,I), where I is an N × N identity matrix. This leads to the following constrained optimization problem:
(1)

In the above equation, D refers to the observed set of graphs (corresponding to molecules in the training dataset), DKL(·) is the Kullback–Leibler divergence (KLD) that allows comparing two probability distributions and ϵ is a parameter that specifies the strength of the applied constraint; that is, ϵ allows weighting how much we want the disentanglement constraint to be enforced.

Unfortunately, the above constraint formulated to achieve disentanglement is intractable. So, an aggregate objective (loss) function is formulated instead, where the above constraint and the reconstruction error in a VAE are combined together as in:
(2)

This aggregation is similar to the beta-VAE (Higgins et al., 2017) that first introduced the notion of disentanglement (though not for graph data). Note that β weighs how important it is to enforce the disentanglement constraint. Specifically, when β = 1, one obtains a vanilla VAE (Kingma and Welling, 2013). We direct the interested reader to work in Higgins et al. (2017) to understand the effects of β.

2.3 Disentanglement-enhanced models

By considering different approaches to enforce disentanglement, we obtain different instantiations of our D-MolVAE framework, namely, D-MolVAE-V, D-MolVAE-β, D-MolVAE-DIP-I, D-MolVAE-DIP-II and D-MolVAE-VIB.

D-MolVAE-V: We extend the previous work on disentangled variational auto-encoders (Esmaeili et al., 2019; Kingma and Welling, 2013) into that for graph-structured data, as follows:
(3)

In the above, terms ③ and ④ enforce consistency between the marginal distributions over G and z. Specifically, minimizing the KLD in term ③ maximizes the marginal likelihood Eq(G)logpθ(G); maximizing the disentangled inferred priors term ④ enforces the distance between qϕ(Z) and p(Z). Terms ① and ② enforce consistency between the conditional distributions. Specifically, term ① maximizes the correlation for each Z that generates each Gn; when Zqϕ(Z|Gn) is sampled, the likelihood pθ(Gn|Z) should be higher than the marginal likelihood pθ(Gn). Meanwhile, term ② regularizes term ① by minimizing the mutual information I(Z, G) in the inference model.

The D-MolVAE-V objective is defined as:
(4)
D-MolVAE-β: The penalty term β>1 has proven useful to enforce the disentanglement of the latent variables without worsening reconstruction performance (Higgins et al., 2017). We emphasize that β allows balancing between reconstruction loss and KLD loss. So, our first model that introduces disentanglement for graph-based representation learning for small molecule generation is D-MolVAE-β. Its objective is similar to D-MolVAE-V. The only difference concerns the weighted KLD terms ++β(+) as follows:
(5)
D-MolVAE-DIP-I: It is important to note that term may lead to poor reconstruction, when the disentanglement is heavily enforced by setting high values for the β parameter. To address this, the ‘Disentangled Inferred Prior Variational Autoencoders’ (DIPVAE) mode introduces a hyperparameter λ in term ④ (Kumar et al., 2018). This term is also referred to as the ‘inferred priors’ term and enforces the distance between qϕ(z) and p(z). The hyperparameter allows controlling the trade-off between the reconstruction loss and the KLD term. We incorporate this idea to obtain D-MolVAE-DIP-I, whose objective function is now +++λ, as in:
(6)
D-MolVAE-DIP-II: Note that the term represents the mutual information I(Z, G) between the latent representation Z and the molecule G, which may lead to poor reconstruction. An alternative approach to balance between disentanglement and reconstruction is to discard term , thus obtaining DIPVAE-MolVAE-II, whose objective function now is:
(7)
D-MolVAE-VIB: The Variational Information Bottleneck (VIB) approach interprets the capacity of the KLD as the information bottleneck of the network (Alemi et al., 2017). It proposes to add a controllable value C and a hyperparameter γ over the KLD term to control the information flowing through it. Later work demonstrates that by slowly increasing the value of C, the latent representation is able to gradually capture the semantic factors (Locatello et al., 2018). Inspired by these works, we obtain our final model D-MolVAE-VIB, whose objective function is:
(8)

2.4 Implementation details

The variants are summarized in terms of their objectives in Table 1. The encoder and decoder architecture are summarized in Table 2. Finally, the hyperparameters used for training are related in Table 3. The rows refer to the different benchmark datasets, which we describe in Section 3. We observe that increasing β leads to a better disentangled representation, as later shown in Table 7.

Table 1.

Summary of D-Mol-VAE variants in terms of their disentanglement objectives

ModelObjectives
D-MolVAE-V+++
D-MolVAE-β++β(+)
D-MolVAE-DIP-I+++λ
D-MolVAE-DIP-II++
D-MolVAE-VIB++γ++C
ModelObjectives
D-MolVAE-V+++
D-MolVAE-β++β(+)
D-MolVAE-DIP-I+++λ
D-MolVAE-DIP-II++
D-MolVAE-VIB++γ++C
Table 1.

Summary of D-Mol-VAE variants in terms of their disentanglement objectives

ModelObjectives
D-MolVAE-V+++
D-MolVAE-β++β(+)
D-MolVAE-DIP-I+++λ
D-MolVAE-DIP-II++
D-MolVAE-VIB++γ++C
ModelObjectives
D-MolVAE-V+++
D-MolVAE-β++β(+)
D-MolVAE-DIP-I+++λ
D-MolVAE-DIP-II++
D-MolVAE-VIB++γ++C
Table 2.

Encoders and decoders architectures

EncoderDecoder
Input: G(V,E,E,F)Input[z]R100
FC.100 ReLUFC.100 ReLU
GGNN.100 ReLUGGNN.100 ReLU
GGNN.100 ReLUGGNN.100 ReLU
FC.100FC.bv (batch node size) FC.3 (edge)
EncoderDecoder
Input: G(V,E,E,F)Input[z]R100
FC.100 ReLUFC.100 ReLU
GGNN.100 ReLUGGNN.100 ReLU
GGNN.100 ReLUGGNN.100 ReLU
FC.100FC.bv (batch node size) FC.3 (edge)

Note: Each layer is expressed in the format as <kernel_size><layer_type><Num_channel><Activation_function><stride_size>. FC refers to the fully connected layers

Table 2.

Encoders and decoders architectures

EncoderDecoder
Input: G(V,E,E,F)Input[z]R100
FC.100 ReLUFC.100 ReLU
GGNN.100 ReLUGGNN.100 ReLU
GGNN.100 ReLUGGNN.100 ReLU
FC.100FC.bv (batch node size) FC.3 (edge)
EncoderDecoder
Input: G(V,E,E,F)Input[z]R100
FC.100 ReLUFC.100 ReLU
GGNN.100 ReLUGGNN.100 ReLU
GGNN.100 ReLUGGNN.100 ReLU
FC.100FC.bv (batch node size) FC.3 (edge)

Note: Each layer is expressed in the format as <kernel_size><layer_type><Num_channel><Activation_function><stride_size>. FC refers to the fully connected layers

Table 3.

Hyperparameters used for training

DatasetLearning_rateBatch_sizeλNum_iteration
QM95e−464110
ZINC5e−4815
MOSES5e−4415
CHEMBL5e−4415
DatasetLearning_rateBatch_sizeλNum_iteration
QM95e−464110
ZINC5e−4815
MOSES5e−4415
CHEMBL5e−4415
Table 3.

Hyperparameters used for training

DatasetLearning_rateBatch_sizeλNum_iteration
QM95e−464110
ZINC5e−4815
MOSES5e−4415
CHEMBL5e−4415
DatasetLearning_rateBatch_sizeλNum_iteration
QM95e−464110
ZINC5e−4815
MOSES5e−4415
CHEMBL5e−4415

3 Results

3.1 Datasets and experimental setup

We employ four benchmark datasets: QM9, ZINC, MOSES and ChEMBL (Du et al., 2021b). QM9 (Ramakrishnan et al., 2014; Ruddigkeit et al., 2012) contains around 134k stable small organic molecules with up to nine heavy atoms [e.g. carbon (C), oxygen (O), nitrogen (N) and fluorine (F)]. ZINC (Irwin et al., 2012) contains approximately 250k drug-like chemical compounds with an average of 23 heavy atoms. The molecules in this dataset are more complex than in QM9. MOSES (Polykovskiy et al., 2020) contains about 1.9M larger molecules with up to 30 heavy atoms. ChEMBL (Gaulton et al., 2017) contains about 1.8M manually curated bioactive molecules with drug-like properties. For QM9, we use the entire dataset, while for ZINC, MOSES and ChEMBL which have larger molecules, we randomly sample 70k molecules from the entire dataset, and split into 6:1 for training and validation. During testing, we generate 30k molecules for our experiments.

We utilize qualitative and quantitative experiments that evaluate the proposed D-MolVAE-V, D-MolVAE-β, D-MolVAE-DIP-I, D-MolVAE-DIP-II and D-MolVAE-VIB. The models are pitched against nine state-of-the-art deep generative models for molecule generation: ChemVAE (Gómez-Bombarelli et al., 2018), GrammarVAE (Kusner et al., 2017), GraphVAE (Simonovsky and Komodakis, 2018), GraphGMG (Li et al., 2018), SMILES-LSTM (Sundermeyer et al., 2012), GraphNVP (Madhawa et al., 2019), GRF (Honda et al., 2019), GraphAF (Shi et al., 2019) and CGVAE (Liu et al., 2018). In the interest of brevity, summaries of the main computational ingredients in each of these models are related in the Supplementary Material. All experiments are conducted on a 64-bit machine with a 6 core Intel CPU i9-9820X, 32 GB RAM and an NVIDIA GPU (GeForce RTX 2080ti, 1545 MHz, 11 GB GDDR6).

3.2 Evaluating the quality of generated molecules

Table 4 relates the comparative analysis. Each trained model is used to generate 30k molecules. For GraphGMG, we obtain 20k generated molecules from the GraphGMG authors. Results for ChemVAE, GrammarVAE, GraphVAE and SMILES-LSTM are obtained from Liu et al. (2018). The quality of generated dataset is evaluated via the three common metrics of Novelty, Uniqueness and Validity. Novelty measures the fraction of generated molecules that are not in the training dataset. Uniqueness measures the fraction of generated molecules after and before removing duplicates. Validity measures the fraction of generated molecules that are chemically valid.

Table 4.

Novelty, uniqueness and validity, shown in %, are measured on a generated dataset

ModelQM9
ZINC
ValidityNoveltyUniqueValidityNoveltyUnique
ChemVAE10.0090.0067.5017.0098.0030.98
GrammarVAE30.0095.449.3031.00100.0010.76
GraphVAE61.0085.0040.9014.00100.0031.60
GraphGMG89.2089.1099.41
SMILES-LSTM94.7882.9896.9496.80100.0099.97
GraphNVP90.1054.0097.3074.40100.0094.80
GRF84.5058.6066.0073.40100.0053.70
GraphAF100.0088.8394.51100.00100.0099.10
CGVAE100.0096.3398.03100.00100.0099.95
D-MolVAE-V100.0096.1099.15100.00100.0099.95
D-MolVAE-β100.0095.3596.62100.00100.0099.72
D-MolVAE-DIP-I100.0097.3697.80100.0099.9999.88
D-MolVAE-DIP-II100.0098.3172.36100.00100.0051.42
D-MolVAE-VIB100.0095.8598.66100.00100.0099.18
ModelQM9
ZINC
ValidityNoveltyUniqueValidityNoveltyUnique
ChemVAE10.0090.0067.5017.0098.0030.98
GrammarVAE30.0095.449.3031.00100.0010.76
GraphVAE61.0085.0040.9014.00100.0031.60
GraphGMG89.2089.1099.41
SMILES-LSTM94.7882.9896.9496.80100.0099.97
GraphNVP90.1054.0097.3074.40100.0094.80
GRF84.5058.6066.0073.40100.0053.70
GraphAF100.0088.8394.51100.00100.0099.10
CGVAE100.0096.3398.03100.00100.0099.95
D-MolVAE-V100.0096.1099.15100.00100.0099.95
D-MolVAE-β100.0095.3596.62100.00100.0099.72
D-MolVAE-DIP-I100.0097.3697.80100.0099.9999.88
D-MolVAE-DIP-II100.0098.3172.36100.00100.0051.42
D-MolVAE-VIB100.0095.8598.66100.00100.0099.18

Note: The highest value achieved on a metric is highlighted in boldface.

Table 4.

Novelty, uniqueness and validity, shown in %, are measured on a generated dataset

ModelQM9
ZINC
ValidityNoveltyUniqueValidityNoveltyUnique
ChemVAE10.0090.0067.5017.0098.0030.98
GrammarVAE30.0095.449.3031.00100.0010.76
GraphVAE61.0085.0040.9014.00100.0031.60
GraphGMG89.2089.1099.41
SMILES-LSTM94.7882.9896.9496.80100.0099.97
GraphNVP90.1054.0097.3074.40100.0094.80
GRF84.5058.6066.0073.40100.0053.70
GraphAF100.0088.8394.51100.00100.0099.10
CGVAE100.0096.3398.03100.00100.0099.95
D-MolVAE-V100.0096.1099.15100.00100.0099.95
D-MolVAE-β100.0095.3596.62100.00100.0099.72
D-MolVAE-DIP-I100.0097.3697.80100.0099.9999.88
D-MolVAE-DIP-II100.0098.3172.36100.00100.0051.42
D-MolVAE-VIB100.0095.8598.66100.00100.0099.18
ModelQM9
ZINC
ValidityNoveltyUniqueValidityNoveltyUnique
ChemVAE10.0090.0067.5017.0098.0030.98
GrammarVAE30.0095.449.3031.00100.0010.76
GraphVAE61.0085.0040.9014.00100.0031.60
GraphGMG89.2089.1099.41
SMILES-LSTM94.7882.9896.9496.80100.0099.97
GraphNVP90.1054.0097.3074.40100.0094.80
GRF84.5058.6066.0073.40100.0053.70
GraphAF100.0088.8394.51100.00100.0099.10
CGVAE100.0096.3398.03100.00100.0099.95
D-MolVAE-V100.0096.1099.15100.00100.0099.95
D-MolVAE-β100.0095.3596.62100.00100.0099.72
D-MolVAE-DIP-I100.0097.3697.80100.0099.9999.88
D-MolVAE-DIP-II100.0098.3172.36100.00100.0051.42
D-MolVAE-VIB100.0095.8598.66100.00100.0099.18

Note: The highest value achieved on a metric is highlighted in boldface.

Table 4 allows making several observations. ChemVAE, GrammarVAE and GraphVAE have the lowest performance. The D-MolVAE models achieve superior performance over the other models. In particular, all D-MolVAE models achieve 100% on validity on all datasets. Similar performance is observed on uniqueness as well. Varied performance is observed on novelty, though all D-MolVAE models consistently outperform or match the performance of the other models; CGVAE is the only other model with a consistently good performance across all metrics on all datasets. This is not surprising, as our proposed models build over the CGVAE architecture but additionally enforce disentanglement. The explicit disentanglement enforcement seems to provide some benefits on higher novelty, in particular, on the QM9 dataset, over CGVAE. Taken altogether, these results suggest that the disentanglement enforcement does not reduce and actually improves performance; adding the disentanglement regularization does not influence the reconstruction error and so does not sacrifice the quality of generated molecules. It is worth noting that some of the proposed models, such as D-MolVAE-DIP-I and D-MolVAE-DIP-II, generate more novel molecules. Between the two, D-MolVAE-DIP-II generates more novel (nearly 100%) yet less unique (50–70%) molecules due to the stronger constraint exerted by the KL divergence term. In Table 5, we further evaluate the performance of our proposed methods and the strongest baseline, CGVAE, on two new datasets, MOSES and ChEMBL. In MOSES dataset, all the model achieve 100% validity and novelty, while D-MolVAE-VIB and D-MolVAE-DIp-I also perform 100% unique. In CHEMBL dataset, all the models achieve a comparable result except D-MolVAE-V on Unique.

Table 5.

Novelty, uniqueness and validity, shown in %, are measured on a generated dataset

ModelMOSES
CHEMBL
ValidityNoveltyUniqueValidityNoveltyUnique
CGVAE99.9799.9795.33100.0099.9799.85
D-MolVAE-V100.00100.0099.70100.00100.0014.85
D-MolVAE-β100.00100.0099.73100.00100.0099.35
D-MolVAE-DIP-I100.00100.00100.00100.00100.0099.96
D-MolVAE-DIP-II100.00100.0056.53100.00100.0099.93
D-MolVAE-VIB100.00100.00100.00100.0099.9799.88
ModelMOSES
CHEMBL
ValidityNoveltyUniqueValidityNoveltyUnique
CGVAE99.9799.9795.33100.0099.9799.85
D-MolVAE-V100.00100.0099.70100.00100.0014.85
D-MolVAE-β100.00100.0099.73100.00100.0099.35
D-MolVAE-DIP-I100.00100.00100.00100.00100.0099.96
D-MolVAE-DIP-II100.00100.0056.53100.00100.0099.93
D-MolVAE-VIB100.00100.00100.00100.0099.9799.88

Note: The highest value achieved on a metric is highlighted in boldface.

Table 5.

Novelty, uniqueness and validity, shown in %, are measured on a generated dataset

ModelMOSES
CHEMBL
ValidityNoveltyUniqueValidityNoveltyUnique
CGVAE99.9799.9795.33100.0099.9799.85
D-MolVAE-V100.00100.0099.70100.00100.0014.85
D-MolVAE-β100.00100.0099.73100.00100.0099.35
D-MolVAE-DIP-I100.00100.00100.00100.00100.0099.96
D-MolVAE-DIP-II100.00100.0056.53100.00100.0099.93
D-MolVAE-VIB100.00100.00100.00100.0099.9799.88
ModelMOSES
CHEMBL
ValidityNoveltyUniqueValidityNoveltyUnique
CGVAE99.9799.9795.33100.0099.9799.85
D-MolVAE-V100.00100.0099.70100.00100.0014.85
D-MolVAE-β100.00100.0099.73100.00100.0099.35
D-MolVAE-DIP-I100.00100.00100.00100.00100.0099.96
D-MolVAE-DIP-II100.00100.0056.53100.00100.0099.93
D-MolVAE-VIB100.00100.00100.00100.0099.9799.88

Note: The highest value achieved on a metric is highlighted in boldface.

3.3 Comparing the learned distribution to the training distribution

Given the above results, we now focus the comparison of our models against CGVAE. We measure the distance between the generated and the training datasets in terms of molecular properties and graph statistics, as shown in Table 6, utilizing two popular metrics, the maximum mean discrepancy (MMD) (You et al., 2018) and KL divergence (KLD) (You et al., 2018). MMD is used when comparing distributions of graph statistics and KLD is used when comparing distributions of molecular properties; the molecular properties of interest are selected due to their low correlation, which is ideal for the disentanglement experiment setting that requires independent semantic factors. The correlation heatmap between commonly used molecular properties evaluated in QM9 dataset is shown in Supplementary Figure S1. All these statistics are described in detail the Supplementary Material, where we also draw randomly selected QM9 molecules over the generated dataset for each of the models.

Table 6.

Comparing the difference between the training and generated distributions of graph properties via MMD and KLD

DatasetMetricCGVAEMol-VMol-βMol-DIMol-DIIMol-VIB
QM9MMD(Deg)0.01670.02580.05410.08380.02380.0232
MMD(CC)0.00970.00510.02590.01750.00950.0045
MMD(Orbit)0.00180.02100.00210.00790.00310.0017
KLD(cLogP)0.080.410.440.350.460.01
KLD(cLogS)0.060.270.260.181.230.13
KLD(Drug)0.070.150.080.180.220.04
KLD(RPSA)0.040.290.110.180.510.04
KLD(PSA)0.030.070.070.300.090.03
KLD(SA)0.440.210.500.890.160.20
ZINCMMD(Deg)0.00230.00050.00430.00340.79620.0111
MMD(CC))0.00130.00020.00130.00050.03160.0363
MMD(Orbit)0.00050.07310.00010.00010.00010.0006
KLD(cLogP)0.670.590.090.670.300.23
KLD(cLogS)0.740.040.090.740.580.10
KLD(Drug)1.291.630.971.291.520.01
KLD(RPSA)0.780.470.310.791.170.08
KLD(PSA)0.560.060.140.590.010.12
KLD(SA)0.560.750.790.762.290.82
MOSESMMD(Deg)0.00520.00320.00310.02200.45200.0024
MMD(CC))0.00030.00270.00040.00050.00000.0002
MMD(Orbit)0.00090.00130.00020.00060.02170.0005
KLD(cLogP)0.470.010.960.120.370.25
KLD(cLogS)0.220.210.171.010.500.16
KLD(Drug)0.350.560.841.410.330.48
KLD(RPSA)0.040.010.180.930.970.05
KLD(PSA)0.070.220.360.710.580.07
KLD(SA)1.571.761.851.253.571.09
CHEMBLMMD(Deg)0.00280.66340.00220.00150.00130.0025
MMD(CC))0.00020.00100.00040.00010.00020.0001
MMD(Orbit)0.00040.04240.00100.00020.00020.0004
KLD(cLogP)0.030.050.310.040.040.03
KLD(cLogS)0.040.040.050.040.040.04
KLD(Drug)0.010.010.020.020.020.01
KLD(RPSA)0.010.020.010.010.010.01
KLD(PSA)0.230.240.250.240.250.23
KLD(SA)0.070.080.090.080.080.08
DatasetMetricCGVAEMol-VMol-βMol-DIMol-DIIMol-VIB
QM9MMD(Deg)0.01670.02580.05410.08380.02380.0232
MMD(CC)0.00970.00510.02590.01750.00950.0045
MMD(Orbit)0.00180.02100.00210.00790.00310.0017
KLD(cLogP)0.080.410.440.350.460.01
KLD(cLogS)0.060.270.260.181.230.13
KLD(Drug)0.070.150.080.180.220.04
KLD(RPSA)0.040.290.110.180.510.04
KLD(PSA)0.030.070.070.300.090.03
KLD(SA)0.440.210.500.890.160.20
ZINCMMD(Deg)0.00230.00050.00430.00340.79620.0111
MMD(CC))0.00130.00020.00130.00050.03160.0363
MMD(Orbit)0.00050.07310.00010.00010.00010.0006
KLD(cLogP)0.670.590.090.670.300.23
KLD(cLogS)0.740.040.090.740.580.10
KLD(Drug)1.291.630.971.291.520.01
KLD(RPSA)0.780.470.310.791.170.08
KLD(PSA)0.560.060.140.590.010.12
KLD(SA)0.560.750.790.762.290.82
MOSESMMD(Deg)0.00520.00320.00310.02200.45200.0024
MMD(CC))0.00030.00270.00040.00050.00000.0002
MMD(Orbit)0.00090.00130.00020.00060.02170.0005
KLD(cLogP)0.470.010.960.120.370.25
KLD(cLogS)0.220.210.171.010.500.16
KLD(Drug)0.350.560.841.410.330.48
KLD(RPSA)0.040.010.180.930.970.05
KLD(PSA)0.070.220.360.710.580.07
KLD(SA)1.571.761.851.253.571.09
CHEMBLMMD(Deg)0.00280.66340.00220.00150.00130.0025
MMD(CC))0.00020.00100.00040.00010.00020.0001
MMD(Orbit)0.00040.04240.00100.00020.00020.0004
KLD(cLogP)0.030.050.310.040.040.03
KLD(cLogS)0.040.040.050.040.040.04
KLD(Drug)0.010.010.020.020.020.01
KLD(RPSA)0.010.020.010.010.010.01
KLD(PSA)0.230.240.250.240.250.23
KLD(SA)0.070.080.090.080.080.08

Note: We abbreviate D-MolVAE by Mol, DIP by D, degree by Deg, clustering coefficient by Coeff, drug-likeness by Drug and Rel PSA by RPSA. The best value per row is in boldface.

Table 6.

Comparing the difference between the training and generated distributions of graph properties via MMD and KLD

DatasetMetricCGVAEMol-VMol-βMol-DIMol-DIIMol-VIB
QM9MMD(Deg)0.01670.02580.05410.08380.02380.0232
MMD(CC)0.00970.00510.02590.01750.00950.0045
MMD(Orbit)0.00180.02100.00210.00790.00310.0017
KLD(cLogP)0.080.410.440.350.460.01
KLD(cLogS)0.060.270.260.181.230.13
KLD(Drug)0.070.150.080.180.220.04
KLD(RPSA)0.040.290.110.180.510.04
KLD(PSA)0.030.070.070.300.090.03
KLD(SA)0.440.210.500.890.160.20
ZINCMMD(Deg)0.00230.00050.00430.00340.79620.0111
MMD(CC))0.00130.00020.00130.00050.03160.0363
MMD(Orbit)0.00050.07310.00010.00010.00010.0006
KLD(cLogP)0.670.590.090.670.300.23
KLD(cLogS)0.740.040.090.740.580.10
KLD(Drug)1.291.630.971.291.520.01
KLD(RPSA)0.780.470.310.791.170.08
KLD(PSA)0.560.060.140.590.010.12
KLD(SA)0.560.750.790.762.290.82
MOSESMMD(Deg)0.00520.00320.00310.02200.45200.0024
MMD(CC))0.00030.00270.00040.00050.00000.0002
MMD(Orbit)0.00090.00130.00020.00060.02170.0005
KLD(cLogP)0.470.010.960.120.370.25
KLD(cLogS)0.220.210.171.010.500.16
KLD(Drug)0.350.560.841.410.330.48
KLD(RPSA)0.040.010.180.930.970.05
KLD(PSA)0.070.220.360.710.580.07
KLD(SA)1.571.761.851.253.571.09
CHEMBLMMD(Deg)0.00280.66340.00220.00150.00130.0025
MMD(CC))0.00020.00100.00040.00010.00020.0001
MMD(Orbit)0.00040.04240.00100.00020.00020.0004
KLD(cLogP)0.030.050.310.040.040.03
KLD(cLogS)0.040.040.050.040.040.04
KLD(Drug)0.010.010.020.020.020.01
KLD(RPSA)0.010.020.010.010.010.01
KLD(PSA)0.230.240.250.240.250.23
KLD(SA)0.070.080.090.080.080.08
DatasetMetricCGVAEMol-VMol-βMol-DIMol-DIIMol-VIB
QM9MMD(Deg)0.01670.02580.05410.08380.02380.0232
MMD(CC)0.00970.00510.02590.01750.00950.0045
MMD(Orbit)0.00180.02100.00210.00790.00310.0017
KLD(cLogP)0.080.410.440.350.460.01
KLD(cLogS)0.060.270.260.181.230.13
KLD(Drug)0.070.150.080.180.220.04
KLD(RPSA)0.040.290.110.180.510.04
KLD(PSA)0.030.070.070.300.090.03
KLD(SA)0.440.210.500.890.160.20
ZINCMMD(Deg)0.00230.00050.00430.00340.79620.0111
MMD(CC))0.00130.00020.00130.00050.03160.0363
MMD(Orbit)0.00050.07310.00010.00010.00010.0006
KLD(cLogP)0.670.590.090.670.300.23
KLD(cLogS)0.740.040.090.740.580.10
KLD(Drug)1.291.630.971.291.520.01
KLD(RPSA)0.780.470.310.791.170.08
KLD(PSA)0.560.060.140.590.010.12
KLD(SA)0.560.750.790.762.290.82
MOSESMMD(Deg)0.00520.00320.00310.02200.45200.0024
MMD(CC))0.00030.00270.00040.00050.00000.0002
MMD(Orbit)0.00090.00130.00020.00060.02170.0005
KLD(cLogP)0.470.010.960.120.370.25
KLD(cLogS)0.220.210.171.010.500.16
KLD(Drug)0.350.560.841.410.330.48
KLD(RPSA)0.040.010.180.930.970.05
KLD(PSA)0.070.220.360.710.580.07
KLD(SA)1.571.761.851.253.571.09
CHEMBLMMD(Deg)0.00280.66340.00220.00150.00130.0025
MMD(CC))0.00020.00100.00040.00010.00020.0001
MMD(Orbit)0.00040.04240.00100.00020.00020.0004
KLD(cLogP)0.030.050.310.040.040.03
KLD(cLogS)0.040.040.050.040.040.04
KLD(Drug)0.010.010.020.020.020.01
KLD(RPSA)0.010.020.010.010.010.01
KLD(PSA)0.230.240.250.240.250.23
KLD(SA)0.070.080.090.080.080.08

Note: We abbreviate D-MolVAE by Mol, DIP by D, degree by Deg, clustering coefficient by Coeff, drug-likeness by Drug and Rel PSA by RPSA. The best value per row is in boldface.

Table 7.

Evaluation of disentanglement across all top models on each of the datasets

DatasetModelβ-M(%)↑F-M(%)↑DCI↑Mod↑
QM9CGVAE10057.00.0550.239
Mol-V10050.00.0190.233
Mol-β10056.00.04660.223
Mol-DI10061.20.0230.261
Mol-DII10062.00.09720.241
Mol-VIB10072.00.12820.243
ZINCCGVAE10048.00.0110.195
Mol-V10044.00.0160.163
Mol-β10052.00.0160.151
Mol-DI10052.40.0100.197
Mol-DII10050.00.0190.188
Mol-VIB10058.00.0360.189
MOSESCGVAE10038.00.0590.184
Mol-V10044.00.0600.189
Mol-β10046.00.0610.186
Mol-DI10058.00.0620.209
Mol-DII10050.00.0710.212
Mol-VIB10054.00.0780.253
CHEMBLCGVAE82.061.30.1810.500
Mol-V80.062.00.2020.499
Mol-β82.662.30.2190.491
Mol-DI84.062.00.2090.481
Mol-DII80.064.00.2130.456
Mol-VIB85.364.60.1830.504
DatasetModelβ-M(%)↑F-M(%)↑DCI↑Mod↑
QM9CGVAE10057.00.0550.239
Mol-V10050.00.0190.233
Mol-β10056.00.04660.223
Mol-DI10061.20.0230.261
Mol-DII10062.00.09720.241
Mol-VIB10072.00.12820.243
ZINCCGVAE10048.00.0110.195
Mol-V10044.00.0160.163
Mol-β10052.00.0160.151
Mol-DI10052.40.0100.197
Mol-DII10050.00.0190.188
Mol-VIB10058.00.0360.189
MOSESCGVAE10038.00.0590.184
Mol-V10044.00.0600.189
Mol-β10046.00.0610.186
Mol-DI10058.00.0620.209
Mol-DII10050.00.0710.212
Mol-VIB10054.00.0780.253
CHEMBLCGVAE82.061.30.1810.500
Mol-V80.062.00.2020.499
Mol-β82.662.30.2190.491
Mol-DI84.062.00.2090.481
Mol-DII80.064.00.2130.456
Mol-VIB85.364.60.1830.504

Note: ↑ indicates that a higher value on a metric is better. Best performances are bolded.

Table 7.

Evaluation of disentanglement across all top models on each of the datasets

DatasetModelβ-M(%)↑F-M(%)↑DCI↑Mod↑
QM9CGVAE10057.00.0550.239
Mol-V10050.00.0190.233
Mol-β10056.00.04660.223
Mol-DI10061.20.0230.261
Mol-DII10062.00.09720.241
Mol-VIB10072.00.12820.243
ZINCCGVAE10048.00.0110.195
Mol-V10044.00.0160.163
Mol-β10052.00.0160.151
Mol-DI10052.40.0100.197
Mol-DII10050.00.0190.188
Mol-VIB10058.00.0360.189
MOSESCGVAE10038.00.0590.184
Mol-V10044.00.0600.189
Mol-β10046.00.0610.186
Mol-DI10058.00.0620.209
Mol-DII10050.00.0710.212
Mol-VIB10054.00.0780.253
CHEMBLCGVAE82.061.30.1810.500
Mol-V80.062.00.2020.499
Mol-β82.662.30.2190.491
Mol-DI84.062.00.2090.481
Mol-DII80.064.00.2130.456
Mol-VIB85.364.60.1830.504
DatasetModelβ-M(%)↑F-M(%)↑DCI↑Mod↑
QM9CGVAE10057.00.0550.239
Mol-V10050.00.0190.233
Mol-β10056.00.04660.223
Mol-DI10061.20.0230.261
Mol-DII10062.00.09720.241
Mol-VIB10072.00.12820.243
ZINCCGVAE10048.00.0110.195
Mol-V10044.00.0160.163
Mol-β10052.00.0160.151
Mol-DI10052.40.0100.197
Mol-DII10050.00.0190.188
Mol-VIB10058.00.0360.189
MOSESCGVAE10038.00.0590.184
Mol-V10044.00.0600.189
Mol-β10046.00.0610.186
Mol-DI10058.00.0620.209
Mol-DII10050.00.0710.212
Mol-VIB10054.00.0780.253
CHEMBLCGVAE82.061.30.1810.500
Mol-V80.062.00.2020.499
Mol-β82.662.30.2190.491
Mol-DI84.062.00.2090.481
Mol-DII80.064.00.2130.456
Mol-VIB85.364.60.1830.504

Note: ↑ indicates that a higher value on a metric is better. Best performances are bolded.

In Table 6, the smaller the value, the more similar the generated set is to the training set on a property under comparison. Table 6 shows that all models reasonably preserve the distributions of properties in the training set. In comparison with CGVAE, our D-MolVAE models preserve more on the ZINC and MOSES dataset while less on the QM9 dataset. However, our models consistently perform well on all four datasets. The only dataset where CGVAE performs better than any of our models on about half of the properties (4/9) is on the QM9 dataset. CGVAE also performs comparably on KLD to at least one of our models on the CHEMLB dataset, but it is outperformed on MMD. On both the ZINC and the MOSES datasets, our models outperform CGVAE. In particular, D-MolVAE-VIB performs consistently well across all four datasets. The KLD between the training and the generated datasets are small, and this is further confirmed visually by plotting the distributions of the molecular properties cLogP, cLogS, PSA, rPSA and drug-likeness for each model in Supplementary Figures S2–S7. These results make clear that our D-MolVAE models capture well the distributions of the molecular properties in the training dataset.

Altogether, these results suggest that the proposed models capture the underlying property distribution of the training dataset. Overall, all models balance well between information preservation and novelty in the generated molecules. Among all our D-MolVAE models, it is easily observed that D-MolVAE-VIB outperforms all the others along most metrics. Interestingly, even though disentanglement-enhanced models do not outperform the baselines in terms of capturing the synthesis accessibility (SA) score distribution, they generate novel molecules with higher SA score, e.g. MolVAE-VIB. This observation actually demonstrates the exploration power of the disentangled models and the better trade-off they allow us to achieve between exploration and exploitation. It is worth noting that one can choose over the disentangled models and the base models by preferences of exploration or exploitation.

3.3.1 Quantitative evaluation of disentanglement learning

Table 7 relates the evaluation of our models’ disentanglement scores via β-M, F-M, MOD and DCI, which are four popular metrics to evaluate disentanglement. Briefly, β-M (Higgins et al., 2017) measures disentanglement by examining the accuracy of a linear classifier that predicts the index of a fixed factor of variation. F-M (Kim and Mnih, 2018) addresses several issues by using a majority voting classifier on a different feature vector that represents a corner case in the β-M. The β-M and F-M metrics are formulated as follows:
(9)
 
(10)
 
(11)
 
(12)
 
(13)
 
(14)
MOD (Ridgeway and Mozer, 2018) measures whether each latent variable depends on at most a factor describing the maximum variation using their mutual information. We first calculate the mutual information between the latent representations and the values of the factors of variation in a matrix m. Then, we compute a vector ti for each dimension of representation i. Finally, we average over the dimensions of the representation with N factors, as follows:
(15)
 
(16)
DCI (Eastwood and Williams, 2018) computes the entropy of the distribution obtained by normalizing the importance of each dimension of the learned representation for predicting the value of a factor of variation. For DCI, we first take the importance weights for each factor by fitting gradient boosted trees and form an importance matrix R. We then compute the relative importance of each dimension ρi and disentanglement score DCI as follows:
(17)
 
(18)

All implementation details are as in Locatello et al. (2018).

Table 7 shows that our models achieve the best overall disentanglement scores over CGVAE. Specifically, on the QM9 dataset with smaller molecules, D-MolVAE-DIP-I, D-MolVAE-DIP-II and D-MolVAE-VIB achieve F-M scores of 61.2%, 62.0%, 72.0%, respectively, whereas CGVAE achieves only 57.0%. All models achieve comparable MOD scores, with D-MolVAE-DIP-I achieving the highest. All models achieve a β-M of 100%. D-MolVAE-VIB outperforms all others on the DCI score, and this observation holds across all four datasets. Interestingly, all models perform worse on the ZINC dataset, which contains larger molecules than the QM9 dataset. Similarly, on the MOSES dataset, all the models perform worse than on QM9 but better than on ZINC. Specifically, D-MolVAE-DIP-I and D-MolVAE-VIB rank as the top two on the F-M metrics, and D-Mol-VAE achieves the best performance on the DCI and Mod metrics, with an up to 16% improvement over the second best model, D-MolVAE-DIP-II. On the CHEMBL dataset, D-MolVAE-DIV performs the best across the β-M, F-M and MOD metrics. D-Mol-DIP-I achieves the second in β-M (84.0%), while CGVAE performs only 82.0%. Nevertheless, D-Mol-β performs slightly better over D-Mol-DII on the DCI metric which achieves the best performance. Altogether, these results show that the proposed disentanglement-enhanced models improve the ability of a model for disentanglement learning, especially for D-MolVAE-VIB.

3.3.2 Relating disentangled factors to molecular properties

In Figure 1, we show how the learned disentangled factors relate to the biological properties computed on each generated molecule. The mutual information is calculated between each of the disentangled factors learned by CGVAE and the D-MolVAE models and the molecular properties computed on generated molecules. We focus the comparison here to the MOSES-trained CGVAE and D-MolVAE-VIB models but show all models on all datasets in the Supplementary Material.

The mutual information is calculated between each of the disentangled factors and the molecular properties computed on generated molecules
Fig. 1.

The mutual information is calculated between each of the disentangled factors and the molecular properties computed on generated molecules

Figure 1 clearly show that the factors learned by CGVAE relate weakly with the molecular properties. Such relationship is stronger on the disentangled factors learned by our D-MolVAE models, even though all models are unsupervised. Moreover, different disentangled factors from D-MolVAE-VIB tend to more clearly correlate to different properties than CGVAE, thanks to the disentanglement enhancement.

Figure 2 allows digging deeper into the impact of a property of interest by visualizing the change in the property over molecules generated when a particular latent factor is varied in a range, and others are kept fixed. We focus on one of our top models, D-MolVAE-VIB and on PSA, which is a crucial consideration when generating molecules, as it directly relates to our ability to actually synthesize them in wet laboratories. We can clearly see that basically only one factor is majorly related to PSA, thanks to our disentanglement enhancement that strengthens the independence among different factors and hence minimizes the number of different factors correlated to a property (e.g. PSA). Figure 2 shows that one of the latent factors impacts PSA, and this is more clearly visible on the QM9 and MOSES datasets.

Change in PSA is tracked as a latent factor is varied in a range while keeping all others fixed. Focus here is on the latent factors learned by D-MolVAE-VIB
Fig. 2.

Change in PSA is tracked as a latent factor is varied in a range while keeping all others fixed. Focus here is on the latent factors learned by D-MolVAE-VIB

4 Conclusion

The evaluation presented in this article suggests that the proposed disentanglement framework D-MolVAE is effective at generating valid, novel and unique small molecules and outperforms several state-of-the-art generative models. This performance is due to the sequence decoding process and, specifically, valence checking and the stop-checking mechanism. Other graph-based generative models that lack this process (for instance, GraphVAE) suffer in this respect and generate invalid molecules. The variational inference in D-MolVAE also allows better capturing the distribution of the input dataset and so sampling novel and unique molecules from the learned distribution.

It is important to note that the loss functions in the models we propose here effectively implement a trade-off between the disentanglement enhancement and the reconstruction. The distributions of specific properties (for instance, synthesis accessibility) show the exploration-exploitation trade-off in the various disentangled models. Our analysis shows that explicit disentanglement enforcement does not hurt the proposed models; indeed, like CGVAE, the proposed models generate novel and unique molecules and even surpass CGVAE on some of the datasets; the disentangled factors provide an advantage. Moreover, the proposed D-MolVAE models better capture the underlying graph statistics and distributions of various biological properties. Our evaluation also reveals that different types of disentangled models have different abilities. In particular, the experiments suggest that D-MolVAE-VIB is a promising model for exploring disentangled representations.

We consider the proposed work to be a first step to address remaining challenges in small molecule generation. Beyond interpreting the generation process, it is important to precisely control the properties of generated molecules. The disentangled representation learning is this article falls under the umbrella of unsupervised learning. Therefore, specific control and correspondence of latent factors to molecular properties of interest is not expected to be strong. Our analysis shows that, in principle, one can build over the models proposed here for such precise control. Ideally, given specific, target values for several properties of interest, one could decode back the latent variables into a molecule that achieves the target property values. Our future work will address such models.

We also note that current models, including those proposed and evaluated this article, are only concerned with global properties of molecules (or their graph representations), such as ClogP, drug-likeness and others. Preserving local properties of an atom or a cluster of atoms (e.g. an aromatic hydrocarbon) has not been explored so far. Doing both can be helpful in designing novel molecules while improving our understanding of the contribution of each element in the overall molecular properties of interest. We caution, however, that supervised representation learning, while useful in many specific applications, may also bias toward a known, target set of molecular properties and miss possibly interesting new discoveries. In our future work, we hope to advance both unsupervised and supervised representation learning in small molecule generation.

Funding

This work was supported in part by the National Science Foundation [grant numbers 1942594, 1755850, 1907805]. This material was additionally based upon work by A.S. supported by (while serving at) the National Science Foundation. Any opinion, findings and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.

Conflict of Interest: none declared.

References

Alemi
A.A.
 et al.  (
2017
)
Deep variational information bottleneck
. In: 5th International Conference on Learning Representations, ICLR. Palais des Congrès Neptune, Toulon, Fr.

Blaschke
T.
 et al.  (
2018
)
Application of generative autoencoder in de novo molecular design
.
Mol. Inf.
,
37
,
1700123
.

Bojchevski
A.
 et al.  (
2018
) Netgan: generating graphs via random walks. In: International Conference on Machine Learning, Stockholm, Sweden, pp.
609
618
.

Chen
T.Q.
 et al.  (
2018
) Isolating sources of disentanglement in variational autoencoders. In: Advances in Neural Information Processing Systems, Montréal, Quebec, Canada, pp.
2610
2620
.

Dai
H.
 et al.  (
2018
) Syntax-directed variational autoencoder for structured data. In: International Conference on Learning Representations, Vancouver, Canada.

De Samanta
B.A.
 et al.  (
2018
) Designing random graph models using variational autoencoders with applications to chemical design. arXiv preprint arXiv:1802.05283.

Doshi-Velez
F.
,
Kim
B.
(
2017
) Towards a rigorous science of interpretable machine learning. arXiv preprint arXiv:1702.08608.

Du
Y.
 et al.  (
2020
) Interpretable molecule generation via disentanglement learning. In: 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics Workshops: Comput Struct Biol Workshop (CSBW), Baltimore-Washington, DC Area, USA, pp.
1
8
.

Du
Y.
 et al.  (
2021a
) Deep latent-variable models for controllable molecule generation. In: 2021 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), Virtual, pp. 372–375. IEEE.

Du
Y.
 et al.  (
2021b
) Graphgt: Machine learning datasets for graph generation and transformation. In: Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2), Virtual.

Eastwood
C.
,
Williams
C.K.
(
2018
) A framework for the quantitative evaluation of disentangled representations. In: 6th International Conference on Learning Representations ICLR, Vancouver, Canada.

Ellman
J.A.
(
1996
)
Design, synthesis, and evaluation of small-molecule libraries
.
Acc. Chem. Res
.,
29
,
132
143
.

Esmaeili
B.
 et al.  (
2019
)
Structured disentangled representations
.
Proc. Mach. Learn. Res
.,
89
. 2525–2534.

Gaulton
A.
 et al.  (
2017
)
The chembl database in 2017
.
Nucleic Acids Res
.,
45
,
D945
D954
.

Gómez-Bombarelli
R.
 et al.  (
2018
)
Automatic chemical design using a data-driven continuous representation of molecules
.
ACS Cent. Sci.
,
4
,
268
276
.

Grover
A.
 et al.  (
2019
) Graphite: Iterative generative modeling of graphs. In: Proceedings of the 36th International Conference on Machine Learning, Long Beach, CA, USA, United States, Vol.
97
, pp.
2434
2444
.

Guimaraes
G.L.
 et al.  (
2017
) Objective-reinforced generative adversarial networks (organ) for sequence generation models. arXiv preprint arXiv:1705.10843.

Guo
X.
 et al.  (
2018
) Deep graph translation. arXiv preprint arXiv:1805.09980.

Guo
X.
 et al.  (
2020
) Property controllable variational autoencoder via invertible mutual dependence. In: International Conference on Learning Representations, Virtual.

Guo
X.
 et al.  (
2021
) Deep generative model for spatial networks. In: 27th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, Singapore.

Higgins
I.
 et al.  (
2017
) beta-VAE: learning basic visual concepts with a constrained variational framework. In: 5th International Conference on Learning Representations, ICLR. Palais des Congrès Neptune, Toulon, Fr.

Honda
S.
 et al.  (
2019
) Graph residual flow for molecular graph generation. arXiv preprint arXiv:1909.13521.

Irwin
J.J.
 et al.  (
2012
)
Zinc: a free tool to discover chemistry for biology
.
J. Chem. Inf. Model.
,
52
,
1757
1768
.

Janz
D.
 et al.  (
2017
) Actively learning what makes a discrete sequence valid. arXiv preprint arXiv:1708.04465.

Jin
H.
 et al.  (
2018
) Discriminative graph autoencoder. In: International Conference on Big Knowledge (ICBK), Venue Singapore, Singapore. IEEE.

Kim
H.
,
Mnih
A.
(
2018
) Disentangling by factorising. arXiv preprint arXiv:1802.05983.

Kingma
D.P.
,
Welling
M.
(
2013
) Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114.

Kipf
T.N.
,
Welling
M.
(
2016
) Variational graph auto-encoders. arXiv preprint arXiv:1611.07308.

Kumar
A.
 et al.  (
2018
) Variational inference of disentangled latent concepts from unlabeled observations. In: 6th International Conference on Learning Representations, ICLR, Vancouver Convention Center, Vancouver, BC, Canada.

Kusner
M.J.
 et al.  (
2017
) Grammar variational autoencoder. In: Proceedings of the 34th International Conference on Machine Learning, Sydney, Australia, Vol.
70
, pp.
1945
1954
.

Li
Y.
 et al.  (
2018
) Learning deep generative models of graphs. arXiv preprint aeXiv:1803.03324.

Liu
Q.
 et al.  (
2018
) Constrained graph variational autoencoders for molecule design. In:
Bengio
S.
,
Wallach
H.
,
Larochelle
H.
,
Grauman
K.
,
Cesa-Bianchi
N.
,
Garnett
R.
(eds.)
Advances in Neural Information Processing Systems
, Vol. 31, pp.
7795
7804
.
Curran Associates, Inc
., Red Hook, NY.

Locatello
F.
 et al.  (
2018
) Challenging common assumptions in the unsupervised learning of disentangled representations. arXiv preprint arXiv:1811.12359.

Lopez
R.
 et al.  (
2018
)
Information constraints on auto-encoding variational bayes
. In: Thirty-second Conference on Neural Information Processing Systems, Montréal, Canada.

Madhawa
K.
 et al.  (
2019
) Graphnvp: an invertible flow model for generating molecular graphs. arXiv preprint arXiv:1905.11600.

Polykovskiy
D.
 et al.  (
2020
)
Molecular sets (MOSES): a benchmarking platform for molecular generation models
.
Front. Pharmacol
.,
11
.

Ramakrishnan
R.
 et al.  (
2014
)
Quantum chemistry structures and properties of 134 kilo molecules
.
Sci. Data
,
1
,
140022
.

Renz
P.
 et al.  (
2020
)
On failure modes in molecule generation and optimization
.
Drug Discov. Today Technol
. 32–
33
:
55
63
.

Reymond
J.
 et al.  (
2012
)
The enumeration of chemical space
.
Wires Comput. Mol. Sci
.,
2
,
717
733
.

Ridgeway
K.
,
Mozer
M.C.
(
2018
) Learning deep disentangled embeddings with the F-statistic loss. In: Advances in Neural Information Processing Systems, Montréal, Canada, pp.
185
194
.

Ruddigkeit
L.
 et al.  (
2012
)
Enumeration of 166 billion organic small molecules in the chemical universe database gdb-17
.
J. Chem. Inf. Model
.,
52
,
2864
2875
.

Schneider
P.
,
Schneider
G.
(
2016
)
De novo design at the edge of chaos
.
J. Med. Chem
.,
59
,
4077
4086
.

Segler
M.H.
 et al.  (
2018
)
Generating focused molecule libraries for drug discovery with recurrent neural networks
.
ACS Cent. Sci
.,
4
,
120
131
.

Shi
C.
 et al.  (
2019
) GraphAF: a flow-based autoregressive model for molecular graph generation. In:
International Conference on Learning Representations.
Ernest N. Morial Convention Center, New Orleans.

Simonovsky
M.
,
Komodakis
N.
(
2018
) GraphVAE: towards generation of small graphs using variational autoencoders. In: International Conference on Artificial Neural Networks, Rhodes, Greece, pp.
412
422
. Springer.

Stumpfe
D.
,
Bajorath
B.
(
2011
)
Similarity searching
.
WIREs Comput. Mol. Sci
.,
1
,
260
282
.

Sundermeyer
M.
 et al.  (
2012
) LSTM neural networks for language modeling. In: Thirteenth Annual Conference of the International Speech Communication Association, Portland, Oregon, USA.

Weininger
D.
(
1988
)
SMILES, a chemical language and information system
.
J. Chem. Inf. Model
.,
28
,
31
36
.

Whitesides
G.M.
(
2015
)
Reinventing chemistry
.
Angew. Chem. Int. Ed. Engl
.,
54
,
3196
3209
.

Xue
D.
 et al.  (
2019
)
Advances and challenges in deep generative models for de novo molecule generation
.
Wiley Interdisc. Rev. Comput. Mol. Sci
.,
9
,
e1395
.

Yoshikawa
N.
 et al.  (
2018
)
Population-based de novo molecule generation, using grammatical evolution
.
Chem. Lett
.,
47
,
1431
1434
.

You
J.
 et al.  (
2018
) GraphRNN: generating realistic graphs with deep auto-regressive models. arXiv preprint arXiv:1802.08773.

Zhao
S.
 et al.  (
2019
) Infovae: information maximizing variational autoencoders. In: The Thirty-Third AAAI Conference on Artificial Intelligence (AAAI-19), Hawaii, USA.

This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic.oup.com/journals/pages/open_access/funder_policies/chorus/standard_publication_model)
Associate Editor: Jinbo Xu
Jinbo Xu
Associate Editor
Search for other works by this author on:

Supplementary data