- Split View
-
Views
-
Cite
Cite
Yuanqi Du, Xiaojie Guo, Yinkai Wang, Amarda Shehu, Liang Zhao, Small molecule generation via disentangled representation learning, Bioinformatics, Volume 38, Issue 12, June 2022, Pages 3200–3208, https://doi.org/10.1093/bioinformatics/btac296
- Share Icon Share
Abstract
Expanding our knowledge of small molecules beyond what is known in nature or designed in wet laboratories promises to significantly advance cheminformatics, drug discovery, biotechnology and material science. In silico molecular design remains challenging, primarily due to the complexity of the chemical space and the non-trivial relationship between chemical structures and biological properties. Deep generative models that learn directly from data are intriguing, but they have yet to demonstrate interpretability in the learned representation, so we can learn more about the relationship between the chemical and biological space. In this article, we advance research on disentangled representation learning for small molecule generation. We build on recent work by us and others on deep graph generative frameworks, which capture atomic interactions via a graph-based representation of a small molecule. The methodological novelty is how we leverage the concept of disentanglement in the graph variational autoencoder framework both to generate biologically relevant small molecules and to enhance model interpretability.
Extensive qualitative and quantitative experimental evaluation in comparison with state-of-the-art models demonstrate the superiority of our disentanglement framework. We believe this work is an important step to address key challenges in small molecule generation with deep generative frameworks.
Training and generated data are made available at https://ieee-dataport.org/documents/dataset-disentangled-representation-learning-interpretable-molecule-generation. All code is made available at https://anonymous.4open.science/r/D-MolVAE-2799/.
Supplementary data are available at Bioinformatics online.
1 Introduction
Expanding our knowledge of small molecules beyond what is known in nature or designed in wet laboratories promises to significantly advance drug discovery, biotechnology and material science (Whitesides, 2015). In-silico molecule design is central to cheminformatics research but remains challenging (Schneider and Schneider, 2016). Studies estimate that 1060 drug-like molecules are synthetically accessible (Reymond et al., 2012). This size of chemical space is beyond the scope of even high-throughput wet-laboratory technologies.
A multi-decade journey in cheminformatics research informs us of several challenges for small molecule generation. The first concerns the poorly understood and complex relationship between chemical and biological space. Not all molecules in the vast chemical space meet desired biological/functional properties of interest, such as water soluble, drug-likeness and more (Ramakrishnan et al., 2014). Moreover, changes to the chemical structure to optimize along a biological criterion may worsen other criteria; the search space that links chemical and biological space may be rich in with barriers separating neighboring local optima.
Until a decade ago, molecule generation, widely referred to as computational screening, was dominated by similarity search methods (Stumpfe and Bajorath, 2011). While conceptually straightforward, these methods were limited in their ability to generate novel small molecules. Advances in machine learning expedited progress. Shallow models were not very effective (Ellman, 1996; Renz et al., 2020; Xue et al., 2019; Yoshikawa et al., 2018), as they relied heavily on domain insight to formulate and construct meaningful representations of small molecules. Due to their inherent ability to learn directly from data, deep generative models then made a debut. Initial efforts utilized a linear representation of molecules, known as SMILES (Weininger, 1988), which stands for ‘molecular-input line-entry system’. SMILES is a formal grammar that describes molecules with an alphabet of characters; aromatic and aliphatic carbon atoms are denoted by ‘c’ and ‘C’, oxygen atoms by ‘O’, single bonds by ‘–’, double bonds by ‘=’, etc. The SMILES representation allows addressing molecule generation as a string generation problem. Deep learning methods based on the recurrent neural network (RNN) framework suddenly became useful (Gómez-Bombarelli et al., 2018; Kusner et al., 2017; Segler et al., 2018). However, SMILES-based deep models could generate few valid molecules. In response, later works (Dai et al., 2018; Kusner et al., 2017) added syntactic and semantic constraints. In other works, models were guided to generate valid SMILES through active learning, reinforcement learning and additional training signals (Guimaraes et al., 2017; Janz et al., 2017). While some improvements were observed, generating valid molecules remained challenging.
Graph-generative deep models leverage a more expressive representation of a molecule via the concept of a molecular graph. The atoms are represented as vertices and the bonds as edges connecting the vertices. In deep learning literature, graph-generative models are based on the variational autoencoder (VAE) (Blaschke et al., 2018; Dai et al., 2018; De Samanta et al., 2018; Jin et al., 2018; Simonovsky and Komodakis, 2018) or generative adversarial networks (Bojchevski et al., 2018; Guo et al., 2018). For instance, GraphRNN (You et al., 2018) builds an autoregressive generative model based on a generative RNN that generates the graph one vertex at a time. In contrast, GraphVAE (Simonovsky and Komodakis, 2018) represents each graph in terms of its adjacent matrix and feature vectors of vertices. A VAE model is then utilized to learn the distribution of the graphs conditioned on a latent representation at the graph level. Other works (Grover et al., 2019; Kipf and Welling, 2016) encode the vertices into vertex-level embeddings and predict the edges between each pair of vertices to generate a graph.
The adoption of graph-generative models for small molecule generation has been rapid. Current graph-generative models for molecule generation leverage the VAE framework to address two subtasks: (i) encoding: learning a low-dimensional, latent code/representation of a molecular graph; (ii) decoding: learning to map the latent representation back into a (reconstructed) molecular graph. For instance, work in Simonovsky and Komodakis (2018) generates molecular graphs by predicting their adjacency matrices. Work in Liu et al. (2018) generates molecules through a constrained graph-generative model that enforces validity by generating a molecule one atom at a time. These works generate more valid molecules than SMILES-based models and additionally subject generated molecules to the sanitization checks in RDKit (RDKit: Open-source cheminformatics; http://www.rdkit.org).
Graph-generative VAEs represent a promising platform that we leverage in this article, but current graph-generative VAEs for small molecule generation fall short. The learned latent representation has all the latent factors entangled which limits the model transparency and interpretability. Specifically, these models do not facilitate linking the chemical space to the biological space and so do not advance our understanding of complex relationship between chemical and biological space for small molecules. Facilitating this linking is central not only for molecule generation but also for molecule optimization Alemi et al. (2017), an important and related task that beyond the scope of this article.
In this article, we advance research on small molecule representation learning for molecule generation by disentanglement enhancement. Disentangled representation learning is an active research area, particularly in image representation learning (Alemi et al., 2017; Chen et al., 2018; Guo et al., 2021; Higgins et al., 2017; Kim and Mnih, 2018) and has been shown key to improving model generalizability and robustness against adversarial attacks, and even facilitate debugging and auditing (Alemi et al., 2017; Doshi-Velez and Kim, 2017). While a comprehensive review is beyond the scope of this article, we point to recent approaches that modify the VAE objective by adding, removing or altering the weight of individual terms in the loss function to improve disentanglement (Alemi et al., 2017; Chen et al., 2018; Du et al., 2021a; Esmaeili et al., 2019; Guo et al., 2020; Kim and Mnih, 2018; Kumar et al., 2018; Lopez et al., 2018; Zhao et al., 2019). Currently, however, we do not know the best approach to learn disentangled representations of graph data. This includes the small molecule generation domain. In a recent workshop paper (Du et al., 2020), we demonstrated that learning disentangled representations results in better molecule generation over methods that do not leverage disentanglement. However, as our goal was a proof-of-concept demonstration that VAEs for disentangled representation learning achieve good assessment for small molecule generation, the study was limited to classic disentanglement and focused on few datasets of known small molecules of the same size.
Here we propose a graph-generative VAE framework that learns a disentangled code/representation, so that we may additionally elucidate how the factors that encode chemical structure control biological properties. Specifically, we design and evaluate the D-MolVAE framework, which stands for Disentangled Molecule VAE. The framework permits various mechanisms for disentanglement, resulting in several novel deep graph-generative models, which we compare to one another and many other state-of-the-art methods on benchmark datasets across several metrics.
Our experiments show that the D-MolVAE framework is effective and superior at generating valid, novel and unique small molecules over other methods. The framework also accommodates variable-size molecules which improves its scope and applicability. Our experiments additionally show that disentanglement representation learning is valuable for better interpretation and understanding of the relationship between the chemical space and the biological space; the proposed D-MolVAE models are better able to capture the underlying graph statistics and distributions of various biological properties.
The D-MolVAE models effectively implement a trade-off between the disentanglement enhancement and the reconstruction. Our experiments show that explicit disentanglement enforcement does not hurt performance. In fact, the models are superior over many methods. Taken altogether, our findings suggest that the disentangled factors provide an advantage with respect to the quality of generated molecules, as well as the linking of the chemical and biological space. Our experiments suggest several models as promising platforms for further exploring disentangled representations for improving small molecule generation.
2 Materials and methods
We first define and formalize the problem. Then we describe the graph-generative models based on VAE framework, namely D-MolVAE, focusing the description on the variants of disentanglement terms proposed to obtain different disentangled graph-generative VAE models.
2.1 Problem formulation
Let us represent a molecule as a graph . The N atoms of the molecule constitute the N vertices V of graph G. The M bonds connecting pairs of atoms in the molecule constitute the edges , where is an edge connecting vertices and . also contains E and F. is the edge-type tensor that records the K bond types. Specifically, is an one-hot vector encoding the type of edge . is the vertex-type feature matrix that records the atom types. Specifically, is the one-hot encoding vector denoting the type of atom vi.
The objective in graph-generative disentangled representation learning is to learn the joint distribution of G and a set of generative disentangled latent factors/variables , such that the observed graph G can be generated as . Note that L is the dimensionality of the latent factors. Disentanglement denotes the additional constraint that the individual variables in Z be independent from one another.
2.2 D-MolVAE framework
Two challenges present themselves with the above formulation: (i) how to integrate the disentanglement constraint and the reconstruction quality constraint in the loss function that guides learning; (ii) how to efficiently encode and decode molecules/graphs of different sizes. We first show how the first challenge is addressed in the D-MolVAE framework via a generative objective function. We show in this context that different approaches here can result in different models. Then we show how the second challenge is addressed via variable-size edge-to-edge and edge-to-vertex convolution operators in D-MolVAE.
We are inspired by disentanglement representation learning in the image domain (Higgins et al., 2017), where a suitable objective in learning is to maximize the marginal (log-)likelihood of the observed graph G in expectation over the whole distribution of latent variables set as , where θ allows explicitly denoting the parameters characterizing this distribution.
In the above equation, refers to the observed set of graphs (corresponding to molecules in the training dataset), is the Kullback–Leibler divergence (KLD) that allows comparing two probability distributions and ϵ is a parameter that specifies the strength of the applied constraint; that is, ϵ allows weighting how much we want the disentanglement constraint to be enforced.
This aggregation is similar to the beta-VAE (Higgins et al., 2017) that first introduced the notion of disentanglement (though not for graph data). Note that β weighs how important it is to enforce the disentanglement constraint. Specifically, when β = 1, one obtains a vanilla VAE (Kingma and Welling, 2013). We direct the interested reader to work in Higgins et al. (2017) to understand the effects of β.
2.3 Disentanglement-enhanced models
By considering different approaches to enforce disentanglement, we obtain different instantiations of our D-MolVAE framework, namely, D-MolVAE-V, D-MolVAE-β, D-MolVAE-DIP-I, D-MolVAE-DIP-II and D-MolVAE-VIB.
In the above, terms ③ and ④ enforce consistency between the marginal distributions over G and z. Specifically, minimizing the KLD in term ③ maximizes the marginal likelihood ; maximizing the disentangled inferred priors term ④ enforces the distance between and p(Z). Terms ① and ② enforce consistency between the conditional distributions. Specifically, term ① maximizes the correlation for each Z that generates each Gn; when is sampled, the likelihood should be higher than the marginal likelihood . Meanwhile, term ② regularizes term ① by minimizing the mutual information I(Z, G) in the inference model.
2.4 Implementation details
The variants are summarized in terms of their objectives in Table 1. The encoder and decoder architecture are summarized in Table 2. Finally, the hyperparameters used for training are related in Table 3. The rows refer to the different benchmark datasets, which we describe in Section 3. We observe that increasing β leads to a better disentangled representation, as later shown in Table 7.
Model . | Objectives . |
---|---|
D-MolVAE-V | |
D-MolVAE-β | |
D-MolVAE-DIP-I | |
D-MolVAE-DIP-II | |
D-MolVAE-VIB |
Model . | Objectives . |
---|---|
D-MolVAE-V | |
D-MolVAE-β | |
D-MolVAE-DIP-I | |
D-MolVAE-DIP-II | |
D-MolVAE-VIB |
Model . | Objectives . |
---|---|
D-MolVAE-V | |
D-MolVAE-β | |
D-MolVAE-DIP-I | |
D-MolVAE-DIP-II | |
D-MolVAE-VIB |
Model . | Objectives . |
---|---|
D-MolVAE-V | |
D-MolVAE-β | |
D-MolVAE-DIP-I | |
D-MolVAE-DIP-II | |
D-MolVAE-VIB |
Encoder . | Decoder . |
---|---|
Input: | Input |
FC.100 ReLU | FC.100 ReLU |
GGNN.100 ReLU | GGNN.100 ReLU |
GGNN.100 ReLU | GGNN.100 ReLU |
FC.100 | FC.bv (batch node size) FC.3 (edge) |
Encoder . | Decoder . |
---|---|
Input: | Input |
FC.100 ReLU | FC.100 ReLU |
GGNN.100 ReLU | GGNN.100 ReLU |
GGNN.100 ReLU | GGNN.100 ReLU |
FC.100 | FC.bv (batch node size) FC.3 (edge) |
Note: Each layer is expressed in the format as . FC refers to the fully connected layers
Encoder . | Decoder . |
---|---|
Input: | Input |
FC.100 ReLU | FC.100 ReLU |
GGNN.100 ReLU | GGNN.100 ReLU |
GGNN.100 ReLU | GGNN.100 ReLU |
FC.100 | FC.bv (batch node size) FC.3 (edge) |
Encoder . | Decoder . |
---|---|
Input: | Input |
FC.100 ReLU | FC.100 ReLU |
GGNN.100 ReLU | GGNN.100 ReLU |
GGNN.100 ReLU | GGNN.100 ReLU |
FC.100 | FC.bv (batch node size) FC.3 (edge) |
Note: Each layer is expressed in the format as . FC refers to the fully connected layers
Dataset . | Learning_rate . | Batch_size . | λ . | Num_iteration . |
---|---|---|---|---|
QM9 | 5e−4 | 64 | 1 | 10 |
ZINC | 5e−4 | 8 | 1 | 5 |
MOSES | 5e−4 | 4 | 1 | 5 |
CHEMBL | 5e−4 | 4 | 1 | 5 |
Dataset . | Learning_rate . | Batch_size . | λ . | Num_iteration . |
---|---|---|---|---|
QM9 | 5e−4 | 64 | 1 | 10 |
ZINC | 5e−4 | 8 | 1 | 5 |
MOSES | 5e−4 | 4 | 1 | 5 |
CHEMBL | 5e−4 | 4 | 1 | 5 |
Dataset . | Learning_rate . | Batch_size . | λ . | Num_iteration . |
---|---|---|---|---|
QM9 | 5e−4 | 64 | 1 | 10 |
ZINC | 5e−4 | 8 | 1 | 5 |
MOSES | 5e−4 | 4 | 1 | 5 |
CHEMBL | 5e−4 | 4 | 1 | 5 |
Dataset . | Learning_rate . | Batch_size . | λ . | Num_iteration . |
---|---|---|---|---|
QM9 | 5e−4 | 64 | 1 | 10 |
ZINC | 5e−4 | 8 | 1 | 5 |
MOSES | 5e−4 | 4 | 1 | 5 |
CHEMBL | 5e−4 | 4 | 1 | 5 |
3 Results
3.1 Datasets and experimental setup
We employ four benchmark datasets: QM9, ZINC, MOSES and ChEMBL (Du et al., 2021b). QM9 (Ramakrishnan et al., 2014; Ruddigkeit et al., 2012) contains around 134k stable small organic molecules with up to nine heavy atoms [e.g. carbon (C), oxygen (O), nitrogen (N) and fluorine (F)]. ZINC (Irwin et al., 2012) contains approximately 250k drug-like chemical compounds with an average of 23 heavy atoms. The molecules in this dataset are more complex than in QM9. MOSES (Polykovskiy et al., 2020) contains about 1.9M larger molecules with up to 30 heavy atoms. ChEMBL (Gaulton et al., 2017) contains about 1.8M manually curated bioactive molecules with drug-like properties. For QM9, we use the entire dataset, while for ZINC, MOSES and ChEMBL which have larger molecules, we randomly sample 70k molecules from the entire dataset, and split into for training and validation. During testing, we generate 30k molecules for our experiments.
We utilize qualitative and quantitative experiments that evaluate the proposed D-MolVAE-V, D-MolVAE-β, D-MolVAE-DIP-I, D-MolVAE-DIP-II and D-MolVAE-VIB. The models are pitched against nine state-of-the-art deep generative models for molecule generation: ChemVAE (Gómez-Bombarelli et al., 2018), GrammarVAE (Kusner et al., 2017), GraphVAE (Simonovsky and Komodakis, 2018), GraphGMG (Li et al., 2018), SMILES-LSTM (Sundermeyer et al., 2012), GraphNVP (Madhawa et al., 2019), GRF (Honda et al., 2019), GraphAF (Shi et al., 2019) and CGVAE (Liu et al., 2018). In the interest of brevity, summaries of the main computational ingredients in each of these models are related in the Supplementary Material. All experiments are conducted on a 64-bit machine with a 6 core Intel CPU i9-9820X, 32 GB RAM and an NVIDIA GPU (GeForce RTX 2080ti, 1545 MHz, 11 GB GDDR6).
3.2 Evaluating the quality of generated molecules
Table 4 relates the comparative analysis. Each trained model is used to generate 30k molecules. For GraphGMG, we obtain 20k generated molecules from the GraphGMG authors. Results for ChemVAE, GrammarVAE, GraphVAE and SMILES-LSTM are obtained from Liu et al. (2018). The quality of generated dataset is evaluated via the three common metrics of Novelty, Uniqueness and Validity. Novelty measures the fraction of generated molecules that are not in the training dataset. Uniqueness measures the fraction of generated molecules after and before removing duplicates. Validity measures the fraction of generated molecules that are chemically valid.
Model . | QM9 . | ZINC . | ||||
---|---|---|---|---|---|---|
. | Validity . | Novelty . | Unique . | Validity . | Novelty . | Unique . |
ChemVAE | 10.00 | 90.00 | 67.50 | 17.00 | 98.00 | 30.98 |
GrammarVAE | 30.00 | 95.44 | 9.30 | 31.00 | 100.00 | 10.76 |
GraphVAE | 61.00 | 85.00 | 40.90 | 14.00 | 100.00 | 31.60 |
GraphGMG | – | – | – | 89.20 | 89.10 | 99.41 |
SMILES-LSTM | 94.78 | 82.98 | 96.94 | 96.80 | 100.00 | 99.97 |
GraphNVP | 90.10 | 54.00 | 97.30 | 74.40 | 100.00 | 94.80 |
GRF | 84.50 | 58.60 | 66.00 | 73.40 | 100.00 | 53.70 |
GraphAF | 100.00 | 88.83 | 94.51 | 100.00 | 100.00 | 99.10 |
CGVAE | 100.00 | 96.33 | 98.03 | 100.00 | 100.00 | 99.95 |
D-MolVAE-V | 100.00 | 96.10 | 99.15 | 100.00 | 100.00 | 99.95 |
D-MolVAE-β | 100.00 | 95.35 | 96.62 | 100.00 | 100.00 | 99.72 |
D-MolVAE-DIP-I | 100.00 | 97.36 | 97.80 | 100.00 | 99.99 | 99.88 |
D-MolVAE-DIP-II | 100.00 | 98.31 | 72.36 | 100.00 | 100.00 | 51.42 |
D-MolVAE-VIB | 100.00 | 95.85 | 98.66 | 100.00 | 100.00 | 99.18 |
Model . | QM9 . | ZINC . | ||||
---|---|---|---|---|---|---|
. | Validity . | Novelty . | Unique . | Validity . | Novelty . | Unique . |
ChemVAE | 10.00 | 90.00 | 67.50 | 17.00 | 98.00 | 30.98 |
GrammarVAE | 30.00 | 95.44 | 9.30 | 31.00 | 100.00 | 10.76 |
GraphVAE | 61.00 | 85.00 | 40.90 | 14.00 | 100.00 | 31.60 |
GraphGMG | – | – | – | 89.20 | 89.10 | 99.41 |
SMILES-LSTM | 94.78 | 82.98 | 96.94 | 96.80 | 100.00 | 99.97 |
GraphNVP | 90.10 | 54.00 | 97.30 | 74.40 | 100.00 | 94.80 |
GRF | 84.50 | 58.60 | 66.00 | 73.40 | 100.00 | 53.70 |
GraphAF | 100.00 | 88.83 | 94.51 | 100.00 | 100.00 | 99.10 |
CGVAE | 100.00 | 96.33 | 98.03 | 100.00 | 100.00 | 99.95 |
D-MolVAE-V | 100.00 | 96.10 | 99.15 | 100.00 | 100.00 | 99.95 |
D-MolVAE-β | 100.00 | 95.35 | 96.62 | 100.00 | 100.00 | 99.72 |
D-MolVAE-DIP-I | 100.00 | 97.36 | 97.80 | 100.00 | 99.99 | 99.88 |
D-MolVAE-DIP-II | 100.00 | 98.31 | 72.36 | 100.00 | 100.00 | 51.42 |
D-MolVAE-VIB | 100.00 | 95.85 | 98.66 | 100.00 | 100.00 | 99.18 |
Note: The highest value achieved on a metric is highlighted in boldface.
Model . | QM9 . | ZINC . | ||||
---|---|---|---|---|---|---|
. | Validity . | Novelty . | Unique . | Validity . | Novelty . | Unique . |
ChemVAE | 10.00 | 90.00 | 67.50 | 17.00 | 98.00 | 30.98 |
GrammarVAE | 30.00 | 95.44 | 9.30 | 31.00 | 100.00 | 10.76 |
GraphVAE | 61.00 | 85.00 | 40.90 | 14.00 | 100.00 | 31.60 |
GraphGMG | – | – | – | 89.20 | 89.10 | 99.41 |
SMILES-LSTM | 94.78 | 82.98 | 96.94 | 96.80 | 100.00 | 99.97 |
GraphNVP | 90.10 | 54.00 | 97.30 | 74.40 | 100.00 | 94.80 |
GRF | 84.50 | 58.60 | 66.00 | 73.40 | 100.00 | 53.70 |
GraphAF | 100.00 | 88.83 | 94.51 | 100.00 | 100.00 | 99.10 |
CGVAE | 100.00 | 96.33 | 98.03 | 100.00 | 100.00 | 99.95 |
D-MolVAE-V | 100.00 | 96.10 | 99.15 | 100.00 | 100.00 | 99.95 |
D-MolVAE-β | 100.00 | 95.35 | 96.62 | 100.00 | 100.00 | 99.72 |
D-MolVAE-DIP-I | 100.00 | 97.36 | 97.80 | 100.00 | 99.99 | 99.88 |
D-MolVAE-DIP-II | 100.00 | 98.31 | 72.36 | 100.00 | 100.00 | 51.42 |
D-MolVAE-VIB | 100.00 | 95.85 | 98.66 | 100.00 | 100.00 | 99.18 |
Model . | QM9 . | ZINC . | ||||
---|---|---|---|---|---|---|
. | Validity . | Novelty . | Unique . | Validity . | Novelty . | Unique . |
ChemVAE | 10.00 | 90.00 | 67.50 | 17.00 | 98.00 | 30.98 |
GrammarVAE | 30.00 | 95.44 | 9.30 | 31.00 | 100.00 | 10.76 |
GraphVAE | 61.00 | 85.00 | 40.90 | 14.00 | 100.00 | 31.60 |
GraphGMG | – | – | – | 89.20 | 89.10 | 99.41 |
SMILES-LSTM | 94.78 | 82.98 | 96.94 | 96.80 | 100.00 | 99.97 |
GraphNVP | 90.10 | 54.00 | 97.30 | 74.40 | 100.00 | 94.80 |
GRF | 84.50 | 58.60 | 66.00 | 73.40 | 100.00 | 53.70 |
GraphAF | 100.00 | 88.83 | 94.51 | 100.00 | 100.00 | 99.10 |
CGVAE | 100.00 | 96.33 | 98.03 | 100.00 | 100.00 | 99.95 |
D-MolVAE-V | 100.00 | 96.10 | 99.15 | 100.00 | 100.00 | 99.95 |
D-MolVAE-β | 100.00 | 95.35 | 96.62 | 100.00 | 100.00 | 99.72 |
D-MolVAE-DIP-I | 100.00 | 97.36 | 97.80 | 100.00 | 99.99 | 99.88 |
D-MolVAE-DIP-II | 100.00 | 98.31 | 72.36 | 100.00 | 100.00 | 51.42 |
D-MolVAE-VIB | 100.00 | 95.85 | 98.66 | 100.00 | 100.00 | 99.18 |
Note: The highest value achieved on a metric is highlighted in boldface.
Table 4 allows making several observations. ChemVAE, GrammarVAE and GraphVAE have the lowest performance. The D-MolVAE models achieve superior performance over the other models. In particular, all D-MolVAE models achieve 100% on validity on all datasets. Similar performance is observed on uniqueness as well. Varied performance is observed on novelty, though all D-MolVAE models consistently outperform or match the performance of the other models; CGVAE is the only other model with a consistently good performance across all metrics on all datasets. This is not surprising, as our proposed models build over the CGVAE architecture but additionally enforce disentanglement. The explicit disentanglement enforcement seems to provide some benefits on higher novelty, in particular, on the QM9 dataset, over CGVAE. Taken altogether, these results suggest that the disentanglement enforcement does not reduce and actually improves performance; adding the disentanglement regularization does not influence the reconstruction error and so does not sacrifice the quality of generated molecules. It is worth noting that some of the proposed models, such as D-MolVAE-DIP-I and D-MolVAE-DIP-II, generate more novel molecules. Between the two, D-MolVAE-DIP-II generates more novel (nearly 100%) yet less unique (50–70%) molecules due to the stronger constraint exerted by the KL divergence term. In Table 5, we further evaluate the performance of our proposed methods and the strongest baseline, CGVAE, on two new datasets, MOSES and ChEMBL. In MOSES dataset, all the model achieve 100% validity and novelty, while D-MolVAE-VIB and D-MolVAE-DIp-I also perform 100% unique. In CHEMBL dataset, all the models achieve a comparable result except D-MolVAE-V on Unique.
Model . | MOSES . | CHEMBL . | ||||
---|---|---|---|---|---|---|
. | Validity . | Novelty . | Unique . | Validity . | Novelty . | Unique . |
CGVAE | 99.97 | 99.97 | 95.33 | 100.00 | 99.97 | 99.85 |
D-MolVAE-V | 100.00 | 100.00 | 99.70 | 100.00 | 100.00 | 14.85 |
D-MolVAE-β | 100.00 | 100.00 | 99.73 | 100.00 | 100.00 | 99.35 |
D-MolVAE-DIP-I | 100.00 | 100.00 | 100.00 | 100.00 | 100.00 | 99.96 |
D-MolVAE-DIP-II | 100.00 | 100.00 | 56.53 | 100.00 | 100.00 | 99.93 |
D-MolVAE-VIB | 100.00 | 100.00 | 100.00 | 100.00 | 99.97 | 99.88 |
Model . | MOSES . | CHEMBL . | ||||
---|---|---|---|---|---|---|
. | Validity . | Novelty . | Unique . | Validity . | Novelty . | Unique . |
CGVAE | 99.97 | 99.97 | 95.33 | 100.00 | 99.97 | 99.85 |
D-MolVAE-V | 100.00 | 100.00 | 99.70 | 100.00 | 100.00 | 14.85 |
D-MolVAE-β | 100.00 | 100.00 | 99.73 | 100.00 | 100.00 | 99.35 |
D-MolVAE-DIP-I | 100.00 | 100.00 | 100.00 | 100.00 | 100.00 | 99.96 |
D-MolVAE-DIP-II | 100.00 | 100.00 | 56.53 | 100.00 | 100.00 | 99.93 |
D-MolVAE-VIB | 100.00 | 100.00 | 100.00 | 100.00 | 99.97 | 99.88 |
Note: The highest value achieved on a metric is highlighted in boldface.
Model . | MOSES . | CHEMBL . | ||||
---|---|---|---|---|---|---|
. | Validity . | Novelty . | Unique . | Validity . | Novelty . | Unique . |
CGVAE | 99.97 | 99.97 | 95.33 | 100.00 | 99.97 | 99.85 |
D-MolVAE-V | 100.00 | 100.00 | 99.70 | 100.00 | 100.00 | 14.85 |
D-MolVAE-β | 100.00 | 100.00 | 99.73 | 100.00 | 100.00 | 99.35 |
D-MolVAE-DIP-I | 100.00 | 100.00 | 100.00 | 100.00 | 100.00 | 99.96 |
D-MolVAE-DIP-II | 100.00 | 100.00 | 56.53 | 100.00 | 100.00 | 99.93 |
D-MolVAE-VIB | 100.00 | 100.00 | 100.00 | 100.00 | 99.97 | 99.88 |
Model . | MOSES . | CHEMBL . | ||||
---|---|---|---|---|---|---|
. | Validity . | Novelty . | Unique . | Validity . | Novelty . | Unique . |
CGVAE | 99.97 | 99.97 | 95.33 | 100.00 | 99.97 | 99.85 |
D-MolVAE-V | 100.00 | 100.00 | 99.70 | 100.00 | 100.00 | 14.85 |
D-MolVAE-β | 100.00 | 100.00 | 99.73 | 100.00 | 100.00 | 99.35 |
D-MolVAE-DIP-I | 100.00 | 100.00 | 100.00 | 100.00 | 100.00 | 99.96 |
D-MolVAE-DIP-II | 100.00 | 100.00 | 56.53 | 100.00 | 100.00 | 99.93 |
D-MolVAE-VIB | 100.00 | 100.00 | 100.00 | 100.00 | 99.97 | 99.88 |
Note: The highest value achieved on a metric is highlighted in boldface.
3.3 Comparing the learned distribution to the training distribution
Given the above results, we now focus the comparison of our models against CGVAE. We measure the distance between the generated and the training datasets in terms of molecular properties and graph statistics, as shown in Table 6, utilizing two popular metrics, the maximum mean discrepancy (MMD) (You et al., 2018) and KL divergence (KLD) (You et al., 2018). MMD is used when comparing distributions of graph statistics and KLD is used when comparing distributions of molecular properties; the molecular properties of interest are selected due to their low correlation, which is ideal for the disentanglement experiment setting that requires independent semantic factors. The correlation heatmap between commonly used molecular properties evaluated in QM9 dataset is shown in Supplementary Figure S1. All these statistics are described in detail the Supplementary Material, where we also draw randomly selected QM9 molecules over the generated dataset for each of the models.
Dataset . | Metric . | CGVAE . | Mol-V . | Mol-β . | Mol-DI . | Mol-DII . | Mol-VIB . |
---|---|---|---|---|---|---|---|
QM9 | MMD(Deg) | 0.0167 | 0.0258 | 0.0541 | 0.0838 | 0.0238 | 0.0232 |
MMD(CC) | 0.0097 | 0.0051 | 0.0259 | 0.0175 | 0.0095 | 0.0045 | |
MMD(Orbit) | 0.0018 | 0.0210 | 0.0021 | 0.0079 | 0.0031 | 0.0017 | |
KLD(cLogP) | 0.08 | 0.41 | 0.44 | 0.35 | 0.46 | 0.01 | |
KLD(cLogS) | 0.06 | 0.27 | 0.26 | 0.18 | 1.23 | 0.13 | |
KLD(Drug) | 0.07 | 0.15 | 0.08 | 0.18 | 0.22 | 0.04 | |
KLD(RPSA) | 0.04 | 0.29 | 0.11 | 0.18 | 0.51 | 0.04 | |
KLD(PSA) | 0.03 | 0.07 | 0.07 | 0.30 | 0.09 | 0.03 | |
KLD(SA) | 0.44 | 0.21 | 0.50 | 0.89 | 0.16 | 0.20 | |
ZINC | MMD(Deg) | 0.0023 | 0.0005 | 0.0043 | 0.0034 | 0.7962 | 0.0111 |
MMD(CC)) | 0.0013 | 0.0002 | 0.0013 | 0.0005 | 0.0316 | 0.0363 | |
MMD(Orbit) | 0.0005 | 0.0731 | 0.0001 | 0.0001 | 0.0001 | 0.0006 | |
KLD(cLogP) | 0.67 | 0.59 | 0.09 | 0.67 | 0.30 | 0.23 | |
KLD(cLogS) | 0.74 | 0.04 | 0.09 | 0.74 | 0.58 | 0.10 | |
KLD(Drug) | 1.29 | 1.63 | 0.97 | 1.29 | 1.52 | 0.01 | |
KLD(RPSA) | 0.78 | 0.47 | 0.31 | 0.79 | 1.17 | 0.08 | |
KLD(PSA) | 0.56 | 0.06 | 0.14 | 0.59 | 0.01 | 0.12 | |
KLD(SA) | 0.56 | 0.75 | 0.79 | 0.76 | 2.29 | 0.82 | |
MOSES | MMD(Deg) | 0.0052 | 0.0032 | 0.0031 | 0.0220 | 0.4520 | 0.0024 |
MMD(CC)) | 0.0003 | 0.0027 | 0.0004 | 0.0005 | 0.0000 | 0.0002 | |
MMD(Orbit) | 0.0009 | 0.0013 | 0.0002 | 0.0006 | 0.0217 | 0.0005 | |
KLD(cLogP) | 0.47 | 0.01 | 0.96 | 0.12 | 0.37 | 0.25 | |
KLD(cLogS) | 0.22 | 0.21 | 0.17 | 1.01 | 0.50 | 0.16 | |
KLD(Drug) | 0.35 | 0.56 | 0.84 | 1.41 | 0.33 | 0.48 | |
KLD(RPSA) | 0.04 | 0.01 | 0.18 | 0.93 | 0.97 | 0.05 | |
KLD(PSA) | 0.07 | 0.22 | 0.36 | 0.71 | 0.58 | 0.07 | |
KLD(SA) | 1.57 | 1.76 | 1.85 | 1.25 | 3.57 | 1.09 | |
CHEMBL | MMD(Deg) | 0.0028 | 0.6634 | 0.0022 | 0.0015 | 0.0013 | 0.0025 |
MMD(CC)) | 0.0002 | 0.0010 | 0.0004 | 0.0001 | 0.0002 | 0.0001 | |
MMD(Orbit) | 0.0004 | 0.0424 | 0.0010 | 0.0002 | 0.0002 | 0.0004 | |
KLD(cLogP) | 0.03 | 0.05 | 0.31 | 0.04 | 0.04 | 0.03 | |
KLD(cLogS) | 0.04 | 0.04 | 0.05 | 0.04 | 0.04 | 0.04 | |
KLD(Drug) | 0.01 | 0.01 | 0.02 | 0.02 | 0.02 | 0.01 | |
KLD(RPSA) | 0.01 | 0.02 | 0.01 | 0.01 | 0.01 | 0.01 | |
KLD(PSA) | 0.23 | 0.24 | 0.25 | 0.24 | 0.25 | 0.23 | |
KLD(SA) | 0.07 | 0.08 | 0.09 | 0.08 | 0.08 | 0.08 |
Dataset . | Metric . | CGVAE . | Mol-V . | Mol-β . | Mol-DI . | Mol-DII . | Mol-VIB . |
---|---|---|---|---|---|---|---|
QM9 | MMD(Deg) | 0.0167 | 0.0258 | 0.0541 | 0.0838 | 0.0238 | 0.0232 |
MMD(CC) | 0.0097 | 0.0051 | 0.0259 | 0.0175 | 0.0095 | 0.0045 | |
MMD(Orbit) | 0.0018 | 0.0210 | 0.0021 | 0.0079 | 0.0031 | 0.0017 | |
KLD(cLogP) | 0.08 | 0.41 | 0.44 | 0.35 | 0.46 | 0.01 | |
KLD(cLogS) | 0.06 | 0.27 | 0.26 | 0.18 | 1.23 | 0.13 | |
KLD(Drug) | 0.07 | 0.15 | 0.08 | 0.18 | 0.22 | 0.04 | |
KLD(RPSA) | 0.04 | 0.29 | 0.11 | 0.18 | 0.51 | 0.04 | |
KLD(PSA) | 0.03 | 0.07 | 0.07 | 0.30 | 0.09 | 0.03 | |
KLD(SA) | 0.44 | 0.21 | 0.50 | 0.89 | 0.16 | 0.20 | |
ZINC | MMD(Deg) | 0.0023 | 0.0005 | 0.0043 | 0.0034 | 0.7962 | 0.0111 |
MMD(CC)) | 0.0013 | 0.0002 | 0.0013 | 0.0005 | 0.0316 | 0.0363 | |
MMD(Orbit) | 0.0005 | 0.0731 | 0.0001 | 0.0001 | 0.0001 | 0.0006 | |
KLD(cLogP) | 0.67 | 0.59 | 0.09 | 0.67 | 0.30 | 0.23 | |
KLD(cLogS) | 0.74 | 0.04 | 0.09 | 0.74 | 0.58 | 0.10 | |
KLD(Drug) | 1.29 | 1.63 | 0.97 | 1.29 | 1.52 | 0.01 | |
KLD(RPSA) | 0.78 | 0.47 | 0.31 | 0.79 | 1.17 | 0.08 | |
KLD(PSA) | 0.56 | 0.06 | 0.14 | 0.59 | 0.01 | 0.12 | |
KLD(SA) | 0.56 | 0.75 | 0.79 | 0.76 | 2.29 | 0.82 | |
MOSES | MMD(Deg) | 0.0052 | 0.0032 | 0.0031 | 0.0220 | 0.4520 | 0.0024 |
MMD(CC)) | 0.0003 | 0.0027 | 0.0004 | 0.0005 | 0.0000 | 0.0002 | |
MMD(Orbit) | 0.0009 | 0.0013 | 0.0002 | 0.0006 | 0.0217 | 0.0005 | |
KLD(cLogP) | 0.47 | 0.01 | 0.96 | 0.12 | 0.37 | 0.25 | |
KLD(cLogS) | 0.22 | 0.21 | 0.17 | 1.01 | 0.50 | 0.16 | |
KLD(Drug) | 0.35 | 0.56 | 0.84 | 1.41 | 0.33 | 0.48 | |
KLD(RPSA) | 0.04 | 0.01 | 0.18 | 0.93 | 0.97 | 0.05 | |
KLD(PSA) | 0.07 | 0.22 | 0.36 | 0.71 | 0.58 | 0.07 | |
KLD(SA) | 1.57 | 1.76 | 1.85 | 1.25 | 3.57 | 1.09 | |
CHEMBL | MMD(Deg) | 0.0028 | 0.6634 | 0.0022 | 0.0015 | 0.0013 | 0.0025 |
MMD(CC)) | 0.0002 | 0.0010 | 0.0004 | 0.0001 | 0.0002 | 0.0001 | |
MMD(Orbit) | 0.0004 | 0.0424 | 0.0010 | 0.0002 | 0.0002 | 0.0004 | |
KLD(cLogP) | 0.03 | 0.05 | 0.31 | 0.04 | 0.04 | 0.03 | |
KLD(cLogS) | 0.04 | 0.04 | 0.05 | 0.04 | 0.04 | 0.04 | |
KLD(Drug) | 0.01 | 0.01 | 0.02 | 0.02 | 0.02 | 0.01 | |
KLD(RPSA) | 0.01 | 0.02 | 0.01 | 0.01 | 0.01 | 0.01 | |
KLD(PSA) | 0.23 | 0.24 | 0.25 | 0.24 | 0.25 | 0.23 | |
KLD(SA) | 0.07 | 0.08 | 0.09 | 0.08 | 0.08 | 0.08 |
Note: We abbreviate D-MolVAE by Mol, DIP by D, degree by Deg, clustering coefficient by Coeff, drug-likeness by Drug and Rel PSA by RPSA. The best value per row is in boldface.
Dataset . | Metric . | CGVAE . | Mol-V . | Mol-β . | Mol-DI . | Mol-DII . | Mol-VIB . |
---|---|---|---|---|---|---|---|
QM9 | MMD(Deg) | 0.0167 | 0.0258 | 0.0541 | 0.0838 | 0.0238 | 0.0232 |
MMD(CC) | 0.0097 | 0.0051 | 0.0259 | 0.0175 | 0.0095 | 0.0045 | |
MMD(Orbit) | 0.0018 | 0.0210 | 0.0021 | 0.0079 | 0.0031 | 0.0017 | |
KLD(cLogP) | 0.08 | 0.41 | 0.44 | 0.35 | 0.46 | 0.01 | |
KLD(cLogS) | 0.06 | 0.27 | 0.26 | 0.18 | 1.23 | 0.13 | |
KLD(Drug) | 0.07 | 0.15 | 0.08 | 0.18 | 0.22 | 0.04 | |
KLD(RPSA) | 0.04 | 0.29 | 0.11 | 0.18 | 0.51 | 0.04 | |
KLD(PSA) | 0.03 | 0.07 | 0.07 | 0.30 | 0.09 | 0.03 | |
KLD(SA) | 0.44 | 0.21 | 0.50 | 0.89 | 0.16 | 0.20 | |
ZINC | MMD(Deg) | 0.0023 | 0.0005 | 0.0043 | 0.0034 | 0.7962 | 0.0111 |
MMD(CC)) | 0.0013 | 0.0002 | 0.0013 | 0.0005 | 0.0316 | 0.0363 | |
MMD(Orbit) | 0.0005 | 0.0731 | 0.0001 | 0.0001 | 0.0001 | 0.0006 | |
KLD(cLogP) | 0.67 | 0.59 | 0.09 | 0.67 | 0.30 | 0.23 | |
KLD(cLogS) | 0.74 | 0.04 | 0.09 | 0.74 | 0.58 | 0.10 | |
KLD(Drug) | 1.29 | 1.63 | 0.97 | 1.29 | 1.52 | 0.01 | |
KLD(RPSA) | 0.78 | 0.47 | 0.31 | 0.79 | 1.17 | 0.08 | |
KLD(PSA) | 0.56 | 0.06 | 0.14 | 0.59 | 0.01 | 0.12 | |
KLD(SA) | 0.56 | 0.75 | 0.79 | 0.76 | 2.29 | 0.82 | |
MOSES | MMD(Deg) | 0.0052 | 0.0032 | 0.0031 | 0.0220 | 0.4520 | 0.0024 |
MMD(CC)) | 0.0003 | 0.0027 | 0.0004 | 0.0005 | 0.0000 | 0.0002 | |
MMD(Orbit) | 0.0009 | 0.0013 | 0.0002 | 0.0006 | 0.0217 | 0.0005 | |
KLD(cLogP) | 0.47 | 0.01 | 0.96 | 0.12 | 0.37 | 0.25 | |
KLD(cLogS) | 0.22 | 0.21 | 0.17 | 1.01 | 0.50 | 0.16 | |
KLD(Drug) | 0.35 | 0.56 | 0.84 | 1.41 | 0.33 | 0.48 | |
KLD(RPSA) | 0.04 | 0.01 | 0.18 | 0.93 | 0.97 | 0.05 | |
KLD(PSA) | 0.07 | 0.22 | 0.36 | 0.71 | 0.58 | 0.07 | |
KLD(SA) | 1.57 | 1.76 | 1.85 | 1.25 | 3.57 | 1.09 | |
CHEMBL | MMD(Deg) | 0.0028 | 0.6634 | 0.0022 | 0.0015 | 0.0013 | 0.0025 |
MMD(CC)) | 0.0002 | 0.0010 | 0.0004 | 0.0001 | 0.0002 | 0.0001 | |
MMD(Orbit) | 0.0004 | 0.0424 | 0.0010 | 0.0002 | 0.0002 | 0.0004 | |
KLD(cLogP) | 0.03 | 0.05 | 0.31 | 0.04 | 0.04 | 0.03 | |
KLD(cLogS) | 0.04 | 0.04 | 0.05 | 0.04 | 0.04 | 0.04 | |
KLD(Drug) | 0.01 | 0.01 | 0.02 | 0.02 | 0.02 | 0.01 | |
KLD(RPSA) | 0.01 | 0.02 | 0.01 | 0.01 | 0.01 | 0.01 | |
KLD(PSA) | 0.23 | 0.24 | 0.25 | 0.24 | 0.25 | 0.23 | |
KLD(SA) | 0.07 | 0.08 | 0.09 | 0.08 | 0.08 | 0.08 |
Dataset . | Metric . | CGVAE . | Mol-V . | Mol-β . | Mol-DI . | Mol-DII . | Mol-VIB . |
---|---|---|---|---|---|---|---|
QM9 | MMD(Deg) | 0.0167 | 0.0258 | 0.0541 | 0.0838 | 0.0238 | 0.0232 |
MMD(CC) | 0.0097 | 0.0051 | 0.0259 | 0.0175 | 0.0095 | 0.0045 | |
MMD(Orbit) | 0.0018 | 0.0210 | 0.0021 | 0.0079 | 0.0031 | 0.0017 | |
KLD(cLogP) | 0.08 | 0.41 | 0.44 | 0.35 | 0.46 | 0.01 | |
KLD(cLogS) | 0.06 | 0.27 | 0.26 | 0.18 | 1.23 | 0.13 | |
KLD(Drug) | 0.07 | 0.15 | 0.08 | 0.18 | 0.22 | 0.04 | |
KLD(RPSA) | 0.04 | 0.29 | 0.11 | 0.18 | 0.51 | 0.04 | |
KLD(PSA) | 0.03 | 0.07 | 0.07 | 0.30 | 0.09 | 0.03 | |
KLD(SA) | 0.44 | 0.21 | 0.50 | 0.89 | 0.16 | 0.20 | |
ZINC | MMD(Deg) | 0.0023 | 0.0005 | 0.0043 | 0.0034 | 0.7962 | 0.0111 |
MMD(CC)) | 0.0013 | 0.0002 | 0.0013 | 0.0005 | 0.0316 | 0.0363 | |
MMD(Orbit) | 0.0005 | 0.0731 | 0.0001 | 0.0001 | 0.0001 | 0.0006 | |
KLD(cLogP) | 0.67 | 0.59 | 0.09 | 0.67 | 0.30 | 0.23 | |
KLD(cLogS) | 0.74 | 0.04 | 0.09 | 0.74 | 0.58 | 0.10 | |
KLD(Drug) | 1.29 | 1.63 | 0.97 | 1.29 | 1.52 | 0.01 | |
KLD(RPSA) | 0.78 | 0.47 | 0.31 | 0.79 | 1.17 | 0.08 | |
KLD(PSA) | 0.56 | 0.06 | 0.14 | 0.59 | 0.01 | 0.12 | |
KLD(SA) | 0.56 | 0.75 | 0.79 | 0.76 | 2.29 | 0.82 | |
MOSES | MMD(Deg) | 0.0052 | 0.0032 | 0.0031 | 0.0220 | 0.4520 | 0.0024 |
MMD(CC)) | 0.0003 | 0.0027 | 0.0004 | 0.0005 | 0.0000 | 0.0002 | |
MMD(Orbit) | 0.0009 | 0.0013 | 0.0002 | 0.0006 | 0.0217 | 0.0005 | |
KLD(cLogP) | 0.47 | 0.01 | 0.96 | 0.12 | 0.37 | 0.25 | |
KLD(cLogS) | 0.22 | 0.21 | 0.17 | 1.01 | 0.50 | 0.16 | |
KLD(Drug) | 0.35 | 0.56 | 0.84 | 1.41 | 0.33 | 0.48 | |
KLD(RPSA) | 0.04 | 0.01 | 0.18 | 0.93 | 0.97 | 0.05 | |
KLD(PSA) | 0.07 | 0.22 | 0.36 | 0.71 | 0.58 | 0.07 | |
KLD(SA) | 1.57 | 1.76 | 1.85 | 1.25 | 3.57 | 1.09 | |
CHEMBL | MMD(Deg) | 0.0028 | 0.6634 | 0.0022 | 0.0015 | 0.0013 | 0.0025 |
MMD(CC)) | 0.0002 | 0.0010 | 0.0004 | 0.0001 | 0.0002 | 0.0001 | |
MMD(Orbit) | 0.0004 | 0.0424 | 0.0010 | 0.0002 | 0.0002 | 0.0004 | |
KLD(cLogP) | 0.03 | 0.05 | 0.31 | 0.04 | 0.04 | 0.03 | |
KLD(cLogS) | 0.04 | 0.04 | 0.05 | 0.04 | 0.04 | 0.04 | |
KLD(Drug) | 0.01 | 0.01 | 0.02 | 0.02 | 0.02 | 0.01 | |
KLD(RPSA) | 0.01 | 0.02 | 0.01 | 0.01 | 0.01 | 0.01 | |
KLD(PSA) | 0.23 | 0.24 | 0.25 | 0.24 | 0.25 | 0.23 | |
KLD(SA) | 0.07 | 0.08 | 0.09 | 0.08 | 0.08 | 0.08 |
Note: We abbreviate D-MolVAE by Mol, DIP by D, degree by Deg, clustering coefficient by Coeff, drug-likeness by Drug and Rel PSA by RPSA. The best value per row is in boldface.
Dataset . | Model . | β-M(%)↑ . | F-M(%)↑ . | DCI↑ . | Mod↑ . |
---|---|---|---|---|---|
QM9 | CGVAE | 100 | 57.0 | 0.055 | 0.239 |
Mol-V | 100 | 50.0 | 0.019 | 0.233 | |
Mol-β | 100 | 56.0 | 0.0466 | 0.223 | |
Mol-DI | 100 | 61.2 | 0.023 | 0.261 | |
Mol-DII | 100 | 62.0 | 0.0972 | 0.241 | |
Mol-VIB | 100 | 72.0 | 0.1282 | 0.243 | |
ZINC | CGVAE | 100 | 48.0 | 0.011 | 0.195 |
Mol-V | 100 | 44.0 | 0.016 | 0.163 | |
Mol-β | 100 | 52.0 | 0.016 | 0.151 | |
Mol-DI | 100 | 52.4 | 0.010 | 0.197 | |
Mol-DII | 100 | 50.0 | 0.019 | 0.188 | |
Mol-VIB | 100 | 58.0 | 0.036 | 0.189 | |
MOSES | CGVAE | 100 | 38.0 | 0.059 | 0.184 |
Mol-V | 100 | 44.0 | 0.060 | 0.189 | |
Mol-β | 100 | 46.0 | 0.061 | 0.186 | |
Mol-DI | 100 | 58.0 | 0.062 | 0.209 | |
Mol-DII | 100 | 50.0 | 0.071 | 0.212 | |
Mol-VIB | 100 | 54.0 | 0.078 | 0.253 | |
CHEMBL | CGVAE | 82.0 | 61.3 | 0.181 | 0.500 |
Mol-V | 80.0 | 62.0 | 0.202 | 0.499 | |
Mol-β | 82.6 | 62.3 | 0.219 | 0.491 | |
Mol-DI | 84.0 | 62.0 | 0.209 | 0.481 | |
Mol-DII | 80.0 | 64.0 | 0.213 | 0.456 | |
Mol-VIB | 85.3 | 64.6 | 0.183 | 0.504 |
Dataset . | Model . | β-M(%)↑ . | F-M(%)↑ . | DCI↑ . | Mod↑ . |
---|---|---|---|---|---|
QM9 | CGVAE | 100 | 57.0 | 0.055 | 0.239 |
Mol-V | 100 | 50.0 | 0.019 | 0.233 | |
Mol-β | 100 | 56.0 | 0.0466 | 0.223 | |
Mol-DI | 100 | 61.2 | 0.023 | 0.261 | |
Mol-DII | 100 | 62.0 | 0.0972 | 0.241 | |
Mol-VIB | 100 | 72.0 | 0.1282 | 0.243 | |
ZINC | CGVAE | 100 | 48.0 | 0.011 | 0.195 |
Mol-V | 100 | 44.0 | 0.016 | 0.163 | |
Mol-β | 100 | 52.0 | 0.016 | 0.151 | |
Mol-DI | 100 | 52.4 | 0.010 | 0.197 | |
Mol-DII | 100 | 50.0 | 0.019 | 0.188 | |
Mol-VIB | 100 | 58.0 | 0.036 | 0.189 | |
MOSES | CGVAE | 100 | 38.0 | 0.059 | 0.184 |
Mol-V | 100 | 44.0 | 0.060 | 0.189 | |
Mol-β | 100 | 46.0 | 0.061 | 0.186 | |
Mol-DI | 100 | 58.0 | 0.062 | 0.209 | |
Mol-DII | 100 | 50.0 | 0.071 | 0.212 | |
Mol-VIB | 100 | 54.0 | 0.078 | 0.253 | |
CHEMBL | CGVAE | 82.0 | 61.3 | 0.181 | 0.500 |
Mol-V | 80.0 | 62.0 | 0.202 | 0.499 | |
Mol-β | 82.6 | 62.3 | 0.219 | 0.491 | |
Mol-DI | 84.0 | 62.0 | 0.209 | 0.481 | |
Mol-DII | 80.0 | 64.0 | 0.213 | 0.456 | |
Mol-VIB | 85.3 | 64.6 | 0.183 | 0.504 |
Note: ↑ indicates that a higher value on a metric is better. Best performances are bolded.
Dataset . | Model . | β-M(%)↑ . | F-M(%)↑ . | DCI↑ . | Mod↑ . |
---|---|---|---|---|---|
QM9 | CGVAE | 100 | 57.0 | 0.055 | 0.239 |
Mol-V | 100 | 50.0 | 0.019 | 0.233 | |
Mol-β | 100 | 56.0 | 0.0466 | 0.223 | |
Mol-DI | 100 | 61.2 | 0.023 | 0.261 | |
Mol-DII | 100 | 62.0 | 0.0972 | 0.241 | |
Mol-VIB | 100 | 72.0 | 0.1282 | 0.243 | |
ZINC | CGVAE | 100 | 48.0 | 0.011 | 0.195 |
Mol-V | 100 | 44.0 | 0.016 | 0.163 | |
Mol-β | 100 | 52.0 | 0.016 | 0.151 | |
Mol-DI | 100 | 52.4 | 0.010 | 0.197 | |
Mol-DII | 100 | 50.0 | 0.019 | 0.188 | |
Mol-VIB | 100 | 58.0 | 0.036 | 0.189 | |
MOSES | CGVAE | 100 | 38.0 | 0.059 | 0.184 |
Mol-V | 100 | 44.0 | 0.060 | 0.189 | |
Mol-β | 100 | 46.0 | 0.061 | 0.186 | |
Mol-DI | 100 | 58.0 | 0.062 | 0.209 | |
Mol-DII | 100 | 50.0 | 0.071 | 0.212 | |
Mol-VIB | 100 | 54.0 | 0.078 | 0.253 | |
CHEMBL | CGVAE | 82.0 | 61.3 | 0.181 | 0.500 |
Mol-V | 80.0 | 62.0 | 0.202 | 0.499 | |
Mol-β | 82.6 | 62.3 | 0.219 | 0.491 | |
Mol-DI | 84.0 | 62.0 | 0.209 | 0.481 | |
Mol-DII | 80.0 | 64.0 | 0.213 | 0.456 | |
Mol-VIB | 85.3 | 64.6 | 0.183 | 0.504 |
Dataset . | Model . | β-M(%)↑ . | F-M(%)↑ . | DCI↑ . | Mod↑ . |
---|---|---|---|---|---|
QM9 | CGVAE | 100 | 57.0 | 0.055 | 0.239 |
Mol-V | 100 | 50.0 | 0.019 | 0.233 | |
Mol-β | 100 | 56.0 | 0.0466 | 0.223 | |
Mol-DI | 100 | 61.2 | 0.023 | 0.261 | |
Mol-DII | 100 | 62.0 | 0.0972 | 0.241 | |
Mol-VIB | 100 | 72.0 | 0.1282 | 0.243 | |
ZINC | CGVAE | 100 | 48.0 | 0.011 | 0.195 |
Mol-V | 100 | 44.0 | 0.016 | 0.163 | |
Mol-β | 100 | 52.0 | 0.016 | 0.151 | |
Mol-DI | 100 | 52.4 | 0.010 | 0.197 | |
Mol-DII | 100 | 50.0 | 0.019 | 0.188 | |
Mol-VIB | 100 | 58.0 | 0.036 | 0.189 | |
MOSES | CGVAE | 100 | 38.0 | 0.059 | 0.184 |
Mol-V | 100 | 44.0 | 0.060 | 0.189 | |
Mol-β | 100 | 46.0 | 0.061 | 0.186 | |
Mol-DI | 100 | 58.0 | 0.062 | 0.209 | |
Mol-DII | 100 | 50.0 | 0.071 | 0.212 | |
Mol-VIB | 100 | 54.0 | 0.078 | 0.253 | |
CHEMBL | CGVAE | 82.0 | 61.3 | 0.181 | 0.500 |
Mol-V | 80.0 | 62.0 | 0.202 | 0.499 | |
Mol-β | 82.6 | 62.3 | 0.219 | 0.491 | |
Mol-DI | 84.0 | 62.0 | 0.209 | 0.481 | |
Mol-DII | 80.0 | 64.0 | 0.213 | 0.456 | |
Mol-VIB | 85.3 | 64.6 | 0.183 | 0.504 |
Note: ↑ indicates that a higher value on a metric is better. Best performances are bolded.
In Table 6, the smaller the value, the more similar the generated set is to the training set on a property under comparison. Table 6 shows that all models reasonably preserve the distributions of properties in the training set. In comparison with CGVAE, our D-MolVAE models preserve more on the ZINC and MOSES dataset while less on the QM9 dataset. However, our models consistently perform well on all four datasets. The only dataset where CGVAE performs better than any of our models on about half of the properties (4/9) is on the QM9 dataset. CGVAE also performs comparably on KLD to at least one of our models on the CHEMLB dataset, but it is outperformed on MMD. On both the ZINC and the MOSES datasets, our models outperform CGVAE. In particular, D-MolVAE-VIB performs consistently well across all four datasets. The KLD between the training and the generated datasets are small, and this is further confirmed visually by plotting the distributions of the molecular properties cLogP, cLogS, PSA, rPSA and drug-likeness for each model in Supplementary Figures S2–S7. These results make clear that our D-MolVAE models capture well the distributions of the molecular properties in the training dataset.
Altogether, these results suggest that the proposed models capture the underlying property distribution of the training dataset. Overall, all models balance well between information preservation and novelty in the generated molecules. Among all our D-MolVAE models, it is easily observed that D-MolVAE-VIB outperforms all the others along most metrics. Interestingly, even though disentanglement-enhanced models do not outperform the baselines in terms of capturing the synthesis accessibility (SA) score distribution, they generate novel molecules with higher SA score, e.g. MolVAE-VIB. This observation actually demonstrates the exploration power of the disentangled models and the better trade-off they allow us to achieve between exploration and exploitation. It is worth noting that one can choose over the disentangled models and the base models by preferences of exploration or exploitation.
3.3.1 Quantitative evaluation of disentanglement learning
All implementation details are as in Locatello et al. (2018).
Table 7 shows that our models achieve the best overall disentanglement scores over CGVAE. Specifically, on the QM9 dataset with smaller molecules, D-MolVAE-DIP-I, D-MolVAE-DIP-II and D-MolVAE-VIB achieve F-M scores of 61.2%, 62.0%, 72.0%, respectively, whereas CGVAE achieves only 57.0%. All models achieve comparable MOD scores, with D-MolVAE-DIP-I achieving the highest. All models achieve a of 100%. D-MolVAE-VIB outperforms all others on the DCI score, and this observation holds across all four datasets. Interestingly, all models perform worse on the ZINC dataset, which contains larger molecules than the QM9 dataset. Similarly, on the MOSES dataset, all the models perform worse than on QM9 but better than on ZINC. Specifically, D-MolVAE-DIP-I and D-MolVAE-VIB rank as the top two on the F-M metrics, and D-Mol-VAE achieves the best performance on the DCI and Mod metrics, with an up to 16% improvement over the second best model, D-MolVAE-DIP-II. On the CHEMBL dataset, D-MolVAE-DIV performs the best across the , F-M and MOD metrics. D-Mol-DIP-I achieves the second in (84.0%), while CGVAE performs only 82.0%. Nevertheless, D-Mol-β performs slightly better over D-Mol-DII on the DCI metric which achieves the best performance. Altogether, these results show that the proposed disentanglement-enhanced models improve the ability of a model for disentanglement learning, especially for D-MolVAE-VIB.
3.3.2 Relating disentangled factors to molecular properties
In Figure 1, we show how the learned disentangled factors relate to the biological properties computed on each generated molecule. The mutual information is calculated between each of the disentangled factors learned by CGVAE and the D-MolVAE models and the molecular properties computed on generated molecules. We focus the comparison here to the MOSES-trained CGVAE and D-MolVAE-VIB models but show all models on all datasets in the Supplementary Material.
Figure 1 clearly show that the factors learned by CGVAE relate weakly with the molecular properties. Such relationship is stronger on the disentangled factors learned by our D-MolVAE models, even though all models are unsupervised. Moreover, different disentangled factors from D-MolVAE-VIB tend to more clearly correlate to different properties than CGVAE, thanks to the disentanglement enhancement.
Figure 2 allows digging deeper into the impact of a property of interest by visualizing the change in the property over molecules generated when a particular latent factor is varied in a range, and others are kept fixed. We focus on one of our top models, D-MolVAE-VIB and on PSA, which is a crucial consideration when generating molecules, as it directly relates to our ability to actually synthesize them in wet laboratories. We can clearly see that basically only one factor is majorly related to PSA, thanks to our disentanglement enhancement that strengthens the independence among different factors and hence minimizes the number of different factors correlated to a property (e.g. PSA). Figure 2 shows that one of the latent factors impacts PSA, and this is more clearly visible on the QM9 and MOSES datasets.
4 Conclusion
The evaluation presented in this article suggests that the proposed disentanglement framework D-MolVAE is effective at generating valid, novel and unique small molecules and outperforms several state-of-the-art generative models. This performance is due to the sequence decoding process and, specifically, valence checking and the stop-checking mechanism. Other graph-based generative models that lack this process (for instance, GraphVAE) suffer in this respect and generate invalid molecules. The variational inference in D-MolVAE also allows better capturing the distribution of the input dataset and so sampling novel and unique molecules from the learned distribution.
It is important to note that the loss functions in the models we propose here effectively implement a trade-off between the disentanglement enhancement and the reconstruction. The distributions of specific properties (for instance, synthesis accessibility) show the exploration-exploitation trade-off in the various disentangled models. Our analysis shows that explicit disentanglement enforcement does not hurt the proposed models; indeed, like CGVAE, the proposed models generate novel and unique molecules and even surpass CGVAE on some of the datasets; the disentangled factors provide an advantage. Moreover, the proposed D-MolVAE models better capture the underlying graph statistics and distributions of various biological properties. Our evaluation also reveals that different types of disentangled models have different abilities. In particular, the experiments suggest that D-MolVAE-VIB is a promising model for exploring disentangled representations.
We consider the proposed work to be a first step to address remaining challenges in small molecule generation. Beyond interpreting the generation process, it is important to precisely control the properties of generated molecules. The disentangled representation learning is this article falls under the umbrella of unsupervised learning. Therefore, specific control and correspondence of latent factors to molecular properties of interest is not expected to be strong. Our analysis shows that, in principle, one can build over the models proposed here for such precise control. Ideally, given specific, target values for several properties of interest, one could decode back the latent variables into a molecule that achieves the target property values. Our future work will address such models.
We also note that current models, including those proposed and evaluated this article, are only concerned with global properties of molecules (or their graph representations), such as ClogP, drug-likeness and others. Preserving local properties of an atom or a cluster of atoms (e.g. an aromatic hydrocarbon) has not been explored so far. Doing both can be helpful in designing novel molecules while improving our understanding of the contribution of each element in the overall molecular properties of interest. We caution, however, that supervised representation learning, while useful in many specific applications, may also bias toward a known, target set of molecular properties and miss possibly interesting new discoveries. In our future work, we hope to advance both unsupervised and supervised representation learning in small molecule generation.
Funding
This work was supported in part by the National Science Foundation [grant numbers 1942594, 1755850, 1907805]. This material was additionally based upon work by A.S. supported by (while serving at) the National Science Foundation. Any opinion, findings and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.
Conflict of Interest: none declared.