In our latest paper, published in JCIM, we explore two key steps for maximising QSAR model performance: graph featurisation and heterogeneous ensemble modelling. Taken together, these push the limit of molecular property prediction, outperforming existing state-of-the-art models. 

Probably the most common use of machine learning (ML) in drug discovery is QSAR modelling, where we predict the properties of potential compounds, such as toxicity or binding affinity, from their structure.  

Determining every property for every compound experimentally would require an unfeasible amount of time and money, particularly when we have thousands of compounds in our study. Therefore, we want to ensure that we’re making the best possible predictions from our data, to only prioritise the most promising candidates for synthesis and testing. Squeezing every last drop of performance from our models translates directly into real-world cost savings and faster, more efficient drug discovery.

In our paper ‘Improving Predictions of Molecular Properties with Graph Featurization and Heterogeneous Ensemble Models‘ published in JCIM, we explore these two key steps for maximising QSAR model performance. Taken together, these push the limit of molecular property prediction, outperforming existing state-of-the-art models. 

MetaModel: Heterogeneous ensemble modelling improves generalisability 

A well-known result in ML modelling is the so-called “no free lunch” theorem, which states that across all possible problems, all algorithms will perform equally well. In practice, this means that no single class of ML model (random forest, neural network, linear regression, etc.) will perform best on all modelling tasks. Experienced computational chemists are likely familiar with this already – sometimes the latest, sophisticated deep learning model is outperformed by a more traditional random forest, despite all the ground-breaking research that’s gone into the newer model. 

Drug discovery is an extremely rich field. The properties that we want to predict, and the data we use to predict them, vary enormously in terms of complexity, functional form, and quality. Therefore, we cannot expect any single algorithm to perform well across all the problems we encounter.  

To overcome this challenge, we introduce the MetaModel (Figure 1), an ensemble model constructed by combining the predictions from many different classes of ML algorithms. By weighting the individual sub-model predictions based on their validation set performance, we can ensure that we prioritise the models that perform best on any given problem.

Figure 1. MetaModel includes a diverse ensemble of sub-model classes

Figure 2 shows the performance of the MetaModel ensemble as a function of the number of sub-models in the ensemble on the MoleculeNet BACE dataset. In this case, smaller ensembles perform worse, even if the average performance of the sub-models in the ensemble is better. The sharp drop in performance below an ensemble size of 20 corresponds to a drop in the number of sub-model classes from 6 to 2, demonstrating the importance of a diverse ensemble of models. 

Figure 2. MetaModel ensemble performance (measured with the area-under-curve (AUC) of the receiver operating characteristic (ROC) curve, so higher is better) increases with the total number of sub-models and the number of sub-model classes. 

There are additional benefits to this approach: 

  • Combining different model classes reduces the inductive bias and the variance in the prediction that arises from the stochastic noise inherent to many models. 
  • We can use the ensemble to estimate the uncertainty in our predictions by looking at how closely the different sub-models agree. 

Now that we have an ensemble that should be able to tackle any modelling problem, how can we set it up for success? 

Graph featurisation: Overcoming the limits of fixed descriptors

You may already be familiar with the phrase “garbage in, garbage out” as it applies to ML: the final model is only ever as good as the data used to train it. For QSAR modelling, a big part of this is the choice of molecular descriptors. Descriptors are how we convert a compound into a set of numbers, representing different physical or chemical properties of the compound, which can then be used as input to an ML model. Many thousands of different potential descriptors exist, from basic properties like molecular weight to sophisticated 4D descriptors built from molecular dynamics simulations. 

Given the abundance of descriptors, one might naively assume that we can shove all of them into our model and let the algorithm figure it out. However, if we try this, we run into the scourge of machine learning: overfitting. Like humans seeing faces in clouds, ML models love to find patterns where none exist. The more numbers we feed them, the more spurious correlations they will find, and the worse they will perform in the real world. 

Because of this, we have to pick a relatively small set of descriptors, typically tens to hundreds, to better our chances of identifying real patterns in the data. Unfortunately, just as no single model is optimal across all problems, no one set of descriptors will be optimal. This puts us in a bit of a bind: we can’t use all possible descriptors, and any subset we choose will sometimes be wrong. 

Deriving task-specific descriptors with graph neural networks

Fortunately, we can work around this problem by deriving our own task-specific descriptors. The most popular way to do this is using graph neural networks (GNNs). Like the convolutional networks used for image classification, these models can work directly from the raw data, which in this case are molecular graphs. 

Figure 3. How a GNN makes predictions from a molecular graph. First, graph network layers learn the properties of the molecule. This information is then compressed to a single 1D vector, a latent representation of the compound. The vector is then input into a conventional FFN, which makes the final prediction. 

The GNN first derives its own numerical representation of the molecular graph, as a latent vector, before using a feed-forward network (FFN) prediction head to turn that vector into a property prediction. This numerical representation encodes all the relevant information the GNN can find, effectively a learned set of descriptors. These descriptors will be custom-made for the property we’re trying to predict, so we don’t have to worry about pushing irrelevant data into our model. 

Optimising performance with a ‘best-of-both’ strategy

While this is a big improvement, GNNs do have weaknesses. Firstly, it’s generally not possible to derive all the relevant descriptors from a limited dataset, and GNNs have their own blind spots – they’re very good at understanding local structure, but struggle with global properties. Modern GNNs like ChemProp mitigate this by supplementing their latent vector with general-purpose fixed descriptors, getting the best of both worlds.

The second weakness is inherent to the model architecture – GNNs typically rely on an FFN to turn their descriptors into a property prediction. This is effectively the same as a multi-layer perceptron model, the classic neural network for property prediction, and, as we’ve seen, no single model class will perform optimally across all tasks. Relying purely on FFNs to make predictions limits the potential of GNN-based models, meaning that their powerful molecular representations are not always fully exploited. 

In our paper, we show that we can improve on this by extracting the learned descriptors from a trained ChemProp model and using them, and with a set of general-purpose descriptors, as input to the MetaModel ensemble.  

Figure 4.  The combined graph featurisation and MetaModel framework. Two sets of descriptors are calculated and combined for each compound, one using a trained GNN, and one using conventional descriptors. These descriptors are used to train the ensemble of sub-models, and their predictions are weighted and combined by the MetaModel.

In this way, we get the benefit of task-specific descriptors without the limitations of a pure neural network approach.

Results

MetaModel delivers substantial improvements where neural networks alone underperform

We tested our modelling framework on the MoleculeNet datasets, benchmarking the MetaModel performance against ChemProp.

Figure 5. MetaModel outperforms ChemProp on all regression datasets tested.

This plot shows the performance of ChemProp and the MetaModel on the six regression datasets (measured using mean absolute error (MAE), or root mean squared error (RMSE), so lower is better). In each case, we test the MetaModel with a set of general-purpose descriptors from RDKit, and either with or without GNN descriptors from a ChemProp message-passing neural network (MPNN). The MetaModel performs better than ChemProp on every dataset, with five of them highly significant, and one marginally (Lipophilicity, p=0.05).

The addition of the GNN/MPNN descriptors is more nuanced, improving the MetaModel performance on the three quantum mechanics datasets but having little or no benefit on the physical chemistry datasets. This is likely because the general-purpose RDKit descriptors are already a good match to the PhysChem datasets, meaning that the learned descriptors are unable to add value.

Figure 6. MetaModel outperforms ChemProp on 6 out of the 9 classification datasets tested.

The picture is similar for the nine classification datasets (performance is measured in the AUC of either the ROC curve or the precision recall curve (PRC), so higher is better).

The MetaModel outperforms ChemProp on six datasets (of which three are highly significant), while ChemProp significantly outperforms the MetaModel on the ToxCast and Tox21 datasets. These are large, sparse datasets with many endpoints, which benefit a multi-target model like ChemProp over the single-target MetaModel. In fact, these datasets are much more suited to an imputation model like the one underpinning our Cerella™ deep learning platform!

Aside from these exceptions, where we could not reasonably expect the MetaModel to perform as well, it consistently matches or outperforms ChemProp. This is the sort of behaviour we should anticipate: where ChemProp is already close to optimal, there is no room for improvement, so the best we can do is match it. On the other hand, where ChemProp is underperforming (because the FFN is not the best model for the job), the flexibility of the MetaModel ensemble allows it to find better solutions and make better predictions.

Conclusion

Our work demonstrates the potential for pushing the limits of QSAR modelling using diverse ensemble models, to ensure the best-performing algorithms are prioritised for every task, and by using GNNs as featurisers to extract relevant information from molecular graphs. The paper is published in JCIM, and a pre-print version is available here.

These upgrades will be implemented into our Auto-Modeller™ module, enabling StarDrop users to build and validate their own cutting-edge QSAR models, tailored to their chemistry and data, at the press of a button. Ultimately, these improved models give chemists the tools they need to better understand their molecules and prioritise the right designs for synthesis and testing.

About the author

Michael Parker, PhD

Michael is a Principal AI Scientist at Optibrium, applying advanced AI techniques to accelerate drug discovery and improve decision-making. With a Ph.D. in Astronomy and Astrophysics from the University of Cambridge, he brings a data-driven approach to solving complex scientific challenges. Michael is also a thought leader, contributing to discussions on the impact of AI in pharmaceutical research.

Linkedin

More about QSAR modelling