The pillars of drug discovery model-building

We often get asked, “How do you know that your models work”? While it is a straightforward question, there isn’t an easy answer. The fact is that there isn’t a model that works for everything, and even the most accurate model has its weak points.

Here at Optibrium, we have four model-building “pillars” that help us to make sure our models behave as expected:

  • Clean data;
  • Appropriate methods;
  • Model validation (including appropriate metrics);
  • Understanding the domain of applicability.

Data curation for model building

A model can only be as good as the data it has been trained on. We put a lot of emphasis on the data, manually curating the data sets. We only use sources that provide detailed information on the desired property.

Results published in the literature can vary significantly due to differences in experimental protocols and methodologies used in laboratories. To reduce variability, we carefully filter out results from experiments that won’t meet our criteria. For example, for our metabolism models, the data from any experiments run with unphysiological substrate concentrations were rejected. Thus, the gathered data better represents the property we aim to model.

Choosing the right modelling method

Once the data is gathered, we will decide whether building a quality model is feasible. Most of our ADMET models are built using the classical QSAR/QSPR approach. With this empirical modelling method, we use various molecular descriptors to model and predict a property of interest using statistics or machine learning. However, such an approach is not viable for smaller data sets. This is because such sets are too small to mathematically extract meaningful trends or patterns.

Learn more about QSAR modelling in our webinar, ‘Perfecting the use of imperfect QSAR models’.

The other end of the modelling spectrum would be mechanistic approaches like molecular mechanics, molecular dynamics, or even quantum mechanics. Due to their underlying physical principles, such models can be built on smaller data sets. The downside of mechanistic-based models is the execution time. Even with modest methods, this can be too long for practical use.

At Optibrium, for smaller data sets, we aim for the sweet spot where we combine elements from the empirical modelling methods with aspects from the mechanistic approach. Thus, we can create physically meaningful models that achieve a balance between computational cost and accuracy . As an example, you can read about our metabolism models in the Journal of Medicinal Chemistry.

Validated predictive models

When training models, testing them on data unseen by the model is crucial.

Once we have gathered the data for building a model, we set aside a part of the data (usually 15 to 20%) and won’t touch it until all models are built. We call this the external test set. We use it to run the final test on our model and report the results (e.g., in a scientific journal or in our reference guide).

For a bigger data set, we choose the external data points randomly. For smaller datasets, we use a clustering algorithm to ensure that the training data represents the test data. However, for situations where the data is biased, e.g., towards a certain class in a category model, we might use a Y-based split, where we make sure that the amount of underrepresented data is proportional between the training set and the external test set.

The remaining data, in turn, is split between training and validation sets, where we train models and, based on the validation results, choose the best-performing model (or tune it if needed). The danger of using such an approach is that we train the model to do well with the validation set; however, testing the best model on the external test set enables us to estimate whether the model has been overtrained or not.

Assessing model performance – using the right metrics

During the training process, it is also important to constantly assess the results using appropriate metrics and visualise them. Statistics such as R2 and accuracy are informative. Caution should be taken, though—we don’t want to be led astray. For example, data point clusters can artificially boost R2 values, and unbalanced test sets could give a high accuracy value to a lousy model.

Data point clusters can result in high R2 values, leading us astray when R2 is considered without context.
Data point clusters can result in high R2 values, leading us astray when R2 is considered without context.

For classification models, we try to avoid accuracy and usually report balanced accuracy and Cohen’s kappa value. These take the bias in the data sets into account. In addition, we study the confusion matrix. With this, we can obtain metrics such as sensitivity, specificity, true positives, true negatives, false positives, and false negatives. We look at the confusion matrix because we must ensure that, for certain models, we avoid false positives, or we prioritise sensitivity over specificity. Therefore, in addition to targeting a high R² or accuracy, we aim to avoid common pitfalls. This makes the models useful and effective for all intents and purposes.

In addition, to avoid pitfalls, we often analyse outliers and influential points that disproportionately impact the model or the results.

Furthermore, for many models we also include uncertainty, that helps us to understand if the difference between the predictions is meaningful.

Domain of applicability

Last but not least, we mentioned in the beginning that there isn’t a model fit for every problem. Since most of our models aim to predict properties relevant in pharmaceutical industries, we verify that the predictive models work on drug-like molecules. While it might seem excessive because the data used to train the models is mainly obtained from experiments with drug-like molecules, it is important to cover all relevant chemistries.

Furthermore, it is essential to test the applicability of models trained on smaller data sets. We have an in-house data set of 1300 launched drugs that we use to test the models. If the prediction values have high uncertainties for many of the compounds in the set, it most likely means that our model, despite having good statistics, is not suitable for a large number of compounds in the chemical space. In other words, instead of a global model, we have a local model. These can be useful, but the use cases will be limited.

In Daniel’s blog article, learn more about harnessing uncertainty to gain value from models.

Conclusions – predictive models that work

In summary, we ensure that our predictive models are trained on quality data. We use various techniques to prevent overfitting during the training process. Furthermore, we avoid models that have good statistics, but are either useless due to artificially boosted statistics or because their domain of applicability is very limited. Finally, it is important to communicate the results as clearly as possible to our users so they would understand the strengths and weaknesses of any given model.

About the author

Mario Öeren, PhD

Mario is a Principal Scientist at Optibrium, where he applies expertise in quantum chemistry, conformational analysis, and molecular modeling to support drug discovery research. With a PhD in Natural Sciences from Tallinn University of Technology, his academic work focused on computational studies of macrocyclic molecules. Mario has authored numerous publications and has extensive experience in chemical education and advanced spectroscopy.

Linkedin

Dr Mario Öeren, Principal Scientist, Optibrium

More predictive modelling resources