Automatic QSAR modeling of ADME properties: blood-brain barrier penetration and aqueous solubility
Summary In this study, our researchers combined an automatic model generation process for building QSAR models with the Gaussian Processes…
Before we start troubleshooting, let’s remember why we build and run these models in the first place. Good QSAR models help you understand the properties of a compound prior to synthesis.
When used as part of a multiple-parameter optimisation strategy, they make it possible to prioritise compounds with the optimal balance of potency, ADMET and physicochemical properties. Just as importantly, they highlight situations where such a balance is not possible. In drug discovery, if you’re going to fail, you want to do it quickly, cheaply, and confidently.
Often, you can’t test every compound in every assay. There just isn’t enough time or resource. Predicting properties with QSAR models can help to fill in those gaps and avoid costly late-stage surprises.
When working with medicinal chemists and their data, we see five common explanations behind these discrepancies.
The first thing to do is assess the quality of the data used to build your model.
For a successful model, you need a balance of both active and inactive compounds in your training dataset. Without this, your model can’t learn to distinguish what makes a molecule active against your target, making it worthless to your project.
It’s also important to consider the reliability of your experimental data. If it’s too noisy, your model will struggle to make confident predictions that you can rely on in your research.
Predicting properties like in vivo clearance is incredibly valuable for identifying optimal drug candidates, but also incredibly complex. These outcomes often involve multiple biological mechanisms and are determined by multiple interconnected factors.
Dissecting SAR this complex would require an enormous amount of data, and unfortunately these are typically the sorts of experiments where you have the least amount of data.
When you’re trying to predict more complex biological properties, consider alternative approaches like imputation. Unlike traditional QSAR, imputation can identify relationships between different experimental endpoints as well as structural descriptors. This additional layer of information is key. If you’d like to learn more, our CEO Matthew Segall has written a helpful blog on the differences and applications of QSAR and imputation predictive models.
Sometimes you need to take step back and ask yourself: “Should I even build a model for this property?”
For instance, enzymes that exhibit broader substrate specificity (such as cytochrome P450s and glutathione S-transferases (GSTs)) will interact with so many different compounds that your model might end up predicting everything as ‘active’.
In these cases, using physicochemical properties as surrogates for activity might be a more productive approach. For example, when predicting P450 3A4 metabolism, we often observe a strong correlation with logD, which can be a more effective way to prioritise your compounds.
QSAR models perform best when predicting properties for compounds chemically similar to their training data. We define the chemical space within which a QSAR model can make reliable predictions as the domain of applicability.
For example, when designing new compounds, once you extrapolate into new chemical space, you might notice decreased confidence in your predictions. This is a sign that these compounds differ from those used to build the model; the model has less information on which to base its prediction. This can indicate that the model is no longer applicable and it’s time to rebuild.
There are some approaches to take during model building to ensure your model has a robust domain of applicability. For example, select the training set to cover the full diversity of the available data. One way to do this is by clustering the data and then putting the cluster heads and singletons into the training set. Divide the remaining cluster members between the training, validation and test sets. This will ensure that your training set doesn’t exclusively come from one cluster while your test set comes from another.
It’s also important here to consider local versus global models and which is most appropriate to use:
How you apply your model matters just as much as how you build it. If you’re looking to prioritise compounds for synthesis, then a categorical model can be a more practical approach. It enables you to quickly flag compounds as “make these,” “maybe,” and “reject these.”
For example, you may have two compounds (A and B) that can’t be meaningfully distinguished. Inherent uncertainties in the model and/or assay may show contradicting information: where they’re predicted as A>B, but then measured as B>A.
There is no need to get hung up on this or lose trust in your model. Using a categorial approach here that identifies both compounds as “make these” is still valuable for decision making.
Understanding these common pitfalls helps to establish realistic expectations for your QSAR models and when you might need to rebuild or rethink your approach. If you’re finding that your model isn’t working, make sure to run through these five checkpoints.
With StarDrop’s ADME QSAR module, it’s easy to predict a wide range of ADME and physicochemical properties using a ready-to-implement list of models.
How can you know if StarDrop’s predictive models work? We explain the four pillars that ensure our models behave as expected in this blog.
Want to take a sneak peek at the new and updated models heading to StarDrop? Catch our on-demand webinar to hear the latest developments, including models for intrinsic clearance, and P-gp substrates and inhibitors.
Need models tailored specifically to your chemistry and data? Explore our Auto-Modeller module, which provides an intuitive workspace to build and validate custom predictive models, no matter your level of expertise.
President, Optibrium Inc. and Global Head of Application Science
Tamsin holds a PhD in Organic Chemistry from University of East Anglia in the UK and pursued Postdoctoral studies in the labs of Prof. Philip Magnus at University of Texas, Austin.
She is an experienced drug discovery scientist, having worked as a medicinal chemist at Eli Lilly and UCB Research. Her interests lie in coupling machine learning and artificial intelligence techniques with generative chemistry approaches to explore chemistry space and guide compound design.
Summary In this study, our researchers combined an automatic model generation process for building QSAR models with the Gaussian Processes…
Summary In this study, the researchers look to solve classification quantitative structure−activity relationship (QSAR) modelling problems using Gaussian processes. They…
My hope is that these posts will be of interest to people who want to understand more of the nuts…