Rae Lawrence, our product manager, debunks a common misconception in this blog: numerical predictive models are always superior in the early stages of drug development. However, in reality, categorical models can be more practical and reliable when dealing with sparse data in this phase. So, how do these models add value, and why should medicinal chemists consider using them? Read on to find out!

Introduction

In early-stage drug discovery, medicinal chemists rely on predictive models to help guide which compounds to synthesise or test next. Ideally, these models would provide highly accurate numerical predictions of properties like potency, solubility, or metabolic stability. However, early discovery is often characterised by limited experimental data, making it challenging to build robust models that generate precise numerical outputs. 

This is where categorical models – such as red / amber / green (RAG) classifications – come into play. Rather than attempting to predict an exact value, categorical models provide qualitative insights that help with prioritising compounds with good properties and deprioritise weaker candidates. But how exactly do these models provide value, and why should medicinal chemists consider using them? 

The challenges of numerical prediction in early discovery 

Building highly accurate numerical models requires large, high-quality datasets. However, in the early stages of discovery: 

  • Data is scarce – There are often too few measured compounds to train reliable numerical models. 
  • Experimental variability is high – Assay noise and batch-to-batch variability can lead to inconsistent numerical predictions. 
  • Extrapolation is risky – numerical models trained on small data sets may struggle to generalise to novel chemical space. 

These factors make it difficult to confidently rank compounds based on numerical scores alone. 

What are categorical models? 

Categorical models take a different approach by classifying compounds into broad categories based on predicted properties. A common example is the red /amber/green system where: 

  • Green – Compound is likely to have favourable properties and should be prioritised 
  • Amber – Compound has moderate potential, but requires further evaluation. 
  • Red – Compound is unlikely to meet desired criteria and should be deprioritised. 

Instead of struggling with uncertain numerical values, chemists can use these classifications to quickly filter, prioritise, and optimise compound selection. 

How categorical models provide value

Categorical models may seem like a compromise compared to numerical predictions, but in many cases, they are actually more practical and actionable. Here’s why: 

1. More robust in low-data scenarios. 

Because categorical models focus on classification rather than exact prediction, they can still be useful when training data is limited. Even with small datasets, a well-calibrated RAG model can help guide decision-making with greater reliability than a weak numerical value. 

2. Easier interpretation and faster decision-making 

Rather than presenting medchemists with an arbitrary number, categorical models provide a clear go/no-go decision framework. This makes it easier to triage compounds and focus on the most promising ones without over-interpreting uncertain numerical values. 

3. Aligns with real-world selection process 

In drug discovery, compound progression is rarely based on a single numerical threshold. Instead, teams typically make holistic go/no-go decisions considering multiple factors. A categorical model mirrors this qualitative decision-making approach, making it a more intuitive tool for medicinal chemists. 

4. Can be used with multi-parameter optimisation 

A single numerical score often fails to capture the complexity of drug discovery. Categorical models fit nicely into MPOs to evaluate multiple properties in parallel, filtering out problematic compounds based on a combination of predicted factors (e.g., potency, solubility, metabolic stability). 

Real-world application: Using RAG scoring to prioritise synthesis 

Imagine a medicinal chemistry team working on a novel kinase inhibitor programme. They have 500 virtual compounds to choose from, but only enough resources to synthesise 20. 

Using a categorical model, the team scores each compound as red, amber, or green, based on predicted potency and selectivity. They: 

  • Immediately discard red compounds, saving time and effort. 
  • Focus their attention on green compounds, ensuring they prioritise high-confidence candidates. 
  • Review amber compounds carefully, considering additional data to refine their selection. 

This streamlined decision-making process enables the team to make better choices, faster, ultimately increasing the efficiency of their discovery efforts. 

Limitations and Considerations 

While categorical models offer several advantages, they are not without challenges: 

  • Less granularity – Categorical models do not provide detailed rankings within each category. 
  • Dependence on threshold settings – The definitions of red, amber, and green need careful calibration based on project needs. 
  • Not a standalone solution – They should be used alongside other prioritisation strategies, such as experimental validation and medicinal chemistry intuition. 

Conclusion 

In early drug discovery, where data is sparse and uncertainty is high, categorical models offer a practical and effective alternative to numerical predictions. By providing clear, interpretable classifications, these models help medicinal chemists, prioritise synthesis, filter out weak candidates, and make more confident decisions. 

Rather than wait for perfect numerical models, teams can leverage categorical predictions to drive faster, smarter compound selection – accelerating the path to new drug candidates. 

More predictive modelling resources