Data integration in AI-guided drug discovery
Benefits of continuous integration to use up-to-date data in model building Cerella has the ability to work with and learn…
First, the easiest format to work with is a simple table of data, where each row represents a unique compound, and each column represents a property or endpoint, as illustrated below.
But there are a few extra things to think about:
It’s common for an individual compound to have more than one measured value for the same endpoint. These might come from multiple experimental replicates or different published values. In most cases, you’ll want to aggregate these into a single value to include in your analysis.
We’d most commonly take the average of the values, either the arithmetic mean or the geometric mean.
As a rule of thumb, if the values of a property have similar orders of magnitude, you should take an arithmetic mean. A geometric mean is more appropriate if they span several orders of magnitude, e.g., IC50 values.
But there are other things to consider:
Finally, be careful if you combine data from different experimental methods. This may be encountered in public domain data sources such as ChEMBL or PubChem. The results from two experiments ostensibly measuring the same property or activity can vary dramatically depending on their experimental protocols – you can find a nice illustration of this in G.A. Landrum and S. Riniker (2024) J. Chem. Inf. Model. 64(5) pp. 1560-1567. At the least, you should look at the experimental protocols and determine if you would expect them to generate consistent results. Ideally, if the two sources contain data for some common compounds, you should compare the results empirically.
You may also wish to transform your data into more convenient units or classifications. For instance, if the values for an endpoint span several orders of magnitude, it’s common to use logged units; for example, it’s conventional to transform IC50 values into pIC50 by taking the negative log of the IC50 in Molar units so that a 1 µM IC50 becomes a pIC50 of 6 and 1 nM transforms to a pIC50 of 9. This ‘spreads out’ a highly skewed distribution and makes it much easier to visualise or model variations.
Sometimes, your data may be qualitative. Therefore, you may wish to transform it into classes, such as High/Low or High/Medium/Low. These classes can be based on cut-offs you define.
If you transform your data, keep both the raw and transformed values. If you suspect an error, you can go back and check. This is particularly important if the transformation loses information, such as changing a numerical value to a class; if you change your mind about the class definitions, you can quickly go back and do it differently.
Of course, the whole point of cheminformatics is to relate data to chemical structures. So, we need to think carefully about how to handle them.
We want to perform SAR analyses and calculate descriptors on chemical structures. Therefore, we must represent them consistently. For example, is a nitro group represented by a pentavalent nitrogen with two double-bonded oxygens or in a charge-separated form with N+ and O-? There’s no right or wrong. You can choose your own rules, but make sure you stick to them!
Some formats for storing chemical structures, such as SMILES (see below) can represent the same compound in multiple ways. It’s important to ensure that each compound has a canonical or unique representation to make it easy to find duplicate compounds.
All good cheminformatics libraries and tools will include methods for standardising and canonicalising the representations of compounds.
The first question to ask about the best file format for storing chemical structures is whether you need to include 3D structural information. If so, you’ll need to use a file format such as SD or mol2. For cheminformatics applications, you often only need to store 2D information. In this case, a standard file format such as comma-separated value (CSV) files is a good option. These files can include a column containing the structure in a text notation such as SMILES or InChi.
SMILES is (slightly) more human readable However, InChi can contain more information, e.g., about different tautomeric forms of the same compound. Furthermore, each compound is defined by a unique InChi string. Both SMILES and InChi can store information defining stereochemistry, but to record enhanced stereochemistry information, you’ll need to use a V2000 SD or mol file.
Finally, be careful of proprietary modifications to industry-standard file or chemical structure formats. Software developers sometimes add these to extend or add features. However, these can create headaches if other applications can’t load your files and you’re locked into one platform.
Finally, when you’ve prepared your cheminformatics data, save it somewhere safe. Give it an appropriate filename and, ideally, add a README file with a description of its contents and, if relevant, the references from which you obtained the data. There’s nothing worse than trying to member if the ‘right’ version of a data set is the one called “my_data_set_merged_final_corrected.csv” or “my_data_set_deduped_final_final.csv” two years after you created it! It’s even more difficult for someone else to pick up your files and work out what’s in them.
CEO and Company Director
Matt holds a Master’s in Computation from the University of Oxford and a PhD in Theoretical Physics from the University of Cambridge. He led teams developing predictive ADME models and advanced decision-support tools for drug discovery at Camitro (UK), ArQule Inc., and Inpharmatica. In 2006, he took charge of Inpharmatica’s ADME business, overseeing experimental services and the StarDrop software platform. After Inpharmatica’s acquisition, he became Senior Director of BioFocus DPI’s ADMET division and, in 2009, led a management buyout to establish Optibrium.
Benefits of continuous integration to use up-to-date data in model building Cerella has the ability to work with and learn…
StarDrop — A Swiss Army knife for drug discovery It’s designed to fit right in with the other tools you…
Everyone knows smooth collaboration can speed up successful drug discovery projects. But how can we collaborate easily in drug discovery…