How should I prepare and store my data for cheminformatics applications?

Author

Matthew Segall, PhD

When you’re getting your compound data ready for cheminformatics analyses, whether you want to build a QSAR model, analyse SAR or create some plots, there are some things to consider before you start. And, when you come back to your data months or even years later to repeat or extend your analyses, you’ll be grateful you got it right the first time. Believe me, we’ve learned this the hard way!

If you’re lucky, you’ll have a well-constructed database that will prepare all the data behind the scenes and present you with a nice clean data table. But many of us are not that fortunate, and if you’re working with data from public domain sources, you’ll probably have to do some manual cleaning before you get started. You’d be amazed how often published data sets contain errors, duplicates and inconsistencies!

Remember, if you put garbage data in, you’ll get garbage results out. Following these hints and tips will help you start on a strong foundation.

Structuring your cheminformatics data

First, the easiest format to work with is a simple table of data, where each row represents a unique compound, and each column represents a property or endpoint, as illustrated below.

A table is a useful structure for your data.

But there are a few extra things to think about:

Make sure you label the units of each endpoint. When you come back to your data, don’t assume you’ll remember if you recorded the solubility in µM or mg/mL
If you have data missing for some compounds and endpoints, decide how you want to record it; for example, leave a blank cell or use a standard annotation, e.g. “N/A”
Consider how you want to handle qualifiers, for example, “>” or “<” symbols. Do you want to keep them in the same column as the qualified value or in separate columns, e.g. a ‘modifier’ column for each column of values? You may choose to omit qualifiers and just keep the numerical value. However, you’ll be losing information and may end up with a strange distribution for that property, with a ‘spike’ at a single value

Data aggregation

It’s common for an individual compound to have more than one measured value for the same endpoint. These might come from multiple experimental replicates or different published values. In most cases, you’ll want to aggregate these into a single value to include in your analysis.

We’d most commonly take the average of the values, either the arithmetic mean or the geometric mean.

Two possible mean formulas for data aggregation.

As a rule of thumb, if the values of a property have similar orders of magnitude, you should take an arithmetic mean. A geometric mean is more appropriate if they span several orders of magnitude, e.g., IC₅₀ values.

But there are other things to consider:

How do you want to handle qualifiers? For example, what’s the arithmetic mean of 4, >5 and <2? You could ignore the qualifiers, which would give an answer of roughly 3.7, or you could choose to delete qualified values, giving a result of 4.
What do you want to do about outliers? One value that is very different from the others, which might be due to an error in the experiment or when recording the data, can dramatically skew the average. For example, the arithmetic mean of 10, 15, 30, 20, and 1000 is 215, which doesn’t seem representative of the data.
At what level of compound definition do you want to aggregate your data? You may wish to aggregate at the parent level (ignoring stereochemistry), for individual stereoisomers (if they are known), or for specific batches of a compound for which experimental results may vary depending on, for example, different levels of impurities. The answer will depend on the questions you want to answer.

Finally, be careful if you combine data from different experimental methods. This may be encountered in public domain data sources such as ChEMBL or PubChem. The results from two experiments ostensibly measuring the same property or activity can vary dramatically depending on their experimental protocols – you can find a nice illustration of this in G.A. Landrum and S. Riniker (2024) J. Chem. Inf. Model. 64(5) pp. 1560-1567. At the least, you should look at the experimental protocols and determine if you would expect them to generate consistent results. Ideally, if the two sources contain data for some common compounds, you should compare the results empirically.

Data transformations

You may also wish to transform your data into more convenient units or classifications. For instance, if the values for an endpoint span several orders of magnitude, it’s common to use logged units; for example, it’s conventional to transform IC₅₀ values into pIC₅₀ by taking the negative log of the IC₅₀ in Molar units so that a 1 µM IC₅₀ becomes a pIC₅₀ of 6 and 1 nM transforms to a pIC₅₀ of 9. This ‘spreads out’ a highly skewed distribution and makes it much easier to visualise or model variations.

Sometimes, your data may be qualitative. Therefore, you may wish to transform it into classes, such as High/Low or High/Medium/Low. These classes can be based on cut-offs you define.

If you transform your data, keep both the raw and transformed values. If you suspect an error, you can go back and check. This is particularly important if the transformation loses information, such as changing a numerical value to a class; if you change your mind about the class definitions, you can quickly go back and do it differently.

Handling chemical structures

Of course, the whole point of cheminformatics is to relate data to chemical structures. So, we need to think carefully about how to handle them.

We want to perform SAR analyses and calculate descriptors on chemical structures. Therefore, we must represent them consistently. For example, is a nitro group represented by a pentavalent nitrogen with two double-bonded oxygens or in a charge-separated form with N+ and O-? There’s no right or wrong. You can choose your own rules, but make sure you stick to them!

In cheminformatics, you'll need to consider how your data represents chemical structures. Here, two depictions for a nitro group are shown, one displaying charges and one not.

Some formats for storing chemical structures, such as SMILES (see below) can represent the same compound in multiple ways. It’s important to ensure that each compound has a canonical or unique representation to make it easy to find duplicate compounds.

All good cheminformatics libraries and tools will include methods for standardising and canonicalising the representations of compounds.

File formats for cheminformatics

The first question to ask about the best file format for storing chemical structures is whether you need to include 3D structural information. If so, you’ll need to use a file format such as SD or mol2. For cheminformatics applications, you often only need to store 2D information. In this case, a standard file format such as comma-separated value (CSV) files is a good option. These files can include a column containing the structure in a text notation such as SMILES or InChi.

SMILES is (slightly) more human readable However, InChi can contain more information, e.g., about different tautomeric forms of the same compound. Furthermore, each compound is defined by a unique InChi string. Both SMILES and InChi can store information defining stereochemistry, but to record enhanced stereochemistry information, you’ll need to use a V2000 SD or mol file.

Finally, be careful of proprietary modifications to industry-standard file or chemical structure formats. Software developers sometimes add these to extend or add features. However, these can create headaches if other applications can’t load your files and you’re locked into one platform.

Archiving your data

Finally, when you’ve prepared your cheminformatics data, save it somewhere safe. Give it an appropriate filename and, ideally, add a README file with a description of its contents and, if relevant, the references from which you obtained the data. There’s nothing worse than trying to member if the ‘right’ version of a data set is the one called “my_data_set_merged_final_corrected.csv” or “my_data_set_deduped_final_final.csv” two years after you created it! It’s even more difficult for someone else to pick up your files and work out what’s in them.

Matthew Segall, PhD

CEO and Company Director

Matt holds a Master’s in Computation from the University of Oxford and a PhD in Theoretical Physics from the University of Cambridge. He led teams developing predictive ADME models and advanced decision-support tools for drug discovery at Camitro (UK), ArQule Inc., and Inpharmatica. In 2006, he took charge of Inpharmatica’s ADME business, overseeing experimental services and the StarDrop software platform. After Inpharmatica’s acquisition, he became Senior Director of BioFocus DPI’s ADMET division and, in 2009, led a management buyout to establish Optibrium.

Cookies

Structuring your cheminformatics data

Data aggregation

Data transformations

Handling chemical structures

File formats for cheminformatics

Archiving your data

About the author

Matthew Segall, PhD

More drug discovery data management resources

Data integration in AI-guided drug discovery

What other software does StarDrop integrate with?

The complexity of collaboration in drug discovery