How do I assess similarities between molecules?

Author

Mario Öeren, PhD

A central concept in chemistry is the similarity principle. It refers to the idea that chemically similar molecules tend to exhibit similar properties. Although molecular similarity is a powerful tool in drug discovery, its objective assessment is complicated by the diverse and context-specific ways in which molecular structures can be compared. In this blog, we’ll dive into how molecular similarity works, how it’s measured, when it fails, and why it matters in real-world chemistry and drug design.

What is the similarity principle?

The similarity principle is one of the key concepts in drug design. It implies that making a series of minor changes to a molecule shouldn’t drastically affect how strongly it binds to its target. The challenge, however, is that the principle doesn’t tell us how to measure or define similarity between molecules in practice. Moreover, while local structural changes may preserve activity, this continuity does not necessarily hold over a long series of small modifications¹.

Activity cliffs: The exception to the rule

Exceptions to this principle, where structurally similar compounds exhibit noticeably different activities, are referred to as activity cliffs. Mathematically, they are defined by the ratio of the difference in activity of two compounds to their distance of separation in a given chemical space²(we’ll speak more about chemical space later). In a nutshell, activity cliffs occur when structurally similar compounds targeting the same target exhibit large differences in potency. Due to their complexity, activity cliffs are difficult to model computationally and challenging to identify systematically³.

How to assess chemical similarity

Step 1: Convert your structures into molecular fingerprints

The most widely used approaches convert molecular structures into binary or count-based vectors, which are then compared using similarity metrics. The vectors are commonly known as molecular fingerprints and form the basis for many cheminformatics tools that enable rapid comparison of large compound libraries.

Fingerprints can be represented as:

Counts (e.g., number of given functional groups in the structure)

Indicator variables (e.g., 1 for the presence of one or more given functional groups, and 0 for none).

Each type of molecular fingerprint comes with its own advantages and limitations, so selecting the most appropriate one for a specific task should be guided by benchmarking studies tailored to the problem at hand¹.

Below, we’ll cover some of the most common fingerprint types. Until now, we have focused on 2D fingerprints; however, it is important to note that 3D fingerprints (e.g., 3D pharmacophore fingerprints) are also widely used, though they heavily depend on the molecular conformation. A detailed discussion of these methods, however, lies beyond the scope of this blog.

Path-based molecular fingerprints

Path-based molecular fingerprints are among the oldest types used in cheminformatics. They analyse linear paths through molecular graphs – sequences of atoms and bonds, up to a predefined maximum length (typically 5 to 7 atoms). Each unique path is hashed into a position in a fixed-size bit vector, commonly 1024 bits⁴. These fingerprints are widely used for molecular similarity searching and substructure matching due to their simplicity and computational efficiency.

A notable subcategory of path-based fingerprints is 2D pharmacophore fingerprints, where atoms are annotated with pharmacophoric features (such as hydrogen bond donors, acceptors, or aromatic rings), and the paths between these pharmacophore points are encoded.

Atom-pair molecular fingerprints

This approach creates fingerprints based on atom pairs within molecules. Each pair is defined by the atom types, along with the topological distance (the number of bonds between them). These are typically encoded as triplets: (atom type 1, atom type 2, bond distance).

To generate a fixed-length fingerprint, each unique atom pair is hashed into a numerical value and mapped to an index within a fixed-size vector. Since the number of possible atom pairs is vast, this mapping is many-to-one, meaning that different pairs can hash to the same index—a phenomenon known as a hash collision. For each pair, the corresponding index is either set to 1 (in a binary fingerprint) or incremented (in a count-based fingerprint). The result is a compact yet informative representation of a molecule’s medium-range structural features, well-suited for rapid similarity comparisons⁵.

Circular molecular fingerprints

This approach captures the local atomic environment around each atom in a molecule. It iteratively gathers information about neighbouring atoms up to a certain radius (number of bonds), effectively encoding atom-centered substructures. These neighbourhoods are then hashed into identifiers, which represent unique structural patterns. As with atom pairs, the hashed identifiers are mapped into a fixed-length vector using a hash function that assigns each feature to a specific index.

Due to the extremely large number of possible substructures, hash collisions can occur where multiple patterns map to the same index⁶. Circular fingerprints are particularly useful for separating active compounds from inactives, where they may have similar physical properties in a virtual screen.

Step 2: Quantify similarity using the Tanimoto coefficient

The Tanimoto coefficient is the most common method of quantifying molecular similarity, particularly with binary fingerprints. It measures the overlap between two fingerprint vectors by comparing the number of shared features to the total number of unique features. The result is a similarity score ranging from 0 (no similarity) to 1 (identical).

While this provides a rigorous, numerical measure of similarity, it is often useful to complement it with a visual representation. This is typically done by projecting high-dimensional molecular descriptors into a 2D chemical space using dimensionality reduction techniques such as PCA, t-SNE, or UMAP (see the image below). In this space, similar compounds tend to cluster together, allowing researchers to visually inspect relationships that might not be obvious from numerical scores alone. Thus, similarity can be both calculated and seen, providing complementary perspectives on molecular relationships.

Interpreting similarity differences

When analysing similarity, how different is meaningfully different? In other words, what size difference in score reflects a meaningful difference in biological or chemical behaviour?

For example, a Tanimoto similarity of 0.85 versus 0.75 may appear close numerically, but depending on the fingerprint type and context (e.g., target class, assay type), this difference might correspond to a substantial change in activity or none at all.

To assess statistical significance, researchers often use benchmark datasets that correlate similarity scores to known properties like bioactivity. By comparing distributions of similarities between active–active, active–inactive, and random compound pairs, it’s possible to define thresholds where similarity differences become statistically meaningful.

Summary

Molecular similarity is a fundamental concept in drug discovery: the greater the resemblance between two molecules, the more likely they are to exhibit similar biological activity. Computational techniques are commonly used to convert molecular structures into digital fingerprints—simplified representations of molecular features—that can be compared through various algorithms. Numerous fingerprinting methods exist, but no single approach is universally applicable, as molecules that appear nearly identical in chemical structure or computational chemical space may behave differently in biological systems. Therefore, determining true similarity between compounds is highly context-dependent.

References

O’Boyle, N. M., & Sayle , R. A. (2016). Comparing structural fingerprints using a literature-based similarity benchmark. Journal of Cheminformatics, 8(36), 1-14. doi:10.1186/s13321-016-0148-0

Maggiora, G. M. (2006). On Outliers and Activity CliffsWhy QSAR Often Disappoints. Journal of Chemical Information and Modeling, 46(4), 1535. doi:10.1021/ci060117s

Stumpfe, D., Hu, H., & Bajorath, J. (2019). Evolving Concept of Activity Cliffs. ACS Omega, 4(11), 14360–14368. doi:10.1021/acsomega.9b02221

Boldini, D., Ballabio, D., Consonni, V., Todeschini, R., Grisoni, F., & Sieber, S. A. (2024). Effectiveness of molecular fingerprints for exploring the chemical space of natural products. Journal of Cheminformatics, 16(35), 1-16. doi:10.1186/s13321-024-00830-3

Carhart, R. E., Smith, D. H., & Venkataraghavan, R. (1985). Atom Pairs as Molecular Features in Structure-Activity Studies: Definition and Applications. The Journal for Chemical Information and Computer Scientists, 25, 64-73. doi:10.1021/ci00046a002
Rogers, D., & Hahn, M. (2010). Extended-Connectivity Fingerprints. Journal of Chemical Information and Modeling, 50(5), 742–754. doi:10.1021/ci100050t

Mario Öeren, PhD

Mario is a Principal Scientist at Optibrium, where he applies expertise in quantum chemistry, conformational analysis, and molecular modeling to support drug discovery research. With a PhD in Natural Sciences from Tallinn University of Technology, his academic work focused on computational studies of macrocyclic molecules. Mario has authored numerous publications and has extensive experience in chemical education and advanced spectroscopy.

Cookies