AboutBlogMediaTags

RDKit Tutorial for Chemists: Beginner’s Guide to Python Cheminformatics

Runcell Team,

Why Another RDKit Tutorial?

There are already many RDKit tutorials out there, but most of them assume one of two things:

  1. You’re already comfortable with Python.
  2. You’re a developer who happens to care about chemistry.

If you’re a chemist with little programming experience, a lot of those resources can feel:

This tutorial is designed to fill that gap:


1. What Is RDKit (and Why Should Chemists Care)?

RDKit is a free, open-source toolkit for cheminformatics. In practice, that means it helps you:

You can think of RDKit as a Swiss army knife for chemists in Python.

Good news:
You don’t need to be a Python expert to start. If you’re willing to type a few short scripts and you understand basic chemistry, you’re ready.


2. Installation (Short & Skippable)

If you already have RDKit installed and working, skip this section.

The easiest and most reliable way to install RDKit is with conda (Anaconda or Miniconda).

2.1 Install via conda

Open a terminal (or Anaconda Prompt on Windows) and run:

# Create a new environment with RDKit installed conda create -n my_rdkit_env -c conda-forge rdkit # Activate the environment conda activate my_rdkit_env

Now, in that environment, you can start Python or Jupyter Notebook and import RDKit.

python

Then in Python:

from rdkit import Chem

If that runs with no error, you’re good.

2.2 Quick alternatives (if conda is not an option)

If you get stuck, the simplest way for beginners is: use conda or use Colab.


3. Your First Molecule in RDKit

We’ll start with SMILES:

Examples:

Let’s turn a SMILES string into a molecule object in RDKit.

from rdkit import Chem # Create a molecule object from a SMILES string ethanol = Chem.MolFromSmiles("CCO") print(ethanol)

You’ll see something like:

<rdkit.Chem.rdchem.Mol object at 0x...>

That means RDKit successfully created an internal Mol object.

To confirm:

print(Chem.MolToSmiles(ethanol))

You should see:

CCO

RDKit can canonicalize SMILES, so sometimes it may reorder atoms or change capitalization in a consistent way, but it still represents the same structure.

Notebook tip: In Jupyter, just put ethanol as the last line in a cell and run it — RDKit will often show a nice 2D structure.


4. Visualizing Molecules (2D Drawings)

A picture is worth a thousand SMILES.

RDKit’s Draw module can generate images of molecules.

from rdkit.Chem import Draw # Single molecule image img = Draw.MolToImage(ethanol) img.show()

In a Jupyter notebook, just:

img

and it should display inline.

4.1 Drawing multiple molecules side by side

Let’s add benzene and acetone and draw them together:

benzene = Chem.MolFromSmiles("c1ccccc1") # benzene acetone = Chem.MolFromSmiles("CC(=O)C") # acetone mols = [ethanol, benzene, acetone] img2 = Draw.MolsToImage( mols, legends=["Ethanol", "Benzene", "Acetone"] ) img2.show()

You should see three structures in one image, each labeled.

If img.show() doesn’t display anything:

img2.save("three_molecules.png")

5. Calculating Basic Properties (Descriptors)

RDKit can compute many descriptors — numerical values representing molecular properties.

Let’s start with:

from rdkit.Chem import Descriptors mw = Descriptors.MolWt(ethanol) logp = Descriptors.MolLogP(ethanol) print(f"Ethanol molecular weight: {mw:.2f}") print(f"Ethanol LogP: {logp:.2f}")

Expected output (approx.):

Ethanol molecular weight: 46.07 Ethanol LogP: -0.18

Now compare with benzene:

print(f"Benzene molecular weight: {Descriptors.MolWt(benzene):.2f}") print(f"Benzene LogP: {Descriptors.MolLogP(benzene):.2f}")

Benzene has a higher MW (~78.11) and higher LogP (more hydrophobic).

5.1 Atom counts (and the hydrogen “gotcha”)

RDKit often treats hydrogens as implicit by default.

# Heavy atoms (non-hydrogen) heavy_count = ethanol.GetNumAtoms() # Total atoms (including hydrogens) all_atom_count = ethanol.GetNumAtoms(onlyExplicit=False) print("Ethanol heavy atom count:", heavy_count) print("Ethanol total atom count (with H):", all_atom_count)

You should see something like:

Ethanol heavy atom count: 3 Ethanol total atom count (with H): 9

Because:

Key point: Many RDKit functions only count heavy atoms by default. If your counts look “too small”, check whether hydrogens are implicit.

You can add explicit hydrogens with:

ethanol_with_H = Chem.AddHs(ethanol)

but you don’t usually need this until more advanced tasks.


6. Comparing Molecules by Similarity

A classic cheminformatics task: “Find molecules similar to this one.”

To do that, RDKit uses:

6.1 Fingerprints + Tanimoto

Let’s compare ethanol and propanol:

from rdkit import DataStructs mol1 = ethanol mol2 = Chem.MolFromSmiles("CCCO") # 1-propanol fp1 = Chem.RDKFingerprint(mol1) fp2 = Chem.RDKFingerprint(mol2) sim = DataStructs.TanimotoSimilarity(fp1, fp2) print(f"Similarity between ethanol and propanol: {sim:.2f}")

You’ll likely get a similarity around 0.6–0.7 (they’re quite similar).

Now compare ethanol to benzene:

fp_benzene = Chem.RDKFingerprint(benzene) sim2 = DataStructs.TanimotoSimilarity(fp1, fp_benzene) print(f"Similarity between ethanol and benzene: {sim2:.2f}")

You should see a much lower similarity (they’re structurally very different).

Interpretation:

Later, you can explore other fingerprints (e.g. Morgan/ECFP) and similarity metrics, but this basic pattern (fingerprint + Tanimoto) is enough to build simple similarity searches.


7. Substructure Search (Finding Functional Groups)

Want to know if a molecule contains a particular scaffold, like a benzene ring?

We can use one molecule as a substructure pattern and see if it’s found inside another.

phenol = Chem.MolFromSmiles("c1ccccc1O") # phenol benzene_pattern = Chem.MolFromSmiles("c1ccccc1") match = phenol.HasSubstructMatch(benzene_pattern) print("Does phenol contain a benzene ring?", match)

Expected:

Does phenol contain a benzene ring? True

This can scale up: search a library of molecules for a specific functional group or pharmacophore by repeating HasSubstructMatch over many molecules.

Note: For advanced pattern searches, RDKit supports SMARTS (like SMILES, but for patterns), but using SMILES as a pattern is enough for basic substructure checks.


8. Common Pitfalls for Beginners

Here are some things that often trip up new RDKit users (especially chemists new to Python):

8.1 Import and environment issues

If you see:

ModuleNotFoundError: No module named 'rdkit'

it means Python is not using the environment where RDKit is installed.

8.2 Invalid SMILES → None

mol = Chem.MolFromSmiles("C1=CC=CC=C1O") # OK mol_bad = Chem.MolFromSmiles("C1=CC=CC=Z1O") # invalid print(mol) print(mol_bad)

mol_bad will be None if the SMILES is invalid.

Always check:

mol = Chem.MolFromSmiles(smiles) if mol is None: print("Invalid SMILES:", smiles)

8.3 Hydrogens (again)

Remember:

8.4 Sanitization

RDKit usually sanitizes molecules automatically when you create them. If you start editing atoms/bonds manually, call:

Chem.SanitizeMol(mol)

This checks chemistry validity (valences, aromaticity, etc.) and fixes or raises errors if something is inconsistent.


Let’s combine what we’ve learned into a small, realistic workflow:

Task: You have a small set of compounds.

  1. Filter out molecules heavier than 200 Da.
  2. Among the remaining ones, find which is most similar to benzene.

9.1 Data setup

from rdkit import Chem from rdkit.Chem import Descriptors from rdkit import DataStructs smiles_list = [ "CC(=O)OC1=CC=CC=C1C(=O)O", # Aspirin "c1ccccc1", # Benzene "CC(=O)O", # Acetic acid "CCO" # Ethanol ] molecules = [Chem.MolFromSmiles(s) for s in smiles_list]

9.2 Step 1 – Filter by molecular weight (< 200 Da)

light_mols = [] for mol in molecules: if mol is None: continue if Descriptors.MolWt(mol) < 200: light_mols.append(mol) print(f"Molecules under 200 Da: {len(light_mols)}") for mol in light_mols: print(Chem.MolToSmiles(mol), Descriptors.MolWt(mol))

You’ll see which molecules are under 200. (In this set, all are under 200, but you can add heavier ones like small peptides to see filtering in action.)

9.3 Step 2 – Find the molecule most similar to benzene

reference = Chem.MolFromSmiles("c1ccccc1") # benzene ref_fp = Chem.RDKFingerprint(reference) most_sim = -1 most_sim_mol = None for mol in light_mols: fp = Chem.RDKFingerprint(mol) sim = DataStructs.TanimotoSimilarity(fp, ref_fp) if sim > most_sim: most_sim = sim most_sim_mol = mol print("Most similar to benzene:") print(" SMILES:", Chem.MolToSmiles(most_sim_mol)) print(" Tanimoto:", f"{most_sim:.2f}")

In this tiny example, benzene itself will have similarity 1.0. If you exclude benzene from the candidates, aspirin should come out more similar than acetic acid or ethanol, because it also contains an aromatic ring.

This is basically the skeleton of a similarity search pipeline.


10. Where to Go Next

You now know how to:

Next steps you might explore:

© Runcell.RSS