RDKit Tutorial for Chemists: Beginner’s Guide to Python Cheminformatics
Why Another RDKit Tutorial?
There are already many RDKit tutorials out there, but most of them assume one of two things:
- You’re already comfortable with Python.
- You’re a developer who happens to care about chemistry.
If you’re a chemist with little programming experience, a lot of those resources can feel:
- Too focused on API details, not enough on why you’d use a feature.
- Light on installation steps (or they gloss over tricky parts).
- Full of examples, but short on end-to-end, practical tasks.
- Heavy on jargon: “descriptors”, “fingerprints”, “Tanimoto” without gentle explanations.
This tutorial is designed to fill that gap:
- You’re a chemist first, coder second.
- We’ll keep the Python gentle and explain what each piece does.
- We’ll focus on practical use cases: drawing molecules, computing basic properties, and comparing structures.
- We’ll highlight common pitfalls early so you avoid frustration.
1. What Is RDKit (and Why Should Chemists Care)?
RDKit is a free, open-source toolkit for cheminformatics. In practice, that means it helps you:
- Represent molecules digitally (atoms, bonds, rings, etc.).
- Compute properties like molecular weight, LogP, ring counts, etc.
- Draw molecules as 2D images in notebooks or scripts.
- Search and compare molecules (substructure search, similarity search).
- Integrate your chemistry data with Python tools (pandas, scikit-learn, etc.).
You can think of RDKit as a Swiss army knife for chemists in Python.
Good news:
You don’t need to be a Python expert to start. If you’re willing to type a few short scripts and you understand basic chemistry, you’re ready.
2. Installation (Short & Skippable)
If you already have RDKit installed and working, skip this section.
The easiest and most reliable way to install RDKit is with conda (Anaconda or Miniconda).
2.1 Install via conda
Open a terminal (or Anaconda Prompt on Windows) and run:
# Create a new environment with RDKit installed
conda create -n my_rdkit_env -c conda-forge rdkit
# Activate the environment
conda activate my_rdkit_envNow, in that environment, you can start Python or Jupyter Notebook and import RDKit.
pythonThen in Python:
from rdkit import ChemIf that runs with no error, you’re good.
2.2 Quick alternatives (if conda is not an option)
-
Google Colab: In a Colab notebook cell:
!pip install rdkit-pypi -
pip locally (depends on platform & version support):
pip install rdkit-pypi
If you get stuck, the simplest way for beginners is: use conda or use Colab.
3. Your First Molecule in RDKit
We’ll start with SMILES:
- SMILES = Simplified Molecular Input Line Entry System
- It’s a text string that describes a molecule.
- Think of it as a barcode for a molecule.
Examples:
- Water:
O - Ethanol (CH₃–CH₂–OH):
CCO - Benzene:
c1ccccc1
Let’s turn a SMILES string into a molecule object in RDKit.
from rdkit import Chem
# Create a molecule object from a SMILES string
ethanol = Chem.MolFromSmiles("CCO")
print(ethanol)You’ll see something like:
<rdkit.Chem.rdchem.Mol object at 0x...>That means RDKit successfully created an internal Mol object.
To confirm:
print(Chem.MolToSmiles(ethanol))You should see:
CCORDKit can canonicalize SMILES, so sometimes it may reorder atoms or change capitalization in a consistent way, but it still represents the same structure.
Notebook tip: In Jupyter, just put
ethanolas the last line in a cell and run it — RDKit will often show a nice 2D structure.
4. Visualizing Molecules (2D Drawings)
A picture is worth a thousand SMILES.
RDKit’s Draw module can generate images of molecules.
from rdkit.Chem import Draw
# Single molecule image
img = Draw.MolToImage(ethanol)
img.show()In a Jupyter notebook, just:
imgand it should display inline.
4.1 Drawing multiple molecules side by side
Let’s add benzene and acetone and draw them together:
benzene = Chem.MolFromSmiles("c1ccccc1") # benzene
acetone = Chem.MolFromSmiles("CC(=O)C") # acetone
mols = [ethanol, benzene, acetone]
img2 = Draw.MolsToImage(
mols,
legends=["Ethanol", "Benzene", "Acetone"]
)
img2.show()You should see three structures in one image, each labeled.
If
img.show()doesn’t display anything:
- In Jupyter, just output
imgorimg2at the end of the cell.- Or save to a file:
img2.save("three_molecules.png")5. Calculating Basic Properties (Descriptors)
RDKit can compute many descriptors — numerical values representing molecular properties.
Let’s start with:
- Molecular weight (MolWt)
- LogP (MolLogP) – a hydrophobicity estimate
from rdkit.Chem import Descriptors
mw = Descriptors.MolWt(ethanol)
logp = Descriptors.MolLogP(ethanol)
print(f"Ethanol molecular weight: {mw:.2f}")
print(f"Ethanol LogP: {logp:.2f}")Expected output (approx.):
Ethanol molecular weight: 46.07
Ethanol LogP: -0.18Now compare with benzene:
print(f"Benzene molecular weight: {Descriptors.MolWt(benzene):.2f}")
print(f"Benzene LogP: {Descriptors.MolLogP(benzene):.2f}")Benzene has a higher MW (~78.11) and higher LogP (more hydrophobic).
5.1 Atom counts (and the hydrogen “gotcha”)
RDKit often treats hydrogens as implicit by default.
# Heavy atoms (non-hydrogen)
heavy_count = ethanol.GetNumAtoms()
# Total atoms (including hydrogens)
all_atom_count = ethanol.GetNumAtoms(onlyExplicit=False)
print("Ethanol heavy atom count:", heavy_count)
print("Ethanol total atom count (with H):", all_atom_count)You should see something like:
Ethanol heavy atom count: 3
Ethanol total atom count (with H): 9Because:
- Ethanol has 2 C + 1 O = 3 heavy atoms.
- Total = 2 C + 1 O + 6 H = 9 atoms.
Key point: Many RDKit functions only count heavy atoms by default. If your counts look “too small”, check whether hydrogens are implicit.
You can add explicit hydrogens with:
ethanol_with_H = Chem.AddHs(ethanol)but you don’t usually need this until more advanced tasks.
6. Comparing Molecules by Similarity
A classic cheminformatics task: “Find molecules similar to this one.”
To do that, RDKit uses:
- Fingerprints: bit vectors encoding structural features.
- Tanimoto similarity: a number between 0 (no overlap) and 1 (identical fingerprints).
6.1 Fingerprints + Tanimoto
Let’s compare ethanol and propanol:
from rdkit import DataStructs
mol1 = ethanol
mol2 = Chem.MolFromSmiles("CCCO") # 1-propanol
fp1 = Chem.RDKFingerprint(mol1)
fp2 = Chem.RDKFingerprint(mol2)
sim = DataStructs.TanimotoSimilarity(fp1, fp2)
print(f"Similarity between ethanol and propanol: {sim:.2f}")You’ll likely get a similarity around 0.6–0.7 (they’re quite similar).
Now compare ethanol to benzene:
fp_benzene = Chem.RDKFingerprint(benzene)
sim2 = DataStructs.TanimotoSimilarity(fp1, fp_benzene)
print(f"Similarity between ethanol and benzene: {sim2:.2f}")You should see a much lower similarity (they’re structurally very different).
Interpretation:
- 0.8–1.0: very similar or identical.
- 0.5–0.8: moderately similar (often interesting in drug discovery).
- < 0.3: very different.
Later, you can explore other fingerprints (e.g. Morgan/ECFP) and similarity metrics, but this basic pattern (fingerprint + Tanimoto) is enough to build simple similarity searches.
7. Substructure Search (Finding Functional Groups)
Want to know if a molecule contains a particular scaffold, like a benzene ring?
We can use one molecule as a substructure pattern and see if it’s found inside another.
phenol = Chem.MolFromSmiles("c1ccccc1O") # phenol
benzene_pattern = Chem.MolFromSmiles("c1ccccc1")
match = phenol.HasSubstructMatch(benzene_pattern)
print("Does phenol contain a benzene ring?", match)Expected:
Does phenol contain a benzene ring? TrueThis can scale up: search a library of molecules for a specific functional group or pharmacophore by repeating HasSubstructMatch over many molecules.
Note: For advanced pattern searches, RDKit supports SMARTS (like SMILES, but for patterns), but using SMILES as a pattern is enough for basic substructure checks.
8. Common Pitfalls for Beginners
Here are some things that often trip up new RDKit users (especially chemists new to Python):
8.1 Import and environment issues
If you see:
ModuleNotFoundError: No module named 'rdkit'it means Python is not using the environment where RDKit is installed.
- Make sure to run
conda activate my_rdkit_env(or your environment name) before starting Python or Jupyter. - In Jupyter, ensure the kernel is using the correct conda environment.
8.2 Invalid SMILES → None
mol = Chem.MolFromSmiles("C1=CC=CC=C1O") # OK
mol_bad = Chem.MolFromSmiles("C1=CC=CC=Z1O") # invalid
print(mol)
print(mol_bad)mol_bad will be None if the SMILES is invalid.
Always check:
mol = Chem.MolFromSmiles(smiles)
if mol is None:
print("Invalid SMILES:", smiles)8.3 Hydrogens (again)
Remember:
-
Many, but not all, calculations treat hydrogens implicitly.
-
If you need explicit hydrogens (e.g., for 3D coordinates or detailed analyses):
mol_H = Chem.AddHs(mol)
8.4 Sanitization
RDKit usually sanitizes molecules automatically when you create them. If you start editing atoms/bonds manually, call:
Chem.SanitizeMol(mol)This checks chemistry validity (valences, aromaticity, etc.) and fixes or raises errors if something is inconsistent.
9. Mini Case Study: Filter & Similarity Search
Let’s combine what we’ve learned into a small, realistic workflow:
Task: You have a small set of compounds.
- Filter out molecules heavier than 200 Da.
- Among the remaining ones, find which is most similar to benzene.
9.1 Data setup
from rdkit import Chem
from rdkit.Chem import Descriptors
from rdkit import DataStructs
smiles_list = [
"CC(=O)OC1=CC=CC=C1C(=O)O", # Aspirin
"c1ccccc1", # Benzene
"CC(=O)O", # Acetic acid
"CCO" # Ethanol
]
molecules = [Chem.MolFromSmiles(s) for s in smiles_list]9.2 Step 1 – Filter by molecular weight (< 200 Da)
light_mols = []
for mol in molecules:
if mol is None:
continue
if Descriptors.MolWt(mol) < 200:
light_mols.append(mol)
print(f"Molecules under 200 Da: {len(light_mols)}")
for mol in light_mols:
print(Chem.MolToSmiles(mol), Descriptors.MolWt(mol))You’ll see which molecules are under 200. (In this set, all are under 200, but you can add heavier ones like small peptides to see filtering in action.)
9.3 Step 2 – Find the molecule most similar to benzene
reference = Chem.MolFromSmiles("c1ccccc1") # benzene
ref_fp = Chem.RDKFingerprint(reference)
most_sim = -1
most_sim_mol = None
for mol in light_mols:
fp = Chem.RDKFingerprint(mol)
sim = DataStructs.TanimotoSimilarity(fp, ref_fp)
if sim > most_sim:
most_sim = sim
most_sim_mol = mol
print("Most similar to benzene:")
print(" SMILES:", Chem.MolToSmiles(most_sim_mol))
print(" Tanimoto:", f"{most_sim:.2f}")In this tiny example, benzene itself will have similarity 1.0. If you exclude benzene from the candidates, aspirin should come out more similar than acetic acid or ethanol, because it also contains an aromatic ring.
This is basically the skeleton of a similarity search pipeline.
10. Where to Go Next
You now know how to:
- Install and import RDKit.
- Turn SMILES into molecule objects.
- Visualize molecules.
- Compute basic descriptors (MW, LogP).
- Compare molecules using fingerprints and Tanimoto.
- Do simple substructure searches.
- Chain everything into a mini workflow.
Next steps you might explore:
- More descriptors: polar surface area, H-bond donors/acceptors, etc.
- Lipinski’s Rule of Five filters on a compound library.
- Morgan (ECFP) fingerprints for better similarity and ML tasks.
- Generating 3D conformers and basic geometry.
- RDKit + pandas: store molecules and properties in tables.
- Build a simple QSAR model by using RDKit descriptors as features.