DESeq2 in Python: Beginner's Tutorial with PyDESeq2 for RNA-seq Analysis
Audience: beginners in bioinformatics who know basic Python / data science
Goal: understand what DESeq2 does (the “real knowledge”), then run it end-to-end in Python using PyDESeq2—without getting lost in setup or platform details.
TL;DR
- What problem it solves: detect genes that change expression between conditions using a negative-binomial GLM that models RNA-seq counts, not TPM/FPKM. Shrinkage and multiple-testing control make results stable and interpretable. (BioMed Central )
- What you must feed it: a count matrix of integers (genes × samples) and a sample metadata table (conditions, batches, etc.). Don’t use TPMs/FPKMs as input. (rdrr.io )
- How it normalizes: by estimating size factors via the median-of-ratios method so samples are comparable. (RDocumentation )
- Python option: use PyDESeq2, a from-scratch Python implementation of the DESeq2 workflow (no R needed). You’ll create a
DeseqDataSet, fit the model withdeseq2(), run stats withDeseqStats, and (optionally) shrink log fold-changes. (PyDESeq2 )
If you just want “DESeq2 in Python” as a recipe: scroll to the “Minimal Python workflow with PyDESeq2” and the “A compact, reusable template” sections.
Before you start: install PyDESeq2 and organize your data
You’ll need:
- Python ≥3.10
- A recent version of PyDESeq2
- Your RNA-seq data as:
- A count matrix (
counts.csv) - A sample metadata table (
metadata.csv)
- A count matrix (
Install PyDESeq2 (quick start)
# (recommended) in a fresh conda environment
conda create -n pydeseq2 python=3.11
conda activate pydeseq2
# install from PyPI
pip install pydeseq2
# Optional but handy for this tutorial
pip install matplotlib scikit-learn pandas(You can also install via Bioconda: conda install -c bioconda pydeseq2.)
What your input files should look like
counts.csv – integer counts, samples in rows, genes in columns:
GeneA GeneB GeneC ...
sample_1 123 45 0
sample_2 98 30 5
sample_3 150 80 1
...metadata.csv – sample information, same row names as counts.csv:
condition batch
sample_1 treated batch1
sample_2 control batch1
sample_3 treated batch2
...The row names of
metadatamust match the row names ofcounts(sample IDs). This is the most common beginner pitfall.
Core ideas (in plain language)
Counts, not continuous expression
RNA-seq yields integer read counts per gene per sample. DESeq2 models these with a negative binomial distribution whose variability (“dispersion”) depends on the mean. It pools information across genes to stabilize dispersion and shrinks effect sizes (log2 fold changes, LFCs) toward zero when evidence is weak—reducing false positives and improving ranking. P-values are adjusted with Benjamini–Hochberg FDR. (BioMed Central )
Normalization via size factors
Some libraries are sequenced deeper than others. DESeq2’s median-of-ratios method estimates one scaling factor per sample and divides counts by that factor. This handles library size and composition effects without using gene length (gene length is irrelevant for between-sample testing of the same gene). (RDocumentation )
Design formulas and contrasts
You declare your experimental design (e.g., ~ batch + condition) and then choose a contrast to ask, “treated vs control, holding batch constant.” The design can include multiple factors (batch, sex, etc.) and interactions when needed. (PyDESeq2 )
A few common designs:
| Scenario | Design formula | What it means |
|---|---|---|
| Simple 2-group comparison | ~ condition | treated vs control |
| Batch-corrected comparison | ~ batch + condition | treated vs control, adjusting for batch |
| Paired / subject-level samples | ~ subject + condition | before/after per subject, blocking on subject |
| Time-course with treatment | ~ condition + time + condition:time | different time trends per condition |
Later, you define a contrast such as ["condition", "treated", "control"] to ask: “Is treated different from control?”
Independent filtering & outliers
Very low-count genes rarely show significant changes; DESeq2 performs independent filtering (and flags some adjusted p-values as NA) to improve power. It also uses Cook’s distance to detect count outliers. (Bioconductor )
Minimal Python workflow with PyDESeq2
The snippet below focuses on the analysis steps. It assumes you have a count matrix and a metadata table in memory (or load from CSV). Class and method names follow the official PyDESeq2 examples. (PyDESeq2 )
0) Inputs you need
counts_df: samples × genes (integers). Rows = sample IDs, columns = gene IDs.metadata: samples × variables. Must include your grouping variable(s), e.g.,condition(treated/control), and optional covariates likebatch.
import pandas as pd
# Example: load your own data
counts_df = pd.read_csv("counts.csv", index_col=0) # rows: samples, cols: genes, values: counts (ints)
metadata = pd.read_csv("metadata.csv", index_col=0) # rows: samples, cols include e.g. 'condition','batch'
# Quick sanity check
print(counts_df.shape, metadata.shape)
print(counts_df.index.equals(metadata.index)) # should be TrueSanity checks:
- Indices (row names) must match between
counts_dfandmetadata. - Counts should be non-negative integers.
- For light prefiltering, keep genes with, say, ≥10 total counts across all samples (removes almost-never-expressed genes). (Griffith Lab )
# Drop genes with extremely low overall counts (optional but common)
keep = counts_df.sum(axis=0) >= 10
counts_df = counts_df.loc[:, keep]1) Build and fit the DESeq2 model
from pydeseq2.dds import DeseqDataSet
from pydeseq2.ds import DeseqStats
from pydeseq2.default_inference import DefaultInference
# Example design: adjust for batch and test condition
inference = DefaultInference(n_cpus=4)
dds = DeseqDataSet(
counts=counts_df,
metadata=metadata,
design="~ batch + condition", # change to your variables
refit_cooks=True,
inference=inference,
)
# Fit size factors, dispersions, and GLM coefficients
dds.deseq2()What just happened:
- Size factors and normalized counts were computed (median-of-ratios).
- Gene-wise dispersions were estimated, then shrunk via a dispersion prior and trend.
- GLM coefficients (including LFCs) were estimated per gene.
You can inspect pieces in the AnnData-like structure, e.g.:
dds.layers["normed_counts"]dds.var["dispersions"]- the design matrix at
dds.obsm["design_matrix"](PyDESeq2 )
2) Run statistical tests and collect results
# Define the contrast explicitly: variable, test level, reference (control) level
ds = DeseqStats(dds, contrast=["condition", "treated", "control"], inference=inference)
# Run the Wald tests, with default Cook's distance filtering and independent filtering
ds.summary()
# A tidy results table lives here:
res = ds.results_df # columns include: baseMean, log2FoldChange, lfcSE, stat, pvalue, padj
res_sig = res.query("padj < 0.05").sort_values("padj")
res_sig.head()- Wald test is used by default for pairwise contrasts; p-values are FDR-adjusted.
- Expect to see
baseMean,log2FoldChange,lfcSE, test statistic,pvalue, andpadj. - Some
padjmay beNAdue to independent filtering (low-information genes). (PyDESeq2 )
3) (Optional) Shrink log2 fold changes
Shrunken LFCs are easier to rank/interpret and produce cleaner plots for low-count genes.
# Check the coefficient name to shrink (from the design matrix columns)
print(dds.obsm["design_matrix"].columns)
# e.g., you might see something like: 'condition[T.treated]'
ds.lfc_shrink(coeff="condition[T.treated]")
res_shrunk = ds.results_dfPyDESeq2 applies MAP LFC shrinkage via lfc_shrink. It overwrites LFCs/SEs in ds with shrunken values and sets ds.shrunk_LFCs = True. (PyDESeq2 )
4) Quick visuals
MA plot (built-in)
import matplotlib.pyplot as plt
ds.plot_MA(s=20) # scatter of mean expression vs LFC; highlights DE genes
plt.show()PCA for QC (on transformed counts)
Do this on transformed counts (variance-stabilized), not raw counts.
# Variance-stabilizing transformation (VST) is available in PyDESeq2
dds.vst() # stores transformed values
vst = dds.layers["vst_counts"] # samples x genes
# Basic PCA with scikit-learn
import numpy as np
from sklearn.decomposition import PCA
X = np.asarray(vst) # make sure it's a numeric array
pca = PCA(n_components=2).fit_transform(X) # PCs across samples
# Plot, color by condition
cond = metadata["condition"].values
import matplotlib.pyplot as plt
plt.scatter(pca[:, 0], pca[:, 1], c=(cond == cond[0])) # quick demo: two colors
for i, sid in enumerate(metadata.index):
plt.text(pca[i, 0], pca[i, 1], sid, fontsize=8)
plt.xlabel("PC1"); plt.ylabel("PC2"); plt.title("VST-PCA by condition")
plt.show()VST (or rlog in R) stabilizes variance across the mean so distance-based QC (PCA, clustering) behaves well. In PyDESeq2 the transformed matrix is stored in
dds.layers["vst_counts"]. (PyDESeq2 )
Interpreting the results (without hype)
- Sign of
log2FoldChange: positive means higher in the first level of your contrast (treated), negative means higher in the reference (control). padj(FDR): typical reporting threshold is 0.05, but justify your choice scientifically.- Effect size matters: rank by shrunken LFC (plus FDR) to avoid chasing tiny, unstable effects at low counts. (BioMed Central )
- Batch and other covariates: include them in the design (
~ batch + condition) to remove confounding. For stronger unwanted variation, consider methods like SVA/RUV (R packages), or equivalent covariate strategies in your metadata. (Bioconductor )
Common pitfalls (and how to avoid them)
- Using TPM/FPKM as input. Don’t. DESeq2 expects raw integer counts. Use tools like
tximport(R) or generate gene-level count matrices directly (e.g., featureCounts, Salmon quant → summarize to counts). (rdrr.io ) - Too few biological replicates. You need replicates to estimate dispersion; more is better. Power studies suggest ≥6 per group for robust detection in many settings (context-dependent; aim higher when effects are subtle). (PubMed )
- Forgetting batch or pairing. If you ignore known structure (batch, subject), you may call spurious DE. Model it in the design. (Bioconductor )
- Over-filtering or biased filtering. If you prefilter, do it unbiasedly (e.g., total count across samples) and let DESeq2’s independent filtering do its job. (Griffith Lab )
- Misreading coefficient names. Before LFC shrinkage or custom contrasts, print
dds.obsm["design_matrix"].columnsso you shrink/test the intended coefficient. (PyDESeq2 )
A compact, reusable template
from pathlib import Path
import pandas as pd
from pydeseq2.dds import DeseqDataSet
from pydeseq2.ds import DeseqStats
from pydeseq2.default_inference import DefaultInference
# 1) Load data
counts = pd.read_csv("counts.csv", index_col=0)
meta = pd.read_csv("metadata.csv", index_col=0)
# Optional: unbiased low-count filter
counts = counts.loc[:, counts.sum(axis=0) >= 10]
# 2) Fit model
dds = DeseqDataSet(
counts=counts, metadata=meta,
design="~ batch + condition",
refit_cooks=True,
inference=DefaultInference(n_cpus=4),
)
dds.deseq2()
# 3) Stats for your contrast
ds = DeseqStats(dds, contrast=["condition", "treated", "control"])
ds.summary() # runs Wald tests + multiple testing + filters
res = ds.results_df
# 4) Optional LFC shrinkage for plotting and ranking
# print(dds.obsm["design_matrix"].columns)
ds.lfc_shrink(coeff="condition[T.treated]")
res_shrunk = ds.results_df
# 5) Save results
Path("out").mkdir(exist_ok=True)
res.to_csv("out/deseq2_results.csv") # raw LFCs
res_shrunk.to_csv("out/deseq2_results_shrunk.csv") # shrunken LFCsEverything in this minimal pipeline follows the PyDESeq2 examples and API (e.g., DeseqDataSet, deseq2(), DeseqStats, results_df, lfc_shrink, design matrix in dds.obsm). (PyDESeq2 )
FAQ for beginners
Q: Can I do all this without R? A: Yes. PyDESeq2 re-implements the DESeq2 pipeline in Python and exposes a consistent API. You still need count matrices as input. (PyDESeq2 )
Q: What if I have more than two groups or a time course?
A: Use a multi-factor design (e.g., ~ batch + condition + time + condition:time) and pick appropriate contrasts. (In R/DESeq2, the likelihood ratio test is often used for multi-level/time-course hypotheses; PyDESeq2 examples emphasize Wald tests and threshold-based variants.) (HBC Training )
Q: Should I use rlog or VST?
A: For QC visualization (PCA/heatmaps), use VST (or rlog in R). Don’t feed transformed counts into the DE tests—use raw counts for modeling. PyDESeq2 provides dds.vst() and stores results in dds.layers["vst_counts"]. (PyDESeq2 )
Q: How many replicates do I need? A: There’s no universal number, but at least ~6 per group is a common target for robust discovery; plan higher for subtle effects. (PubMed )
Key terms (one-liners)
- Count matrix: integer reads per gene × sample. (rdrr.io )
- Size factor: sample-level scaling for normalization (median-of-ratios). (RDocumentation )
- Dispersion: gene-specific variability beyond Poisson; estimated and shrunk. (BioMed Central )
- Design formula: model of your experiment, e.g.,
~ batch + condition. (PyDESeq2 ) - Contrast: the comparison you test (e.g., condition
treatedvscontrol). (PyDESeq2 ) - Wald test: default per-gene test for the chosen contrast; p-values FDR-adjusted. (BioMed Central )
- LFC shrinkage: stabilizes effect sizes for ranking/plotting. (BioMed Central )
- Independent filtering: removes uninformative genes to increase power; may set some
padjtoNA. (Bioconductor ) - Cook’s distance: outlier diagnostic used in filtering/refitting. (PyDESeq2 )
References & further reading
- PyDESeq2 documentation & examples (classes, workflow, plots). (PyDESeq2 )
- DESeq2 paper: Love, Huber, Anders (2014) — canonical description of dispersion/LFC shrinkage and BH FDR. (BioMed Central )
- DESeq2 vignette: guidance on independent filtering, batch handling, and practical tips. (Bioconductor )
- Median-of-ratios normalization (what size factors are). (RDocumentation )
- Replicates and power (planning perspective). (PubMed )
- VST / rlog for QC (why transform for PCA, not for testing). (PyDESeq2 )
Takeaway
If you remember only three things:
- Use raw integer counts + a thoughtful design (include known covariates). (rdrr.io )
- Trust the model’s normalization and shrinkage—don’t pre-normalize with TPM/FPKM for DE testing. (RDocumentation )
- Replicates matter far more than fancy plotting. Aim for enough samples to estimate dispersion with confidence. (PubMed )