AboutBlogMediaTags

DESeq2 in Python: Beginner's Tutorial with PyDESeq2 for RNA-seq Analysis

Runcell Team,

Audience: beginners in bioinformatics who know basic Python / data science
Goal: understand what DESeq2 does (the “real knowledge”), then run it end-to-end in Python using PyDESeq2—without getting lost in setup or platform details.


TL;DR

If you just want “DESeq2 in Python” as a recipe: scroll to the “Minimal Python workflow with PyDESeq2” and the “A compact, reusable template” sections.


Before you start: install PyDESeq2 and organize your data

You’ll need:

Install PyDESeq2 (quick start)

# (recommended) in a fresh conda environment conda create -n pydeseq2 python=3.11 conda activate pydeseq2 # install from PyPI pip install pydeseq2 # Optional but handy for this tutorial pip install matplotlib scikit-learn pandas

(You can also install via Bioconda: conda install -c bioconda pydeseq2.)

What your input files should look like

counts.csv – integer counts, samples in rows, genes in columns:

GeneA GeneB GeneC ... sample_1 123 45 0 sample_2 98 30 5 sample_3 150 80 1 ...

metadata.csv – sample information, same row names as counts.csv:

condition batch sample_1 treated batch1 sample_2 control batch1 sample_3 treated batch2 ...

The row names of metadata must match the row names of counts (sample IDs). This is the most common beginner pitfall.


Core ideas (in plain language)

Counts, not continuous expression

RNA-seq yields integer read counts per gene per sample. DESeq2 models these with a negative binomial distribution whose variability (“dispersion”) depends on the mean. It pools information across genes to stabilize dispersion and shrinks effect sizes (log2 fold changes, LFCs) toward zero when evidence is weak—reducing false positives and improving ranking. P-values are adjusted with Benjamini–Hochberg FDR. (BioMed Central )

Normalization via size factors

Some libraries are sequenced deeper than others. DESeq2’s median-of-ratios method estimates one scaling factor per sample and divides counts by that factor. This handles library size and composition effects without using gene length (gene length is irrelevant for between-sample testing of the same gene). (RDocumentation )

Design formulas and contrasts

You declare your experimental design (e.g., ~ batch + condition) and then choose a contrast to ask, “treated vs control, holding batch constant.” The design can include multiple factors (batch, sex, etc.) and interactions when needed. (PyDESeq2 )

A few common designs:

ScenarioDesign formulaWhat it means
Simple 2-group comparison~ conditiontreated vs control
Batch-corrected comparison~ batch + conditiontreated vs control, adjusting for batch
Paired / subject-level samples~ subject + conditionbefore/after per subject, blocking on subject
Time-course with treatment~ condition + time + condition:timedifferent time trends per condition

Later, you define a contrast such as ["condition", "treated", "control"] to ask: “Is treated different from control?”

Independent filtering & outliers

Very low-count genes rarely show significant changes; DESeq2 performs independent filtering (and flags some adjusted p-values as NA) to improve power. It also uses Cook’s distance to detect count outliers. (Bioconductor )


Minimal Python workflow with PyDESeq2

The snippet below focuses on the analysis steps. It assumes you have a count matrix and a metadata table in memory (or load from CSV). Class and method names follow the official PyDESeq2 examples. (PyDESeq2 )

0) Inputs you need

import pandas as pd # Example: load your own data counts_df = pd.read_csv("counts.csv", index_col=0) # rows: samples, cols: genes, values: counts (ints) metadata = pd.read_csv("metadata.csv", index_col=0) # rows: samples, cols include e.g. 'condition','batch' # Quick sanity check print(counts_df.shape, metadata.shape) print(counts_df.index.equals(metadata.index)) # should be True

Sanity checks:

# Drop genes with extremely low overall counts (optional but common) keep = counts_df.sum(axis=0) >= 10 counts_df = counts_df.loc[:, keep]

1) Build and fit the DESeq2 model

from pydeseq2.dds import DeseqDataSet from pydeseq2.ds import DeseqStats from pydeseq2.default_inference import DefaultInference # Example design: adjust for batch and test condition inference = DefaultInference(n_cpus=4) dds = DeseqDataSet( counts=counts_df, metadata=metadata, design="~ batch + condition", # change to your variables refit_cooks=True, inference=inference, ) # Fit size factors, dispersions, and GLM coefficients dds.deseq2()

What just happened:

You can inspect pieces in the AnnData-like structure, e.g.:


2) Run statistical tests and collect results

# Define the contrast explicitly: variable, test level, reference (control) level ds = DeseqStats(dds, contrast=["condition", "treated", "control"], inference=inference) # Run the Wald tests, with default Cook's distance filtering and independent filtering ds.summary() # A tidy results table lives here: res = ds.results_df # columns include: baseMean, log2FoldChange, lfcSE, stat, pvalue, padj res_sig = res.query("padj < 0.05").sort_values("padj") res_sig.head()

3) (Optional) Shrink log2 fold changes

Shrunken LFCs are easier to rank/interpret and produce cleaner plots for low-count genes.

# Check the coefficient name to shrink (from the design matrix columns) print(dds.obsm["design_matrix"].columns) # e.g., you might see something like: 'condition[T.treated]' ds.lfc_shrink(coeff="condition[T.treated]") res_shrunk = ds.results_df

PyDESeq2 applies MAP LFC shrinkage via lfc_shrink. It overwrites LFCs/SEs in ds with shrunken values and sets ds.shrunk_LFCs = True. (PyDESeq2 )


4) Quick visuals

MA plot (built-in)

import matplotlib.pyplot as plt ds.plot_MA(s=20) # scatter of mean expression vs LFC; highlights DE genes plt.show()

PCA for QC (on transformed counts)

Do this on transformed counts (variance-stabilized), not raw counts.

# Variance-stabilizing transformation (VST) is available in PyDESeq2 dds.vst() # stores transformed values vst = dds.layers["vst_counts"] # samples x genes # Basic PCA with scikit-learn import numpy as np from sklearn.decomposition import PCA X = np.asarray(vst) # make sure it's a numeric array pca = PCA(n_components=2).fit_transform(X) # PCs across samples # Plot, color by condition cond = metadata["condition"].values import matplotlib.pyplot as plt plt.scatter(pca[:, 0], pca[:, 1], c=(cond == cond[0])) # quick demo: two colors for i, sid in enumerate(metadata.index): plt.text(pca[i, 0], pca[i, 1], sid, fontsize=8) plt.xlabel("PC1"); plt.ylabel("PC2"); plt.title("VST-PCA by condition") plt.show()

VST (or rlog in R) stabilizes variance across the mean so distance-based QC (PCA, clustering) behaves well. In PyDESeq2 the transformed matrix is stored in dds.layers["vst_counts"]. (PyDESeq2 )


Interpreting the results (without hype)


Common pitfalls (and how to avoid them)

  1. Using TPM/FPKM as input. Don’t. DESeq2 expects raw integer counts. Use tools like tximport (R) or generate gene-level count matrices directly (e.g., featureCounts, Salmon quant → summarize to counts). (rdrr.io )
  2. Too few biological replicates. You need replicates to estimate dispersion; more is better. Power studies suggest ≥6 per group for robust detection in many settings (context-dependent; aim higher when effects are subtle). (PubMed )
  3. Forgetting batch or pairing. If you ignore known structure (batch, subject), you may call spurious DE. Model it in the design. (Bioconductor )
  4. Over-filtering or biased filtering. If you prefilter, do it unbiasedly (e.g., total count across samples) and let DESeq2’s independent filtering do its job. (Griffith Lab )
  5. Misreading coefficient names. Before LFC shrinkage or custom contrasts, print dds.obsm["design_matrix"].columns so you shrink/test the intended coefficient. (PyDESeq2 )

A compact, reusable template

from pathlib import Path import pandas as pd from pydeseq2.dds import DeseqDataSet from pydeseq2.ds import DeseqStats from pydeseq2.default_inference import DefaultInference # 1) Load data counts = pd.read_csv("counts.csv", index_col=0) meta = pd.read_csv("metadata.csv", index_col=0) # Optional: unbiased low-count filter counts = counts.loc[:, counts.sum(axis=0) >= 10] # 2) Fit model dds = DeseqDataSet( counts=counts, metadata=meta, design="~ batch + condition", refit_cooks=True, inference=DefaultInference(n_cpus=4), ) dds.deseq2() # 3) Stats for your contrast ds = DeseqStats(dds, contrast=["condition", "treated", "control"]) ds.summary() # runs Wald tests + multiple testing + filters res = ds.results_df # 4) Optional LFC shrinkage for plotting and ranking # print(dds.obsm["design_matrix"].columns) ds.lfc_shrink(coeff="condition[T.treated]") res_shrunk = ds.results_df # 5) Save results Path("out").mkdir(exist_ok=True) res.to_csv("out/deseq2_results.csv") # raw LFCs res_shrunk.to_csv("out/deseq2_results_shrunk.csv") # shrunken LFCs

Everything in this minimal pipeline follows the PyDESeq2 examples and API (e.g., DeseqDataSet, deseq2(), DeseqStats, results_df, lfc_shrink, design matrix in dds.obsm). (PyDESeq2 )


FAQ for beginners

Q: Can I do all this without R? A: Yes. PyDESeq2 re-implements the DESeq2 pipeline in Python and exposes a consistent API. You still need count matrices as input. (PyDESeq2 )

Q: What if I have more than two groups or a time course? A: Use a multi-factor design (e.g., ~ batch + condition + time + condition:time) and pick appropriate contrasts. (In R/DESeq2, the likelihood ratio test is often used for multi-level/time-course hypotheses; PyDESeq2 examples emphasize Wald tests and threshold-based variants.) (HBC Training )

Q: Should I use rlog or VST? A: For QC visualization (PCA/heatmaps), use VST (or rlog in R). Don’t feed transformed counts into the DE tests—use raw counts for modeling. PyDESeq2 provides dds.vst() and stores results in dds.layers["vst_counts"]. (PyDESeq2 )

Q: How many replicates do I need? A: There’s no universal number, but at least ~6 per group is a common target for robust discovery; plan higher for subtle effects. (PubMed )


Key terms (one-liners)


References & further reading


Takeaway

If you remember only three things:

  1. Use raw integer counts + a thoughtful design (include known covariates). (rdrr.io )
  2. Trust the model’s normalization and shrinkage—don’t pre-normalize with TPM/FPKM for DE testing. (RDocumentation )
  3. Replicates matter far more than fancy plotting. Aim for enough samples to estimate dispersion with confidence. (PubMed )
© Runcell.RSS