Differential Expression Analysis: Complete Guide with R and Python Code

Runcell Team,11/6/2025

Differential expression (DE) analysis asks a simple question: which genes change their expression between conditions? The devil, as usual, is in the details—experimental design, normalization, dispersion modeling, multiple testing, and interpretation. This post gives you a clear, practical path through bulk RNA‑seq (and what changes for single‑cell), with concise definitions and ready‑to‑run code in R (DESeq2/edgeR/limma‑voom) and Python (Scanpy).

TL;DR

Plan the design (replicates, covariates, contrasts).
Get counts (gene × sample matrix) and metadata (sample table).
Filter low counts; normalize library size; model counts (NB).
Test contrasts, control FDR, shrink log2FCs.
QC and visualize (PCA, MA, volcano).
Interpret with gene set enrichment.
For single‑cell, prefer pseudobulk across biological replicates, or use cell‑level tests with care.

Key terms you’ll actually use

Feature: usually a gene; can be a transcript or exon.
Counts: integer read counts per feature per sample.
Library size: total counts per sample (sequencing depth).
Normalization: corrects sample‑level scaling (e.g., size factors, TMM).
Dispersion (α): extra‑Poisson variability modeled by Negative Binomial (NB).
Design matrix: encodes your experimental design (e.g., ~ batch + condition).
Contrast: the specific comparison you test (e.g., treated vs control).
log2 fold change (log2FC): magnitude of effect; often shrunk to stabilize.
p‑value / adjusted p‑value (FDR): significance after multiple testing control.
Batch effect: unwanted variation (date, kit, lane, donor) you model as covariates.
Pseudobulk (scRNA‑seq): aggregate counts per sample × cell type to use bulk methods.

0) Experimental design first (don’t skip!)

Replicates: ≥3 biological replicates per group is a practical minimum.
Covariates: record anything systematic (batch/date/donor/sex/RIN). Add them to the model.
Factorial designs: interactions test whether an effect differs by another factor e.g., ~ genotype * treatment lets you examine genotype:treatment interaction.
Paired designs: repeated measures? Use ~ subject + condition (block on subject).

1) From FASTQ to counts (what you need to know)

Alignment-based: STAR/Hisat2 → featureCounts/HTSeq to gene counts.
Alignment-free: Salmon/Kallisto (quasi‑mapping) → transcript quantification; sum to genes via tximport.
Do not use TPM/FPKM directly for DE testing—use raw counts (normalized internally by the method).

2) Bulk RNA‑seq with DESeq2 (R)

Below is a lean, reliable template using the Bioconductor airway dataset (glucocorticoid treatment). It shows filtering, modeling with a batch covariate, contrasts, LFC shrinkage, and common plots.

Install once (uncomment if needed):


# install.packages("BiocManager")
# BiocManager::install(c("DESeq2", "airway", "apeglm"))
# install.packages("ggplot2")


library(DESeq2)
library(airway)
library(ggplot2)
 
# 1) Load example counts + metadata
data(airway)
se <- airway
# Factors: cell line (~batch-like) and dex (treated vs untrt)
se$dex  <- relevel(se$dex, ref = "untrt")
 
# 2) Build DESeqDataSet, filter very low counts
dds <- DESeqDataSet(se, design = ~ cell + dex)
keep <- rowSums(counts(dds) >= 10) >= 3
dds <- dds[keep,]
 
# 3) Fit model (NB with empirical dispersion)
dds <- DESeq(dds)
 
# 4) Extract results for the contrast of interest
res <- results(dds, contrast = c("dex", "trt", "untrt"))  # treated vs control
res <- lfcShrink(dds, coef = "dex_trt_vs_untrt", type = "apeglm")  # effect-size shrinkage
 
# 5) QC & visualization
vsd <- vst(dds, blind = FALSE)
 
# PCA colored by condition
plotPCA(vsd, intgroup = "dex")
 
# MA plot of shrunk effects
plotMA(res, ylim = c(-4,4))
 
# Volcano with ggplot2
res_df <- as.data.frame(res)
res_df$neglog10padj <- -log10(res_df$padj)
res_df$hit <- with(res_df, !is.na(padj) & padj < 0.05 & abs(log2FoldChange) > 1)
 
ggplot(res_df, aes(x = log2FoldChange, y = neglog10padj)) +
  geom_point(alpha = 0.4) +
  geom_hline(yintercept = -log10(0.05), linetype = 2) +
  geom_vline(xintercept = c(-1, 1), linetype = 2) +
  labs(x = "log2 fold change (shrunk)", y = "-log10(FDR)", title = "Volcano plot")
 
# 6) Tidy significant table
sig <- subset(res_df, padj < 0.05)
sig <- sig[order(sig$padj), c("log2FoldChange","lfcSE","stat","pvalue","padj")]
head(sig, 10)

Notes on this workflow

design = ~ cell + dex treats cell line as a batch‑like covariate to reduce false positives.
vst (or rlog) is for visualization/QC, not for the DE fit itself.
lfcShrink(..., type="apeglm") gives more stable effect sizes for ranking and plotting.

3) Alternative bulk frameworks

edgeR (R, quasi‑likelihood GLM; great for multifactor designs)


# install.packages("BiocManager"); BiocManager::install("edgeR")
library(edgeR)
 
counts_mat <- counts(dds)         # reuse counts from above for illustration
group <- colData(dds)$dex
batch <- colData(dds)$cell
 
y <- DGEList(counts = counts_mat)
keep <- filterByExpr(y, group = group)  # adaptive low-count filter
y <- y[keep,, keep.lib.sizes=FALSE]
y <- calcNormFactors(y, method = "TMM")
 
design <- model.matrix(~ batch + group)
y <- estimateDisp(y, design)
fit <- glmQLFit(y, design)
qlf <- glmQLFTest(fit, coef = "grouptrt")   # treated vs control
 
topTags(qlf, n = 10)

limma‑voom (R, fast and flexible; works exceptionally well with many samples)


# BiocManager::install(c("limma"))
library(limma)
 
v <- voom(counts_mat, design = design)  # uses the same 'design' from edgeR block
fit <- lmFit(v, design)
fit <- eBayes(fit)
topTable(fit, coef = "grouptrt", number = 10)

4) Going beyond a simple two‑group comparison

Multiple groups: specify design = ~ batch + condition and test pairwise or custom contrasts.
Interactions: design = ~ batch + genotype * treatment then test the interaction term (e.g., results(..., name = "genotypeB.treatmenttreated") in DESeq2).
Paired: design = ~ subject + condition; the contrast tests within‑subject changes.
Time course: model time as a factor or spline; in limma you can use duplicateCorrelation for repeated measures.

5) Common QC you should always look at

Library sizes & complexity (barplots; proportion of reads mapped).
Sample‑to‑sample distances (Euclidean on VST/rlog; heatmap).
PCA/UMAP on transformed counts—should separate by biology, not by batch.
Mean–variance trend (voom plot) to ensure modeling assumptions are OK.


# Sample distance heatmap (VST) example
library(pheatmap)
d <- dist(t(assay(vsd)))
pheatmap(as.matrix(d), clustering_distance_rows = d, clustering_distance_cols = d)

6) Single‑cell RNA‑seq: do it right

Two solid strategies

Pseudobulk (recommended when you have replicates) Aggregate counts per sample × cell type (or cluster), then run bulk DE (DESeq2/edgeR/limma).

Captures between‑sample variability → valid p‑values.
Easy to add covariates (donor, batch).
Use the same steps as bulk, just do them per cell type.

Sketch in R (using a SingleCellExperiment sce):


# BiocManager::install(c("SingleCellExperiment","scuttle"))
library(SingleCellExperiment); library(scuttle)
# ids says how to group cells; here by sample and cell_type
pb <- aggregateAcrossCells(sce, ids = colData(sce)[, c("sample", "cell_type")])
# pb assays now contain summed counts; build DE model per cell_type or with interaction

Cell‑level tests (when pseudobulk is impossible) Use methods like Wilcoxon rank‑sum (Scanpy), or model‑based (MAST). Control for library size and latent factors.

Scanpy example (Python, cluster markers):


import scanpy as sc
adata = sc.read_h5ad("my_data.h5ad")         # adata.X should be counts or use adata.layers['counts']
sc.pp.normalize_total(adata, target_sum=1e4) # for ranking/visualization
sc.pp.log1p(adata)
sc.tl.leiden(adata, resolution=0.5)
sc.tl.rank_genes_groups(adata, groupby="leiden", method="wilcoxon")
sc.pl.rank_genes_groups(adata, n_genes=20)

Condition DE within a cluster:


cluster = "2"
ad = adata[adata.obs["leiden"] == cluster].copy()
sc.tl.rank_genes_groups(ad, groupby="condition", groups=["treated"], reference="control", method="wilcoxon")
sc.get.rank_genes_groups_df(ad, group="treated").head()

Tip: If you have multiple donors per condition, pseudobulk generally yields the most reliable inferences for scRNA‑seq DE because it respects biological replication.

7) Multiple testing, thresholds, and effect sizes

Report both FDR (e.g., padj < 0.05) and effect size (|log2FC| ≥ 1 is a common, not universal, threshold).
Prefer shrunk LFCs (DESeq2: lfcShrink) for ranking/plotting; use unshrunk for exact algebraic contrasts if needed.
Avoid hard CPM/TPM cutoffs after modeling; do expression filtering before fitting using robust rules (e.g., edgeR filterByExpr).

8) Gene set enrichment (quick start)

Once you have a ranked list, run GSEA/ORA against pathways (GO/KEGG/MSigDB). In R, fgsea works well:


# BiocManager::install("fgsea")
library(fgsea)
# ranks: named vector of gene scores (e.g., signed -log10 p; or log2FC)
ranks <- res_df$stat; names(ranks) <- rownames(res_df)
# pathways <- your list of gene sets as a named list
fgseaRes <- fgsea(pathways, ranks, minSize=15, maxSize=500)
fgseaRes[order(fgseaRes$pval),][1:10, c("pathway","NES","padj")]

9) Frequent pitfalls & how to dodge them

No replicates → you can’t estimate dispersion well. Get replicates or treat as exploratory only.
Batch‑confounded design (all cases sequenced on one lane, all controls on another) → you cannot disentangle effects.
Using TPM/FPKM for DE tests → don’t. Provide raw counts to DE tools; they handle normalization.
Filtering after testing → inflates false positives. Filter low expression before modeling.
Over‑interpreting p‑values → focus on biological effect sizes and pathway context.
Mixing technical with biological replicates → do not average technical replicates; sum counts or include them as nested factors.

10) Handy snippets (copy/paste)

DESeq2 custom contrast (multi‑group):


# Suppose condition has levels: ctrl, drugA, drugB
res_AB <- results(dds, contrast = c("condition", "drugA", "drugB"))

DESeq2 interaction test (does treatment effect differ by genotype?):


# design = ~ batch + genotype * treatment
# Test the interaction term directly (naming may vary by factor levels)
results(dds, name = "genotypeKO.treatmenttreated")

tximport for Salmon/Kallisto → gene‑level counts:


# BiocManager::install(c("tximport","readr"))
library(tximport); library(readr)
samples <- read_csv("samples.csv") # columns: sample, condition, path_to_quant
files <- setNames(samples$path_to_quant, samples$sample)
tx2gene <- read_csv("tx2gene.csv") # columns: transcript, gene
txi <- tximport(files, type = "salmon", tx2gene = tx2gene)
dds <- DESeqDataSetFromTximport(txi, colData = samples, design = ~ batch + condition)
dds <- DESeq(dds)

edgeR user‑defined contrasts (limma-style):


library(limma); library(edgeR)
design <- model.matrix(~ 0 + condition)         # no intercept
colnames(design) <- levels(condition)
y <- estimateDisp(DGEList(counts), design)
fit <- glmQLFit(y, design)
contr <- makeContrasts(DrugA_vs_Ctrl = DrugA - Ctrl, levels = design)
qlf <- glmQLFTest(fit, contrast = contr[,"DrugA_vs_Ctrl"])
topTags(qlf)

11) A short checklist you can tape next to your monitor

Do I have ≥3 biological reps per group?
Did I record batch/covariates and include them in the design?
Did I filter lowly expressed genes before modeling?
Did I use raw counts for the DE fit (not TPM/FPKM)?
Did I check QC (library size, PCA, sample distances)?
Are my contrasts answering the scientific question?
Did I report FDR and effect sizes (shrunk log2FC)?
Did I perform pathway analysis for biological context?
Are my conclusions robust to batch and confounding?

Closing thoughts

Differential expression analysis is conceptually simple but statistically rich. If you (1) plan your design, (2) model with the correct likelihood (NB for counts), (3) control FDR, and (4) keep an eye on effect sizes and QC plots, you will produce results that are both statistically defensible and biologically interpretable. The code above is a practical starting point you can adapt to bulk and single‑cell projects alike.

If you’d like, tell me your organism, sample size, and specific question—I can tailor the design formula and code (including pseudobulk templates) to your dataset.