AboutBlogMediaTags

Differential Expression Analysis: Complete Guide with R and Python Code

Runcell Team,

Differential expression (DE) analysis asks a simple question: which genes change their expression between conditions? The devil, as usual, is in the details—experimental design, normalization, dispersion modeling, multiple testing, and interpretation. This post gives you a clear, practical path through bulk RNA‑seq (and what changes for single‑cell), with concise definitions and ready‑to‑run code in R (DESeq2/edgeR/limma‑voom) and Python (Scanpy).


TL;DR

  1. Plan the design (replicates, covariates, contrasts).
  2. Get counts (gene × sample matrix) and metadata (sample table).
  3. Filter low counts; normalize library size; model counts (NB).
  4. Test contrasts, control FDR, shrink log2FCs.
  5. QC and visualize (PCA, MA, volcano).
  6. Interpret with gene set enrichment.
  7. For single‑cell, prefer pseudobulk across biological replicates, or use cell‑level tests with care.

Key terms you’ll actually use


0) Experimental design first (don’t skip!)


1) From FASTQ to counts (what you need to know)


2) Bulk RNA‑seq with DESeq2 (R)

Below is a lean, reliable template using the Bioconductor airway dataset (glucocorticoid treatment). It shows filtering, modeling with a batch covariate, contrasts, LFC shrinkage, and common plots.

Install once (uncomment if needed):

# install.packages("BiocManager") # BiocManager::install(c("DESeq2", "airway", "apeglm")) # install.packages("ggplot2")
library(DESeq2) library(airway) library(ggplot2) # 1) Load example counts + metadata data(airway) se <- airway # Factors: cell line (~batch-like) and dex (treated vs untrt) se$dex <- relevel(se$dex, ref = "untrt") # 2) Build DESeqDataSet, filter very low counts dds <- DESeqDataSet(se, design = ~ cell + dex) keep <- rowSums(counts(dds) >= 10) >= 3 dds <- dds[keep,] # 3) Fit model (NB with empirical dispersion) dds <- DESeq(dds) # 4) Extract results for the contrast of interest res <- results(dds, contrast = c("dex", "trt", "untrt")) # treated vs control res <- lfcShrink(dds, coef = "dex_trt_vs_untrt", type = "apeglm") # effect-size shrinkage # 5) QC & visualization vsd <- vst(dds, blind = FALSE) # PCA colored by condition plotPCA(vsd, intgroup = "dex") # MA plot of shrunk effects plotMA(res, ylim = c(-4,4)) # Volcano with ggplot2 res_df <- as.data.frame(res) res_df$neglog10padj <- -log10(res_df$padj) res_df$hit <- with(res_df, !is.na(padj) & padj < 0.05 & abs(log2FoldChange) > 1) ggplot(res_df, aes(x = log2FoldChange, y = neglog10padj)) + geom_point(alpha = 0.4) + geom_hline(yintercept = -log10(0.05), linetype = 2) + geom_vline(xintercept = c(-1, 1), linetype = 2) + labs(x = "log2 fold change (shrunk)", y = "-log10(FDR)", title = "Volcano plot") # 6) Tidy significant table sig <- subset(res_df, padj < 0.05) sig <- sig[order(sig$padj), c("log2FoldChange","lfcSE","stat","pvalue","padj")] head(sig, 10)

Notes on this workflow


3) Alternative bulk frameworks

edgeR (R, quasi‑likelihood GLM; great for multifactor designs)

# install.packages("BiocManager"); BiocManager::install("edgeR") library(edgeR) counts_mat <- counts(dds) # reuse counts from above for illustration group <- colData(dds)$dex batch <- colData(dds)$cell y <- DGEList(counts = counts_mat) keep <- filterByExpr(y, group = group) # adaptive low-count filter y <- y[keep,, keep.lib.sizes=FALSE] y <- calcNormFactors(y, method = "TMM") design <- model.matrix(~ batch + group) y <- estimateDisp(y, design) fit <- glmQLFit(y, design) qlf <- glmQLFTest(fit, coef = "grouptrt") # treated vs control topTags(qlf, n = 10)

limma‑voom (R, fast and flexible; works exceptionally well with many samples)

# BiocManager::install(c("limma")) library(limma) v <- voom(counts_mat, design = design) # uses the same 'design' from edgeR block fit <- lmFit(v, design) fit <- eBayes(fit) topTable(fit, coef = "grouptrt", number = 10)

4) Going beyond a simple two‑group comparison


5) Common QC you should always look at

# Sample distance heatmap (VST) example library(pheatmap) d <- dist(t(assay(vsd))) pheatmap(as.matrix(d), clustering_distance_rows = d, clustering_distance_cols = d)

6) Single‑cell RNA‑seq: do it right

Two solid strategies

  1. Pseudobulk (recommended when you have replicates) Aggregate counts per sample × cell type (or cluster), then run bulk DE (DESeq2/edgeR/limma).

    • Captures between‑sample variability → valid p‑values.
    • Easy to add covariates (donor, batch).
    • Use the same steps as bulk, just do them per cell type.

    Sketch in R (using a SingleCellExperiment sce):

    # BiocManager::install(c("SingleCellExperiment","scuttle")) library(SingleCellExperiment); library(scuttle) # ids says how to group cells; here by sample and cell_type pb <- aggregateAcrossCells(sce, ids = colData(sce)[, c("sample", "cell_type")]) # pb assays now contain summed counts; build DE model per cell_type or with interaction
  2. Cell‑level tests (when pseudobulk is impossible) Use methods like Wilcoxon rank‑sum (Scanpy), or model‑based (MAST). Control for library size and latent factors.

    Scanpy example (Python, cluster markers):

    import scanpy as sc adata = sc.read_h5ad("my_data.h5ad") # adata.X should be counts or use adata.layers['counts'] sc.pp.normalize_total(adata, target_sum=1e4) # for ranking/visualization sc.pp.log1p(adata) sc.tl.leiden(adata, resolution=0.5) sc.tl.rank_genes_groups(adata, groupby="leiden", method="wilcoxon") sc.pl.rank_genes_groups(adata, n_genes=20)

    Condition DE within a cluster:

    cluster = "2" ad = adata[adata.obs["leiden"] == cluster].copy() sc.tl.rank_genes_groups(ad, groupby="condition", groups=["treated"], reference="control", method="wilcoxon") sc.get.rank_genes_groups_df(ad, group="treated").head()

Tip: If you have multiple donors per condition, pseudobulk generally yields the most reliable inferences for scRNA‑seq DE because it respects biological replication.


7) Multiple testing, thresholds, and effect sizes


8) Gene set enrichment (quick start)

Once you have a ranked list, run GSEA/ORA against pathways (GO/KEGG/MSigDB). In R, fgsea works well:

# BiocManager::install("fgsea") library(fgsea) # ranks: named vector of gene scores (e.g., signed -log10 p; or log2FC) ranks <- res_df$stat; names(ranks) <- rownames(res_df) # pathways <- your list of gene sets as a named list fgseaRes <- fgsea(pathways, ranks, minSize=15, maxSize=500) fgseaRes[order(fgseaRes$pval),][1:10, c("pathway","NES","padj")]

9) Frequent pitfalls & how to dodge them


10) Handy snippets (copy/paste)

DESeq2 custom contrast (multi‑group):

# Suppose condition has levels: ctrl, drugA, drugB res_AB <- results(dds, contrast = c("condition", "drugA", "drugB"))

DESeq2 interaction test (does treatment effect differ by genotype?):

# design = ~ batch + genotype * treatment # Test the interaction term directly (naming may vary by factor levels) results(dds, name = "genotypeKO.treatmenttreated")

tximport for Salmon/Kallisto → gene‑level counts:

# BiocManager::install(c("tximport","readr")) library(tximport); library(readr) samples <- read_csv("samples.csv") # columns: sample, condition, path_to_quant files <- setNames(samples$path_to_quant, samples$sample) tx2gene <- read_csv("tx2gene.csv") # columns: transcript, gene txi <- tximport(files, type = "salmon", tx2gene = tx2gene) dds <- DESeqDataSetFromTximport(txi, colData = samples, design = ~ batch + condition) dds <- DESeq(dds)

edgeR user‑defined contrasts (limma-style):

library(limma); library(edgeR) design <- model.matrix(~ 0 + condition) # no intercept colnames(design) <- levels(condition) y <- estimateDisp(DGEList(counts), design) fit <- glmQLFit(y, design) contr <- makeContrasts(DrugA_vs_Ctrl = DrugA - Ctrl, levels = design) qlf <- glmQLFTest(fit, contrast = contr[,"DrugA_vs_Ctrl"]) topTags(qlf)

11) A short checklist you can tape next to your monitor


Closing thoughts

Differential expression analysis is conceptually simple but statistically rich. If you (1) plan your design, (2) model with the correct likelihood (NB for counts), (3) control FDR, and (4) keep an eye on effect sizes and QC plots, you will produce results that are both statistically defensible and biologically interpretable. The code above is a practical starting point you can adapt to bulk and single‑cell projects alike.

If you’d like, tell me your organism, sample size, and specific question—I can tailor the design formula and code (including pseudobulk templates) to your dataset.

© Runcell.RSS