Preloader

Confronting false discoveries in single-cell differential expression

Literature review

To identify which statistical methods for DE analysis have been most commonly used within the field, we conducted an extensive literature review. We annotated the statistical method used to perform DE analysis across experimental conditions within cell types for each publication included in a large, curated database of scRNA-seq studies37. The database was accessed on November 4, 2020. Because the single-cell studies catalogued in this database span a long period of time, and we aimed to establish which methods for DE analysis are currently in wide use, we limited our analysis to the 500 most recently published studies. Accordingly, the inclusion criteria for our review were (i) studies present in the curated database as of November 4, 2020, and (i) studies within the 500 most recent entries in this database at the time it was accessed. Each of these 500 studies were then manually reviewed to determine the statistical methodology used to compare cells of the same type between experimental conditions. We did not annotate methods used to identify genes differentially expressed between cell types (i.e., marker gene identification), as this problem presents a distinct set of statistical challenges10,38. In total, 205 of the 500 studies conducted DE analysis between biological conditions. The complete list of all 500 studies is provided as Source Data.

Ground-truth datasets

Previous benchmarks of DE analysis methods for single-cell transcriptomics have relied heavily on simulated data, or else have compared the results of different methods in scenarios where no ground truth was available10,17. We reasoned that the best possible approximation to the biological ground truth in a scRNA-seq experiment would consist of a matched bulk RNA-seq dataset in the same purified cell type, exposed to the same perturbation under identical experimental conditions, and sequenced in the same laboratory. We surveyed the literature to identify such matching single-cell and bulk RNA-seq datasets, which led us to compile a resource of eighteen ground truth datasets from four publications12,13,14,15. Datasets of mouse, rat, pig, and rabbit bone marrow-derived mononuclear phagocytes stimulated with either lipopolysaccharide or poly-I:C for 4 h were obtained from Hagai et al.12 Datasets of naive or memory T cells stimulated for 5 d with anti-CD3/anti-CD28 coated beads in the presence or absence of various combinations of cytokines (Th0: anti-CD3/anti-CD28 alone; Th2: IL-4, anti-IFNγ; Th17: TGFβ, IL6, IL23, IL1β, anti-IFNγ, anti-IL4; iTreg: TGFβ, IL2) were obtained from Cano-Gamez et al.13 We additionally obtained label-free quantitative proteomics data for the same comparisons from this study. Datasets of alveolar macrophages and type II pneumocytes from young (3 m) and old (24 m) mice were obtained from Angelidis et al.14 Datasets of alveolar macrophages and type II pneumocytes from patients with pulmonary fibrosis and control individuals were obtained from Reyfman et al.15

Differential expression analysis methods

We compared fourteen statistical methods for DE analysis of single-cell transcriptomics data on their ability to recover ground-truth patterns of DE, as established through bulk RNA-seq analysis of matching cell populations. These fourteen methods comprised seven statistical tests that compared gene expression in individual cells (“single-cell methods”); six tests that aggregated cells within a biological replicate to form pseudobulks before performing statistical analysis (“pseudobulk methods”); and a linear mixed model.

The seven single-cell methods analyzed here included a t-test, a Wilcoxon rank-sum test, logistic regression39, negative binomial and Poisson generalized linear models, a likelihood ratio test40, and the two-part hurdle model implemented by MAST7. The implementation provided in the Seurat function ‘FindMarkers’ was used for all seven tests, with all filters (‘min.pct’, ‘min.cells.feature’, and ‘logfc.threshold’) disabled. In addition, we implemented a linear mixed model within Seurat, using the ‘lmerTest’ R package to optimize the restricted maximum likelihood and obtain p-values from the Satterthwaite approximation for degrees of freedom. We observed that some statistical tests returned a large number of p-values below the double precision limit in R (approximately 2 × 10–308), potentially confounding the calculation of the concordance metrics described below. To avoid this pitfall, we modified the Seurat implementation to also return the value of the test statistic from which the p-value was derived. The modified version of Seurat 3.1.5 used to perform all single-cell DE analyses reported in this study is available from http://github.com/jordansquair/Seurat.

The pseudobulk methods employed the DESeq241, edgeR29, and limma42 packages for analysis of aggregated read counts. Briefly, for cells of a given type, we first aggregated reads across biological replicates, transforming a genes-by-cells matrix to a genes-by-replicates matrix using matrix multiplication. For DESeq2, we used both a Wald test of the negative binomial model coefficients (DESeq2-Wald) as well as a likelihood ratio test compared to a reduced model (DESeq2-LRT) to compute the statistical significance. For edgeR, we used both the likelihood ratio test (edgeR-LRT)43 as well as the quasi-likelihood F-test approach (edgeR-QLF)44. For limma, we compared two modes: limma-trend, which incorporates the mean-variance trend into the empirical Bayes procedure at the gene level, and voom (limma-voom), which incorporates the mean-variance trend by assigning a weight to each individual observation45. Log-transformed counts per million values computed by edgeR were provided as input to limma-trend.

DE analysis of bulk RNA-seq datasets was performed with six methods (DESeq2-LRT, DESeq2-Wald, edgeR-LRT, edgeR-QLF, limma-trend, and limma-voom), except for the two pulmonary fibrosis datasets15; for these datasets, the raw bulk RNA-seq data from sorted cells could not be obtained, so only the results of the bulk DE analysis performed by the authors of the original publication were used. The AUCC and rank correlation were calculated for each bulk DE analysis method separately, and subsequently averaged over all six methods. DE analysis of normalized bulk proteomics data was performed using the moderated t-test implemented within limma, as in the original publication.

Measuring concordance between single-cell and bulk RNA-seq

To evaluate the concordance between DE analyses of matched single-cell and bulk RNA-seq data, we computed two metrics, designed to evaluate the concordance between only the most highly ranked subset of DE genes and across the entire transcriptome, respectively. To calculate the first of these metrics, the area under the concordance curve (AUCC)16,17, we ranked genes in both the single-cell and bulk datasets in descending order by the statistical significance of their differential expression. Then, we created lists of the top-ranked genes in each dataset of matching size, up to some maximum size k. For each of these lists (that is, for the top-1 genes, top-2 genes, top-3 genes, and so on), we computed the size of the intersection between the single-cell and bulk DE genes. This procedure yielded a curve relating the number of shared genes between datasets to the number of top-ranked genes considered. The area under this curve was computed by summing the size of all intersections, and normalized to the range [0, 1] by dividing it by its maximum possible value, k × (k + 1) / 2. To evaluate the concordance of DE analysis, we used k = 500 except where otherwise noted, but found our results were insensitive to the precise value of k. To compute the second metric, the transcriptome-wide rank correlation, we multiplied the absolute value of the test statistic for each gene by the sign of its log-fold change between conditions, and then computed the Spearman correlation over genes between the single-cell and bulk datasets.

In addition to evaluating the consistency of DE analyses at the gene level, we also asked whether each DE method yielded broader patterns of functional enrichment that were similar between the single-cell and bulk datasets, allowing for some divergence in the rankings of individual genes. To address this question, we performed gene set enrichment analysis46 using the ‘fgsea’ R package47. GO term annotations for human and mouse (2019-12-09 release) were obtained from the Gene Ontology Consortium website. GO terms annotated to less than 10 genes or more than 1,000 genes within each dataset were excluded in order to mitigate the influence of very specific or very broad terms. Genes were ranked in descending order by the absolute value of the test statistic, and 106 permutations were performed. To evaluate the concordance of GO term enrichment, we used k = 100, on the basis that fewer top-ranked GO terms are generally of interest than are top-ranked genes.

Impact of mean expression

We initially hypothesized that differences between single-cell DE analysis methods could be attributed to their differing sensitivities towards lowly expressed genes. To explore this hypothesis, we performed the following analyses. First, we divided genes from the eighteen gold standard datasets into three equally sized bins on the basis of their mean expression, then re-calculated the AUCC as described above within each bin separately. Second, we inspected the properties of genes falsely called as DE in the single-cell data (false positives) or incorrectly inferred to be unchanging in the single-cell data (false negatives). To identify false positive genes, we used the bulk DE analysis to exclude genes called as DE at a false discovery rate of 10% from the matched single-cell results, then retained the 100 top-ranked remaining genes in the single-cell data. To identify false negative genes, we used the bulk DE analysis to identify genes called as DE at a false discovery rate of 10%, but with a false discovery rate exceeding 10% in the matched single-cell results, again retaining the 100 top-ranked such genes. For each of these genes, we computed both the mean expression level and the proportion of zero gene expression measurements. Third, we analyzed a Smart-seq2 dataset of human dermal fibroblasts stimulated with interferon-β, in which a mixture of synthetic RNAs was spiked into each individual cell12. We performed DE analysis on the synthetic spike-ins, then calculated the Spearman correlation between the mean expression level of each spike-in and the statistical significance of differential expression, as assigned by each single-cell DE method. Fourth, we assembled a compendium of 46 published scRNA-seq datasets, and asked whether the genes called as DE by each method tended to be more or less highly expressed across the entire compendium. Complete details on the preprocessing of these 46 datasets are provided below. Because each of these datasets were sequenced to different depths, and captured different total numbers of genes (depending on both the sequencing depth and the biological system under study), mean expression values were not directly comparable across datasets. To enable such a comparison, we first calculated the mean expression for each gene, then converted this value into the quantile of mean expression using the empirical cumulative distribution function. We then calculated the mean expression quantile of the 200 top-ranked genes from each method in each of the 46 datasets.

Dissecting pseudobulk DE methods

To understand the principles underlying the improved performance of the six pseudobulk DE methods, we performed the following analyses. First, we disabled the aggregation procedure that led to the creation of pseudobulks (that is, we treated each individual cell as its own replicate), then performed an identical DE analysis of individual cells. For each DE method, we then re-calculated both the AUCC and the bias towards highly expressed genes, as quantified by (i) the rank correlation to mean-spike in expression, and (ii) the expression quantile across 46 scRNA-seq datasets. Second, we aggregated random groups of cells into ‘pseudo-replicates’ by randomizing the replicate associated with each cell. We then again re-calculated both the AUCC and the bias towards highly expressed genes.

These experiments led us to suspect that discarding information about the inherent variability of biological replicates caused both the bias towards highly expressed genes and the attendant decrease in performance. To test this hypothesis, we compared the variance of gene expression in pseudobulks and pseudo-replicates. For each gene, we calculated the difference in variance (∆-variance) between pseudobulks and pseudo-replicates. We initially visualized the ∆-variance in an exemplary dataset, consisting of mouse bone marrow mononuclear cells stimulated with poly-I:C12. Subsequently, we calculated the mean ∆-variance across all genes in each of the 46 datasets in our scRNA-seq compendium, observing a decrease in the variance in all 46 cases. To clarify the relationship between the ∆-variance and mean gene expression, we computed the correlation between ∆-variance and mean expression, first in the poly-I:C dataset and then across all 46 datasets in the compendium. We observed a significant negative correlation, confirming that the variance of highly expressed genes is disproportionately underestimated when discarding information about biological replicates. We performed a similar analysis correlating the original variance of gene expression to the ∆-variance, demonstrating that the variance of the most variable genes is disproportionately underestimated when discarding information about biological replicates. However, in partial correlation analyses, only gene expression variance remained correlated with ∆-variance, implying that failing to account for biological replicates induces a bias towards highly expressed genes because these genes are also more variably expressed. Supplementary Fig. 4h-i employ the signed pseudo-logarithm transformation from the ‘ggallin’ R package to visualize the ∆-variance.

Simulation studies

Our understanding of the importance of accounting for variability between biological replicates led us to ask whether failing to account for biological replication could lead to the appearance of false discoveries in the absence of a perturbation. To test this hypothesis, we simulated scRNA-seq data with no biological effect, in which we systematically varied the degree of heterogeneity between replicates. Simulations were performed using the ‘Splatter’ R package48, with simulation parameters estimated from the Kang et al. dataset5 using the ‘splatEstimate’ function. Populations of between 100 and 2,000 cells were simulated, with between 3 and 20 replicates per condition. DE of varying magnitudes was simulated between replicates by varying the location parameter of the DE factor log-normal distribution (‘de.facLoc’) between 0 and 1, treating each replicate as its own group, and the total proportion of DE genes (‘de.prob’) set to 0.5. Then, half of the replicates were randomly assigned to an artificial ‘treatment’ condition and the remaining half to a ‘control’ condition, and DE analysis was performed between the treatment and control groups. Except where otherwise noted, plots show results from a simulated population of 500 cells, with three replicates per condition.

Analysis of published scRNA-seq control groups

To confirm that the trends observed in simulation studies were reflective of experimental datasets, we performed a similar analysis using published scRNA-seq data. Within our compendium, we identified a total of fourteen studies with control groups that included six or more samples5,6,15,49,50,51,52,53,54,55,56,57,58,59. Details on the preprocessing of each of these datasets are provided below. For each of these studies, we split the control group randomly into artificial ‘control’ and ‘treatment’ groups, and performed DE analysis. In addition to computing the total number of DE genes, we identified GO terms enriched among DE genes using a hypergeometric test. We also performed a similar analysis for one spatial transcriptomics dataset24, identifying DE genes between random groups of control mice with barcodes grouped by spinal cord region rather than cell type. Spatial transcriptomics data was downloaded from the supporting website at https://als-st.nygenome.org. Only data from wild-type mice was retained for the analysis. Last, we hypothesized that scRNA-seq studies of human tissues would display more heterogeneity between replicates than studies of animal models, where factors such as genotype, environment, and perturbation can be precisely controlled. To test this hypothesis, we computed the mean ∆-variance across all genes in the 38 human or mouse scRNA-seq datasets in our compendium (n = 18 human datasets and 20 mouse datasets).

Application to spinal cord injury

To demonstrate the relevance of our findings to the discovery of new biological mechanisms, we collected scRNA-seq data of the mouse lumbar spinal cord after SCI, and performed DE analysis.

Animal model

Experiments were conducted on adult male or female C57BL/6 mice (15-35 g body weight, 12-30 weeks of age). Vglut2:Cre (Jackson Laboratory 016963) transgenic mice were used and maintained on a mixed genetic background (129/C57BL/6). Housing, surgery, behavioral experiments and euthanasia were performed in compliance with the Swiss Veterinary Law guidelines. Animal care, including manual bladder voiding, was performed twice daily for the first 3 weeks after injury and once daily for the remaining post-injury period. All procedures and surgeries were approved by the Veterinary Office of the Canton of Geneva (Switzerland; GE/57/20 A).

Surgical procedures and post-surgical care

Surgical procedures were performed as previously described25,60,61,62. Briefly, a laminectomy was made at the mid-thoracic level (T9 vertebra). We performed a contusion injury using a force-controlled spinal cord impactor (IH-0400 Impactor, Precision Systems and Instrumentation LLC, USA63), as previously described60,64. The applied force was set to 90 kdyn. Analgesia (buprenorphine, Essex Chemie AG, Switzerland, 0.01–0.05 mg per kg, s.c.) was provided for three days after surgery.

Kinematic recordings

Kinematic recordings were performed as previously described25,60,61,65,66,67. Bilateral leg kinematics were captured using a 12-camera infrared (200 Hz) Vicon high-speed motion capture system (Vicon Motion Systems, UK). We attached reflective markers bilaterally at the iliac crest, the greater trochanter (hip joint), the lateral condyle (knee joint), the lateral malleolus (ankle), and the distal end of the fifth metatarsophalangeal joint.

Kinematic analysis

For each leg, 15 step cycles were extracted for each mouse. A total of 75 parameters quantifying kinematic and kinetic features were computed for each gait cycle accordingly. To evaluate differences between conditions we implemented a multistep statistical procedure based on principal component analysis, as previously described25,60,61,65,66,67.

Electrophysiology

Mice were anaesthetised using a ketamine/xylazine anesthesia mixture. Stainless steel needle electrodes (30 G) were inserted through the posterior surface of the ankle for nerve stimulation and into the lateral, plantar surface of the foot for digital electromyographic recordings. Responses were recorded at a stimulation intensity of 2 x threshold for evoking an H-wave. Signals were amplified and filtered (1000x and 300 Hz–5 kHz, AM Systems differential amplifier) then digitised (PowerLab, AD instruments) for acquisition. Twenty recordings were made at each of 5 different stimulation frequencies (0.1, 0.5, 1, 2, and 5 Hz) with a one minute break between each frequency setting. Peak to peak amplitudes for at least three responses were measured for both M and H waves at each frequency, for each animal. Response amplitudes were first normalized to the amplitude of the M wave at each frequency, and then normalized to the H/M ratio at 0.1 Hz for comparisons across animals.

Single-nucleus RNA sequencing

Single-nucleus dissociation of the mouse spinal cord was performed as previously described27,51. Animals were first euthanized by isoflurane inhalation and cervical dislocation. The lumbar spinal cord site was rapidly dissected and frozen on dry ice. Spinal cords were dounced in 500 µl sucrose buffer (0.32 M sucrose, 10 mM HEPES [pH 8.0], 5 mM CaCl2, 3 mM Mg acetate, 0.1 mM EDTA, 1 mM DTT) and 0.1% Triton X-100 with the Kontes Dounce Tissue Grinder. 2 mL of sucrose buffer was added and filtered through a µm cell strainer. The lysate was centrifuged at 3200 g for 10 min at 4 °C. The supernatant was decanted, and 3 mL of sucrose buffer added to the pellet and incubated for 1 min. The pellet was homogenized using an Ultra-Turrax and 12.5 mL of density buffer (1 M sucrose, 10 mM HEPES [pH 8.0], 3 mM Mg acetate, 1 mM DTT) was added below the nuclei layer. The tube was centrifuged at 3200 g at 4 °C and supernatant poured off. Nuclei on the bottom half of the tube wall were collected with 100 µl PBS with 0.04% BSA and 0.2 U/µl RNase inhibitor. Resuspended nuclei were filtered through a 30 µm strainer, and adjusted to 1000 nuclei/µl.

Library preparation

Library preparation was carried out using the 10x Genomics Chromium Single Cell Kit Version 2. The nuclei suspension was added to the Chromium RT mix to achieve loading numbers of 5,000. For downstream cDNA synthesis (13 PCR cycles), library preparation and sequencing, the manufacturer’s instructions were followed.

Read alignment

Reads were aligned to the most recent Ensembl release (GRCm38.93) using Cell Ranger, and a matrix of unique molecular identifier (UMI) counts, including both intronic and exonic reads, was obtained using velocyto68. Seurat35 was then used to calculate quality control metrics for each cell barcode, including the number of genes detected, number of UMIs, and proportion of reads aligned to mitochondrial genes. Low-quality cells were filtered by removing cells expressing less than 200 genes or with more than 5% mitochondrial reads. Genes expressed in less than 3 cells were likewise removed, yielding a count matrix consisting of 22,806 genes and 19,237 cells.

Clustering and integration

Prior to clustering analysis, we first performed batch effect correction and data integration across the two different experimental conditions as previously described27. Gene expression data were normalized using regularized negative binomial models69, then integrated across batches using the data integration workflow within Seurat. The normalized and integrated gene expression matrices were then subjected to clustering to identify cell types in the integrated dataset, again using the default Seurat workflow. Cell types were manually annotated on the basis of marker gene expression, guided by previous studies of the mouse spinal cord27,51,70.

Viral tract tracing

All surgeries on mice were performed at EPFL under general anaesthesia with isoflurane in oxygen-enriched air using an operating microscope, and rodent stereotaxic apparatus (David Kopf). We identified plasticity of excitatory neurons in the lumbar spinal cord after SCI using AAV-DJ-hSyn Flex mGFP 2 A synaptophysin mRuby (Stanford Vector Core Facility, reference AAV DJ GVVC-AAV-100, titer 1.15E12 genome copies per ml71) injections on each side of the cord of Vglut2:Cre mice at the L6 spinal level, 0.25 μl 0.6 mm below the surface at 0.1 μl per minute using glass micropipettes (ground to 50 to 100 μm tips) connected via high-pressure tubing (Kopf) to 10-μl syringes under the control of microinfusion pumps.

Immunohistochemistry

After terminal anaesthesia by barbiturate overdose, mice were perfused transcardially with 4% paraformaldehyde and spinal cords processed for immunofluorescence as previously described60,72. Primary antibodies were: rat anti-Pecam1 (BD Biosciences 550274, 1:200). Secondary antibodies were: Alexa Fluor 555 Goat Anti Rat (1:200, Life Technologies, A21434). Immunofluorescence was imaged digitally on a slide scanner [Olympus VS-120 Slide scanner] or confocal microscope [Zeiss LSM880 + Airy fast module with ZEN 2 Black software (Zeiss, Oberkochen, Germany)]. Images were processed using ImageJ (NIH) or Imaris (Bitplane, version 9.0.0).

Tissue clearing

Mouse spinal cords were cleared using CLARITY73 four weeks after injection of AAV-DJ-hSyn-flex-mGFP-2A-Synaptophysin-mRuby71. Mice were perfused transcardially first with 0.1 M PBS followed by 4% PFA (in 0.1 M PBS, pH 7.4) at 4 °C. The spinal cords were dissected and post-fixed in 4% PFA (in 0.1 M PBS) for 24 h at 4 °C. The dura was removed from the samples prior to clearing. Samples were incubated in A4P0 hydrogel solution (4% acrylamide in 0.001 M PBS with 0.25% of the photoinitiator 2,2′-azobis[2-(2-imidazolin-2-yl)propane] dihydrochloride (Wako Pure Chemical, Osaka, Japan)) for 24 h at 4 °C with gentle nutation. Samples were degassed by bubbling nitrogen gas through the tube for 3 m. Hydrogel polymerization was initiated by incubating the sample in a 37 °C water bath for 2 h. Tissue was washed in 0.001 M PBS for 5 m at room temperature. Samples were then placed in the X-CLARITY Tissue Clearing System I (Logos Biosystems Inc., South Korea) set to 1.2 A, 100 RPM, 37 °C. Clearing solution was made in-house from 40 g of sodium dodecyl sulfate (SDS), 200 mM boric acid, and filled to a total volume of 1 L with dH2O (pH adjusted to 8.5). Samples cleared after ~10–15 h. The sample was then washed for at least 24 h at room temperature with shaking in 1x PBS and 0.1% Triton-X (with 0.02% sodium azide) to remove excess SDS. The sample was then incubated in RIMS (40 g of Histodenz dissolved in 30 mL of 0.02 M PB, pH 7.5, 0.01% sodium azide, refractive index 1.465) for at least 24 h at room temperature with gentle shaking prior to imaging. Imaging was performed using a custom-built MesoSPIM74 and CLARITY-optimized light-sheet microscope (COLM) as described73. A customized sample holder was used to secure the spinal cords in a chamber filled with RIMS. Samples were imaged using a 2.5× objective at the MesoSPIM and a 4x objective at the COLM with two lightsheets illuminating the sample from the left and the right sides. The pixel resolution was 2.6 μm × 2.6 μm × 3 μm for the 2,5x acquisition; and 1.4 μm by 1.4 μm by 5 μm for the 4x acquisition in the x-, y-, and z-directions. Images were acquired as 16-bit TIFF files. 3D reconstructions of the raw images were produced using Arivis (v3.4) and Imaris softwares (Bitplane, v.9.0.0).

RNAscope

To corroborate the results suggested by DE analysis of scRNA-seq data, we analyzed the in situ co-localization of putatively DE genes and cell type marker genes using RNAscope (Advanced Cell Diagnostics)30. Lists of putatively DE genes were obtained for representative single-cell and pseudobulk DE methods (the Wilcoxon rank-sum test and edgeR-LRT, respectively), and cross-referenced against a list of validated probes designed and provided by Advanced Cell Diagnostics, Inc. In total, probes were obtained for 13 genes identified as DE by the Wilcoxon rank-sum test (Sgms1, catalog no. 538561; Pcdh9, catalog no. 524921; Epas1, catalog no. 314371; Tcaf1, catalog no. 466921; Gspt1, catalog no. 530471; Prex2, catalog no. 432481; Sema7a, catalog no. 437261; Zfp366, catalog no. 443301; Cpe, catalog no. 454091; Afap1l2, catalog no. 556251; Nedd4l, catalog no. 491981; Adipor2, catalog no. 452861; Ptpn14, catalog no. 493181) and 7 genes identified by edgeR-LRT (Slc7a11, catalog no. 422511; Gjb2, catalog no. 518881; Pi16, catalog no. 451311; Rbp4, catalog no. 508501; Col1a1, catalog no. 319371; Igfbp6, catalog no. 425721). In addition, we obtained probes for Pecam1 (catalog no. 316721), a classic endothelial cell marker gene. We then obtained 16 μm cryosections from fixed-frozen spinal cords as previously described27 and performed staining for each probe according to the manufacturer’s instructions, using the RNAscope Fluorescent Multiplex Reagent Kit (cat. no. 323133). For each biological replicate (n = 4 per condition for both injured and uninjured mice), we analyzed ten cells positive for Pecam1 and tallied the number of speckles for each gene of interest. The mean expression of each gene was then calculated for each biological replicate, and a one-tailed t-test was conducted based on the directional change in the snRNA-seq data.

Mixed models

Having established that the performance of DE methods is contingent on their ability to account for biological replicates, we asked why mixed models failed to match the performance of pseudobulk methods. In addition to the linear mixed model described above, we implemented generalized linear mixed models (GLMMs) based on the negative binomial or Poisson distributions, adapting implementations provided in the ‘muscat’ R package10. For each of these models, we evaluated the impact of incorporating the library size factors as an offset term, and compared the Wald test of model coefficients to a likelihood ratio test against a reduced model, yielding a total of four GLMMs from each distribution. The enormous computational requirements of the GLMMs prevented us from evaluating these models in the full ground truth datasets; instead, we analyzed a series of downsampled datasets, each containing between 25 and 1,000 cells. To quantify the computational resources required by each DE method, we monitored peak memory usage using the ‘peakRAM’ R package, and the base R function ‘system.time’ to record wall time.

Preprocessing and analysis of published single-cell datasets

We assembled a compendium of 46 published single-cell or single-nucleus RNA-seq studies (Supplementary Fig. 3), and performed DE analyses across this compendium to establish the generality of our conclusions. For publications containing more than one comparison, only a single comparison was retained, as described in detail below. We retained the comparison involving the greatest number of cells, and used the most fine-grained cell type annotations provided by the authors of the original studies. When count matrices did not use gene symbols, the provided identifiers were mapped to gene symbols, and counts summed across genes mapping to identical symbols. Only cell types with at least three cells in each condition were subjected to DE analysis, and genes detected in less than three cells were removed.

Angelidis et al., 201914. scRNA-seq data from young and aged mouse lung (3 m and 24 m, respectively), as well as matching bulk data from two purified cell types, was obtained from GEO (GSE124872). Metadata was obtained from GitHub (https://github.com/theislab/2018_Angelidis). Cells with missing cell type annotations were removed from the single-cell data. DE analysis was performed by comparing cells from young and old mice.

Arneson et al., 201875. scRNA-seq data from the hippocampus of mice after a mild traumatic brain injury (mTBI), delivered using a mild fluid percussion injury model, and matched controls was obtained from GEO (GSE101901). Metadata, including cell type annotations, were provided by the authors. DE analysis was performed by comparing cells from mTBI and control mice.

Avey et al., 201876. scRNA-seq data from the nucleus accumbens of mice treated with morphine for 4 h and saline-treated controls was obtained from GEO (GSE118918). Cells identified as doublets and non-unique barcodes were removed. Metadata, including cell type annotations, were provided by the authors. DE analysis was performed by comparing cells from morphine- and saline-treated mice.

Aztekin et al., 201977. scRNA-seq data from regeneration-competent (NF stage 40-41) Xenopus laevis tadpoles was obtained from ArrayExpress (E-MTAB-7716). DE analysis was performed by comparing cells from tadpoles at 1 d post-amputation to control tadpoles.

Bhattacherjee et al., 201978. scRNA-seq data from the prefrontal cortex of mice exposed to a cocaine withdrawal paradigm was obtained from GEO (GSE124952). DE analysis was performed by comparing cells at the 15 d post-withdrawal timepoint from cocaine- or saline-treated mice.

Brenner et al., 202079. snRNA-seq data from the prefrontal cortex of alcoholic and control individuals was obtained from GEO (GSE141552). Metadata, including cell type annotations, were provided by the authors. DE analysis was performed by comparing nuclei from alcoholic and control individuals.

Cano-Gamez et al., 202013. scRNA-seq data from naive and memory T cells, stimulated with anti-CD3/anti-CD28 coated beads in the presence or absence of various combinations of cytokines, was obtained from the supporting website (https://www.opentargets.org/projects/effectorness). Matching bulk RNA-seq and proteomics data was obtained from the same source. For the analyses presented as part of the compendium of 46 datasets, DE analysis was performed by comparing iTreg and control cells.

Cheng et al., 201980. scRNA-seq data from intestinal crypt cells in wild-type and Hmgcs2 knockout mice was obtained directly from the authors of the original publication. DE analysis was performed by comparing wild type and KO mice.

Co et al., 202081. scRNA-seq data of sorted cells from Drd1a-tdTomato+ control and Foxp2 KO mice was obtained from GEO (GSE130653). Cell type annotations were provided by the authors. Cell types annotated as ‘Low quality’ were removed prior to further analysis. DE analysis was performed by comparing WT and Foxp2 KO mice.

Crowell et al., 202010. snRNA-seq data from the prefrontal cortex of mice peripherally stimulated with lipopolysaccharide (LPS) and control mice was obtained from the Bioconductor package ‘muscData’, using the ‘Crowell19_4vs4’ function. DE analysis was performed by comparing nuclei from LPS-treated and control mice.

Davie et al., 201882. scRNA-seq data from the brains of flies of varying ages, sexes, and genotypes was obtained from the supporting website (http://scope.aertslab.org, file ‘Aerts_Fly_AdultBrain_Filtered_57k.loom’). Cells marked as ‘Unannotated’ were removed. DE analysis was performed by comparing cells from DGRP-551 and W1118 flies.

Denisenko et al., 202083. scRNA-seq data from human kidneys subjected to varying dissociation methods and cell fixation techniques was obtained from GEO (GSE141115). Metadata, including cell type annotations, was obtained from the supporting information files accompanying the published manuscript. DE analysis was performed by comparing cells fixed with methanol and freshly dissociated cells, both at –20 °C.

Der et al., 201984. scRNA-seq data of skin samples from patients with lupus nephritis (LN) and healthy controls was obtained from ImmGen (SDY997, EXP15077). Cell type annotations were obtained from the authors of the original manuscript. Other metadata, including biological replicate and experimental condition annotations for each individual cell, was obtained from the supporting information files accompanying the published manuscript. DE analysis was performed by comparing cells from patients with LN and healthy controls.

Goldfarbmuren et al., 202056. scRNA-seq data of tracheal epithelial cells from smokers and non-smokers was obtained from GEO (GSE134174). Patients designated as ‘excluded’ were removed prior to downstream analysis. DE analysis was performed by comparing cells from smokers and non-smokers.

Grubman et al., 201952. snRNA-seq data from the entorhinal cortex of patients with Alzheimer’s disease and matched controls was obtained from the supporting website (http://adsn.ddnetbio.com). Nuclei annotated as ‘undetermined’ or ‘doublet’ were removed. DE analysis was performed by comparing nuclei from patients with Alzheimer’s disease and controls.

Gunner et al., 201985. scRNA-seq data from the mouse barrel cortex before or after whisker lesioning was obtained from GEO (GSE129150). Cell types not included in Supplementary Fig. 10 of the original publication were removed. DE analysis was performed by comparing cells from lesioned and control mice.

Haber et al., 201786. scRNA-seq data from epithelial cells of the mouse small intestine in healthy mice and after ten days of infection with the parasitic helminth Heligmosomoides polygyrus was obtained from GEO (GSE92332), using the Drop-seq data collected by the original publication. DE analysis was performed by comparing cells from infected and uninfected mice.

Hagai et al., 201812. scRNA-seq data of bone marrow-derived mononuclear phagocytes from four different species (mouse, rat, pig, and rabbit) exposed to lipopolysaccharide (LPS) or poly-I:C for two, four, or six h was obtained from ArrayExpress (E-MTAB-6754). Matching bulk RNA-seq data was also obtained from ArrayExpress (E-MTAB-6773). Finally, scRNA-seq data from human dermal fibroblasts stimulated with interferon-β for two or six h, in which the ERCC mixture of synthetic mRNAs was spiked in alongside every cell, was obtained from ArrayExpress (E-MTAB-7051). Counts were summed across technical replicates of the same biological samples. For the analyses presented as part of the compendium of 46 datasets, DE analysis was performed by comparing rabbit cells stimulated with LPS for 2 h and control cells. DE analysis of the spike-in dataset was performed by comparing cells stimulated for 2 h and 6 h.

Hashimoto et al., 201987. scRNA-seq data of peripheral blood mononuclear cells from human supercentenarians and younger controls was obtained from the supporting website (http://gerg.gsc.riken.jp/SC2018). Metadata, including cell type annotations, were provided by the authors. DE analysis was performed by comparing cells from supercentenarians and younger controls.

Hrvatin et al., 201850. scRNA-seq data from the visual cortex of mice housed in darkness, then exposed to light for 0 h, 1 h, or 4 h was obtained from GEO (GSE102827). Cell types labeled as ‘NA’ were removed from downstream analyses. DE analysis was performed by comparing cells from mice stimulated with light for 4 h to control mice.

Hu et al., 201788. snRNA-seq data from the cerebral cortex of mice after pentylenetetrazole (PTZ)-induced seizure and saline-treated controls was obtained from the Google Drive folder accompanying the original publication (https://github.com/wulabupenn/Hu_MolCell_2017). DE analysis was performed by comparing cells from PTZ- and saline-treated mice.

Huang et al., 201958. scRNA-seq data from the colon of pediatric patients with colitis and inflammatory bowel disease and matched controls was obtained from the supporting website (https://zhanglaboratory.com/research-data/). DE analysis was performed by comparing cells from patients with colitis and healthy controls.

Jaitin et al., 201989. scRNA-seq data from white adipose tissue of mice fed either a high-fat diet or normal chow for six weeks were obtained from the Bitbucket repository accompanying the original publication (https://bitbucket.org/account/user/amitlab/projects/ATIC). Metadata, including cell type annotations, were provided by the authors. DE analysis was performed by comparing cells from high-fat diet and normal chow-fed mice.

Jakel et al., 201990. snRNA-seq data of oligodendrocytes from patients with multiple sclerosis and matched controls was obtained from GEO (GSE118257). DE analysis was performed by comparing nuclei from individuals with multiple sclerosis versus matched controls.

Kang et al., 20185. scRNA-seq data from peripheral blood mononuclear cells (PBMCs) stimulated with recombinant IFN-β for 6 h and unstimulated PBMCs was obtained from GEO (GSE96583). Doublets and unclassified cells were removed. DE analysis was performed by comparing IFN-stimulated and unstimulated cells.

Kim et al., 201991. scRNA-seq data from the ventromedial hypothalamus of mice exposed to a range of behavioral stimuli and control mice was obtained from the Mendeley repository accompanying the original publication. Cell type annotations were provided directly by the authors. DE analysis was performed by comparing cells from animals engaging in aggressive behaviour to the common population of control animals.

Kotliarov et al., 202092. scRNA-seq data of peripheral blood mononuclear cells from subjects who were subsequently given an influenza vaccination were obtained from Figshare (https://doi.org/10.35092/yhjc.c.4753772). DE analysis was performed by comparing cells from high and low responders to the influenza vaccination, as categorized by the authors.

Madissoon et al., 202093. scRNA-seq data from esophagus, lung, and spleen samples after varying durations of cold storage was obtained from the study website (https://cellgeni.cog.sanger.ac.uk/tissue-stability/). DE analysis was performed by comparing cells from samples preserved for 12 h and fresh samples.

Mathys et al., 20196. snRNA-seq data from the prefrontal cortex of patients with Alzheimer’s disease and matched controls was obtained from Synapse (syn18681734). Patient data and additional metadata were also obtained from Synapse (syn3191087 and syn18642926, respectively). DE analysis was performed by comparing nuclei from patients with Alzheimer’s disease and controls.

Nagy et al., 202057. snRNA-seq data from the dorsolateral prefrontal cortex of patients with major depressive disorder (MDD) and matched controls was obtained from GEO (GSE144136). DE analysis was performed by comparing nuclei from patients with MDD and controls.

Nault et al., 202194. snRNA-seq data from the livers of mice gavaged with 2,3,7,8-tetrachlorodibenzo-p-dioxin or sesame oil vehicle was obtained from GEO (GSE148339). DE analysis was performed by comparing nuclei from treated and vehicle livers.

Ordovas-Montanes et al., 201895. scRNA-seq data from ethmoid sinus cells of patients with chronic rhinosinusitis (CRS), with and without nasal polyps, from Supplementary Table 2 of the original publication. DE analysis was performed by comparing cells from patients with polyposis and non-polyposis CRS.

Reyes et al., 202096. scRNA-seq data of peripheral blood mononuclear cells from patients with sepsis and healthy controls was obtained from the Broad Institute’s Single Cell Portal (SCP548). DE analysis was performed by comparing cells from individuals with bacterial sepsis and controls.

Reyfman et al., 201915. scRNA-seq data from the lungs of patients with pulmonary fibrosis and healthy controls was obtained from GEO (GSE122960). Metadata, including cell type annotations, was provided by the authors. One sample (“Cryobiopsy_01”) was removed as it was sequenced separately from the rest of the experiment. The results of DE analysis of bulk RNA-seq data, comparing purified AT2 cells or alveolar macrophages from patients with pulmonary fibrosis and healthy controls, were obtained from the supporting information accompanying the original publication. DE analysis was performed by comparing cells from patients with pulmonary fibrosis and controls.

Rossi et al., 201953. scRNA-seq data from the hypothalamus of mice fed either a high-fat diet or normal chow for between 9-16 weeks was obtained directly from the authors, in the form of a processed Seurat object. Cells annotated as ‘unclassified’ were removed. DE analysis was performed by comparing cells from high-fat diet and normal chow-fed mice.

Sathyamurthy et al., 201851. snRNA-seq data from the spinal cord parenchyma of adult mice exposed to formalin or matched controls was obtained from GEO (GSE103892). Cell types with blank annotations, or annotated as ‘discarded’, were removed. DE analysis was performed by comparing cells from mice exposed to formalin and control animals.

Schafflick et al., 202097. scRNA-seq data of peripheral blood mononuclear cells from individuals with multiple sclerosis and matched controls was obtained from GEO (GSE138266). Metadata, including cell type annotations, was obtained from Github (https://github.com/chenlingantelope/MSscRNAseq2019). DE analysis was performed by comparing cells from individuals with multiple sclerosis and controls.

Schirmer et al., 201998. snRNA-seq data from cortical and subcortical areas from the brains of patients with multiple sclerosis and control tissue from unaffected individuals was obtained from the web browser accompanying the original publication (https://cells.ucsc.edu/ms). DE analysis was performed by comparing cells from multiple sclerosis and control patients.

Skinnider et al., 202127. snRNA-seq data from the spinal cords of mice with a spinal cord injury, some of which were exposed to epidural electrical stimulation to restore locomotion after paralysis, was obtained from GEO (GSE142245). DE analysis was performed by comparing nuclei from paralyzed and walking mice.

Tran et al., 201955. scRNA-seq data from the retinal ganglion of mice at various timepoints after an optic nerve crush injury, as well as uninjured controls, was obtained from GEO “GSE137398 ”. Metadata, including cell type annotations, was obtained from the Broad Institute’s Single-Cell Portal (SCP509). DE analysis was performed by comparing cells from mice at 12 h post-injury and uninjured mice.

Wagner et al., 201899. scRNA-seq data from zebrafish embryos between 14-16 h post-fertilization, with either the chordin locus or a control locus (tyrosinase) disrupted by CRISPR-Cas9 knock- out, was obtained from GEO (GSE112294). DE analysis was performed by comparing cells from chordin- or tyrosinase-targeted embryos.

Wang et al., 2020100. scRNA-seq data from the ovaries of young and old cynomolgus monkeys was obtained from GEO (GSE130664). Metadata, including cell type annotations, was obtained from the supporting information accompanying the original publication. Spike-ins were removed. DE analysis was performed by comparing cells from young and old primates.

Wilk et al., 202059. scRNA-seq data of peripheral blood mononuclear cells from patients with COVID-19 and healthy controls was obtained from the supporting website (https://www.covid19cellatlas.org/). DE analysis was performed by comparing patients with COVID-19 and controls.

Wirka et al., 2019101. scRNA-seq data from the aortic root of mice fed a high-fat diet or normal chow for eight weeks from GEO (GSE131776). Metadata, including cell type annotations, was provided by the authors, and unannotated cells were removed. DE analysis was performed by comparing cells from high-fat diet and normal chow-fed mice.

Wu et al., 201749. scRNA-seq data from the amygdala of mice subjected to 45 min of immobilization stress and control mice was obtained from GEO (GSE103976). DE analysis was performed by comparing cells from stressed and control mice.

Ximerakis et al., 2019102. scRNA-seq data from whole brains of young (2–3 mo) and old (21–23 mo) mice from the Broad Institute’s Single Cell Portal (SCP263). DE analysis was performed by comparing cells from young and old mice.

Visualization

Throughout the manuscript, box plots show the median (horizontal line), interquartile range (hinges) and smallest and largest values no more than 1.5 times the interquartile range (whiskers), and the error bars show the standard deviation.

Reporting summary

Further information on research design is available in the Nature Research Reporting Summary linked to this article.

Source link