Preloader

Normalizing and denoising protein expression data from droplet-based single cell profiling

Analysis of unstained cells reveals ambient antibody capture as a major source of protein-specific noise

To assess protein count noise, we first utilized our previously reported dataset measuring more than 50,000 peripheral mononuclear cells (PBMCs) from 20 healthy human donors24 stained with an 87 CITE-seq antibody panel (including four isotype controls; Totalseq-A reagents, Biolegend). Consistent with the original CITE-seq report2, we noticed non-zero counts for most proteins in each cell, resulting in positive counts even of markers not expected to be expressed in certain cell types. We also noticed non-zero, “ambient” protein counts in tens of thousands of empty droplets containing capture beads without cells, which emerge naturally due to Poisson distributed cell loading, reminiscent of cell-free RNA observed in droplet-based single cell RNAseq25,26,27. We reasoned that background noise in CITE-seq data may partly reflect such unbound, ambient antibodies captured in droplets. To assess whether counts in empty droplets indeed reflect the ambient component in cell-containing droplets, we compared background protein levels in cell-free droplets with droplets capturing unstained control cells spiked into the cell mixture after cell staining and washing but prior to droplet generation (Fig. 1a). We found positive protein counts even for unstained control cells, and that the average log-transformed level per protein in empty droplets and unstained control cells were highly correlated (Fig. 1b). A similarly strong correlation was observed between the average protein counts in subpopulations of stained cells “negative” for a given protein and those in empty droplets (Supplementary Fig. 1a–c; negative cells correspond to those in the fraction with lower expression of the protein—see the “Methods” section), further suggesting that the noise component correlated across cells is dominated by ambient antibody capture. Thus, protein counts in empty droplets, which are available in all single-cell droplet experiments, provide a direct estimate of the ambient background due to free antibody capture for each protein. Consistent with our findings on ambient antibody capture as the major source of background noise in CITE-seq data, a recent study reporting CITE-seq antibody titration experiments across a` wide concentration range demonstrated that background noise increased with the antibody staining concentration, with some antibodies at or above 2.5 µg/mL having even more cumulative UMIs in the empty droplets compared to cells28. Our observation thus motivated the first step of our method to remove protein-specific technical noise: transforming counts of each protein in cell-containing droplets by subtracting the mean and dividing by the standard deviation of that same protein across empty droplets (see the “Methods” section). The resulting transformed protein expression values for each cell reflect the number of standard deviations above the expected ambient capture noise, thus centering the negative cell population for each protein around zero to help improve interpretability of the resulting protein expression values (Fig. 1c).

Fig. 1: Antibody-derived protein UMI count data noise source assessment.
figure 1

a 1 and 2: Experimental setup and potential noise sources in CITE-seq data. 3: protein-specific noise: if ambient antibody encapsulated in droplets constitutes a major source of protein-specific noise, values should be highly correlated with those in unstained control cells (top); if control cells contain information on noise not captured by empty drops, the correlation should be weak. 4: Cell-specific noise evaluated through the correlation between the background protein population mean and isotype controls across single cells. Created with BioRender.com. b Average protein log10(count + 1) of unstained control cells spiked into the stained cell pool prior to droplet generation (y-axis) versus that of droplets without a cell (x-axis). Pearson correlation coefficient and p value (two sided) are shown. c Density histograms of protein expression of lineage-defining proteins within major subsets in stained cells (black) and unstained controls (red) normalized together using dsb step I (ambient correction and rescaling based on levels in empty droplets). d A two-component Gaussian mixture model was fitted to the protein counts within each single cell; the distributions of the component means from all single cell fits (blue = ”negative” population; red = “positive” population) are shown, protein distributions from a randomly selected cell shown in the inset. e Comparison of Gaussian mixture models fit with between k = 1 and k = 6 subpopulations to dsb normalized protein values for n = 28,229 cells from batch 1 after dsb step I (ambient correction) but prior to step II, vs. the model fit Bayesian Information Criteria (BIC, using mclust R package definition of BIC where larger values correspond to a better fit) from the resulting 169,374 models. Boxplots show the median with hinges at the 25th and 75th percentile, whiskers extend plus or −1.5 times the inter quartile range. k = 2 component Gaussian mixtures have the best fit in more than 80% of cells (orange, right inset bar plot). f Pearson correlation coefficients among isotype controls and background component mean inferred by Gaussian mixture model (µ1 fitted per cell as in d); all corresponding p values (two sided) are <2e−16. g Scatter density plot between µ1, the mean of each cell’s negative subpopulation from the per-cell Gaussian mixture model (blue in c) versus the mean of the four isotype controls across single cells. Pearson correlation coefficient is shown (two-sided p value < 2e−16). h The distribution of the dsb technical component as calculated using a 2 component (x-axis) vs. 3 component (y-axis) mixture model to define the µ1 parameter, Pearson correlation coefficient, p value (two-sided) < 2e−16.

Shared variance between isotype controls and background protein counts in single cells provide cell-intrinsic normalization factors

In addition to ambient noise correlated across single cells as captured by average readouts from empty droplets, cell/droplet-intrinsic technical factors including but not limited to oligo tag capture, cell lysis, reverse transcriptase efficiency, sequencing depth and non-specific antibody binding, can contribute to cell-to-cell variations in protein counts that should ideally be normalized across single cells. Given that the differences in total protein UMI counts between individual cells could reflect biologically relevant variations, such as those due to the physical size of naïve vs. activated lymphocytes, library size normalization (dividing each cell by the total library size) could remove biological rather than technical cell to cell variations. In addition, since current CITE-seq antibody panels are a small subset of total surface proteins, the assumption that total UMI counts should be similar among cells may not be valid. Here we integrated two types of independently derived measures to reveal a more conservative (i.e., avoiding over-correction and removal of biological information), robust estimate of the factor associated with cell-intrinsic technical noise (Fig. 1a).

First, the four isotype control antibodies with non-human antigen specificities in our panel could in principle help capture contributions from non-specific binding and other technical factors discussed above. The counts of the isotype controls were only weakly (but significantly) correlated with each other across cells (Fig. 1f), and interestingly, the correlation between the mean of four isotype controls and the protein library size (which has both biological and technical components) across single cells was even higher (Pearson correlation 0.45) than that between the protein library size and the individual isotype controls (average Pearson correlation 0.25). This suggests that while each isotype control may be individually noisy, and their levels may still partially reflect biological contributions, collectively their shared component of variation (i.e., as reflected by their average) may better capture technical noise in the experiment. Second, to further boost the robustness of estimating cell-intrinsic technical noise, particularly given that the number of isotype controls available in practice can be limited, we sought an additional estimate of droplet-intrinsic technical variation. Since each cell in a sample of multiple distinct cell types (e.g., PBMCs) is expected to express only a subset of protein markers in staining panels, we reasoned that the distribution of each cell’s non-staining proteins (e.g., those specifically expressed in other cell types/lineages) could be differentiated from the cell’s “positively expressed” proteins by fitting a 2-component mixture model to each cell. If so, the average counts in the population of non-staining/negative proteins could reflect and therefore serve as another readout of the cell’s technical component that could then be integrated with the cell-intrinsic noise captured by isotype controls. To assess this hypothesis, we applied a Gaussian mixture model with two (k = 2) subpopulations to fit the protein counts within each single cell after correcting for the protein-specific ambient noise we identified above (see below and the “Methods” section; Fig. 1d). We found clear separation between the background (with mean = µ1) and positive (mean = µ2) protein population with substantial cell-to-cell heterogeneity of subpopulation means (Fig. 1d). We next assessed the robustness of using a two-component mixture to model the protein counts of individual cells by comparing k = 1–6 component models assessed using the Bayesian Information Criterion (BIC). While two-component models had the best fit in a majority (81%) of cells, indicating a bimodal protein distribution within single cells, k = 3 models had the best fit in nearly all remaining cells (Fig. 1e, Supplementary Figs. 2a, b; see also Supplementary Note). The BIC for these cells were very similar to the corresponding k = 2 models (Supplementary Fig. 2c), indicating that the two-component fits were identifying very similar positive and negative populations. Importantly, for the minority of cells with optimal k = 3 or 4 models, the resulting mean of the lowest expression population (µ1 estimate) was highly concordant when the same cells were fit with a k = 2 model (Supplementary Fig. 2d–f). These data suggest that a two-component Gaussian mixture fit of the protein population within single cells can robustly delineate the negative background protein count population for most cells.

Together µ1 and the isotype controls provide estimates of technical noise within each single cell. However, each variable may be individually noisy; we thus assessed information sharing among these variables. The correlations between µ1 and each individual isotype control (average correlation r = 0.33) or the average of all four isotype controls (r = 0.59) were higher than those between the isotype control themselves (average correlation r = 0.11), suggesting that the shared variation (i.e., average) between the independently inferred µ1 and isotype controls captured unobserved, latent factors contributing to technical noise (Fig. 1f, g). We thus reasoned that the first principal component score (λ) capturing the shared variation of µ1 and the isotype controls across single cells would be a robust measure of technical noise intrinsic to individual cells. λ was associated with the protein library size across single cells within cell clusters (Supplementary Fig. 3a clusters defined after dsb steps I and II, see the “Methods” section), supporting the notion that λ captures the technical component of the protein library size. Furthermore, consistent with the observation above regarding the Gaussian mixture model fit, λ was highly concordant regardless of whether the background (µ1 estimate) was defined using a k = 2 or k = 3-component Gaussian mixture (Fig. 1h).

Given the information sharing between µ1 and isotype controls, we recommend the inclusion of multiple isotype controls in CITE-seq experiments to serve as anchors for robust inference of technical normalization factors (see Supplementary Note). Together, our data indicate that while the signal from individual measures, such as isotype controls can be noisy and may reflect multiple yet often unknown sources of variation, their correlated component of variation can serve as a robust normalization factor for surface protein expression in single cells. Thus, in a second, optional but recommended step, our method computes λ for each cell as its “technical component”, which is then regressed out of the ambient noise corrected protein values (Fig. 1c) generated by step 1 above (see the “Methods” section). The underlying modeling assumptions of the dsb technical component also held well in seven independent datasets generated via different assay platforms and protein panels of diverse sizes (from 17 to more than 200 proteins; see below).

Comparison with other transformations and assessing dsb in independent datasets generated by different technology platforms

The unstained spike-in cells above should reflect the level of protein specific, “ground-truth” noise, we thus used these cells to visually compare dsb with other normalization transformations (Supplementary Fig 3e; see the “Methods” section). Unstained cells normalized using dsb centered around zero, while CLR or log transformation placed these cells at arbitrary locations. For example, CD4 has a trimodal distribution due to absence of expression in populations such as B lymphocytes, low expression in CD14+ monocytes and high expression in helper T cells; dsb normalized values centered the background population together with unstained control cells at zero and delineated low-level CD4 staining on monocytes. In contrast, these monocytes are closer to and partially overlapped with the unstained population when CLR or log normalization were used (Supplementary Fig 3e). We further compared dsb to CLR (the version that normalizes across cells) since CLR is the most commonly applied transformation for ADT data to date and normalization across cells should depend less on the protein staining panel than CLR across proteins. Using k-medoids clustering of single cells based on protein expression data only, the Gap-Statistic29, which reflects improvement in within-cluster coherence relative to that expected of random data drawn from a reference distribution, was consistently higher using dsb compared to CLR across different values of k. However, the trend as a function of k was similar between dsb and CLR, suggesting that the improvement could be partly due to scaling differences between these two transformations (Supplementary Fig. 3f). Finally, differential expression analysis comparing major immune cell populations with the rest of the cells revealed that key lineage and cell-type-specific proteins (e.g., CD56 on NK cells) tended to have larger fold changes when using dsb normalized protein values compared to CLR (Supplementary Fig. 3g).

We next tested the general applicability of dsb by using several independent, publicly available CITE-seq datasets. We first assessed whether the modeling assumptions developed using our own CITE-seq data would generalize to four other CITE-seq datasets that profiled ~5000 to 10,000 cells using 14–29 surface phenotyping proteins and three isotype controls, and were generated using different versions of the 10X Genomics droplet profiling kit than the one we used. Similar to our dataset, we detected a large number of empty droplets containing antibody reads (>50,000) inferred by the EmptyDrops25 algorithm used in the Cell Ranger barcode rank algorithm; the number of cell-containing droplets estimated by Cell Ranger and further filtered by quality control metrics (3000–8000 droplets) was also consistent with the number of loaded cells (Fig. 2a, Supplementary Fig. 4a, h, o). Thus, protein-specific ambient noise can be estimated as in our data set using these empty droplets. Applying dsb without any modification resulted in biologically interpretable protein-based clusters (Fig. 2b, c, Supplementary Fig. 4e, f, l, m, s, t) and canonical immune cell populations could be clearly delineated by conventional biaxial plots (Fig. 2d, Supplementary Fig. 4g, n, u). Importantly, the model-fitting behavior and correlations among isotype controls and background counts observed in our dataset were similarly observed in these independent datasets, including: (1) The k = 2 component Gaussian mixture model had the best fit according to BIC in most single cells (Fig. 2e, 89% average across four CITE-seq datasets); (2) the estimated µ1 (mean of background protein counts) for each cell correlated significantly with the mean of isotype controls across single cells and was higher than the correlation with individual isotype controls (Fig. 2f, g, Supplementary Fig. 4b, c, i, j, p, q); (3) the inferred technical component using isotype controls and µ1 was correlated with the protein library size (Fig. 2h, Supplementary Figs. 4d, k, r); finally, (4) even on the smallest panel (14 phenotyping antibodies, 3 isotype controls) the per cell technical component λ was highly concordant regardless of whether the background (µ1 estimate) was defined using a k = 2 or k = 3-component Gaussian mixture (Supplementary Fig. 2g).

Fig. 2: Assessment of dsb model assumptions and performance of dsb normalization on external datasets.
figure 2

Panels ah Application of dsb to a publicly available dataset generated using 10X genomics “NextGem” chemistry measuring 29 proteins across ~5K cells. a The protein library size distribution of empty and cell-containing droplets used for dsb normalization. b UMAP of single cells based on dsb normalized protein values with colors representing clusters obtained from clustering cells on dsb normalized protein values. c Heatmap of the average of dsb normalized values per protein-based cluster shown in (b). d The distribution of CD14 and CD4 dsb normalized values. e As in Fig. 1e, Gaussian mixture model parameters fit to the dsb normalized values of each single cell after step I (ambient noise/background droplet based correction). The Bayesian Information Criterion (BIC) of the model vs. number of components in the model fit for each cell (n = 3774 cells). Boxplots show the median with hinges at the 25th and 75th percentile and whiskers extending plus or minus 1.5 times the inter quartile range. f As in Fig. 1f, Pearson correlation coefficient matrix of variables used to define each cell’s technical component; each isotype control and µ1, the Gaussian mixture model background mean across proteins for each cell. g As in Fig. 1g, Pearson correlation coefficient between the inferred cell-specific background mean µ1 from the Gaussian mixture model vs. the mean of isotype controls in each cell. h The relationship between each cell’s technical component and the cell’s protein library size (Pearson correlation coefficient shown as in Supplementary Fig 3a with 95% confidence interval in gray). i Summary statistics for the eight independent datasets assessed in this study; Cor 1 and 2 correspond to the Pearson correlation coefficient for assessing the relationships between variables shown in (h) and (g) across cells for each dataset.

We next tested the applicability of dsb to several new types of multimodal single cell data generated by technologies that measure surface protein expression in droplet captured single cells using oligo-barcoded antibodies including (1) “proteogenomic” data (protein + DNA mutation assays from Mission Bio; 9 proteins plus an isotype control), (2) ATAC-seq with Select Antigen Profiling (ASAP-seq: protein and chromatin accessibility; 238 proteins plus isotype controls), and 3) Transcription, Epitopes, and Accessibility (TEA-seq: protein + chromatin accessibility and transcriptome assessment; 45 proteins plus one isotype control). All datasets had ADT reads in a large number of empty droplets (Supplementary Figs. 5a, d, e). Our method was compatible with the proteogenomic dataset, helping to identify markers for each cell cluster after correcting for protein-specific background levels estimated from >16,000 empty droplets (Supplementary Figs. 5a–c). In the ASAP-seq dataset that measured multiple isotype controls, µ1 again correlated significantly with the mean of isotype controls across single cells and this correlation was higher than that among the individual isotype controls (Supplementary Figs. 5f, g), and the inferred per-cell dsb technical component was correlated with the library size as observed above (Supplementary Fig. 5h). In TEA-seq and ASAP-seq data, the negative staining cells could often be identified by applying the same 3.5 threshold that we applied in our and other data sets (Supplementary Figs. 5i–k and see below). The compatibility and utility of dsb with large protein panels such as in the ASAP-seq dataset is consistent with our recent CITE-seq analysis of Covid-19 patients using a similarly large panel where dsb helped enable accurate cell population identification by both automated clustering and manual gating30. A summary of results from these datasets is shown in Fig. 2i.

Case study I: dsb improves interpretation of protein-based and joint protein–mRNA clustering results

We next further investigated the ways in which normalization with dsb could help improve cell type identification. By design, dsb zero-centers the background population for each protein and provides normalized expression interpretable as signal above expected background noise. These features are thus particularly helpful in manual gating across cell lineages (Supplementary Fig. 6a) and can improve the annotation of cell types derived from unbiased clustering. In contrast, distinguishing true biological expression from noise within individual cell clusters can be challenging when using transformations such as the CLR, partly because CLR protein values lie on a non-zero-centered scale (each protein also has a distinct noise floor); therefore, cells can appear to express markers known to be specific for other cell lineages. For example, in cluster 4 from our PBMC data (framed cluster in Fig. 3a), proteins such as IgA/IgM and CD57 could be mis-interpreted as showing signal above noise (Fig. 3b). In contrast, dsb normalized values for IgA, IgM, and CD57 are zero-centered (Fig. 3b), indicating that the level of these proteins in this cluster was statistically similar to the level in empty droplets and were therefore not expressed (Fig. 3c, d—red proteins). In contrast, CD16, CD244, and CD56 had dsb values above 8 (i.e., >8 standard deviations above the mean in empty droplets, +/− the correction from regressing out the technical component), suggesting these were CD57 negative CD16++ CD56+ NK cells, which are not known to express B-cell markers such as IgM or IgA. In general, cell clusters identified using dsb normalized protein values had cell type-defining proteins detected above the same threshold (3.5) applied within each cell cluster (Fig. 3e, Supplementary Fig. 6b, c).

Fig. 3: Case study I: dsb improves interpretation of cell clusters derived from protein-based and joint mRNA–protein clustering.
figure 3

a UMAP plot of single cells labeled by cluster number (clustering was performed using dsb normalized protein values). b The distribution of protein expression of cluster 4 (highlighted with a gray box in (a)) using CLR (across cells) or dsb for normalization. c Median log + 1 protein levels (left) and CLR transformed across cells (as in (b), right) in cells from cluster 4 versus the level in empty droplets; proteins highlighted in red are comparable in expression to “positive” proteins after log transformation (left) and CLR transformation across cells (right) but are similar to background levels in empty droplets (identity line y = x shown in black). All proteins with median log10 expression >1 but <3.5 after dsb normalization are labeled with the protein name. d Similar to (c), but the y-axis shows the median dsb normalized values; proteins in red (those near the diagonal in (c)) are now residing below our uniformly applied dsb positivity threshold of 3.5, reflective their proximity with mean counts in empty droplets; proteins above the red line have median dsb normalized expression within the highlighted cluster 4 (see (a) and (b)) above 3.5, i.e., 3.5 standard deviations above ambient noise, ±adjustment for the cell intrinsic technical component. e The dsb normalized value vs. the median value in empty droplets of proteins within a subset of protein-defined clusters. A subset of proteins informative for cluster identification from B cell and dendritic cell subsets with a dsb value above 3.5 (red line) are annotated with the protein name within each panel and are labeled in red when below 3.5 within each subset. Proteins labeled for B cell subsets (C13: Unswitched B cells, C5 Transitional B cells, C12 Switched B cells) include B cell proteins CD20, CD19, IgD, and IgM, proteins labeled for the dendritic cell subsets (C16: pDC, C14: mDC) include innate cell markers CD1c, CD1d, CD34, CD14, CD16, and CD303. f UMAP plot of the same cells shown in (a) but the UMAP embeddings and clusters here were derived using Seurat’s weighted nearest neighbor (WNN) mRNA-protein multimodal algorithm applied to dsb normalized values. g Similar to (d) but derived using cells from WNN cluster 3; Pearson correlation coefficient and p value (two sided) are shown between median dsb normalized values and the 98th percentile expression value (log10) of the same protein in empty droplets. h Similar to (g) but for CLR normalized values. i. Differentially expressed genes (ROC test; see the “Methods” section) for cell in cluster 3 vs. other clusters.

We also assessed compatibility of dsb with an unsupervised joint mRNA-protein clustering algorithm that constructs a weighted nearest-neighbor (WNN) joint embedding of CITE-seq mRNA and protein data31 (Fig. 3f). We ran WNN clustering using the same processed mRNA data together with ADT data normalized by either dsb or CLR (across cells). The clustering results were similar, suggesting dsb and CLR led to broadly concordant results. However, closer examination of individual clusters revealed that dsb could lead to more interpretable results. Notably, CD14-positive cells (presumably monocytes) were distributed across multiple dsb-derived clusters, including cluster 3 characterized by elevated CD86 (Fig. 3g). In contrast, the CLR value of these same cells was relatively low for CD86 but high for other markers (e.g., CD8 and IgM) that should not be expressed by monocytes (Fig. 3h). Furthermore, median CLR values in these cells (but not dsb—Fig. 3g) were correlated with the 98th percentile of expression in empty droplets across proteins (R = 0.67, p = 3.1e−10; Fig. 3h), suggesting that protein-specific ambient noise contributed substantially to the CLR values; this noise source was successfully accounted for by dsb via the use of empty droplets. Finally, relative to the rest of the cells, differentially expressed transcripts in dsb-derived cluster 3 include inflammatory and activation genes (Fig. 3i), consistent with the CD86-high phenotype revealed by dsb.

Case study II: dsb unmasks MAIT cell population in tri-modal TEA-seq data

As a second example, we further analyzed trimodal transcriptome, protein, and chromatin accessibility (TEA-seq) data32. Visual inspection suggested improvement in biaxial plots after dsb normalization as the same interpretable threshold of 3.5 applied to all datasets in this study delineated two cell populations based on CD4 and CD14 (Fig. 4a) compared to normalization with protein library size (as implemented in the original TEA-seq study), and CLR (Supplementary Fig. 9a, b). To assess unsupervised multimodal clustering, we carried out the same comparison of CLR and dsb normalization using WNN clustering (combining mRNA and protein) as above but on TEA-seq data. Similar to above, the clustering results overlapped significantly (Chi-squared test, p < 2e−16 Supplementary Fig. 9c, d). However, we noticed phenotypic marker differences within a specific T cell cluster that could substantially change the biological interpretation of the resulting cell population. During thymic development, human T cells rearrange variable, diversity and joining (VDJ) genes at the T cell receptor (TCR) locus. The resulting TCR gene rearrangements are distinct to functional categories of T cells with known specialized functions. This TEA-seq data included antibodies specific for alpha-beta (TCR a/b—conventional helper and cytotoxic T cells), gamma-delta (TCR g/d gamma-delta T cells), and Va7.2 (specific for mucosal associated invariant T (MAIT) cells). The MAIT TCR Va7.2 median dsb values were high (~15 standard deviations above background noise) in cell cluster 14 (with more than 700 cells); as expected, cells in this cluster expressed TCR Va7.2 exclusively with no other TCR proteins according to dsb normalization (Fig. 4c, Supplementary Fig. 9f). In contrast, the CLR normalized values of the cells in this cluster had higher median values for TCR a/b than TCR-va7.2; both TCRs were similarly distributed and it was thus unclear which was truly expressed given the uncertain noise floor of CLR normalized counts (Fig. 4d, Supplementary Fig. 9e). This was also the case for the gamma-delta T cell receptor protein, which was around zero after dsb normalization (Supplementary Fig. 9e, f). CD56, CD3, CD8, and KLRG1 in Cluster 14 were also positive based on dsb (more than 6 standard deviations above background noise) (Fig. 4e), thus broadly consistent with the known phenotype of CD8 + MAIT cells33. These cells have distinct biological functions from conventional T cells, partly due to their semi-invariant T cell receptor (TCR-Va7.2) specific for bacterial metabolic products presented via major histocompatibility complex-related protein MR134. Based on CLR normalized protein levels alone, cells in cluster 14 had a phenotype resembling conventional T cells with elevated cytotoxic capacity (TCR a/b, KLRG1 and CD56 positive)35,36. Since dsb corrects for protein-specific noise, we hypothesized that the apparent expression of both TCRs in cluster 14 after CLR normalization was likely due to ambient noise present in CLR transformed data. Supporting this notion, the median CLR values (but not the dsb-derived values) were correlated with the 98th percentile values from empty droplets (Pearson correlation 0.8, two-sided p = 2.4e−7, compared Pearson correlation 0.13, two-sided p = 0.39 for dsb), and both the alpha-beta and gamma-delta TCR proteins were among the highest ranked proteins based on expression in empty droplets (Fig. 4e, f). To further assess the identity of this cluster, we performed unbiased differential mRNA expression analysis of cluster 14 cells versus other clusters (Fig. 4g). Among the top discriminative markers for cluster 14 was the transcription factor ZBTB16 (Fig. 4g), which is known to be elevated during iNKT and MAIT cell differentiation37, expressed by mature MAIT cells38,39, but suppressed during conventional naïve T cell differentiation40. We next constructed a 165-transcript MAIT cell signature derived from the top differentially expressed genes reported in an independent study, which used bulk RNA-seq to compare FACS-sorted TCR-Va7.2+ human MAIT cells versus other T cells lacking this TCR41. This MAIT cell signature was significantly enriched (Fig. 4h) in differentially expressed genes from cluster 14 (normalized GSEA enrichment score 2.64, p value 1e−10). As this example demonstrates, dsb helped to avoid potential misannotation of a T cell subset and revealed biologically coherent mRNA and protein profiles of MAIT cells. Thus, dsb is compatible with and can improve downstream analysis outcomes of multimodal single cell data such as TEA-seq.

Fig. 4: Case study II: application of dsb to tri-modal TEA-seq data unmasks a MAIT cell population obscured by noise in CLR normalization.
figure 4

Analysis of TEA-seq (transcriptome, epitopes, and accessibility) tri-modal single cell assay data. a dsb normalization of protein data from TEA-seq showing the distribution of CD4 and CD14 with the same 3.5 threshold used throughout the study. b UMAP plot of single cells and clusters derived by WNN joint mRNA–protein clustering with protein data normalized using dsb. c Bi-axial distribution of the alpha beta and va7.2 T cell receptor (TCR) proteins in cluster 14 cells normalized by dsb and d the same cells’ CLR normalized values. e Similar to Fig. 3g but here for cluster 14 from (b) using dsb or f CLR normalized values (y-axis); in both plots Pearson correlation coefficients and p values (two sided) are shown between normalized values (y axis) and values in empty droplets (x axis). g Differential expression analysis (ROC test) of genes in cluster 14 vs. other clusters. h Gene set enrichment of a MAIT cell signature constructed from FACS-sorted TCR-va7.2+/MAIT cells compared to other T cells (RNA-seq data from Park et al. 2019) with genes ranked by log2 fold change in cluster 14 cells vs. other cells as in (g).

Source link