Preloader

A neural network-based method for exhaustive cell label assignment using single cell RNA-seq data

Method overview

NeuCA is a supervised cell label assignment method. It uses existing scRNA-seq data with known labels to train a neural network-based classifier and then predict cell labels in new data of interest. Figure 1 provides a schematic overview of the proposed method. Based on the training data, NeuCA first obtains the mean gene expression profile for each cell type and calculates the correlation matrix across cell types. NeuCA then will choose one out of the two approaches as described. If the correlation matrix contains highly correlated cell types (defined as cell types with Pearson’s correlation coefficient (ge) (tau)), NeuCA constructs a tree structure by hierarchical clustering and trains a series of neural networks based on this tree structure. We used (tau = 0.95) for the following experiments, but users can specify this value in the R package. Predicted labels are obtained by applying the trained hierarchical neural network model to the testing data. In constructing the hierarchical neural network, different sets of features(genes) are selected at each iteration to accommodate the differences between cell types. If the correlation matrix does not show the existence of highly correlated cell types, NeuCA will train a feed-forward neural network for cell type prediction. This simplified approach is motivated by the fact that a tree structure is unnecessarily expensive when cell types are not similar. In this case, a feed-forward neural network can achieve satisfactory prediction accuracy. Collectively, NeuCA calculates the correlation matrix first and determines its adopted approach from the two strategies described above.

Figure 1
figure 1

A schematic overview of the proposed method. Based on the correlation matrix of the training data, NeuCA will detect if highly correlated cell types exist, and decide between the following two routes: (1) with the presence of highly correlated cell types, NeuCA will adopt a hierarchical model with neural networks for cell label identification (right, pink panel); (2) in the absence of highly correlated cell types, a feed-forward neural network will be adopted for cell label identification (lower-left, cyan panel).

The advantages of the proposed method are threefold. First, NeuCA can recognize whether the cell types are highly correlated or not, and adjust its strategy. The presence of highly correlated cell types is usually an indicator for the increased complexity of the prediction task. Under an increased difficulty level, the combined approach of a tree structure and neural network model will show its merits. Under the scenario where high correlation is absent, the feed-forward neural network usually works well. Second, the adoption of a cell type hierarchical tree and stepwise feature selection in NeuCA help further improve accuracy. The hierarchical structures of cell types have been identified and used in several previous works14,24,30 for their advantage in analyzing cell types with high similarities. When numerous cell types are available, the hierarchical approach of cell annotation disassembles a complicated problem to a series of simple binary classification problems. For example, among peripheral blood mononuclear cells, it is easier to distinguish CD4T cells from CD8T cells, given only the mixture of these two cell types, than mixing them with all other peripheral blood cell types. In addition to hierarchical classification, we select the most distinguishable features at each step, which further improves similar cell type separation accuracy. Third, NeuCA assignments are exhaustive, in that they do not allow cells to stagnate at the intermediate node of the tree. This idea is supported by the assumption that most or all cell types in the new testing data have been included and correctly labeled in the training data. This eliminates the unnecessarily high unassignment rate in well-studied tissues. In the following numerical and real data experiments, we showed this would improve assignment accuracy.

Numerical experiments with cell-sorted PBMC data

We first applied our proposed method, NeuCA, and other existing methods on a series of peripheral blood mononuclear cell (PBMC) datasets. All the experimental datasets generated in this section are based on the 10X PBMC scRNA-seq data31. This dataset contains single cell transcriptome data of more than 60,000 cells from fluorescence-activated cell sorting (FACS). The FACS experiment provided the gold standard of cell labels for all the sequenced cells, making it a reliable resource for benchmarking. We designed a series of experiments with randomly drawn cells from this dataset to serve as Monte Carlo simulation studies. This allowed us to fully evaluate the proposed and existing methods under various scenarios.

The methods benchmarked here included three supervised cell annotation methods, NeuCA, scmap, and CHETAH, and two unsupervised clustering methods, Seurat and SC3. We tested NeuCA with three different numbers of nodes: relatively large, medium, and small (see the “Method” section for details), denoted as NeuCA-big, NeuCA-med, and NeuCA-small, respectively. We also compared two versions of scmap, scmap-cluster, and scmap-cell, representing cluster-wise and cell-wise approaches for cell annotation. The two evaluation metrics were accurately assigned rate, calculated as the proportion of correctly classified cells over total cells, and adjusted Rand index (ARI), an indicator of the similarities between two clustering results. For unsupervised methods, we reported ARI only since matched cell labels are not directly available. Additionally, we benchmarked NeuCA with other neural network based methods (neural-network, scDeepSort32), and presented the results in Supplementary Fig. S16.

Overall comparisons

We provided an overall comparison between NeuCA and existing methods across various scenarios. These scenarios included various training sets using 10%, 20%, 50%, or 80% of all PBMC data and the testing sets using 800, 1600, or 4000 randomly selected PBMC cells. Figure 2A,B present the overall accurately assigned rate and ARI of the evaluated methods. All three versions of NeuCA ranked at the top, indicating high overall accuracy and high consistency with the truth. With NeuCA-med, we achieved a 10% improvement in accuracy, compared with the best existing approach, scmap-cluster. We achieved at least 5% accuracy gain compared with scmap-cluster, even with NeuCA-small. Results of ARI indicate a highest concordance between our predicted cell clustering labels and the true labels. These results show that NeuCA outperforms both unsupervised and supervised methods that are currently available.

We then explored in detail how NeuCA outperforms existing methods, and in what cell types the advantage is prominent. Figure 2C provides an illustration of the correlation and hierarchical structure of the cell types and Fig. 2D presents a detailed breakdown of the frequency counts of the predicted labels from all of the methods. The true label is shown on the left-most bar in each sub-panel of Fig. 2D. Combining Fig. 2C,D, we found CD56 natural killer (NK) and CD14 monocytes are the two distinct cell types that are less correlated with other cell types. This makes them easy to be correctly predicted. As expected, all of the methods work well in annotating CD56 NK and CD14 monocytes (the last two sub-panels in Fig. 2D). For closely correlated cell types, for example, naive cytotoxic T and memory T, existing methods annotate part of the cells as “unassigned” or incorrectly assign them cell labels. NeuCA demonstrates exceptional accuracy in annotating both naive cytotoxic T and memory T with few mistakes. Lastly, naive T, CD4 T helper, and regulatory T are the three most challenging cell types to predict, because the correlation between them is higher than 0.95. For these three cell types, existing methods produced more than 50% “unassigned” labels owing to uncertainty. In contrast, NeuCA can accurately annotate the majority of cells with a small portion of incorrect labels for these three cell types. These findings show that adopting hierarchical structure and step-specific feature selection improves the classification, especially in closely correlated cell types.

Figure 2
figure 2

Cell classification accuracy results in the numerical study of the 10X PBMC dataset. The presented results were summarized over 160 Monte Carlo simulations. (A) and (B) show the accurately assigned rate and the adjusted Rand index (ARI) of the proposed and existing methods. (C) shows the hierarchical clustering results and correlations of the cell types. (D) is a detailed breakdown of the frequency counts of predicted labels using the proposed and existing methods, for each of the eight cell types, respectively.

Effect of training and testing sample sizes

One potential concern of adopting NeuCA is whether the model needs a large training dataset to achieve satisfactory performance. Therefore, we evaluated NeuCA and existing methods with different amounts of training data. Note that unsupervised methods Seurat and SC3 do not use training sets at all, and thus their performance stays similar over various settings. Supplementary Fig. S1 shows the accuracy and ARI with training sizes ranging from 10%, 20%, 50%, and 80% of all cells. Training sets with (10%) correspond to (sim) 6000 cells ((sim) 750 cells per cell type), which is feasible in many real-world experiments. For experiments that are conducted using sequencing platforms that generate moderate number of cells, such as non-droplet-based platforms, it is advisable to let NeuCA take training datasets from alternative sequencing platforms that has larger cell/sample numbers. NeuCA can still have robust performance, as shown later in human pancreas datasets cross-platform experiments. Here, the testing dataset stays at 4000 randomly selected cells. These results show the high accuracy of NeuCA with only 10% cells used as the training data. NeuCA also consistently outperforms existing methods using 20% (and more) of cells for training. The accurately assigned rate increases from 0.86 to 0.90, which indicates the advantage of using a large training set of around 50,000 cells, although the improvement is marginal. From these experiments, we found that a training set with around 1000 cells per cell type is sufficient to provide a reasonably high annotation accuracy. In general, this is attainable in real-world single-cell RNA-seq experiments.

We also evaluated the impact of testing sample size on the performance of all methods. Supplementary Fig. S2 reports the accurately assigned rate and ARI under different testing sample sizes (i.e., 800, 1600, and 4000 cells). As expected, changing the test sample size has little impact on the relative accuracy of supervised methods. The unsupervised methods have some performance improvements, with ARI increasing from 0.45 to 0.5 for Seurat and 0.60 to 0.65 for SC3. This also is within our expectation because increased sample size means more information for use. Nevertheless, NeuCA demonstrated a higher accuracy and ARI than the compared existing methods.

Effect of highly and lowly correlated cell types

Last, we evaluated all of the methods under different prediction difficulty levels. As discussed in the “Introduction” section, the cell annotation becomes challenging when cell types are highly correlated. Using the PBMC data, we specifically designed two settings: one with highly correlated cell types only, and another one with lowly-correlated cell types only. These two scenarios correspond to difficult and easy prediction task, respectively.

For difficult prediction task, we only retained the five T-cell types from the PBMC data. We tested under different training and testing proportions as described in the “Effect of training and testing sample sizes” section. Supplementary Fig. S3 shows an overall summary of the performance. All methods have decreased performance under this difficult setting. With the adoption of a hierarchical structure and step-specific feature selection, NeuCA still outperforms all existing methods in both accuracy and ARI. Interestingly, we find SC3 achieves the highest ARI over all existing methods. The high unassigned rate due to the high correlations of cell types lowers the accuracy and ARI of existing supervised methods. Supplementary Figs. S4 and S5 demonstrate the desirable performance of NeuCA under different training and testing sizes. Similar conclusions can be drawn from the experiments in the easy prediction setting (Supplementary Figs. S6–S8). In this relatively easy cell labeling problem, all methods yield reasonable performance. Our method still leads the competing ones, with almost perfect accuracy on average even when using only 10% of the data to train the model.

Applications on cell sorted human PBMC datasets

Next, we evaluated our proposed method, NeuCA, along with other existing methods, on three additional PBMC-related real datasets: PBMC_8ct_random500 is a subset of the aforementioned FACS-sorted PBMC data with 500 cells randomly drawn for each cell type for a total of 4000 cells31; Zhengmix_8ct_PBMC is also from the 10X platform31 and has been used in several benchmarking studies33; FACmix_NK_Mono is a two-component mixture of FACS-sorted NK cells34 and FACS-sorted monocytes35, with a total of 12,700 cells. Compared with the first two datasets that include eight distinct PBMC cell types, the third one only has two cell types. However, the third dataset is obtained from two independent studies, and thus valuable in evaluating the inter-study performance of all of the methods.

To evaluate the robustness of the proposed method, we trained NeuCA using one dataset and tested on another one, with various combinations of the datasets described above. The third dataset was not used as a training set due to smaller cell type number. The same training and testing strategy is adopted on other existing supervised methods (i.e., scmap-cluster, scmap-cell, and CHETAH) for benchmarking purposes. The results for all supervised methods are presented in Fig. 3. Consistent with our findings in the previous numerical studies, NeuCA achieves more accurate label assignment than existing methods in intra-study predictions (the first and third panels in Fig. 3A). NeuCA also has outstanding performance in correctly annotating the inter-study dataset and obtained more than 98% accuracy (second and fourth panels in Fig. 3A). The misclassification rate for all methods on PBMC datasets are reported in Supplementary Fig. S9. In comparison, the performance of existing supervised methods is less stable in the inter- and intra-study experiments. We found that scmap-cluster and CHETAH have good performance in the intra-study experiments, but they fare worse in the inter-study settings, while scmap-cell has a reverse pattern.

Figure 3
figure 3

Results from applying NeuCA and existing methods on three PBMC datasets. (A) shows the accurately assigned cell number for different methods. (B) shows the t-SNE clustering plot of the cells, using (PBMC_8ct_random500) as the training data and (Zhengmix_8ct_PBMC) as the testing data. Note NeuCA has high concordance with the ground truth, with very few unassigned cells (grey color cells).

We presented the low dimension t-distributed stochastic neighbor embedding (t-SNE) visualization of the true and predicted cell types for NeuCA, scmap-cluster, and CHETAH in Fig. 3B to provide additional insights into the performance differences. The results are based on a specific scenario where (PBMC_8ct_random500) was used as the training data and (Zhengmix_8ct_PBMC) as the testing data. Both scmap-cluster and CHETAH have a mixture of correct and incorrect predictions for the cloud of T cells with a considerable proportion of unassigned cells. In comparison, NeuCA has more similar patterns to the true labels in all major cell lineages. Additional visualizations for scmap-cell and the two unsupervised methods, Seurat and SC3, are presented in Supplementary Fig. S10.

Applications on human pancreas datasets

We applied NeuCA and other methods on four human pancreas datasets to specifically evaluate the inter-study classification performance. The four real datasets are: Baron26 data, Muraro36 data, Seg37 data, and Xin38 data. These four studies contained different numbers of cells, were obtained from different numbers of subjects, and were sequenced using different sequencing protocols. The number of subjects ranges from 4 to 18, and the number of cells ranges from 700 to > 8000. Baron, Muraro, Seg, and Xin used inDrop, CEL-Seq2, Smart-Seq2, and SMARTer, respectively. To comprehensively evaluate the inter-study annotation accuracy, we iteratively used each of the four as the training set and examined the performance in the remaining three datasets. We used the published cell labels as the gold standard annotations. We processed the four datasets by first removing rare cell types, to have true labels consistency in all major cell types among these four studies. This allows for inter-study comparisons, among all available shared cell types. The four processed datasets had 1600, 2038, 8569, and 2126 cells in Xin, Seg, Baron, and Muraro datasets, respectively.

Figure 4A–D shows that NeuCA achieves the highest accuracy among 9 of the 12 scenarios, and comparable accuracies for the remainder. Their corresponding misclassification rates are reported in Supplementary Fig. S11. In several settings, the advantage of NeuCA is substantial compared with other methods. For example, when using Baron for training and Seg for testing, NeuCA accurately assigned labels for more than 90% of the cells, while the competing methods only annotated 80% or less cells correctly. The scmap-cell method was the second-best performer in general, which gave stable performance across the scenarios.

Figure 4
figure 4

Cell label prediction results from applying the proposed method on four human pancreas datasets. The total cell numbers for the four datasets are 1600, 2038, 8569, and 2126. (A)–(D) show the accurately assigned cell numbers from proposed and existing methods, alternating training and testing dataset selections. (E) and (F) show the comparison of true labels (left column) and the estimated labels (right column) from NeuCA and scmap-cell, using Sankey diagrams, with Seg data as the training and Baron data as the testing dataset.

Figure 4E,F are detailed examinations of the classification results of NeuCA and scmap-cell using Sankey diagrams. In the diagram, the width of the flows reflects the frequency of cells in each cell type. For both diagrams, we trained the model using the Seg data and tested the model using Baron data. The box column on the left shows the true frequencies of cells, while the box column on the right shows the predicted frequencies of cells. A one-to-one mapping relationship is shown in the flow in-between. NeuCA has higher overall accuracy than scmap-cell. Although the relative proportions of NeuCA and scmap-cell are similar, scmap-cell suffers from a high proportion of unassigned cells. scmap-cluster and CHETAH also suffer from high proportions of unassigned cells, leading to lower accurately assigned cells numbers (Supplementary Fig. S12). Similar conclusions can be drawn from Sankey diagrams of the unsupervised learning methods Seurat and SC3 (Supplementary Fig. S13).

Applications on ASD data

Lastly, we benchmarked all methods on a cross-condition annotation problem. We obtained a set of single nucleus RNA-seq data from a study for Autism spectrum disorder (ASD)39, a group of cognitive developmental disabilities that cause significant social, communication, and behavioral challenges40 for patients. This study contains snRNA-seq data from 15 ASD patients and 16 controls. Compared with the previous datasets, this is a much larger dataset with 52,003 cells in the ASD group and 52,556 cells in the control group. We were interested in evaluating the methods across the disease conditions, i.e., accurately predict cell types in the ASD group using the control data as the training set or vice versa. This evaluation is motivated by the pragmatic consideration that often the existing single cell data only contain normal samples, while researchers are interested in using the information in annotating the diseased subjects.

Figure 5 shows the accuracy and ARI for two scenarios: ASD samples as the training set and control samples as the testing set (Fig. 5A); control samples as the training set and ASD samples as the testing set (Fig. 5B). NeuCA achieves the highest accuracy and ARI among all methods. The corresponding misclassification rates, for both scenarios, are reported in Supplementary Fig. S14. Interestingly, NeuCA has better performance in predicting control samples than predicting ASD samples. We suspect that ASD-related molecular changes might be associated with such performance change, e.g., the differentially expressed genes make it harder to accurately annotate cells in ASD samples. scmap-cell is the second-best method in the supervised category with an accurately assigned rate around 0.7. Unsupervised methods Seurat and SC3 also have good performance, due to the large number of cells. This experiment shows that NeuCA stably outperforms existing methods in cross-condition annotations.

Figure 5
figure 5

Accurately assigned rate and ARI of the proposed and existing methods on the ASD data, containing both ASD disease samples and control samples. (A) shows the accuracies using ASD disease samples as the training set and control samples as the testing set. (B) shows the results using control samples as the training set and ASD disease samples as the testing set. NeuCA stably outperforms existing methods in cross-condition annotation tasks.

Source link