We define novel features and a method to estimate parameters and build a classifier using pan-cancer data to predict TSGs and OGs. The classifier is further used to predict labels for unlabelled genes at pan-cancer and tissue-specific levels, which are analysed for functional enrichment.
Novel features used for classification of TSGs and OGs
We trained multiple random forest models using a subset (80%) of 136 TSGs and 76 OGs for each fold of the cross-validation. We performed fivefold cross-validation while estimating hyper-parameters for the model, followed by multiple random iterations to estimate stable hyper-parameters and avoid overfitting (as defined in “Methods”). The final model was built using the hyper-parameters so identified (Supplementary Table S1). It is important to carefully consider overfitting as the initial training set is not very large. The accuracy of the test set reduces compared to the training set, but this difference is not substantial. We note that TSGs can be predicted with higher accuracy than OGs; it is probable that the features are biased at capturing information regarding TSGs better than OGs. Across the multiple models, an average accuracy of 0.76 ± 0.03 was achieved (Table 3). These models were further used for the identification of new genes as well as tissue-specific analyses. Our model (cTaG) presents a significant improvement in recall for TSGs. For OGs, the recall is similar to those observed in other tools. Nevertheless, an average recall of driver genes (comprising both classes) shows an improvement over the tools reported earlier12.
To identify features important for the classification of TSGs and OGs, we calculated the average rank of each feature across all models. We observe that the top-ranking features contain LOF and missense mutations (Supplementary Table S2). The new features that replace old features in the top 18 ranks are Nonsense entropy, High missense frequency, Compound/benign, High Frameshift Frequency, Damaging/kb, Compound/kB, Damaging/LoFI and HiFI/benign. Further, we used the training set genes to compare the distribution of feature values in TSG and OGs, and observed that our top-ranking features show the highest differences between the two distributions (Fig. 2). The model built on only the old features performs marginally lower with an accuracy of 0.75 than the model using all features, but the difference is not statistically significant. While it is common knowledge that LOF mutations accumulate in TSG and recurrent missense mutations in OGs, we formally show that the feature distribution is different for these two functional classes.


Distribution of top features identified by the classifier for TSG and OG. Training genes were used to study the differences between the distributions of features (kernel density) in TSG and OG. Kolmogorov–Smirnov statistic and the p-value is given for each feature. Higher value of KS statistic shows magnitude of difference of the two distributions.
Iterative hyper-parameter estimation avoids overfitting
Initial analysis using support vector machines (SVM), logistic regression, and random forest showed high accuracy for random forests (Supplementary Table S3). For many trees, random forest (95.3%) gave a higher accuracy score for training sets comparable to 91.9% achieved by Davoli et al.17. However, these showed very low accuracy for the test set (Supplementary Table S3), indicating overfitting. Additionally, we observed that changing the random seed showed substantial variation in results. This variation is unexpected and could perhaps stem from non-optimum parameters used for classification or the small size of the data. To avoid this variation, we selected random forest for its best performance and re-estimated the parameters, n_estimator, max_features, max_depth and criterion. Changing the n_estimator had a major effect on classification, and a simple grid search with cross-validation did not help in removing overfitting, as seen in our results for balanced bagging (Table 3). Comparison of metrics of our final model with balanced bagging, a similar algorithm that uses decision trees and handles unbalanced data, showed our procedure helps avoid overfitting.
We overcame this by multiple iterations of hyper-parameter estimation by changing the random seed, which helps us identify more stable hyper-parameters. This gave lower accuracy for training sets but improved the accuracy of the test set considerably. When varying sets of random seeds (10, 20, 40, 80, 160, 320) were used, the results were consistent across all cross-validation folds (test set accuracy 0.76 and standard deviation 0.03), implying the increasing number of random seed iterations do not decrease or improve accuracy (Supplementary Table S4). We observe that for a given data fold, the hyper-parameters selected are more stable for varying sets of random seeds. While different parameter sets dominate as the data is changed, the overall results on the test set do not vary.
cTaG identified new TSGs and OGs along with known driver genes
All genes that were not used for training the models were classified into TSGs and OGs. This list also contained genes that are known driver genes present in CGC but not used for training. The labels were predicted for the unlabelled genes, of which 126 genes or transcripts showed consensus across all models (Supplementary Table S5). CGC known driver genes contributed to 40.5% of these predictions, which included genes such as ARID1A, ATRX, NF1, TP53, RB1, and STAG1 and their transcripts. Some new genes predicted consistently are SIN3A, ZNF750, IWS1, CD36, ARHGAP35, MGA, and RASA1 as TSGs. The model tends to be biased towards TSGs; out of the 699 genes with consistent predictions across three or more models, only nine are predicted as OGs. The top OGs predicted are U2AF1, BCL2L10, KRAS, MAP1LC3B, C11orf68, TAB3, MED12, MAX, and BRAF. Further, we show not all transcripts of a gene behave like a driver gene, e.g. ATRX transcript ENST00000373344 is labelled as TSG but not ENST00000400866, ENST00000373341. The presence of known driver genes among top TSG and OG shows the validity of cTaG, and those other genes in the list are potential driver genes.
Enrichment analysis of genes for various KEGG and BIOCARTA pathways revealed genes involved in different cancer pathways such as myeloid leukaemia and pancreatic cancer. Genes are also enriched for various signalling pathways associated with cell growth, such as EGF and PDGF signalling pathways. Further, to validate, a similar analysis was conducted using genes used for training the model. We find GO terms related to cell cycle, regulation of transcription, signalling and cell cycle arrest to be common for both results. These keywords were further clustered with top clusters associated with genes involved in zinc-finger proteins, helicases, ATP-binding, ARID binding and cancer pathways. The analysis shows known driver genes and predicted driver genes enrich for similar pathways.
Our approach identifies genes with low mutation frequency
We analysed the mutation frequencies of the predicted genes. Mutation rates were calculated using MutSigCV, a well-known driver gene predictor, which calculates mutation rates to identify driver genes. MutSigCV ranks all genes of which a total of 602 driver genes were identified above the threshold (p ≤ 0.005, q ≤ 0.01). Training data labels were used to compare the two methods. MutSigCV identified 40% for our training gene set with 85 genes predicted as a driver, while cTaG did better by predicting 85% of genes. The mutation rates of the genes predicted by the two models were compared. Since MutSigCV ranks all genes, we picked top genes equal in size to cTaG predictions (≥ 5 model consensus) and calculated KS statistic against the training set and plotted the fraction of genes below the mutation rate of each gene. We observe that the distribution of mutation rates is similar to genes used for model building for our predicted genes, while MutSigCV tends to be biased towards genes with higher mutation rates (Fig. 3). The minimum mutation rate predicted for cTaG was 0.35, while for MutSigCV was 0.90. The KS (Kolmogorov–Smirnov) statistic for both models, when compared to the training set, shows the difference is far lesser for cTaG (KS statistic = 0.193, p = 0.054) when compared to MutSigCV (KS statistic = 0.774, p = 0.0), which shows that the distribution of mutation rates is similar to what is expected.


Fraction of genes predicted plotted against log transformed mutation rates. Genes predicted by a given method were sorted based on their mutation rate and plotted against the fraction of genes predicted below the given mutation rate.
Further, we compared the precision of predicted driver genes from cTaG, TUSON, 2020+ and DriverNet to the pan-cancer genes listed by Bailey et al. undefined. and the CGC driver gene list (Supplementary Table S6). We compared with feature-based methods, TUSON and 20/20+ as well as with network-based method DriverNet. For each method, we considered the top-ranking genes and compared the overlap with the pan-cancer gene list. Based on driver genes listed by Bailey et al., cTaG performs best followed by 20/20+, TUSON and DriverNet. For driver genes listed in CGC 20/20+ performs best followed by TUSON, cTaG and DriverNet (Supplementary Fig. 2). Some pan-cancer genes are identified as “rescued” as they were excluded as outliers from the initial list before being included in the final list. None of the rescued genes were identified by cTaG, while the three methods identified 4 (TUSON), 3 (20/20+), and 5 (DriverNet) genes. We do not expect a large overlap with rescued genes as they are manually curated and included by experts. We also observe an overlap between cTaG and the methods with maximum overlap with TUSON with 43 genes, followed by 20/20+ (31 genes) and DriverNet (9 genes). Since the number of genes predicted by methods vary, DriverNet (473), TUSON (269), 20/20+ (137) and cTaG (94), precision was used to normalize for the number of predicted genes.
Driver genes are tissue-specific
Cohort studies tend to be specific to a cancer type. The usefulness of a pan-cancer model is further elucidated when it can be used to identify tissue-specific driver genes (Supplementary Table S5). The objective of predicting genes using a subset of data specific to tumour primary tissue source was to identify genes specific to a cancer type. This helped in identifying genes that might otherwise be lost in biological noise (Table 4). We observe TP53 predicted as TSG across the different tissues. Other known driver genes that weren’t identified by the pan-cancer analysis were identified, such as CBFB, CDH1, PTEN in breast cancer and APOB in the liver. Genes such FAM182A, SOX9, AHNAK2, ENSG00000121031, FLT3LG, PMEPA1, ZFP36L2 in the large intestine, ALB, KRTAP19-1, APOB, CD200, CRYGD, KRTAP24-1, OR6N2 in the liver are novel predictions, and their functions in these cancers can further be studied. We used the pan-cancer model (cTaG) to predict tissue-specific driver genes and identified new genes not reported by the pan-cancer analysis.
Genes identified for breast cancer was validated by supporting literature. CBFB30 and PTEN31,32 is a known TSG in breast cancer. PTEN is found to be under-expressed in breast cancer33,34. While CDH1 mutations are found mostly in stomach cancer, they are also shown to be frequently occurring in lobular breast cancer35,36. Pathway analysis of breast cancer genes shows enrichment of pathways involved in gene expression regulation governed by TP53, RUNX1 and PTEN, which includes pathways that regulate estrogen-mediated transcription. CBFB deletion leads to expression loss of RUNX130, which can no longer regulate NOTCH signalling by repression, which is confirmed by pathway analysis. Some apoptosis pathways are enriched that include CDH1 and TP53 genes. The genes identified by cTaG (pan-cancer model) for breast cancer samples predict genes functionally important in breast tumour cells.
Predictions made for liver cancer mainly were novel, which made literature validation difficult. RNA expression levels of genes APOB, ALB and CD200 were higher compared to all other tissues (as reported by The Human Protein Atlas). Higher albumin levels are known to decrease the risk of HCC (Hepatocellular carcinoma)37. APOB mutational signatures are shown computationally to be significant to predict prognosis by loss of regulation of genes such as TP53, PTEN, HGF38. While the role of other genes is difficult to elucidate, our method helps identify research gaps that can be filled by studying these potential driver genes.

