Simulations
We present an overview of CpelNano in the “Methods” section and an illustration in Fig. 1a, while providing a more detailed description in the Supplementary Methods. Unlike existing methods for DNA methylation analysis of bisulfite sequencing data, which only address the inverse problem of inferring statistical properties of DNA methylation from available data, CpelNano also considers the forward problem of predicting the probability distribution of nanopore current signals from a given methylation state. This additional step allows CpelNano to account for nanopore noise and is carried out via a data-generative model expressed in terms of an Ising model for the methylation landscape and emission probabilities computed by Nanopolish4.


The CpelNano method and simulated performance evaluation results. (a) To consider nanopore noise, CpelNano employs a hidden Markov model (HMM) approach, which treats the true methylation state (pmb {x}) over an estimation region of the genome as a hidden state that is observed indirectly through a state (pmb {y}) of nanopore current signals. It then models the hidden state using a parametric correlated potential landscape model (CPEL) (p(pmb {x};alpha ,beta ,gamma )) and addresses the forward problem of modeling the relationship between the observable and hidden methylation states using a data-generative model (r(pmb {y}, pmb {x}; alpha , beta , gamma )) (=) (q(pmb {y} mid pmb {x}) p(pmb {x};alpha ,beta ,gamma )), which is expressed in terms of the CPEL model (p(pmb {x}; alpha , beta , gamma )) and emission probabilities (q(pmb {y} mid pmb {x})) computed using Nanopolish4. Finally, it solves the inverse problem of estimating values (widehat{alpha }), (widehat{beta }), and (widehat{gamma }) for the unknown parameters of the CPEL model of the hidden methylation state from available nanopore data using an expectation-maximization based maximum-likelihood (EM-ML) approach. (b) Binned joined probability distributions and associated Pearson correlation coefficient (PCC) values between estimated and true means and pairwise correlations at individual CpG sites, obtained by using a simulation-based approach (Fig. S4). Results are shown for nanopore noise with standard deviation (text {sd}=3) and data coverages of (10)× and (20)×. A lighter region indicates a higher probability of association between estimated and true values. (c) Boxplots depicting distributions of absolute errors over analysis regions between estimated and true mean methylation level (MML) and normalized methylation entropy (NME) values, as well as distributions of coefficient of methylation divergence (CMD) values between the estimated and the true probability distributions of methylation. These quantities were computed by the EM-based maximum-likelihood (EM-ML) approach of CpelNano (green), as well as by fitting the CPEL model directly to the methylation calls made by Nanopolish4 using maximum-likelihood (ML; blue). Results are shown for nanopore noise with standard deviation (text {sd}=3) and data coverages of (5)×, (10)×, (15)×, (20)×, and (25)×. Center line of box: median value; box bounds: 25th and 75th percentiles; lower whisker: larger of minimum value and 25th percentile minus (1.5)× interquartile range; upper whisker: smaller of maximum value and 75th percentile plus (1.5)× interquartile range.
Since CpelNano relies on Nanopolish4, we first evaluated its detection performance by employing a simulation-based benchmarking procedure which we designed using human WGBS and nanopore sequencing data (Supplementary Methods). Notably, the performance of Nanopolish4 was previously investigated by using a small number of CpG sites in the Escherichia coli reference genome and datasets comprising fully unmethylated or fully methylated CpG sites4,21. However, our benchmarking procedure allowed us to provide a comprehensive evaluation of Nanopolish4 with more realistic input, including simulated DNA fragments that were hemi-methylated, and assess Nanopolish4 over an entire human chromosome (Chr. 22) using four nanopore noise levels. We used different noise levels for two main reasons: first, to demonstrate how methylation calling performance depends on noise level and, second, to identify the actual level of nanopore noise in the data, which is not known.
Our results were similar to those previously achieved when using real data (Figs. S1 and S2), providing additional evidence of deficient detection performance at higher levels of nanopore noise and further showing a trade-off between true positive and false positive rates as well as between precision (probability that a CpG site is correctly predicted to be methylated) and true positive rate (also known as recall). This demonstrates the legitimacy of our benchmarking approach as a convenient and inexpensive computational tool for evaluating the performance of Nanopolish4, which can be easily adapted to other nanopore methylation callers if desired. Notably, the receiver operating characteristic (ROC) and precision-recall (PR) curves we obtained for nanopore noise with standard deviation (text {sd}=3) (Fig. S2) was similar to the one reported by Simpson et al.4 (Fig. 2 corresponding to nanopore chemistry R9 in that paper) and Yuen et al.21 (Fig. 3a,b in that paper), suggesting that this level of nanopore noise is close to reality. Importantly, however, our benchmarking results presented evidence (see below) that the statistical properties of DNA methylation cannot be reliably inferred directly from the methylation calls produced by Nanopolish4 and clearly demonstrated the effectiveness of CpelNano to deal with this problem.


Distributions of methylation levels and entropies in the Utah/Ceph lymphoblastoid cell line. (a) Boxplots depicting distributions of mean methylation level (MML) and normalized methylation entropy (NME) values over selected genomic features of the human genome (Chr. 22), estimated from nanopore (brown) and WGBS (blue) data associated with the human Utah/Ceph lymphoblastoid cell line. Center line of box: median value; box bounds: 25th and 75th percentiles; lower whisker: larger of minimum value and 25th percentile minus (1.5)× interquartile range; upper whisker: smaller of maximum value and 75th percentile plus (1.5)× interquartile range. (b) Densities of MML values; (c) Densities of NME values. (d) Aggregate (average) MML and NME values as a function of distance from the transcription start sites (TSSs) of genes.


Modeling the DNA methylation landscape over repetitive elements. (a) DNA methylation over the L1PA1 and L1PA5 subfamilies of the LINE-1 family of TEs is only partially modeled using WGBS data (GSM2308632) associated with the human Utah/Ceph lymphoblastoid cell line. (b) Methylation over the L1PA2 and L1PA3 subfamilies is not modeled using the WGBS data. However, DNA methylation is successfully modeled by CpelNano using the corresponding nanopore data (NA12878).
We first investigated whether we could directly use the methylation calls produced by Nanopolish4 to perform downstream statistical analysis that takes into account methylation means at individual CpG sites, as well as pairwise correlations at consecutive CpG sites. As previously argued for the case of WGBS data, this necessitates the use of a stochastic model for the methylation state, such as the CPEL model employed by CpelNano, whose parameters must be estimated from nanopore data with acceptable accuracy. However, accurate parameter estimation requires reliable computation of the sufficient statistics associated with the parameters of the CPEL model (Supplementary Methods) from the methylation calls made by Nanopolish4. This depends on faithfully identifying the true methylation state at each CpG site, as well as the true methylation co-occurrence, which identifies pairs of consecutive CpG sites that are both methylated or unmethylated. When the detection threshold used by Nanopolish4 was set to zero, our simulations showed an error rate (probability that a CpG site is not correctly predicted to be methylated or unmethylated) in calling the true methylation state at individual CpG sites ranging between (11) and (16%) when (3 le text {sd} le 3.5) (Fig. S3a). Notably, this rate monotonically decreased to zero with increasing threshold values, but this was achieved by substantially reducing the number of methylation calls made by Nanopolish4. For example, to obtain an error rate of (5%) (typical to WGBS) for (text {sd}=3), our simulations indicated that Nanopolish4 must produce methylation calls at only (73%) of the CpG sites considered, which is in agreement with Simpson et al.4 who reported a (6%) error rate using a log-likelihood ratio detection threshold of 2.5 that produced calls at (77%) of the targeted CpG sites. Importantly, however, our results (Fig. S3b) showed that, with a zero detection threshold, the error rate in calling the true methylation co-occurrence at pairs of consecutive CpG sites was between (19) and (27%) when (3 le text {sd} le 3.5) and that this rate remained significant even at high threshold values. This provided evidence that accurate downstream analysis of methylation calls made by Nanopolish4 comparable to that of WGBS will require the use of a high detection threshold, which will result in a substantial loss of methylation calls (more than 27% must be discarded) and have significant implications for the quality of downstream methylation analysis, an issue we expect to occur when using other existing nanopore callers, since they have been shown to perform similarly to Nanopolish21.
We subsequently carried out simulations to evaluate the performance of the EM-based maximum-likelihood module of CpelNano for estimating the parameters of the CPEL model from nanopore data by modifying the previous benchmarking scheme (“Methods” and Fig. S4). By using cosine similarity distributions, we appraised the closeness of estimated model parameter values to their true values and demonstrated the reliability of this module, even at low coverage (Fig. S5). Remarkably, the median cosine similarity values were close to 1 in all cases considered, implying that parameter estimation performed exceptionally well at least (50%) of the time. Moreover, the estimated CPEL models predicted methylation means and pairwise correlations that were mostly associated with small absolute errors (median (< 5%) at all noise levels and coverages considered; Figs. S6 and S7, green boxes), considering also the fact that these errors cannot be larger than 1 (“Methods”). On the other hand, estimation of methylation means and pairwise correlations by fitting the CPEL model directly to the methylation calls made by Nanopolish4 consistently produced higher errors regardless of the underlying coverage, due to the effect of nanopore noise (Figs. S6 and S7, blue boxes). Notably, and in agreement with previous observations13, empirical estimation of methylation means and correlations using the methylation calls made by Nanopolish4 led to substantial errors at low coverage (Figs. S6 and S7, red boxes). This was expected since, in addition to not taking into account nanopore noise, empirical methods require substantial amounts of methylation data for reliable estimation, which are not available at low coverage.
Although our results demonstrated diminished estimation performance of the EM-based maximum-likelihood module of CpelNano at increasing levels of nanopore noise, the estimated CPEL models produced reliable estimates for methylation means at individual CpG sites and pairwise correlations, especially at higher coverages (Figs. S6 and S7). These results were also corroborated by plots of binned joint probability distributions between estimated and true values for nanopore noise with standard deviation (text {sd}=3) and coverages (10)× and (20)× (Fig. S8), which showed high probabilities for most pairs of estimated vs. true parameter values to be clustered around each plot’s diagonal. However, estimation of the interaction parameter of the CPEL model exhibited a skew towards higher values. We attributed this behavior to a needed assumption that the probability of finding a CG-group (a well-defined genomic region containing a cluster of CpG sites; see Supplementary Methods) with variable methylation in an estimation region is negligible. This is required in order to accommodate the fact that the current version of Nanopolish4 assigns the same methylation state at all CpG sites in a CG-group, thus introducing artificially higher pairwise correlation. As a consequence, estimation regions with high proportion of CpG sites in a few CG-groups would be problematic. Nevertheless, given that almost (85%) of the CG-groups in the human genome contain only one CpG site and that more than (95%) of CG-groups contain at most 2 CpG sites (Fig. S9), very few estimation regions fall into this category. Consequently, our estimation method introduces only a slight bias in the values of the estimated pairwise correlations (Fig. 1b), which can be reduced or even eliminated by better training Nanopolish4 to accommodate heterogeneous methylation over estimation regions.
CpelNano partitions each estimation region into the minimum number of equally-sized non-overlapping analysis regions, whose size is set by default to be no more than 350 bp (“Methods”), and performs methylation analysis at a resolution of one analysis region. It does so by quantifying the average amount of DNA methylation in each analysis region using the mean methylation level (MML), the amount of methylation stochasticity (variability) using the normalized methylation entropy (NME), and discordance in methylation stochasticity between two methylation landscapes by computing the coefficient of methylation divergence (CMD), an information-theoretic measure of dissimilarity between probability distributions of methylation (“Methods”). By using our simulated nanopore data with the standard deviation of the nanopore noise set to (text {sd}=3) and coverages (5)×, (10)×, (15)×, (20)×, (25)×, we sought to evaluate the performance of CpelNano for reliably estimating MMLs, NMEs, and probability distributions of methylation in Chr. 22, and compared the results to those obtained by fitting the CPEL model directly to the methylation calls made by Nanopolish4. As expected, CpelNano produced small MML and NME differences, as well as low CMD values, when comparing estimated to true values, especially at higher coverages (Fig. 1c), thus providing strong evidence about its capability of producing reliable estimates of methylation statistics. Notably, fitting the CPEL model directly to the methylation calls made by Nanopolish4 produced larger differences in methylation statistics, even at higher coverages. Moreover, Fig. 1c shows that, as coverage increases, CpelNano can reduce the absolute error in estimating statistical properties of the hidden methylation landscape more effectively than when performing methylation analysis directly at the output of Nanopolish4. In that sense, CpelNano is capable of efficiently leveraging additional information provided at higher coverages to better estimate the hidden methylation landscape at those coverages.
Concordance between nanopore and WGBS based estimation of methylation statistics
To further scrutinize CpelNano, we investigated agreement of results obtained from 9112 estimation regions in Chr. 22 by using the publicly available NA12878 (nanopore) and GSM2308632 (WGBS) data identified with the Utah/Ceph lymphoblastoid cell line (“Methods”). MML and NME distributions (Fig. 2a) and densities (Fig. 2b,c) were estimated by CpelNano over selected genomic features and close to transcription start sites of genes (Fig. 2d). The results from the nanopore data were like those obtained from the WGBS data using informME13,14, a previously developed powerful approach to methylation analysis. Notably, informME is a special case of CpelNano in the absence of noise, which is approximately the case with WGBS data. Moreover, the results demonstrated known properties of DNA methylation, such as hypomethylation associated with high methylation entropy, an overall reduction in methylation level and entropy over CpG islands (CGIs) when comparing to other genomic features, a bimodal behavior of the methylation level over CGIs towards low and high values, and a progressive reduction of methylation level and entropy closer to transcription start sites.
Although observed dissimilarities, including differences between probability distributions of methylation that were computed from the nanopore and WGBS data using the CMD (Fig. S10), can be attributed to biological, technical, and statistical variability associated with the two methodologies and data used, our results consistently showed a shift of low and high MML values estimated from the WGBS data towards intermediate values when using the nanopore data (Fig. 2b), in agreement with a previous observation10. Notably, this behavior can be explained by pointing to recent results obtained by comparing WGBS and methylation array data, which show that, on average, WGBS underestimates methylation levels below 0.5 while it overestimates levels above 0.5 when compared to those measured by more accurate and highly reproducible 450K and EPIC methylation arrays22. Markedly, this issue can introduce considerable differences between NME values estimated from nanopore and WGBS data, with the most prominent ones appearing over CGIs, shores, and promoters when using the Utah/Ceph lymphoblastoid cell line (Fig. 2a,c), which are associated with noticeable differences between the probability distributions of methylation observed over these genomic features (Fig. S10). Taken together, these results provide evidence that methylation analysis of nanopore data using CpelNano can produce similar results to those obtained from WGBS data but with the potential of effectively addressing known limitations of whole-genome bisulfite sequencing.
CpelNano leads to superior methylation analysis of repetitive DNA
An important feature of nanopore sequencing is its potential for detecting base modifications inside long repetitive elements of the genome, known as transposable elements (TEs)3,23, which cannot be reliably identified by short-read sequencing technologies12. TEs make up a large fraction of the human genome (about (45%)), whereas their activities can seriously affect cellular function by altering the expression of protein-coding genes and by leading to genomic instability. It is therefore not surprising that aberrant TE transcription has been increasingly linked to many human diseases, including cancer24,25,26,27.
DNA methylation, along with other epigenetic mechanisms, is known to provide a critical process for silencing TE transcription28. This motivated us to investigate the possibility of employing CpelNano and nanopore data to model DNA methylation over TEs and contrast our results to those obtained from WGBS data. To that end, we used the nanopore and WGBS Utah/Ceph lymphoblastoid cell line data, NA12878 and GSM2308632, and compared the results over long interspersed nuclear elements 1 (LINE-1 or simply L1), a family of non-long terminal repeat retrotransposons that constitute about (17%) of the human genome24,25,26. We found several examples of L1 subfamilies in Chr. 22, such as L1PA1 (a.k.a. L1HS), L1PA2, L1PA3, and L1PA5, for which modeling the DNA methylation landscape was not successful when using the WGBS data due to ambiguous alignment, despite their high coverage ((sim !! 100)×). Nevertheless, many regions were successfully analyzed by CpelNano using nanopore data. For instance, although DNA methylation over the L1PA1 and L1PA5 subfamilies was only partially modeled using the WGBS data, it was fully modeled by CpelNano using nanopore data (Fig. 3a,b). Moreover, we were not able to model DNA methylation over the L1PA2 and L1PA3 subfamilies using the WGBS data, a problem that was again successfully addressed by CpelNano using the nanopore data (Fig. 3c,d). Notably, the results obtained with CpelNano showed low MMLs over the corresponding retrotransposons and their proximal regions, which were associated with high levels of NME, demonstrating a highly variable DNA methylation landscape.
The previous examples are representative of what one would find when performing genome-wide analysis. Indeed, repetitive DNA sequences are known to frequently result in ambiguous alignments of second-generation sequencing data, which can introduce biases that can affect downstream analysis12, and explains our inability to reliably estimate the DNA methylation landscape over long TEs using WGBS. However, nanopore sequencing does not suffer from such issues, given the significantly larger read size produced by this technology. We therefore expect that, by using nanopore sequencing data, we can reliably model and analyze DNA methylation over repetitive regions of the human genome, provided that we use a method, such as CpelNano, which successfully accounts for the effect of noise introduced by the nanopore chemistry on the data.
Differential methylation analysis of real nanopore data
We further tested and validated CpelNano by performing targeted differential DNA methylation analysis (“Methods”) using real nanopore data and by comparing our results to previously reported findings. Targeted differential analysis is a commonly used approach for evaluating DNA methylation discordance at specific genomic regions of interest that allows for a high depth of coverage, increased statistical power, and reduced sequencing costs. Here, we used publicly available methylation data (“Methods”) recently obtained via nanopore Cas9-targeted sequencing29 using the non-tumorigenic epithelial cell line MCF-10A as “normal” and the epithelial human breast cancer cell line MDA-MB-231 (metastatic mammary adenocarcinoma) as “cancer”. These data correspond to genomic regions that fully or partially overlap with the following cancer-associated genes: BRAF, CA9, GPX1, GSTP1, KRAS, KRT15, KRT19, RHOA, SLC12A4, TP53, and TPM2.
Meaningful statistical evaluation of DNA methylation requires the availability of a sufficient number of replicates, which are currently not available for the previous cell lines. We addressed this issue by randomly partitioning the normal nanopore reads ((271)× median average coverage over 10 CpG sites) into two groups of 5 normal samples, each with an average coverage of (sim !25)×, and did similarly with the cancer nanopore reads ((249)× median average coverage over 10 CpG sites) to generate a group of 5 cancer samples (“Methods”). For each analysis region and each sample, we employed CpelNano to compute the MMLs, NMEs, and CMDs from two CPEL models estimated from the nanopore reads using the EM-based maximum-likelihood module. CpelNano compared two groups of methylation summaries by performing (two-tailed) permutation-based hypothesis testing using three differential test statistics. These statistics summarize the differences between the average MML and average NME values in the two groups, as well as the average of all differences between the probability distributions of methylation (quantified by the CMD) observed between the groups (“Methods” and Supplementary Methods).
Computed values of the differential methylation statistics at 480 analysis regions comprising 3086 CpG sites showed considerably larger MML, NME, and CMD values when comparing one of the two normal groups to the cancer group than when comparing the two normal groups to each other (Fig. S11a), presenting the possibility of statistically significant dysregulation of DNA methylation in the cancer samples. Indeed, the computed empirical cumulative probability functions (eCDFs) of the P-values obtained for each differential test statistic in the normal/cancer comparison were heavily skewed to the left (Fig. S11b), with many eCDF values being smaller than the significance level used (0.05), and the same was true for the computed Q-values (Fig. S11c) obtained by the Benjamini-Hochberg procedure for FDR control, showing that many analysis regions exhibited statistically significant differences in MML, NME, and in the probability distribution of methylation. By comparison, the eCDFs of the Q-values obtained in the normal/normal comparison were heavily skewed to the right (Fig. S11c), showing that none of the analysis regions exhibited statistically significant differences, which is expected to be true when using a hypothesis testing procedure that effectively accounts for biological, statistical, and technical variability present in the normal data. Notably, the computed eCDFs for the P-values were almost linear (Fig. S11b), implying that the P-values were (approximately) uniformly distributed under the null hypothesis, as theoretically expected. Therefore, the probability of observing a P-value that is no larger than a given significance level (alpha) equals (alpha), confirming the theoretical result that the permutation-based hypothesis testing method used by CpelNano properly controls the Type I error, resulting in an error rate that is no more than (5%) in a normal/normal comparison ((4.76%) to be exact; see Supplementary Methods).


Methylation discordance and analysis regions in the targeted breast normal/cancer comparison. (a) Venn diagram showing the number of analysis regions overlapping all genomic regions examined that exhibited significant differences in mean methylation level (MML) and normalized methylation entropy (NME), as well as significant discordance in the probability distribution of methylation quantified by the coefficient of methylation divergence (CMD). (b) Venn diagram of significantly dysregulated analysis regions that overlap gene bodies. (c) Venn diagram of significantly dysregulated analysis regions that overlap promoter regions. (d) Venn diagram of significantly dysregulated analysis regions that overlap known repetitive elements.
We overall found 240 analysis regions exhibiting significant ((q le 0.05)) dysregulation in DNA methylation, which were associated with significant differential MML (77%), differential NME (67%), and CMD (95%) values (Fig. 4a). Interestingly, 22% of the significantly dysregulated analysis regions did not exhibit significant MML differences, whereas 17% of the significantly dysregulated analysis regions exhibited only significant CMD values and 3% demonstrated only significant differences in NME. This demonstrates the need to use all three test statistics when evaluating DNA methylation discordance between groups. However, our results indicate that the CMD is the most comprehensive quantity for evaluating methylation discordance, since it is associated with 95% of the significantly dysregulated analysis regions. We also obtained similar results over gene bodies and promoter regions (Fig. 4b–d) and acquired detailed associations of types, numbers, and locations of significantly dysregulated analysis regions (Tables S1 and S2). Moreover, we investigated DNA methylation discordance over known repetitive elements along the targeted regions and found many types of repetitive sequences exhibiting significant DNA methylation discordance in breast cancer (Table S3), with (46%) of significantly dysregulated analysis regions overlapping Alu elements and (12%) overlapping L1 repeats.
Among the genes that were fully covered by the nanopore data, (beta)-tropomyosin (TPM2), a gene that has been implicated in cell proliferation, migration, and apoptosis, exhibited significant dysregulation of the DNA methylation landscape over its promoter region. This was associated with significant hypermethylation over the gene’s CGI, which was found to be fully unmethylated in the normal group, and a significant increase in methylation entropy, implying increased variability of DNA methylation in breast cancer (Fig. 5a). Interestingly, TPM2 was recently found to be a tumor suppressor gene whose expression is down-regulated in breast cancer30. We also discovered profound changes in the DNA methylation landscape over the promoter region of the cytokeratin-19 (KRT19), a coding gene whose CGI was almost fully methylated in normal but exhibited minimal methylation in cancer (Fig. 5b). Notably, DNA hypomethylation and overexpression of KRT19 has been recently linked to adenocarcinoma31, a form of cancer that starts in the epithelial cells that line organs and tissues throughout the body and leads to breast and lung tumors, as well as other types of tumors. Moreover, KRT19 has been found to be highly upregulated in breast cancer with expression that significantly correlates with cell proliferation, migration, invasion, and prognosis32,33,34.


Methylation discordance, genes, and repetitive elements in the targeted breast normal/cancer comparison. (a) Averages of mean methylation levels (MMLs) and normalized methylation entropies (NMEs), observed in two groups of five “normal” (green lines) and five “cancer” (red lines) samples used for differential analysis, over genomic regions overlapping TPM2 and CA9. The average of all differences in the probability distributions of methylation between the two groups, quantified by the coefficient of methylation divergence (CMD), is also depicted (blue line). Dots indicate individual MML and NME values for each group and sample, whereas boxes delineate genomic regions of significant ((q le 0.05)) DNA methylation discordance. CGIs track: CpG islands; REs track: L1 (blue) and Alu (purple) repetitive elements. (b) Results of methylation discordance associated with KRT19 and KRT15. (c) Results of methylation discordance associated with GPX1 and RHOA. (d) Results of methylation discordance associated with GSTP1.
The breast nanopore data provide full coverage for two additional genes, glutathione peroxidase 1 (GPX1) and glutathione S-transferase P1 (GSTP1). Despite the fact that both genes have been implicated in certain forms and stages of breast cancer35,36, they did not exhibit significant MML or NME discordance over their CGIs, and they were fully unmethylated in both normal and cancer (Fig. 5c,d). Notably, by using bisulfite sequencing, the GPX1 promoter was also found to be unmethylated in the MDA-MB-453 and BT-474 breast cancer cell lines36. Nonetheless, our analysis revealed profound dysregulation of the DNA methylation landscape over a region near the CGI associated with the GPX1 promoter, linked with significant hypermethylation and loss of entropy (Fig. 5c). Moreover, GSTP1 exhibited significant hypomethylation over a 4-kb region near its CGI and significant hypermethylation over a portion of its body, which were both associated with a noticeable reduction in methylation entropy (Fig. 5d). Interestingly, aberrant GSTP1 methylation has been found to be significantly associated with the risk of breast cancer35.
Our results also pointed to methylation discordance associated with two additional genes, carbonic anhydrase IX (CA9) and cytokeratin-15 (KRT15), although it was not possible to provide a complete picture of their methylation status due to incomplete nanopore data covering these genes (Fig. 5a,b). However, KRT15 exhibited significant dysregulation of the methylation landscape, which was associated with considerable hypermethylation and loss of methylation entropy over a portion of its body. Interestingly, CA9 has been related to breast cancer and other tumors37,38, whereas, KRT15 was recently found to be hypermethylated and underexpressed in gastric cancer, as well as underexpressed in breast invasive carcinomas, with its expression being significantly associated with overall patient survival in both types of cancer39,40. Finally, our analysis produced similar results for the BRAF, KRAS, SLC12A4, and TP53 genes (Fig. S12), although a full assessment of their methylation status was not possible due to their incomplete nanopore data coverage.
With respect to repetitive elements, CpelNano found four Alu repeats, AluY (314 bp), AluJb (83 bp), Aluz (310 bp), and AluSz (296 bp), at Chr. 17: 41,525,959–41,529,006 near the promoter CGI of KRT19 exhibiting profound loss of methylation in breast cancer (Fig. 5b and Table S3). This is in agreement with recent results demonstrating early loss of DNA methylation over a small subset of Alu elements in breast cancer41. CpelNano also identified three nearby L1 elements, HAL1 (219 bp), L1ME3G (417 bp), and L1ME3G (249 bp), at Chr. 17: 41,533,458–41,534,634, which exhibited high but variable methylation in both normal and cancer, a methylation state that is common to most L1 retrotransposons24. Interestingly, we found a cluster of five Alu elements, AluY (297 bp), AluSx1 (301 bp), AluYb8 (318 bp), AluSx1 (307 bp), and AluSx1 (276 bp), at Chr. 3: 49,353,360–49,355,409 near GPX1, exhibiting hypermethylation and loss of methylation entropy in breast cancer (Fig. 5c and Table S3). Finally, CpelNano identified a cluster of seven L1 elements, L1MEh (159 bp), L1MEh (258 bp), L1MEh (267 bp), L1MEh (296 bp), L1PA14 (357 bp), L1M5 (182 bp), and L1PA11 (354 bp), separated by three Alu repeats, AluSq (293 bp), AluJb (139 bp), and AluSx (277 bp), at Chr. 11: 67,579,281–67,583,297 near the CGI associated with GSTP1 showing considerable hypomethylation and noticeable entropy reduction in breast cancer (Fig. 5d and Table S3). This concurs with emerging evidence that hypomethylation of L1 elements is an early event in carcinogenesis that leads to aberrant transcription activation and chromosomal instability in many types of cancer42, including breast cancer43,44.
Taken together, the previous results show remarkable consistency with known biological evidence and demonstrate the effectiveness of CpelNano for generating a comprehensive description of DNA methylation discordance at high resolution using nanopore data. Evidently, this is also true at regions of the genome rich in repetitive elements, which are difficult to map and study using short-read sequencing technologies.

