Preloader

DiMeLo-seq: a long-read, single-molecule method for mapping protein–DNA interactions genome wide

DiMeLo-seq workflow

DiMeLo-seq combines elements of antibody-directed protein–DNA mapping approaches6,15,16 to deposit methylation marks near a specific target protein, then uses long-read sequencing to read out these exogenous methylation marks directly10,11,12,13,14. Taking advantage of the low abundance of N6-methyl-deoxyadenosine (hereafter mA) in human DNA17, we fused the antibody-binding protein A to the nonspecific deoxyadenosine methyltransferase Hia5 (pA–Hia5)11,18 to catalyze the formation of mA in the DNA proximal to targeted chromatin-associated proteins (Fig. 1a). First, nuclei are permeabilized, primary antibodies are bound to the protein of interest, and any unbound antibody is washed away. Next, pA–Hia5 is bound to the antibody, and any unbound pA–Hia5 is washed away. The nuclei are then incubated in a buffer containing the methyl donor S-adenosylmethionine (SAM) to activate adenine methylation in the vicinity of the protein of interest16. Finally, genomic DNA is isolated and sequenced using modification-sensitive, long-read sequencing with mA base calls providing a readout of the sites of protein–DNA interactions (Fig. 1a and Supplementary Fig. 1). This approach provides a distinct advantage in the ability to detect multiple binding events by the target protein on each long, single DNA molecule, which would not be possible with short-read sequencing (Fig. 1b). This protocol also avoids amplification biases, enabling improved estimation of absolute protein–DNA interaction frequencies at each site in the genome across a population of cells (Fig. 1c). Modification-sensitive readout allows for the simultaneous detection of both exogenous antibody-directed adenine methylation and endogenous CpG methylation on single molecules (Fig. 1d). Additionally, DiMeLo-seq’s long sequencing reads often overlap multiple heterozygous sites, enabling phasing and measurement of haplotype-specific protein–DNA interactions (Fig. 1e). Finally, long reads enable mapping of protein–DNA interactions within highly repetitive regions of the genome (Fig. 1f).

Fig. 1: Genome-wide mapping of protein–DNA interactions with DiMeLo-seq.
figure 1

a, Schematic of the DiMeLo-seq workflow for mapping protein–DNA interactions. bf, DiMeLo-seq can be used to map multiple interaction sites for a protein of interest on each chromatin fiber (b), estimate protein–DNA interaction frequencies (c), study the joint relationship between endogenous DNA methylation and protein binding (d), study genetic or epigenetic effects on protein binding between parental haplotypes (e) and map protein–DNA interactions across repetitive regions (f).

Antibody-directed histone-specific DNA adenine methylation of reconstituted chromatin in vitro

We expressed and purified recombinant pA–Hia5 and tested its methylation activity on purified DNA using the methylation-sensitive restriction enzyme DpnI, which only cuts GATC sites when adenine is methylated. DNA incubated with Hia5, pA–Hia5 or protein A/G Hia5 (pAG–Hia5) in the presence of SAM became sensitive to DpnI digestion, confirming the methyltransferase activity of the purified fusion proteins (Supplementary Note 1 and Extended Data Fig. 1a,b). To test the ability of pA–Hia5 to target chromatin and methylate accessible DNA in vitro, we reconstituted chromatin containing the histone variant centromere protein A (CENP-A) using the nucleosome-positioning DNA sequence referred to as ‘601’ (ref. 19; Extended Data Fig. 1c,d and Supplementary Note 2). Incubating mononucleosomes together with free-floating pA–Hia5 and SAM, followed by long-read sequencing and methylation-sensitive base calling, showed methylation on 97.1% ± 0.8% of reads (mean ± s.e.m., n = 3; Supplementary Notes 3 and 4, Fig. 2c,d and Extended Data Fig. 1e–k). Moreover, we observed almost no methylation at the expected nucleosome-protected region (Fig. 2c,d and Extended Data Fig. 1j).

Fig. 2: Application of DiMeLo-seq in artificial chromatin.
figure 2

a, Schematic of antibody-directed methylation of artificial chromatin. b, Heat map of 5,000 individual 1 × 601 reads from chromatin containing CENP-A mononucleosomes methylated with CENP-A-directed pA–Hia5 (red dashed line indicates 601 dyad position). c, Plots of A or T density (top) and average mA/A on base position of 1 × 601 containing DNA (bottom; red dashed line indicates 601 dyad position). d, Plot of percentage of reads with methylation as a function of the distance from dyad axis. e, Schematic of directed methylation of 18 × 601 chromatin array. f,g, Heat map of 2,000 individual reads from CENP-A chromatin methylation with CENP-A-directed pA–Hia5, hierarchically clustered by Jaccard distances of inferred nucleosome positions over the entire 18 × 601 array (f) or a subset 4 × 601 region (g) along with cartoons depicting predicted nucleosome positions (red circles). Insets (below) show average mA/A on every base position of 18 × 601 array or 4 × 601 portion (red dashed line indicates the 601 dyad position).

We reconstituted CENP-A chromatin on biotinylated DNA, bound it to streptavidin-coated magnetic beads, incubated it with CENP-A antibody and pA–Hia5, and washed away any unbound antibody and pA–Hia5 before activating methylation with SAM (Fig. 2a and Extended Data Fig. 1c). We observed methylation on 65.0% ± 10.0% of CENP-A DiMeLo-seq reads (mean ± s.e.m.; n = 3; Fig. 2b–d and Extended Data Fig. 1e–h,k), with methylation levels decaying with distance from the nucleosome footprint (Fig. 2c). We observed only background levels of methylation on IgG control DiMeLo-seq reads (5.1% ± 0.6% of IgG reads; mean ± s.e.m.; n = 2), compared to 4.1% ± 0.5% of untreated reads (mean ± s.e.m.; n = 3; Fig. 2d and Extended Data Fig. 1e,k). While reads from either free-floating pA–Hia5 or antibody-tethered pA–Hia5 conditions showed nucleosome-sized protection from methylation (~150–180 bp centered at the dyad; Fig. 2c,d and Extended Data Fig. 1j), ~70% of all methylation on reads from antibody-tethered pA–Hia5 fell within 250 bp on either side of the dyad. This result demonstrates that antibody-tethered pA–Hia5 can methylate accessible DNA close to target nucleosomes in vitro.

To test the specificity of DiMeLo-seq to identify target nucleosomes on chromatin fibers, we first assessed the ability of pA–Hia5 to methylate accessible regions of DNA on in vitro reconstituted chromatin assembled on an 18× array of the 601 nucleosome-positioning sequence (Extended Data Fig. 2a–c). Co-incubation of chromatin together with free-floating pA–Hia5 and SAM resulted in structured patterns of oligonucleosome footprinting (Extended Data Fig. 2b,g,h), as reported previously for reconstituted chromatin incubated with another exogenous methyltransferase, EcoGII10.

We then tested antibody-directed methylation of chromatin arrays reconstituted with nucleosomes containing either CENP-A or histone H3. We incubated chromatin with CENP-A antibody and pA–Hia5, washed away unbound antibody, and activated methylation with SAM (Fig. 2e). Following activation, we immunostained chromatin-conjugated beads with an anti-mA antibody, demonstrating an increase in mA signal when CENP-A chromatin, but not H3 chromatin, was incubated with pA–Hia5 and CENP-A antibody (Extended Data Fig. 2d,e and Supplementary Note 5), indicating antibody-directed methylation. Long-read sequencing detected mA on DNA after CENP-A-directed methylation of CENP-A chromatin (but not H3 chromatin) (Extended Data Fig. 2f). On average, CENP-A-directed methylation of CENP-A chromatin was depleted at the central axis of the nucleosome where the 601 sequence positions the nucleosome dyad (Fig. 2f,g). On individual reads, we observed protection from methylation centered at 601 dyad positions, consistent with nucleosome occupancy protecting the DNA from antibody-directed methylation (Fig. 2f,g) and similar to the free pA–Hia5 condition (Extended Data Fig. 2g,h). In contrast to the free pA–Hia5 condition, for which we observed a high prevalence of methylation on any region not protected by nucleosomes, in the antibody-directed pA–Hia5 condition, we observed approximately fourfold lower average probability of methylation (Fig. 2f and Extended Data Fig. 2g), consistent with the expectation that tethering of pA–Hia5 produces preferential methylation of deoxyadenosines closest to the antibody-bound nucleosome. Despite this reduction in total methylation of accessible DNA in CENP-A DiMeLo-seq reads compared to free pA–Hia5-treated reads, we detect a similar distribution of nucleosome densities in our chromatin array population (Extended Data Fig. 2i). We observed similar results for H3-antibody-directed methylation of H3 chromatin using pAG–Hia5 (Extended Data Fig. 2j–l). We conclude that directing pA–Hia5 activity using a histone-specific antibody targets specific methylation in proximity to the nucleosome of interest in vitro.

Optimization of lamin B1 mapping in situ

We next optimized DiMeLo-seq for mapping protein–DNA interactions in situ in permeabilized nuclei from a human cell line (HEK293T). To do this, we mapped the interaction sites of lamin B1 (LMNB1), which is often targeted in DamID studies to profile LADs20. Large regions of the genome that are almost always in contact with the nuclear lamina across cell types are called constitutive lamina-associated domains (cLADs). Regions that are rarely in contact with the nuclear lamina across cell types and instead reside in the nuclear interior are called constitutive inter-LADs (ciLADs; Fig. 3a). Other regions can vary in their lamina contact frequency between cell types and/or between cells of the same type. We chose LMNB1 as an initial target because (1) cLADs and ciLADs provide well-characterized on-target and off-target control regions, respectively; (2) LMNB1 has a very large binding footprint (LADs have a median size of 500 kb and cover roughly 30% of the genome21), so DNA–LMNB1 interactions can be detected even with very low sequencing coverage; (3) LMNB1 localization at the nuclear lamina can be easily visualized by immunofluorescence, allowing for intermediate quality control using microscopy during each step of the protocol (Extended Data Fig. 3c,d); and (4) we have previously generated LMNB1 DamID data from HEK293T cells using bulk and single-cell protocols, providing ample reference materials22.

Fig. 3: Optimization of DiMeLo-seq targeting lamin B1 in situ.
figure 3

a, Schematic of interactions between LMNB1 and LADs, and the use of mA levels in cLADs and ciLADs to estimate on-target and off-target mA. b, Scatterplot of each protocol condition tested the proportion (y axis) of all A bases (base calling q > 10, n = min 1.4 M, max 28 M A bases per condition) called as methylated (stringent threshold P > 0.9; mA/A) across all reads in on-target (cLAD) regions and the ratio (x axis) of these mA levels compared to off-target (ciLAD) regions. Circles are colored by the methyltransferase condition used. Error bars provide a measure of uncertainty due to each condition’s sequencing coverage (described below). Complete data are in Supplementary Table 1. c, A browser image across chromosome 7 comparing in situ LMNB1-targeted DiMeLo-seq (protocol v1) to in vivo LMNB1-tethered DamID data (blue)22. The coverage of each region by simulated DpnI digestion fragments (splitting reference at GATC sites) between 150 and 750 bp (sequenceable range) is indicated by a teal heat map track (range, 0 to 0.7). The presence of intervals longer than 10 kb between unique 51-mers in the reference, a measure of mappability, is indicated with an orange heat map track. d, A zoom-in view of the centromere on chromosome 7, with added tracks at the bottom illustrating LMNB1 interaction frequencies from scDamID data22, as well as from DiMeLo-seq data (protocol v1). e, For a quality-filtered set of 100-kb genomic bins (gray points; Supplementary Note 7; n = 11,292 total bins), a comparison of LMNB1 interaction frequency estimates from DiMeLo-seq (protocol v1; black circles indicate mean across n = 94 to 663 genomic bins, computed for each genomic bin as the proportion of n = 61 to 335 overlapping reads with at least 1 mA call with P > 0.9) versus scDamID (proportion of n = 32 cells with detected interactions in each genomic bin). A linear regression line computed across all bins is overlaid (blue). Error bars in b and e represent 95% credible intervals determined for each proportion, and the mean of proportions or ratio of proportions by sampling from posterior beta distributions was computed using uninformative priors.

To assess the performance of the LMNB1-targeted DiMeLo-seq protocol, we quantified the proportion of adenines that were called as methylated (mA/A) across all reads mapping to cLADs (on-target regions), and across all reads mapping to ciLADs (off-target regions). We evaluated the performance of each iteration of the protocol using both the on-target methylation rate (as a proxy for sensitivity) and the on-target:off-target ratio (as a proxy for signal-to-background ratio), aiming to increase both. We developed a rapid pipeline for testing variations of many components of the protocol, allowing us to go from harvested cells to fully analyzed data in under 60 h (Methods and Supplementary Notes 6–8). With this optimization pipeline, we tested over 100 different conditions (Fig. 3b), varying the following: methyltransferase type (Hia5 versus EcoGII), input cell numbers, detergents, primary antibody concentrations, the use of secondary antibodies, enzyme concentrations, incubation temperatures, methylation incubation times, methylation buffers and SAM concentrations (Supplementary Note 8 and Supplementary Table 1). We validated an initial version of the protocol (v1; https://doi.org/10.17504/protocols.io.bv8tn9wn/), and then further optimized the methyltransferase activation conditions to increase the amount of on-target methylation by 50–60% without sacrificing specificity (v2; https://doi.org/10.17504/protocols.io.b2u8qezw/); Extended Data Figs. 3 and 4, Supplementary Note 8 and Fig. 3b). To confirm that this optimization would apply to other types of proteins, we also examined the results of different protocol variations targeting the protein CTCF and found them to be concordant (Extended Data Fig. 5a).

We also verified that there is very little loss of performance when using cells that were cryopreserved in dimethylsulfoxide-containing medium or lightly fixed in paraformaldehyde (PFA), when using between 1 and 5 million cells per replicate, or when using concanavalin-A-coated magnetic beads to carry out cell-washing steps by magnetic separation instead of centrifugation (Methods, Supplementary Notes 9 and 10 and Supplementary Table 1). To confirm antibody specificity, we performed experiments using IgG isotype controls and free-floating Hia5 controls to measure nonspecific methylation and DNA accessibility, respectively (Methods and Supplementary Table 2). We also generated a stably transduced line expressing a direct fusion between EcoGII and LMNB1 in vivo, as in MadID23, and then we detected mAs with nanopore sequencing (Extended Data Fig. 4a and Supplementary Note 10). This in vivo approach produced threefold more on-target methylation compared to in situ DiMeLo-seq with pAG–EcoGII (Fig. 3b), although this performance is expected to vary with different fusion proteins and their expression levels (Supplementary Note 10).

We found that DiMeLo-seq and conventional bulk DamID are highly concordant in the non-repetitive parts of the genome (Spearman correlation = 0.71 in 1-Mb bins), but conventional DamID achieves little-to-no coverage across pericentromeric regions (Fig. 3c). This is due in part to the low availability of unique sequence markers to map short reads to in the pericentromere, but also to the low frequency of GATC (the binding motif for Dam and DpnI in the DamID protocol) within centromeric repeats (Fig. 3c)23. DiMeLo-seq, unlike DamID, produces long reads that can be uniquely mapped across the centromeric region of chromosome 7, revealing that this region has an intermediate level of contact with the nuclear lamina (Fig. 3c,d).

Because DiMeLo-seq directly probes unamplified genomic DNA, each sequencing read represents a single, native DNA molecule from a single cell, sampled independently and with near-uniform probability from the population of cells. This allows for estimation of absolute protein–DNA interaction frequencies, that is, the proportion of cells in which a site is bound by the target protein, without needing to account for the amplification bias inherent to other protein–DNA mapping methods. We leveraged single-cell Dam-LMNB1 DamID data from the same cell line22 to assess the relationship between DiMeLo-seq methylation and an orthogonal estimate of protein–DNA interaction frequencies. This revealed a nearly linear relationship between the two interaction frequency estimates, with a simple linear model achieving an R2 of 0.71, compared to an R2 of 0.31 when single-cell DamID (scDamID)-based interaction frequencies were compared to bulk conventional DamID coverage (Fig. 3e and Extended Data Fig. 4f). We note that scDamID tends to slightly overestimate intermediate interaction frequencies compared to DiMeLo-seq, attributable to the in vivo versus in situ nature of the two protocols16, as well as to the fact that homolog-specific information is collapsed within each hypotriploid HEK293T cell22,24. This analysis demonstrates that DiMeLo-seq is capable of estimating absolute protein–DNA interaction frequencies without needing to account for amplification bias, while capturing heterogeneity in protein–DNA interactions at the single-cell level.

Joint analysis of CTCF binding and CpG methylation on single molecules

DiMeLo-seq measures protein–DNA interactions in the context of the local chromatin environment by simultaneously detecting endogenous CpG methylation, nucleosome occupancy and protein binding. To highlight this feature of DiMeLo-seq, we targeted CTCF, a protein that strongly positions surrounding nucleosomes and whose binding is inhibited by CpG methylation25. We first validated that targeted methylation is specific to CTCF in GM12878 cells by calculating the fraction of adenines that are methylated within GM12878 CTCF ChIP–seq peaks relative to the fraction of adenines methylated outside these peaks. We chose to target CTCF in GM12878 cells because GM12878 is an ENCODE tier 1 cell line with abundant ChIP–seq reference datasets. We measured a 16-fold increase in targeted methylation over background in our CTCF-targeted sample (Extended Data Fig. 5b). We also measured a sixfold mA/A enrichment in the free pA–Hia5 control in CTCF ChIP–seq peaks, reflecting that many CTCF-binding sites overlap with accessible regions of the genome where pA–Hia5 can methylate more easily26. However, both the free pA–Hia5 and the IgG controls produced significantly less on-target methylation than the CTCF-targeted sample (Extended Data Fig. 5b). We confirmed that signal enrichment is caused by CTCF-targeted methylation and not accessibility of CTCF sites by measuring a 1.8 times greater proportion of mA in ChIP–seq peaks compared to regions of open chromatin measured by the assay for transposase-accessible chromatin with sequencing (ATAC–seq; Extended Data Fig. 5c).

As further validation of DiMeLo-seq’s concordance with ChIP–seq data and to visualize protein binding on single molecules, we analyzed mA and mCpG across individual molecules spanning CTCF motifs within ChIP–seq peaks of various strengths (Fig. 4a). DiMeLo-seq signal tracks with ChIP–seq signal strength, with mA density decreasing from the top to bottom quartiles of ChIP–seq peak signal. We observed an increase in local mA surrounding the binding motif, with a periodic decay in methylation from the peak center, indicating methylation of neighboring linker DNA between strongly positioned nucleosomes (Extended Data Fig. 5d). The 88-bp dip at the center of the binding peak reflects CTCF’s binding footprint27,28,29 and is evident even on single molecules. CTCF binds to ~50 bp of DNA as determined by DNase I footprinting and ChIP-exo27,30,31. The larger footprint observed with DiMeLo-seq is likely due to steric hindrance with Hia5 unable to methylate DNA within ~20 bp of the physical contact between CTCF and DNA as efficiently. We also observed an asymmetric methylation profile, with stronger methylation 5′ of the CTCF motif. This increased methylation relative to 3′ of the motif extends beyond the central peak to the neighboring linker DNA. We hypothesized that this asymmetry was a result of the antibody binding the C terminus of CTCF, thereby positioning pA–Hia5 closer to the 5′ end of the binding motif. To test this hypothesis, we compared DiMeLo-seq binding profiles in top quartile ChIP–seq peaks when using an antibody targeting the C terminus of CTCF, as is used in Fig. 4, and an antibody targeting the N terminus of CTCF. We observed methylation enrichment 5′ to the binding motif with C-terminal targeting and 3′ to the motif with N-terminal targeting (P value = 0.00010; Supplementary Note 11 and Extended Data Fig. 5e). The free pA–Hia5 control profile supports this finding that the antibody-binding site is causing the peak asymmetry, as there is no notable asymmetry in this untargeted case (Extended Data Fig. 6).

Fig. 4: Single-molecule CTCF-binding and CpG methylation profiles.
figure 4

a, Single molecules spanning CTCF ChIP–seq peaks are shown across quartiles of ChIP–seq peak strength. Quartile 4 (q4) comprises peaks with the strongest ChIP–seq peak signal, while quartile 1 (q1) comprises peaks with the weakest ChIP–seq peak signal. Blue shows mA called with a probability ≥ 0.75, while orange indicates mCpG called with a probability ≥ 0.75. Aggregate curves for each quartile were created with a 50-bp rolling window. Base density across the 2-kb region for each quartile is indicated in the one-dimensional heat maps; the scale bar indicates the number of adenine bases and CG dinucleotides sequenced at each position relative to the motif center. b, Joint mA and mCpG calls on the same individual molecules spanning CTCF ChIP–seq peaks. Molecules displayed have at least one mA called and one mCpG called with a probability ≥ 0.75. Aggregate curves were created with a 50-bp rolling window. Base density is indicated as in a. c, CTCF site protein occupancy is measured on single molecules spanning two neighboring CTCF motifs within 2–10 kb of one another. CTCF motifs are selected from all ChIP–seq peaks, and molecules are shown that have a peak in at least one of the two motifs. Each row is a single molecule, and the molecules are anchored on the peaks that they span, with a variable distance between the peaks indicated by the gray block. ChIP–seq peak signal for each of the motif sites is indicated by the purple bars. The graphic on the side illustrates the CTCF-binding pattern for each cluster. d, Phased reads across the IGF2/H19 imprinting control region with CTCF sites indicated in gray. Blue dots represent mA calls and orange dots represent mCpG calls. Heterozygous sites used for phasing are indicated in turquoise.

To evaluate the use of DiMeLo-seq for de novo peak detection, we called CTCF peaks using DiMeLo-seq data alone and created receiver operating characteristic curves at increasing sequencing depths using ChIP–seq peaks as ground truth (Supplementary Note 11 and Extended Data Fig. 5f). At ~25× coverage, we detected 60% of ChIP–seq peaks (false positive rate of 1.6%) and measured an area under the curve value of 0.92 (Supplementary Note 11). Among the peaks detected with DiMeLo-seq that were not annotated ChIP–seq peaks, 10% overlapped 1-kb marker deserts and gaps in the hg38 reference and were undetectable by ChIP–seq. Another 12% of these peaks fell within 500 bp of a known CTCF motif.

We next probed the relationship between CTCF binding and endogenous CpG methylation. Single molecules spanning CTCF-binding sites in stronger ChIP–seq peaks exhibited a larger dip in mCpG around the motif compared to the shallower dip in weaker ChIP–seq peaks (Fig. 4a). This inverse relationship between CpG methylation and CTCF-targeted methylation reflects previous findings that mCpG inhibits CTCF binding25. We measured both mA and mCpG on the same single molecules and also observed that both A and CpG were preferentially methylated in linker DNA (Fig. 4b). The increased methylation of CpG in linker DNA relative to nucleosome-bound DNA surrounding CTCF sites is supported by previous studies that have similarly reported higher levels of mCpG in linker DNA than nucleosomal DNA around CTCF sites32.

CTCF’s known binding motif and abundance genome wide make it a good target for characterizing the resolution of DiMeLo-seq. To characterize resolution, we estimated the peak center on single molecules spanning the top decile of CTCF ChIP–seq peaks (Supplementary Note 11). The mean single-molecule peak center was 6 bp 5′ of the CTCF motif center, and the peak center on approximately 70% of the reads fell within ±200 bp of the motif center (Extended Data Fig. 5g). This systematic bias toward predicting the peak center 5′ of the motif can be explained by the observed asymmetry in methylation when targeting the C terminus of CTCF. Another factor that impacts the resolution of DiMeLo-seq is the reach of the methyltransferase, which can be characterized by measuring the decay rate of methylation density from the peak center. To do this, we fit the average adenine methylation density with respect to the motif center to an exponential function and calculated a half-life of 169 bp (Extended Data Fig. 5d). Together, this analysis suggests that DiMeLo-seq can resolve binding events to within about 200 bp; however, this metric is likely dependent on the protein target and influenced by the local chromatin environment.

To characterize the sensitivity of DiMeLo-seq for detecting CTCF-binding events on single molecules, we performed a binary classification of individual CTCF-targeted DiMeLo-seq reads based on each read’s proportion of methylated adenines within CTCF peak regions, defined as ±150 bp around the CTCF-binding motif center. For top-decile ChIP–seq peaks, which are regions that are most likely to contain CTCF binding, we classified reads containing CTCF-binding events with 54% sensitivity (5.7% false positive rate; Extended Data Fig. 5h,i and Supplementary Note 11).

We next investigated the ability of DiMeLo-seq to measure protein binding at adjacent sites on single molecules. We first characterized CTCF occupancy across two binding sites that were spanned by a single molecule. We were able to detect neighboring CTCF motifs that are bound by CTCF at both sites or just one of the two sites, and the detected binding appears to track with ChIP–seq peak strength (Fig. 4c). This analysis demonstrates the potential of DiMeLo-seq to analyze coordinated binding patterns on long single molecules, which is not possible with short-read methods. We further investigated this potential within a specific HLA locus on chromosome 6 where haplotype-specific single nucleotide polymorphisms (SNPs) within the CTCF-binding motif prevent CTCF binding at one of the two neighboring sites (Extended Data Fig. 7a). DiMeLo-seq can map haplotype-specific interactions because long reads often span multiple heterozygous sites, allowing reads to be phased. Importantly, at 25× coverage, we were able to detect the binding patterns of both sites on the same single molecule and could attribute the lack of detected binding at one of the two sites to a mutation within the binding motif. The ability to map haplotype-specific interactions is also useful in studying imprinted genomic regions such as the IGF2/H19 imprinting control region, where CpG methylation on the paternal allele prevents CTCF binding, while on the maternal allele, CTCF is able to bind (Fig. 4d). We also reported haplotype-specific CTCF-binding profiles at specific sites and broadly across the active and inactive X chromosomes (Extended Data Fig. 7b–d). These results demonstrate that DiMeLo-seq can measure the effect of haplotype-specific genetic or epigenetic variation on protein binding.

To test the compatibility of DiMeLo-seq with other long-read sequencing platforms capable of modification calling, we performed Pacific Biosciences (PacBio) sequencing of DNA from a CTCF-targeted DiMeLo-seq sample and from an unmethylated control (Supplementary Note 12). We found similar enrichment profiles using both methods (Extended Data Fig. 8), indicating that DiMeLo-seq is compatible with PacBio’s circular consensus sequencing technique. However, while PacBio sequencing has reported improved base-calling accuracy33, this approach detected more methylation in the unmethylated control than nanopore sequencing, slightly reducing the signal-to-noise ratio of the measurement (Extended Data Fig. 8).

Mapping protein–DNA interactions in centromeric regions

Mapping histone modifications in heterochromatin with DiMeLo-seq

To test DiMeLo-seq’s ability to measure protein occupancy in heterochromatic, repetitive regions of the genome, we targeted trimethylated histone H3 lysine 9 (H3K9me3), which is abundant in pericentric heterochromatin. We chose to target H3K9me3 in HG002 cells because the chromosome X centromere has been completely assembled for this male-derived lymphoblast line9, and many different sequencing data types are available for it34. To validate the specificity of targeted methylation, we calculated the fraction of adenines methylated within HG002 CUT&RUN H3K9me3 peaks35 compared to the fraction of adenines methylated outside broadly defined peaks (Supplementary Note 13). For H3K9me3 targeting in HG002 cells, the enrichment of mA/A in CUT&RUN peaks was 3.6-fold over background (Fig. 5a), indicating enrichment of methylation within expected H3K9me3-containing regions of the genome.

Fig. 5: Detecting H3K9me3 in centromeres.
figure 5

a, The proportion of adenines methylated within CUT&RUN peaks relative to the proportion of adenines methylated outside CUT&RUN broad peak regions is reported for the H3K9me3-targeted sample as well as IgG and free pA–Hia5 controls. Error bars represent 95% credible intervals determined for each ratio by sampling from posterior beta distributions computed with uninformative priors. b, The fraction of adenines methylated within centromeres relative to non-centromeric regions, and similarly the fractions of adenines methylated within active HOR arrays relative to non-centromeric regions are displayed for the H3K9me3-targeted sample as well as the IgG and free pA–Hia5 controls. Error bars are defined as in a. c, The decline in mA/A for the H3K9me3-targeted sample in a rolling 100-kb window from −300 kb within the HOR array to 300 kb outside the HOR array. HOR array boundaries that transition quickly into non-repetitive sequences were considered: 1p, 2pq, 6p, 9p, 13q, 14q, 15q, 16p, 17pq, 18pq, 20p, 21q and 22q. d, Single molecules are displayed across the centromere of chromosome 7 for the H3K9me3-targeted sample and the IgG control. Reads mapping to the same position are displayed vertically, and modified bases are colored by the probability of methylation at that base for probabilities ≥ 0.6. Aggregate tracks show mA/A and mCpG/CpG in the H3K9me3-targeted sample in 10-kb bins. Gray bars below the centromere annotation indicate regions with >20-kb marker deserts.

Human centromeres are located within highly repetitive alpha-satellite sequences, which are organized into higher-order repeats (HORs)35,36,37,38. To validate enrichment of H3K9me3-directed mA signal in centromeres, and in particular in HOR arrays, we similarly calculated the fold increase in mA/A and found 1.9-fold enrichment in centromeres and 3-fold enrichment in active (kinetochore-binding) HOR arrays35 over non-centromeric regions (Fig. 5b). We next looked at HOR array boundaries and observed a decrease in H3K9me3 across the boundary moving from within to outside HOR arrays (Fig. 5c). In contrast, for the free pA–Hia5 control, mA/A increased moving from within to outside the HOR array, as chromatin becomes more accessible (Extended Data Fig. 9a)34.

We mapped heterochromatin not only in aggregate across HOR array boundaries, but also in single molecules across the centromere. H3K9me3-targeted DiMeLo-seq reads map across the centromere of chromosome 7, even in regions with over 20 kb between unique markers (Fig. 5d). An IgG isotype control confirmed that adenine methylation in the H3K9me3-targeted sample was not caused by background methylation (Fig. 5d and Extended Data Fig. 9b). Unlike methods that rely on amplifying short DNA fragments, such as ChIP–seq and CUT&RUN, we are able to detect single-molecule heterogeneity in chromatin boundaries, as highlighted in the transition from 65.5 to 68 Mb, where H3K9me3 signal drops as CpG methylation increases (Fig. 5d). However, lower methylation efficiency in heterochromatin and the challenges of mapping even moderately long reads in repetitive regions can still lead to uneven and low coverage in these regions (Extended Data Fig. 9c). To improve sensitivity for targeted DiMeLo-seq applications in the centromere, we developed a centromere enrichment method to enhance coverage in active HOR arrays and applied this method to study CENP-A.

Restriction-based enrichment strategy improves centromere coverage

Within alpha-satellite HOR arrays, the centromere-specific histone variant CENP-A delineates the site where the functional centromere and kinetochore will form. Population-level studies demonstrate that CENP-A nucleosomes are found at the core of these repeat units where the repeats are the most homogeneous35,39,40,41. However, it has not been possible to resolve the positions of CENP-A nucleosomes on single chromatin fibers to determine the one-dimensional organization and density of CENP-A at centromeres. To map the positions of CENP-A nucleosomes at centromeres using DiMeLo-seq, we developed a strategy to enrich specifically for human centromeric DNA to avoid sequencing the entire genome.

Our enrichment strategy, called AlphaHOR-RES (alpha HOR restriction and enrichment by size; from ‘alfajores), is based on classic centromere enrichment strategies42 that involve digesting the genome with restriction enzymes that cut frequently outside centromeric regions but rarely inside them, then removing short DNA fragments (Methods and Extended Data Fig. 10a). We added AlphaHOR-RES to our DiMeLo-seq workflow and observed at least 20-fold enrichment of sequencing coverage at centromeres while preserving relatively long read lengths (mean ~8 kb; Fig. 6a,b, Extended Data Fig. 10b–d and Methods). Thus, this enrichment strategy drastically increases the proportion of molecules sequenced that are useful for investigating CENP-A distribution, saving substantial sequencing time and costs. Furthermore, because AlphaHOR-RES targets the DNA and not the protein in the protein–DNA interaction, and because it is performed after directed methylation is complete, it is unlikely to bias our inferences of protein–DNA interaction frequencies in these regions.

Fig. 6: CENP-A-directed methylation within chromosome X centromeric higher-order repeats.
figure 6

a, Schematic of DiMeLo-seq with AlphaHOR-RES centromere enrichment. b, Genome browser plot on HG002 chromosome X of read coverage from DiMeLo-seq libraries with centromere enrichment (top) or without (middle). Bottom track depicts the region of the alpha-satellite array. c, Bar plot of the percentage mA/A (using a stringent Guppy probability threshold of 0.95) for reads from each library that contain or do not contain CENP-A-enriched k-mers. Fold enrichment of methylation percentage on CENP-A reads over non-CENP-A reads reported on top. Error bars represent 95% confidence intervals. d, View of a 250-kb region spanning the CDR within the active chromosome X HOR array. Top, single-molecule view, with individual reads as gray lines and mCpG positions as orange dots shaded by Guppy’s methylation probability. Bottom, CpG methylation frequency from nanopore sequencing reported in ref. 34. e, Single-molecule view of reads in d. mA positions are depicted as blue dots shaded by Guppy’s methylation probability. f, Aggregate view of mA. mA/A plot indicates the fraction of reads with a Guppy methylation probability above 0.6 at each adenine position (averaged over a 250-bp rolling window for visualization). Marker deserts (regions of at least 10 kb without unique 51-bp k-mers) are shown in orange to illustrate difficult-to-map locations for short-read sequencing. g, For CENP-A or IgG control DiMeLo-seq, read coverage (top plots) and average fraction of nucleosomes detected as CENP-A (bottom plot) per read in sliding 5-kb windows (step size, 1 bp), providing a measure of the density of CENP-A nucleosomes within single DNA molecules across the region. Thick lines indicate a 25-kb rolling average. Cartoon below shows representations of detected CENP-A nucleosomes within a 5-kb region corresponding to the CDR or CDR-adjacent region. RE, restriction enzyme.

DiMeLo-seq reveals variable CENP-A nucleosome density across centromeres

We performed CENP-A-directed DiMeLo-seq on HG002 cells. After extraction of total genomic DNA, we used AlphaHOR-RES to enrich centromeric sequences before sequencing (Fig. 6a,b). In an alignment-independent manner43, we classified DiMeLo-seq reads based on the presence or absence of CENP-A-enriched k-mers from an available short-read sequencing dataset41. CENP-A-directed DiMeLo-seq reads with CENP-A-enriched k-mers had roughly sevenfold more adenine methylation when compared to reads without CENP-A-enriched k-mers (Fig. 6c). We observed similar absolute methylation levels in DiMeLo-seq reads containing CENP-A k-mers when comparing CENP-A-targeted samples to free pA–Hia5 samples. However, the free pA–Hia5 samples also had a higher percentage of mA/A in reads that did not contain CENP-A k-mers, indicating a lack of CENP-A specificity in the absence of targeting.

To examine the positions of CENP-A nucleosomes within centromeric repeat arrays, we aligned our reads to a hybrid complete human assembly containing a fully assembled chromosome X from the HG002 cell line (Supplementary Note 14)9,34. We investigated the recently described chromosome X centromere dip region (CDR), a hypomethylated region in the centromeric alpha HOR array where short-read CENP-A datasets align34,35,41,44. We confirmed low endogenous CpG methylation within the CDR as expected (Fig. 6d). CENP-A-directed mA was found to be higher within both large and small CDRs compared to their adjacent CpG methylated regions, consistent with short-read data for this cell line (Fig. 6e,f)34,35. We found that the density of detected CENP-A nucleosomes increased fivefold within chromosome X CDRs compared to neighboring regions (Fig. 6g). We estimate that 26% ± 5% of nucleosomes contain CENP-A within the chromosome X CDR, whereas only 5% ± 2% of nucleosomes contain CENP-A within a neighboring region (mean ± s.d.; Supplementary Note 14 and Fig. 6g) confirming what ensemble short-read methods cannot: the density of CENP-A nucleosomes on single DNA molecules increases in CDRs. IgG isotype controls confirm that this signal is not due to background methylation (2% ± 1% (mean ± s.d.) of nucleosomes detected on IgG control reads within chromosome X CDR (Fig. 6g and Extended Data Fig. 10e)). A previous study estimated the average CENP-A density across endogenous human centromeres to be 1 in 25 nucleosomes, assuming a mean centromere size of ~1 Mb45. In contrast, we estimated that at least 1 in 4 nucleosomes contain CENP-A within the smaller ~100-kb CDR on chromosome X. This demonstrates that CENP-A nucleosome occupancy varies considerably across a human centromere. Further, we showed that the region with the highest CENP-A density coincides with the CDR. We observed a similar distribution of CENP-A-directed methylation on chromosome 3, where only one of the two HOR arrays was observed to have clear CENP-A-directed methylation (Extended Data Fig. 10f,g). These observations support the finding of one active HOR array for each chromosome35,46. These findings illuminate the density and positioning of CENP-A nucleosomes within HOR sequences on individual chromatin fibers, which was not previously attainable with existing techniques.

Source link