Preloader

Chasing perfection: validation and polishing strategies for telomere-to-telomere genome assemblies

Initial evaluation of CHM13v0.9

The T2T consortium has collected a comprehensive and diverse set of publicly available WGS sequencing and genomic map data (Illumina PCR-free, PacBio HiFi, PacBio CLR, ONT and Bionano optical maps) for the nearly completely homozygous CHM13 cell line (https://github.com/marbl/CHM13/). As part of the consortium, we drew upon these sequencing data to generate a custom pipeline (Fig. 1) to evaluate, identify and correct lingering errors in CHM13v0.9.

Fig. 1: An overview of the evaluation and polishing strategy developed to achieve a complete human genome assembly.
figure 1

a, The evaluation strategies used to assess genome assembly accuracy before (CHM13v0.9) and after (CHM13v1.0 and CHM13v1.1) polishing. b, The ‘do no harm’ polishing strategy developed and implemented to generate CHM13v1.0 and CHM13v1.1.

We first derived k-mer-based quality estimations (k = 21 bp) of CHM13v0.9 using Merqury37 with both Illumina and HiFi reads. The k-mer size was chosen to limit the collision rate to 0.5% given the estimated genome size of 3.05 Gb of CHM13 (ref. 38). While estimating the Illumina reads QVs, we found 15,723 k-mers present in the assembly and not the reads (erroneous k-mers), leading to an estimated base quality of Q66.09. Using HiFi reads, we found 6,881 error k-mers (Q69.68; Fig. 2a). To test how technical sequencing bias may have influenced this QV estimation, we examined the k-mer multiplicity and sequence content of assembly k-mers absent from one technology but present in the other. Here, our results indicated that k-mers missing from Illumina reads were present with expected frequency in HiFi and were enriched for G/C bases. Conversely, k-mers missing in HiFi were present with higher frequency in Illumina reads with A/T base enrichment (Fig. 2b). However, we identified no particular enrichment pattern in the number of GAs or CTs within the k-mers, possibly due to the short k-mer size chosen (Extended Data Fig. 1a). Most of the k-mers absent from HiFi reads were located in patches derived from a previous ONT-based assembly (CHM13v0.7)33, which were included to overcome regions of HiFi coverage dropout1 (Extended Data Fig. 1b,c). These findings highlighted that platform-specific sequencing biases were underestimating the QV when measured from a single sequencing platform. To overcome this, we created a hybrid k-mer database that combined these platforms to be used for QV estimation (Extended Data Fig. 1d). Unlike the default QV estimation in Merqury, we removed low-frequency k-mers to avoid overestimated QVs caused by excessive noise accumulated from both platforms. We estimated base-level accuracy as Q70.22 with 6,073 missing k-mers (Table 1). We note that this estimate does not account for the rarer case of k-mers present in the reads but misplaced or falsely duplicated in the assembly.

Fig. 2: Sequencing biases in PacBio HiFi and Illumina reads.
figure 2

a, Venn diagram of the ‘missing’ k-mers found in the assembly but not in the HiFi reads (green) or Illumina reads (blue). Except for the 1,094 k-mers that were absent from both HiFi and Illumina reads, error k-mers were found in the other sequencing platform with expected frequency, matching the average sequencing coverage (lower). b, Missing k-mers from a with their GC content, colored by the frequency observed. Low-frequency erroneous k-mers did not have a clear GC bias. k-mers found only in HiFi had a higher GC percentage, while higher-frequency k-mers tended to have more AT-rich sequences in Illumina. c, Homopolymer length distribution observed in the assembly and in HiFi reads (upper) or Illumina reads (lower) aligned to that position. Longer homopolymers in the consensus are associated with length variability in HiFi reads especially in the GC homopolymers. The majority of the Illumina reads were concordant with the consensus.

Table 1 k-mer-based consensus quality evaluation

Despite the high accuracy of CHM13v0.9 (Q70.22), we expected to find consensus sequence errors related to the systematic presence of homopolymer-specific or repeat-specific issues in HiFi reads9,39. To detect these, we generated self-alignments by aligning CHM13 reads to CHM13v0.9 for each WGS sequencing technology. Although each data type required technology-specific alignment methods (Methods), we highlight our use of Winnowmap2 that enabled robust alignment of long reads to both repetitive and non-repetitive regions of CHM13v0.9 (refs. 34,35). To understand the homopolymer length differences between the assembly and the reads, we derived a confusion matrix from Illumina read alignments showing discordant representation of long homopolymers between the Illumina reads and the assembly (Fig. 2c). Altogether, the QV and homopolymer analysis suggested that CHM13v0.9 required polishing to maximize the accuracy of a complete human genome.

Identification and correction of assembly errors

To address assembly flaws identified during evaluation, we aimed to establish a customized polishing pipeline that would avoid false-positive polishing edits (especially in repeats) and maintain local haplotype consistency (Fig. 1b and Extended Data Fig. 2). We identified and corrected small errors (≤50 bp) using several small-variant calling tools from self-alignments of Illumina, HiFi and ONT reads to CHM13v0.9. To call both single-nucleotide polymorphisms (SNPs) and small insertions and deletions (INDELs), we applied a hybrid mode of DeepVariant27 that exploited both HiFi and Illumina read alignments40. Simultaneously, we used PEPPER-DeepVariant28 to generate additional SNP calls with ONT reads as it can yield high-quality SNP variants in difficult regions of the genome28,40 (see Supplementary Fig. 1 in ref. 28 for more details). We rigorously filtered all calls using a genotype quality (GQ) score (GQ < 30 for the hybrid calls and GQ < 25 for ONT SNP calls) and variant allele frequency (VAF < 0.5) to exclude any low-frequency false-positive calls (Extended Data Fig. 2). We chose VAF < 0.5 to avoid including heterozygous variants, and the GQ threshold was chosen based on the previously reported calibration plot of DeepVariant, which shows that calls that have quality scores above 25 or 30 are highly unlikely to result in false positives27,28. We then filtered all of the suggested alternate corrections with Merfin41, a tool concurrently developed by members of the T2T consortium, to avoid introducing error k-mers (Figs. 1b and 3c). Finally, we ignored variants near the distal or proximal rDNA junctions on the short arms of the acrocentric chromosomes to avoid homogenizing the alleles from the unassembled rDNAs. After merging all variant calls, we identified 993 small variants (≤50 bp) that represented potential assembly errors and heterozygous sites. From these 993 assembly edits, about two-thirds were homopolymer corrections (512) or low-complexity microsatellite repeats composed of two distinct bases in homopolymer-compressed space (hereby noted as ‘2-mer’) consistent with previous observations of HiFi sequence errors or bias17. Across all 617 loci, we evaluated the edit distribution using both Illumina and HiFi reads and found that the majority of Illumina reads supported the longer homopolymer or 2-mer repeat lengths compared to HiFi reads, thereby uncovering systemic biases in both homopolymer and 2-mer length in HiFi reads17 that caused the propagation of these errors into the consensus assembly sequence (Fig. 3d).

Fig. 3: Errors corrected after polishing.
figure 3

a, Three corrected SV-like errors. b, Bionano optical maps indicating the missing telomeric sequence on the p arm of chromosome 18 (left) with a higher-than-average mapping coverage. This excessive coverage was removed after adding the missing telomeric sequence (right) and most of the Bionano molecules end at the end of the sequence. c, VAF of each variant called by DeepVariant hybrid (HiFi + Illumina) mode, before and after polishing. Most of the high-frequency variants (errors) were removed after polishing, which were called ‘homozygous’ variants. d, Total number of reads in each observed length difference (bp) between the assembly and the aligned reads at each edit position. Positive numbers indicate that more bases are found in the reads, while negative numbers indicate fewer bases in the reads. Both the homopolymer and microsatellite (2-mers in homopolymer-compressed space) length difference became 0 after polishing.

We used Parliament2 (ref. 42) and Sniffles43 to identify medium-sized (>50 bp) assembly errors and heterozygous structural variants (SVs). Parliament2 runs six SV callers42 using short-read data, while the Sniffles detects SVs using one of the long-read technologies (HiFi, ONT and CLR). To improve specificity, we only considered Parliament2 calls supported by at least two SV callers and Sniffles calls supported by at least two long-read technologies. Similarly to small-variant detection, we excluded SVs called in the partial rDNA arrays and the HSat3 satellite repeat on chromosome 9. This pipeline identified a relatively small number of SV calls (66; Extended Data Fig. 2) that we were able to manually curate via genome browsing. In total, we corrected three medium-sized assembly errors (replacing 1,998 bp of CHM13v0.9 sequence with 151 bp of new sequence), and we identified 44 heterozygous SVs (Fig. 3a and Extended Data Fig. 3b). We also identified a missing telomere sequence on the p arm of chromosome 18—a potential result of the string graph simplification process and confirmed through Bionano mapping (Figs. 1b and 3b). To correct this omission, we used the CHM13v0.9 graph to identify a set of HiFi reads expected to cover this locus1 and found ONT reads that mapped to the corresponding subtelomere and contained telomeric repeats. We used the ONT reads to derive a consensus chromosome 18 extension that was subsequently polished with the associated HiFi reads. After patching this telomere extension, we used Bionano alignments to confirm the accuracy of this locus (Fig. 3b). Altogether, the small and medium-sized variant calls along with the chromosome 18 telomere patch were combined into two distinct VCF files: a polishing edits file (homozygous ALT variants and the telomere patch) and a file for heterozygous variants (all other variants). We created the polished CHM13v1.0 assembly by incorporating these edits into the CHM13v0.9 with bcftools44.

We ensured polishing accuracy by extensive manual validation through visual inspection of the repeat-aware alignments, error k-mers, marker k-mers and marker-assisted alignments. Here, we defined ‘marker’ k-mers as k-mers that occur only once in the assembly and in the expected single-copy coverage range of the read k-mer database and are highly likely to represent unique regions of the assembly (Extended Data Fig. 3b–d)33. To generate marker-assisted alignments, we filtered Winnowmap2 (ref. 34) alignments to exclude any alignments that did not span marker k-mers (https://github.com/arangrhie/T2T-Polish/tree/master/marker_assisted/). Our findings supported that most genomic loci contained a deep coverage of marker k-mers to facilitate marker-assisted alignment, except for a few highly repetitive regions (11.3 Mb in total) that lacked markers (termed ‘marker deserts’; Fig. 1a and Extended Data Fig. 3c,d). In parallel, we used TandemMapper36 to detect structural errors in all centromeric regions, including identified marker deserts. TandemMapper36 used locally unique markers for the detection of marker order and orientation discrepancies between the assembly and associated long reads. We manually validated all large polishing edits and heterozygous SVs, and many small loci were validated ad hoc.

Validation of CHM13v1.0

Given the high completeness and accuracy standards of the T2T consortium, and knowing that polishing may introduce additional errors41, we took extra precautions to validate polishing edits and to ensure that edits did not degrade the quality of CHM13v0.9. First, we repeated self-alignment variant calling methods on CHM13v1.0, confirming that all edits made were correct (Fig. 3a). Through Bionano optical map alignments, we validated the structural accuracy of the chromosome 18 telomere patch and confirmed that all 46 telomeres were represented in CHM13v1.0 (Fig. 3b). Notably, our polishing led to a marked improvement in the distribution of GQ and VAF of small-variant calls (Fig. 3c and Extended Data Fig. 4a). Our approach also increased the base-level consensus accuracy from Q70.22 in CHM13v0.9 to Q72.62 in CHM13v1.0. Further, we found that error k-mers were uniformly distributed along each chromosome, suggesting that remaining errors were not clustered within certain genomic regions (Extended Data Fig. 4b,c). Upon reevaluation of the homopolymers and 2-mers, most of the biases that we found in CHM13v0.9 from HiFi reads had been accurately removed, achieving an improved concordance with Illumina reads (Fig. 3d). Polishing did not induce invalid open reading frames (ORFs) in CHM13v0.9 transcripts with valid ORFs, and polishing corrected 16 invalid CHM13v0.9 ORFs (Supplementary Table 1).

Overall, we made a total of 112 polishing edits (impacting 267 bp) in centromeric regions45, with 15 (35 bp) of these edits occurring specifically in centromeric alpha-satellite higher-order repeat arrays. We made 134 edits (4,975 bp) in non-satellite segmental duplications2. Moreover, the polishing edits were neither enriched nor depleted in satellite repeats and segmental duplications (P = 0.85, permutation test), suggesting that non-masked repeats were not overcorrected or undercorrected compared to the rest of the genome (Extended Data Fig. 4c). Finally, through extensive manual inspection, we confirmed the reliability of the alignments for the three SV-associated edits incorporated into CHM13v1.0 (Extended Data Fig. 5), and these efforts uncovered some heterozygous loci in the centromeres. These regions are under active investigation by the T2T consortium to both ensure their structure and understand their evolution45.

As an additional validation, we investigated potential rare or false collapses as well as rare or false duplications in CHM13v1.0. Here, based on k-mer estimates from both GRCh38 and CHM13v1.0 and from Illumina reads for 268 Simon’s Genome Diversity Project (SGDP) samples, we identified regions in CHM13v1.0 with a lower or higher copy number than both GRCh38 and 99% of the SGDP samples2. We found six regions of rare collapses in CHM13v1.0 that were not in GRCh38 (covering 205 kb, four from one single segmental duplication family). Both our HiFi read-depth and Illumina k-mer-based copy number estimates suggest these six regions are likely rare copy number variants in CHM13 (for example, CHM13v1.0 had only a single copy of the 72-kb tandem duplication in GRCh38; Fig. 4a). Additionally, we found that CHM13v1.0 had 33× fewer false or rare collapses than GRCh38 (~185 loci covering 6.84 Mb)6. We identified five regions (160 kb) with rare duplications in CHM13v1.0. This included a single 142-kb region that appeared to be a true, rare tandem duplication based on HiFi read-depth and Illumina k-mer-based copy number estimates (Fig. 4b). Two of the smaller regions appeared to be true, rare tandem duplications, and two other small regions were identified during polishing as heterozygous or mosaic deletions, revealing potential tandem duplications arising during cell line division or immortalization. In summary, we found 7.5× fewer rare or falsely duplicated bases in CHM13v1.0 relative to the 12 likely falsely duplicated regions affecting 1.2 Mb and 74 genes in GRCh38 (ref. 6), including the medically relevant CBS, CRYAA and KCNE1 genes46.

Fig. 4: Examples of the largest CHM13 regions with a copy number in the reference that differs from GRCh38 and most individuals.
figure 4

a, One of the two largest examples of rare collapses in CHM13, where one copy of a common 72-kb tandem duplication is absent in CHM13. b, The largest rare duplication in CHM13, a 142-kb tandem duplication in GRCh38 that is rare in the population. CHM13 and HG002 PacBio HiFi coverage tracks are displayed for both references, GRCh38 (top) and CHM13v1.0 (bottom), to demonstrate that CHM13 reads support the CHM13 copy number, but HG002 reads are consistent with the GRCh38 copy number. Read-depth copy number estimates in CHM13 are shown at the bottom for ‘k-merized’ versions of GRCh38 and CHM13v1.0 references, CHM13 Illumina reads and Illumina reads from a diverse subset (n = 34) of SGDP individuals.

Toward a completely polished sequence of a human genome

While evaluating CHM13v1.0, the T2T consortium successfully completed the construction of the rDNA models and their surrounding sequences on the p arms of the five acrocentric chromosomes1. In parallel, we determined that all telomeric sequences remained unpolished. Specifically, in canonical [TTAGGG]n repeats, we found both HiFi read coverage dropouts and ONT strand bias impeded high-quality variant calling (Extended Data Fig. 6a). For ONT, we observed only negative strands on the p arm and only positive strands on the q arm across all telomeric repeats at chromosomal ends; we suspect the ONT ultra-long transposon-based library preparation prevents reads from starting at chromosome ends, causing reads to only read into the telomere10,33. We tailored our PEPPER-based polishing approach and performed targeted telomere polishing to remove these errors remaining in telomeric sequences (Methods). Finally, automated polishing (described below), indicated that the FAM156B gene was heterozygous in CHM13v0.9, and CHM13v1.0 represented the rare minor allele (encoding a premature stop codon) at this locus. We replaced this minor allele with the other CHM13 allele encoding a full-length protein sequence. Overall, we made 454 telomere edits, producing longer stretches of maximum perfect matches to the canonical k-mer at each position across these telomeres compared to CHM13v1.0 (Extended Data Fig. 6b). Combined with the parallel completion of the five rDNA arrays, our final round of polishing led to an improved QV of Q73.94 for CHM13v1.1.

Again, to ensure updates did not compromise the high accuracy of the assembly and to identify any remaining issues, we carried out an additional round of SV detection and manual curation using HiFi and ONT with an updated Winnowmap2 alignment (Methods and Extended Data Fig. 7), classifying seven loci as remaining issues in CHM13v1.1 (Supplementary Table 2). We excluded CLRs because the lower base accuracy compared to HiFi and ONT and shorter read length compared to ONT were adding no information. Bionano was also excluded as the molecules were lacking coverage in centromeric regions (Extended Data Fig. 8) and did not detect any structural issues beyond the missing telomere and a few heterozygous SVs already identified by HiFi and ONT. Two loci located in the rDNA sequences appear to be a potential discrepancy between the model consensus sequence and actual reads or an artifact of mapping or sequencing bias. Lower consensus quality is indicated at two other loci, one detected with read alignments that were both low in coverage and identity, and one of which contained error k-mers detected by the hybrid dataset. One locus consisted of multiple insertions (<1 kb) with breakpoints detected in low-complexity sequences associated with heterozygous variants and indicated a possible collapsed repeat (Extended Data Fig. 9) and an additional two loci joined and created an artificial chimeric haplotype (Extended Data Fig. 10). Additionally, we found 218 low-coverage loci using HiFi (Supplementary Table 3), with 81.2% associated with GA-rich (78.0%) regions. The remaining 41 loci had signatures of lower consensus quality and alignment identity, and 30 had error k-mers detected from the hybrid k-mer dataset. In contrast, we detected one low-coverage locus using ONT that overlapped with the GA-rich model rDNA sequence. We associated most remaining loci, totaling only 544.8 kb or <0.02% of assembled sequence, with lower consensus quality in regions lacking unique markers. Overall, we found 394 heterozygous regions, including regions with clusters of heterozygous variants (https://github.com/mrvollger/nucfreq/), totaling 317 sites (~1.1 Mb).

We manually curated both the breakpoints and alternate sequences associated with 47 heterozygous SVs, including sites previously inspected (CHM13v1.0) for SV-like error detection. We then investigated HiFi read alignment clippings and confirmed an association with clipping to both true heterozygous variant and spurious low-frequency alignments. Additionally, we detected a further heterozygous inversion that went previously undetected.

A comparison to automated assembly polishing

To demonstrate the efficacy of the customized DeepVariant-based approach, we compared our semiautomated polishing approach used to create CHM13v1.0 (Q72.62) to a popular state-of-the-art automated polishing tool, Racon30. We iteratively polished CHM13v0.9 (three rounds) using Racon with PacBio HiFi alignments. While the QV improved from Q70.22 to Q70.48 after the first round of Racon polishing, it degraded with the subsequent second (Q70.26) and third (Q70.15) rounds, ultimately diminishing assembly accuracy as a result of overcorrection. We also found that Racon incorporated 7,268 alternate alleles from heterozygous variants identified by DeepVariant, thus potentially causing undesirable haplotype switching in originally haplotype-consistent blocks. To examine how Racon polished large, highly similar repetitive elements, we counted the number of corrections in nonoverlapping 1-Mb windows of the CHM13v0.9 assembly and measured local polishing rates. Unlike CHM13v1.0, Racon polishing showed a clear right tail in the distribution of polishing rates, indicating the presence of polishing ‘hotspots’, defined here as loci with >60 corrections/Mb (Fig. 5a). The proximal and distal junctions of the rDNA units (masked from CHM13v1.0 polishing) were prevalent among these loci, a finding that reinforced the importance of masking known collapsed but resolved loci to avoid overcorrection. We also found non-rDNA loci that were preferentially polished by Racon, including satellite repeats such as the highly repetitive HSat3 region in chromosome 9. Finally, CHM13v1.0 made two corrections, recovering two protein-coding transcript’s ORFs, but Racon did not make these corrections (Supplementary Table 1). While CHM13v1.0 did not induce invalid ORFs in any transcripts, Racon made ten corrections that caused invalid ORFs in 22 transcripts (from nine genes) (Fig. 5b). Most of these corrections occurred at homopolymer repeats, consistent with our previous findings that homopolymer bias in HiFi reads could lead to false expansion or contraction of homopolymers during polishing.

Fig. 5: Errors made by automated polishing.
figure 5

a, The distribution of the number of polishing edits made in nonoverlapping 1 Mb windows of the CHM13v0.9 assembly. b, Two Racon polishing edits causing false frameshift errors in the FAM156B gene. Light blue indicates the untranslated region, and dark blue indicates the single coding sequence exon. Highlighted sequence indicates GC-rich homopolymers.

To overcome these relative shortcomings of Racon polishing, we tested polishing the CHM13v0.9 assembly with three iterative rounds of Racon followed by filtering with Merfin (Racon + Merfin). After each round of polishing, Merfin removed proposed Racon edits that incorporated false assembly k-mers. As expected, the Racon + Merfin assembly QV monotonically increased from Q70.22 to Q77.34, Q77.99 and Q78.12. However, Racon + Merfin still incorporated 2,274 alternate alleles from heterozygous variants and polishing hotspots were still evident, suggesting that some repeats were overcorrected (Fig. 5a). These overcorrections are not reflected in the QV measurements as k-mers from true heterozygous variants are considered ‘valid’ sequences. Merfin mitigated the ten ORF-invalidating Racon corrections; however, Merfin also failed to correct the two reading frame corrections made in CHM13v1.0 but not Racon (Supplementary Table 1). Overall, when considering only automated polishing, we suggest that Racon and Merfin can be used together as a highly effective strategy for building reference assemblies with minimum false-positive corrections. However, we would like to emphasize that a custom polishing pipeline with manual interventions is still required for preserving haplotype consistency and avoiding repeat overcorrection.

Source link