Preloader

Merfin: improved variant filtering, assembly evaluation and polishing via k-mer validation

Variant-call filtering for higher precision

A reference genome with its sequence replaced at all alternate variant calls can be considered a ‘consensus’ sequence and evaluated with k-mers. Unlike using a de novo assembled genome of the same individual as a reference, natural biological differences between the sequenced individual and the reference genome or the incomplete state of the reference (missing a segmental duplication) imposes challenges to reliably call variants. Nevertheless, it is possible to construct consensus paths from a variant or series of variants within k base pairs (bp) and confirm its validity. We can score each path by the number of k-mers never found in the reads (error k-mer) and choose the best path to contain minimal error k-mers (Fig. 1a and Extended Data Fig. 1a; ‘-filter’ mode).

Fig. 1: Algorithms and results used in Merfin.
figure 1

a, Two variant calls and their potential consensus paths. The bases and k-mers in red are errors not found in the reads. The path with A > C has no error k-mers and gets chosen for genotyping. For polishing, the average K* is computed in addition to the missing k-mers using the predicted absent k-mers (*). b, Precision, recall and F1 score from a benchmark on HG002 genotyping. Merfin always achieves higher precision and F1 scores compared to the hard-filtered approach with almost no loss in recall. Default, no filtering; red, hard-filtering on default. c, K-mer frequency found in the consensus sequence (KC), reads (KR) with average coverage at four, expected copy number based on the corrected k-mer frequency (Kr = KR / c) and K*. Positive and negative K* values are colored in green and red. The highlighted region (gray) shows the same k-mers and values used to compute K* as affected by the A base in the reference. If two alternatives bear the same number of missing k-mers, the alternative with the K* closest to zero is chosen. d, K* distribution. K* values deviated from 0 indicate collapsed (+) or expanded (−) k-mers in the assembly. e, Genomescope2 k-mer frequency histogram with theoretical k-multiplicity curves (top) and probabilities (bottom) for zero, one, two, three and four-copy k-mers, generated using the –fitted_hist option. Note that the three-copy peak is fully contained in the two- and four-copy peaks. f, Diagram for estimating QV* and completeness from k-mers. Each k-mer is a block colored by its state of presence. In the block tower, each column represents the identical k-mer with its state colored by its presence in the assembly, reads or in both. Note the QV* and K* completeness use all k-mers, including their frequency.

To test the validity of this filtering approach, we benchmarked against unfiltered (default) and hard-filtered variant calls submitted to precisionFDA challenge II, HG002 (ref. 1). The variants were called from Illumina reads or from multiple platforms (Illumina, PacBio HiFi and ONT) using GRCh38 as the reference with GATK HaplotypeCaller. Hard-filtering was performed using the variant caller’s internal scores such as PASS, QD, MQ and QUAL. When comparing precision, recall and F1 score (harmonic mean of precision and recall) on a truth set of Chr. 20 (ref. 27), Merfin always achieved higher precision with minimal loss in recall on both default and hard-filtered sets (Fig. 1b). The hard-filtered set had higher precision, with the price of losing more true positives, resulting in a lower F1 score when compared to the default set. Merfin was able to remove additional false positives on the hard-filtered set. True-positive variants used in this analysis, ranging from 48-bp deletions to 47-bp insertions, were all recovered by Merfin.

Assembly evaluation

When a reference genome is constructed from the same individual, the k-mer multiplicity seen in the reads is expected to match the reference. This property can be used for evaluating de novo assembled genomes. Here, we introduce our revised K*, which identifies potentially collapsed and expanded regions in an assembly and quantitative metrics for representing assembly copy-number concordance and completeness.

Identifying collapsed and expanded regions

The K* metric was defined previously to detect identical collapsed repeats on each k-mer in the assembly21. The method proposed K* = KR/KC, where KR is the frequency of a k-mer found in the reads; and KC is the frequency of a k-mer across the entire consensus sequence of the assembly. In regions with no collapsed repeats, K* will be equal to c, the average coverage of sequencing reads. Here we revised the K* such that it evaluates both collapses and expansions. We propose K* = (KrKC) / min (Kr, KC), where Kr is the expected copy number inferred from the reads (Fig. 1c). For a perfect genome assembly and an unbiased read set, K* is normally distributed with mean 0 and deviations from the mean reflect natural variation in the Poisson sampling process (Fig. 1d). Conversely, any large deviation from the normal distribution can be interpreted either as a bias in the assembly (an assembly error) or a bias in the read set. Specifically, a positive K* implies that the assembly contains fewer copies of k-mers than suggested by the read set (collapsed), whereas a negative K* implies more copies in the assembly than suggested by the read set (expanded).

The Kr can be obtained by rounding to the nearest integer, KR / c, where c is the haploid (one-copy) peak of the k-mer distribution of the reads. Here we assume that rounding Kr is sufficient to account for the s.d. associated with the Poisson process underlying read generation. While this is true in the case of a perfectly sampled sequencing set, the validity of this generalization is challenged in the presence of sampling bias, systematic error in the reads and variable degrees of heterozygosity that results in different likelihoods of specific copy numbers. To account for this uncertainty and improve the accuracy of the results, we modified Genomescope2 (ref. 28) to probabilistically infer Kr for each KR, using the observed k-mer count distribution in the read set. If supplied, Merfin will use these probabilities for Kr ≤ 4. (Fig. 1e).

QV* estimation

An average genome-wide QV accounting for excessive copy numbers (hereby defined as QV*) can be obtained using ∑KC − Kr as errors when KC > Kr for all positions in the assembly (Fig. 1f and Extended Data Fig. 1b; ‘-hist’ mode). These excessive and error k-mers can be generalized as ‘errors’ in Phred-scale QV, as in Merqury16 or YAK29.

Assembly completeness

The sum of Kr − KC (over all positions where Kr > KC) expresses absent k-mers that should be present in the assembly and can be directly translated into a measure of assembly completeness as 1 − ∑(KrKC) / Kr (Fig. 1f). Notably, contrary to other measures of assembly completeness based on a subset of the k-mers (such as relying only on the occurrence of distinct k-mers as in Merqury16), Merfin uses all k-mers, including their frequency and computes the fraction of the expected total number of k-mers (Extended Data Fig. 1b; ‘-completeness’ mode).

Sequence polishing

The K* becomes particularly useful in polishing. Increased QV is achievable through a dedicated polishing tool or via corrections identified by a standard variant-calling method. Even when using polishing tools, generating a set of potential corrections in variant-call format (VCF) allows finer control over the outcome and can be assessed with Merfin. In Merfin, the impact of each correction or combination of corrections are assessed from the given correction candidates by comparing the change in K* metrics (Fig. 1a,c and Extended Data Fig. 1a; ‘-polish’ mode). In addition to the error k-mers collected in each predicted consensus path, we compute the consequent k-mer frequency change and choose the correction only when it improves the assembly-read agreement. For example, when a suggestive correction (replacing AT with A as shown in Fig. 1a) introduces more error k-mers, it should not be used for polishing. Even when no error k-mers are introduced, K* theoretically informs whether a path improves the assembly-read agreement in polishing. The current implementation evaluates each path independently and thus only a local optimum is guaranteed. Variants within distance k are considered in all combinations, allowing Merfin to filter variant calls close to each other. This approach is fully independent of the raw dataset employed. For instance, the assembly could be generated using long reads and the calls evaluated using either short or long reads or both, taking advantage of the strengths of each sequencing platform, making accurate orthogonal validation possible, ultimately maximizing the assembly-read agreement.

Evaluating a complete human genome: T2T-CHM13

The CHM13hTERT (CHM13) cell line originates from a complete hydatidiform mole (46, XX), where both haplotypes are nearly identical30. This cell line was used to generate the most complete high-quality human reference to date, resolving all centromeric and telomeric repeats and all segmental duplications and satellite arrays22,23. Notably, T2T-CHM13v0.9 was polished from a variety of variant calls, filtered with an earlier version of Merfin, which improved the consensus accuracy of the final assembly25. We further evaluated candidate assemblies to identify collapses and expansions using Merfin using k-mers from HiFi and Illumina reads. We found that the T2T-CHM13v1.0 assembly shows a remarkable agreement with the raw data, with only a few regions having K* largely different from 0, coinciding with satellite repeats (Fig. 2a). Rather than being assembly errors, these disagreements were associated with context-dependent augmentation or depletion in HiFi and GC bias in Illumina22,25. In HiFi, the sequencing coverage depends on sequence content22. Illumina has a similar dependence in GC context, introducing biases during library preparation, but not necessarily in the same direction seen in HiFi. Indeed, K* derived from HiFi and Illumina k-mers showed opposite behavior in some regions, e.g. the HSat3 of chromosome 9 (Fig. 2b). These effects were observed only on the highly repetitive regions of the genome.

Fig. 2: CHM13 evaluation and polishing.
figure 2

a, Genome-wide K* for the T2T-CHM13 assembly v1.0. Satellites are associated with repeat- and technology-specific biases. Yet to be resolved rDNA arrays (red) are highlighted by positive K*. b, Highlight of the centromeric satellite repeats45 and segmental duplications (SDs)23 (orange indicates most similar, yellow less and gray least) on chromosome 9. hor, higher-order repeat. c, Genome-wide density distribution of the K* using HiFi k-mers. When the assembly is in agreement with the raw data, the K* is normally distributed with mean 0 and the smaller the s.d. the higher the agreement. T2T-CHM13v1.0 shows a less-dispersed distribution of the K* compared to a less-complete v0.7 assembly. d, Genome-wide comparison of the K* computed using HiFi versus Illumina k-mers on the v1.0 assembly. Agreement between the assembly and the raw reads supported by the two technologies is found around (0, 0). The quadrant highlights where both HiFi versus Illumina technologies suggest the presence of underrepresented k-mers that were largely contributed from the un-assembled rDNAs later resolved in v1.1 (ref. 22) (top right); the quadrant highlights where both technologies suggest the presence of overrepresented k-mers (with perfect agreement found on the diagonals) (bottom left). The axes correspond to regions of substantial disagreement between the two technologies. Diamonds indicate k-mers missing from one (x or y axis) or both (0, 0) technologies.

Compared to a less-complete and less-accurate preliminary assembly, T2T-CHM13v0.7 (ref. 31), T2T-CHM13v1.0 had a higher agreement of the assembly with the k-mers derived from HiFi (Fig. 2c) and Illumina reads (Extended Data Fig. 2). We found a general agreement in K* between HiFi and Illumina PCR-free k-mers, including regions with sequencing bias common to the two technologies (Extended Data Fig. 3). In other cases, the direct comparison of the K* computed from the two technologies highlighted technology-specific sequencing biases (Fig. 2a,d). Particularly, genome-wide comparison of the K* computed using HiFi versus Illumina k-mers on the v1.0 assembly show substantial agreement between the assembly and the raw reads (Fig. 2d; coordinates 0, 0). The only k-mers consistently seen as underrepresented in both technologies (Fig. 2d; top right quadrant) were mostly contributed from the un-assembled ribosomal DNAs (rDNAs) later resolved in v1.1 (ref. 22). At base resolution, the K* could distinguish regions with accurate consensus from bp errors, small and large indels, heterozygous sites and collapsed/expanded regions (Extended Data Fig. 4a,b).

Both QV and QV* measured with Merqury and Merfin improved from v0.7 to v0.9 (ref. 25), which involved a complete reassembly of the genome using HiFi reads and patches from v0.7 at GA-rich sequence dropouts in the HiFi reads (Supplementary Table 1). Merqury QV improved from v0.7 to v0.9, due to the dramatic decrease in error k-mers, however the Merfin QV* only marginally increased as the number of error k-mers is small compared to the number of overly represented k-mers, likely due to sequencing biases. We argue that QV* may still be a more reliable metric, because it accounts for all expected k-mer copy numbers, reflecting the full extent of genome representation. QV* is also only marginally influenced by the coverage, as shown by a titration experiment (Extended Data Fig. 5 and Supplementary Table 2).

Polishing a completely phased assembly: HG002

The need for polishing is particularly evident in genome assemblies generated using noisy long reads. Therefore, we tested Merfin’s variant-calling filtering algorithm on an Oxford Nanopore-based assembly of human HG002 trio data generated by the HPRC using Flye32,3326. We benchmarked Merfin on Medaka, by comparing polishing outcomes from Medaka with or without filtering with Merfin. In a trio setting, the optimal approach is to polish each parental assembly separately, by aligning the binned reads and performing variant calling5,34. This will reduce the introduction of haplotype switches. However, our k-mer-based evaluation of the corrections is best performed on a combined assembly so that it faithfully represents the expected copy-number of each k-mer given the read set.

We first called variants separately from the binned reads used in the assembly with Medaka and then combined the variant calls and the parental assemblies for the Merfin variant-filtering step. K-mers used in Merfin were computed from Illumina sequencing reads. We conducted five different experiments using read sets that differ in coverage, the version of the Guppy basecaller and read length cutoff (Supplementary Table 3). Two rounds of polishing were conducted in all experiments, with the second round performed on the consensus from the first round generated with the additional Merfin step. Overall, in all experiments we observed comparable improvements in base-calling accuracy as measured by Merqury QV when Merfin filtering was applied (Supplementary Table 3). This increase reflected a dramatic positive shift in the QV distribution of individual contigs, with most low-quality contigs being rescued by Merfin and a sharp increase in the number of contigs found without errors, leading to a final Q43.2 and Q42.8 for maternal and paternal haplotypes, respectively (Fig. 3a). In the second round of polishing, the QV ceased to improve or even decreased when Merfin was not applied (Fig. 3a and Supplementary Table 3), suggesting that the best trade-off between errors corrected and introduced in the assembly was already reached in the first round. In contrast, QV continued to increase relative to the first round with Merfin. Haplotype blocks as defined by Merqury increased in a comparable if not better way when using Merfin (Fig. 3b), while haplotypes remained fully phased (Extended Data Fig. 6). Notably, the results with Merfin were achieved by introducing only a fraction of the variants proposed by Medaka, making this approach more conservative than regular polishing (Fig. 3c).

Fig. 3: HG002 human trio polishing and evaluation.
figure 3

a, Distribution of QV scores as measured by Merqury for maternal and paternal contigs polished only with Medaka or with variants generated by Medaka filtered with Merfin, from the experiment (test 4, Supplementary Table 3) using latest basecaller (Guppy 4.2.2) and highest coverage (~50×). The first panel represents the unpolished contigs, the mid panel the first round of Medaka polishing and filtering and the last panel the second round applied to the Merfin results from the previous round. The number of contigs without evidence of errors as judged by Merqury QV are reported (right). b, Size of the haplotype blocks before and after polishing with or without Merfin for both the maternal and paternal assemblies. First round of polishing is represented by the dotted lines. NG, Size of the largest contig (or block) Y where contigs (blocks) ≥ Y cover X% of the estimated genome size. c, Number of variants generated by Medaka for polishing and remaining variants after Merfin filtering for both the maternal and paternal assemblies. d, Assembly-based HG002 small-variant-calling performance of Merfin versus regular Medaka against the GIAB truth set. Variants from the assembly are derived against GRCh38 using dipcall. SNP, single-nucleotide polymorphism.

We further validated the HG002 unpolished and polished assembly by aligning each haplotype assembly to GRCh38 and deriving small variants. When benchmarked against the Genome in a Bottle (GIAB) v4.2.1 truth set27, the results show that using Merfin we get a better F1 score, particularly at indels (Fig. 3d and Supplementary Table 4)27,35,36.

Evaluation, polishing and annotation of pseudo-haploid assemblies

We next applied Merfin to the polishing steps of the VGP-assembly pipeline5 (Extended Data Fig. 7) on pseudo-haploid assemblies from three species (flier cichlid, Archocentruscentrarchus, fArcCen1; Goode’s desert tortoise, Gopherusevgoodei, rGopEvg1; and zebra finch, Taeniopygiaguttata, bTaeGut1). Using PacBio CLR and 10X Genomics linked reads for polishing, we observed a general improvement in QV as measured by Merqury (Fig. 4a and Supplementary Table 5). The largest improvement was observed in the first round of the Arrow polishing step using CLR. Arrow can replace low-quality sequences with patch sequences generated de novo from the reads that align to the region, independent of the original reference quality. We observed low coverage sequencing biases (homopolymer shortening) and mosaic haplotypes in the generated patches, leading to cases of lower QV in the polished assembly (Fig. 4a; rGopEvg1). Merfin rescued the QV decrease or improved the QV in all cases. The variant length range (−453:2,242) was not compromised after Merfin (−453:1,618) and many of the variants well above 50 bp were retained by Merfin (Supplementary Table 6), supporting the notion that if the quality of the consensus sequence is sufficient, large calls will not be negatively impacted.

Fig. 4: Polishing and evaluation of VGP pseudo-haploid assemblies.
figure 4

ac, Polishing results of primary and alternate assemblies for the flier cichlid (fArcCen1), the Goode’s desert tortoise (rGopEvg1) and the zebra finch (bTaeGut1) using the VGP pipeline. Graphed are the unpolished QV values and the Merqury QV that accounts only for missing k-mers (a), the Merqury QV corrected using Merfin models for zero-copy k-mers (b) and QV* that also accounts for overrepresented k-mers (c). df, the general QV increase was reflected in the quality of the gene annotation, with consistent reduction in the number of genes affected by premature stop codons (d), frameshifts errors (e) and low-quality protein-coding gene predictions (f).

In the subsequent polishing steps performed using Freebayes, the benefit of running Merfin to filter the variant set was less pronounced but still present (Fig. 4a; dashed lines). This was true in all cases except for the zebra finch, where the default pipeline performed marginally better. However, when considering low-frequency k-mers as errors from the probability model in Merfin, the QV as well as QV* increased in all cases (adjusted QV and QV*; Fig. 4b,c and Supplementary Table 5). Merqury QV counts all k-mers never seen in the reads as errors, whereas the adjusted QV additionally counts low-frequency k-mers based on the k-mer frequency spectrum as errors. The QV* further includes overrepresented k-mers as errors, therefore capturing not only base accuracy errors, but also false duplications, expressing the uncertainty associated with any particular base given the support from the raw reads.

Most long-read assemblers generate locally phased haplotypes (such as Falcon-Unzip37) and it is therefore important that the polishing does not introduce haplotype switches. To test whether the increase in QV from Merfin was due to introducing haplotype switches, we tested a zebra finch (T.guttata, bTaeGut2) pseudo-haploid assembly for which parental sequence information is available to evaluate, using parent-specific k-mers, the size of haplotype blocks and the number of haplotype switches5. When Merfin was applied to filter variants generated by Freebayes on the Longranger alignments of the 10X reads in the zebra finch pseudo-haploid setting, we noticed an increase in the number of haplotype switches as measured with Merqury (Supplementary Table 7). We realized that this was due to many heterozygous variants being called by Freebayes, when individual reads were mapped to collapsed regions or preferentially to the more-accurate primary assembly5. The missing true heterozygous k-mers in the collapsed or lower quality regions were recovered by the heterozygous variant call and thus preferred by Merfin. Further, even in almost complete pseudo-haploid assemblies, short reads can be easily mismapped, leading to spurious heterozygous calls. To overcome this issue, we decided to remove all heterozygous variants before applying Merfin. This substantially prevented haplotype switches (Extended Data Fig. 8), without affecting the QV increase (Supplementary Table 7). In conclusion, we suggest removing all heterozygous variants before using Merfin as the best practice for polishing pseudo-haploid and haploid assemblies.

In addition, we validated our results using gene annotations, which are sensitive to consensus accuracy error and particularly to frameshift errors caused by indel errors. We performed de novo gene annotation using the RefSeq38 gene annotation pipeline (Gnomon)39 on the VGP assemblies polished with the conventional VGP pipeline (v1.6) and compared against assemblies where Merfin was applied at every polishing step. In Gnomon, if a protein alignment supports a predicted model with an indel introducing frameshift or premature stop codons, the model is labeled as ‘low quality’ and a base is added or removed from the predicted model to compensate for the indel in the genome. If more than one in ten coding genes in an assembly require corrections, the assembly is excluded from RefSeq. Based on information provided by the submitters, almost all rejected assemblies used ONT or PacBio CLR reads.

Again, Merfin substantially reduced the number of genes affected by frameshifts, validating QV and QV* results (Fig. 4d–f, Supplementary Table 8 and an example in Extended Data Fig. 9). Premature stop codons were significantly reduced with respect to the default polishing in all cases (Fig. 4d), with 42.9%, 42% and 21.7% reduction in fArcCen1, rGopEvg1 and bTaeGut1, respectively. Ultimately, 1% or less of genes had code breaks in all cases when using Merfin. Frameshifts were also positively affected (Fig. 4e), with 38%, 49.6% and 19.5% reductions in fArcCen1, rGopEvg1 and bTaeGut1, respectively. Less than 3% of genes had frameshifts in all cases when using Merfin. Similarly, the number of protein-coding gene predictions labeled as ‘low quality’ were reduced (Fig. 4f). From these results, Merfin has been included in the VGP pipeline (v1.7).

Consistent with the variant filtering for genotyping, the improvements in QV with Merfin superseded any hard-filtering attempt using variant-call quality score (QUAL) cutoffs at the Arrow polishing step (Fig. 5a–c and Supplementary Table 9). For the primary assembly, QV* estimates were consistently higher than the best results attainable by hard filtering (fArcCen1, Q32.5 versus Q31.9 at QUAL ≥ 18; rGopEvg1, Q38.7 versus Q36.7 at QUAL ≥ 21; bTaeGut1, Q44.4 versus Q42.4 at QUAL ≥ 21). The best QUAL cutoff was not necessarily consistent between species, indicating that a single cutoff cannot produce the best outcome in all cases. The alternate assembly (alternate haplotype) behaved similarly to the primary assembly, again with Merfin always performing best (fArcCen1, Q31.6 versus Q31.1 at QUAL ≥ 23; rGopEvg1, Q35.2 versus Q34.2 at QUAL ≥ 26; bTaeGut1, Q42.0 versus Q40.6 at QUAL ≥ 23). However, it notably differed in best QUAL cutoff values to maximize QV. At increased QUAL cutoffs, both genuine and erroneous corrections are filtered out. Thus, hard-filtering cutoffs perform best when the number of errors corrected exceeds the number of errors introduced at maximum. In contrast, variants selected by Merfin had a wide range of quality scores, with the majority containing higher quality scores and yet including many below 25 (Fig. 5d–f). Notably, a significant fraction of variants with the highest quality score assigned were introducing error k-mers and thus were rejected by Merfin. Potentially, accumulated sequencing biases in long reads could lead to erroneous variant calls but these can be filtered with more-accurate k-mers from short reads. No hard-filtering methods were able to achieve QV improvements in polishing as observed with Merfin.

Fig. 5: Merfin results against quality scores.
figure 5

ac, QV after polishing as a function of hard-filtered quality score cutoff in primary (black) and alternate (gray) assembly. Results achieved with Merfin are represented by the horizontal lines for comparison. df, Number and proportion of variants by quality score selected by Merfin.

Effect of k-size and computational requirements

The minimum size of k can be determined by a given genome size and a tolerable k-mer collision rate40. This has been adapted in Merqury16 and used for k-mer-based assembly evaluation. In brief, under a maximum allowed collision rate of 0.5%, k = 21 is suggested as the minimum length of k for genomes of size typically found in vertebrate species (1.2–4 Gb), including human and is used throughout our benchmarks. In theory, a larger k size could result in more-accurate filtering variants with the cost of k-mer coverage drop and increased computational burden. We tested whether using k = 31 provides a better F1 score over k = 21 on the variant-filtering GIAB benchmark and found that it provided a marginal improvement in the F1 scores (0.04%; Supplementary Table 10) at the cost of using 1.5-times more memory and 1.6 to 2.6-times more computation. As a large fraction of the read k-mers occur exactly once in the reads (72–89% of all distinct k-mers), we tested how excluding these would affect the performance of Merfin. Excluding unique k-mers in the filtering slightly increased the F1 score (0.01% to 0.03%) compared to using the entire k-mer set, by removing additional false-positive calls. Memory requirement substantially reduced from 122.6 GB to 49.2 GB for loading k-mers obtained from ~60× Illumina reads and 68 GB to 24.3 GB for loading ~25× Illumina reads along with reduced CPU hours (Supplementary Tables 10–12). As the filtering, evaluation and polishing results at different k-size or filtering were nearly identical, we recommend avoiding using larger k and excluding all unique k-mers before applying Merfin to maximize results with minimal computational burden.

Source link