Merfin: improved variant filtering, assembly evaluation and polishing via k-mer validation

Variant-call filtering for higher precision

A reference genome with its sequence replaced at all alternate variant calls can be considered a ‘consensus’ sequence and evaluated with k-mers. Unlike using a de novo assembled genome of the same individual as a reference, natural biological differences between the sequenced individual and the reference genome or the incomplete state of the reference (missing a segmental duplication) imposes challenges to reliably call variants. Nevertheless, it is possible to construct consensus paths from a variant or series of variants within k base pairs (bp) and confirm its validity. We can score each path by the number of k-mers never found in the reads (error k-mer) and choose the best path to contain minimal error k-mers (Fig. 1a and Extended Data Fig. 1a; ‘-filter’ mode).

**Fig. 1: Algorithms and results used in Merfin.**

To test the validity of this filtering approach, we benchmarked against unfiltered (default) and hard-filtered variant calls submitted to precisionFDA challenge II, HG002 (ref. ¹). The variants were called from Illumina reads or from multiple platforms (Illumina, PacBio HiFi and ONT) using GRCh38 as the reference with GATK HaplotypeCaller. Hard-filtering was performed using the variant caller’s internal scores such as PASS, QD, MQ and QUAL. When comparing precision, recall and F1 score (harmonic mean of precision and recall) on a truth set of Chr. 20 (ref. ²⁷), Merfin always achieved higher precision with minimal loss in recall on both default and hard-filtered sets (Fig. 1b). The hard-filtered set had higher precision, with the price of losing more true positives, resulting in a lower F1 score when compared to the default set. Merfin was able to remove additional false positives on the hard-filtered set. True-positive variants used in this analysis, ranging from 48-bp deletions to 47-bp insertions, were all recovered by Merfin.

Assembly evaluation

When a reference genome is constructed from the same individual, the k-mer multiplicity seen in the reads is expected to match the reference. This property can be used for evaluating de novo assembled genomes. Here, we introduce our revised K*, which identifies potentially collapsed and expanded regions in an assembly and quantitative metrics for representing assembly copy-number concordance and completeness.

Identifying collapsed and expanded regions

The K* metric was defined previously to detect identical collapsed repeats on each k-mer in the assembly²¹. The method proposed K* = K_R / K_C, where K_R is the frequency of a k-mer found in the reads; and K_C is the frequency of a k-mer across the entire consensus sequence of the assembly. In regions with no collapsed repeats, K* will be equal to c, the average coverage of sequencing reads. Here we revised the K* such that it evaluates both collapses and expansions. We propose K* = (K_r − K_C) / min (K_r, K_C), where K_r is the expected copy number inferred from the reads (Fig. 1c). For a perfect genome assembly and an unbiased read set, K* is normally distributed with mean 0 and deviations from the mean reflect natural variation in the Poisson sampling process (Fig. 1d). Conversely, any large deviation from the normal distribution can be interpreted either as a bias in the assembly (an assembly error) or a bias in the read set. Specifically, a positive K* implies that the assembly contains fewer copies of k-mers than suggested by the read set (collapsed), whereas a negative K* implies more copies in the assembly than suggested by the read set (expanded).

The K_r can be obtained by rounding to the nearest integer, ⌊K_R / c⌉, where c is the haploid (one-copy) peak of the k-mer distribution of the reads. Here we assume that rounding K_r is sufficient to account for the s.d. associated with the Poisson process underlying read generation. While this is true in the case of a perfectly sampled sequencing set, the validity of this generalization is challenged in the presence of sampling bias, systematic error in the reads and variable degrees of heterozygosity that results in different likelihoods of specific copy numbers. To account for this uncertainty and improve the accuracy of the results, we modified Genomescope2 (ref. ²⁸) to probabilistically infer K_r for each K_R, using the observed k-mer count distribution in the read set. If supplied, Merfin will use these probabilities for K_r ≤ 4. (Fig. 1e).

QV* estimation

An average genome-wide QV accounting for excessive copy numbers (hereby defined as QV*) can be obtained using ∑K_C − K_r as errors when K_C > K_r for all positions in the assembly (Fig. 1f and Extended Data Fig. 1b; ‘-hist’ mode). These excessive and error k-mers can be generalized as ‘errors’ in Phred-scale QV, as in Merqury¹⁶ or YAK²⁹.

Assembly completeness

The sum of K_r − K_C (over all positions where K_r > K_C) expresses absent k-mers that should be present in the assembly and can be directly translated into a measure of assembly completeness as 1 − ∑(K_r − K_C) / K_r (Fig. 1f). Notably, contrary to other measures of assembly completeness based on a subset of the k-mers (such as relying only on the occurrence of distinct k-mers as in Merqury¹⁶), Merfin uses all k-mers, including their frequency and computes the fraction of the expected total number of k-mers (Extended Data Fig. 1b; ‘-completeness’ mode).

Sequence polishing

The K* becomes particularly useful in polishing. Increased QV is achievable through a dedicated polishing tool or via corrections identified by a standard variant-calling method. Even when using polishing tools, generating a set of potential corrections in variant-call format (VCF) allows finer control over the outcome and can be assessed with Merfin. In Merfin, the impact of each correction or combination of corrections are assessed from the given correction candidates by comparing the change in K* metrics (Fig. 1a,c and Extended Data Fig. 1a; ‘-polish’ mode). In addition to the error k-mers collected in each predicted consensus path, we compute the consequent k-mer frequency change and choose the correction only when it improves the assembly-read agreement. For example, when a suggestive correction (replacing AT with A as shown in Fig. 1a) introduces more error k-mers, it should not be used for polishing. Even when no error k-mers are introduced, K* theoretically informs whether a path improves the assembly-read agreement in polishing. The current implementation evaluates each path independently and thus only a local optimum is guaranteed. Variants within distance k are considered in all combinations, allowing Merfin to filter variant calls close to each other. This approach is fully independent of the raw dataset employed. For instance, the assembly could be generated using long reads and the calls evaluated using either short or long reads or both, taking advantage of the strengths of each sequencing platform, making accurate orthogonal validation possible, ultimately maximizing the assembly-read agreement.

Evaluating a complete human genome: T2T-CHM13

The CHM13hTERT (CHM13) cell line originates from a complete hydatidiform mole (46, XX), where both haplotypes are nearly identical³⁰. This cell line was used to generate the most complete high-quality human reference to date, resolving all centromeric and telomeric repeats and all segmental duplications and satellite arrays^22,23. Notably, T2T-CHM13v0.9 was polished from a variety of variant calls, filtered with an earlier version of Merfin, which improved the consensus accuracy of the final assembly²⁵. We further evaluated candidate assemblies to identify collapses and expansions using Merfin using k-mers from HiFi and Illumina reads. We found that the T2T-CHM13v1.0 assembly shows a remarkable agreement with the raw data, with only a few regions having K* largely different from 0, coinciding with satellite repeats (Fig. 2a). Rather than being assembly errors, these disagreements were associated with context-dependent augmentation or depletion in HiFi and GC bias in Illumina^22,25. In HiFi, the sequencing coverage depends on sequence content²². Illumina has a similar dependence in GC context, introducing biases during library preparation, but not necessarily in the same direction seen in HiFi. Indeed, K* derived from HiFi and Illumina k-mers showed opposite behavior in some regions, e.g. the HSat3 of chromosome 9 (Fig. 2b). These effects were observed only on the highly repetitive regions of the genome.

**Fig. 2: CHM13 evaluation and polishing.**

Compared to a less-complete and less-accurate preliminary assembly, T2T-CHM13v0.7 (ref. ³¹), T2T-CHM13v1.0 had a higher agreement of the assembly with the k-mers derived from HiFi (Fig. 2c) and Illumina reads (Extended Data Fig. 2). We found a general agreement in K* between HiFi and Illumina PCR-free k-mers, including regions with sequencing bias common to the two technologies (Extended Data Fig. 3). In other cases, the direct comparison of the K* computed from the two technologies highlighted technology-specific sequencing biases (Fig. 2a,d). Particularly, genome-wide comparison of the K* computed using HiFi versus Illumina k-mers on the v1.0 assembly show substantial agreement between the assembly and the raw reads (Fig. 2d; coordinates 0, 0). The only k-mers consistently seen as underrepresented in both technologies (Fig. 2d; top right quadrant) were mostly contributed from the un-assembled ribosomal DNAs (rDNAs) later resolved in v1.1 (ref. ²²). At base resolution, the K* could distinguish regions with accurate consensus from bp errors, small and large indels, heterozygous sites and collapsed/expanded regions (Extended Data Fig. 4a,b).

Both QV and QV* measured with Merqury and Merfin improved from v0.7 to v0.9 (ref. ²⁵), which involved a complete reassembly of the genome using HiFi reads and patches from v0.7 at GA-rich sequence dropouts in the HiFi reads (Supplementary Table 1). Merqury QV improved from v0.7 to v0.9, due to the dramatic decrease in error k-mers, however the Merfin QV* only marginally increased as the number of error k-mers is small compared to the number of overly represented k-mers, likely due to sequencing biases. We argue that QV* may still be a more reliable metric, because it accounts for all expected k-mer copy numbers, reflecting the full extent of genome representation. QV* is also only marginally influenced by the coverage, as shown by a titration experiment (Extended Data Fig. 5 and Supplementary Table 2).

Polishing a completely phased assembly: HG002

The need for polishing is particularly evident in genome assemblies generated using noisy long reads. Therefore, we tested Merfin’s variant-calling filtering algorithm on an Oxford Nanopore-based assembly of human HG002 trio data generated by the HPRC using Flye^32,3326. We benchmarked Merfin on Medaka, by comparing polishing outcomes from Medaka with or without filtering with Merfin. In a trio setting, the optimal approach is to polish each parental assembly separately, by aligning the binned reads and performing variant calling^5,34. This will reduce the introduction of haplotype switches. However, our k-mer-based evaluation of the corrections is best performed on a combined assembly so that it faithfully represents the expected copy-number of each k-mer given the read set.

We first called variants separately from the binned reads used in the assembly with Medaka and then combined the variant calls and the parental assemblies for the Merfin variant-filtering step. K-mers used in Merfin were computed from Illumina sequencing reads. We conducted five different experiments using read sets that differ in coverage, the version of the Guppy basecaller and read length cutoff (Supplementary Table 3). Two rounds of polishing were conducted in all experiments, with the second round performed on the consensus from the first round generated with the additional Merfin step. Overall, in all experiments we observed comparable improvements in base-calling accuracy as measured by Merqury QV when Merfin filtering was applied (Supplementary Table 3). This increase reflected a dramatic positive shift in the QV distribution of individual contigs, with most low-quality contigs being rescued by Merfin and a sharp increase in the number of contigs found without errors, leading to a final Q43.2 and Q42.8 for maternal and paternal haplotypes, respectively (Fig. 3a). In the second round of polishing, the QV ceased to improve or even decreased when Merfin was not applied (Fig. 3a and Supplementary Table 3), suggesting that the best trade-off between errors corrected and introduced in the assembly was already reached in the first round. In contrast, QV continued to increase relative to the first round with Merfin. Haplotype blocks as defined by Merqury increased in a comparable if not better way when using Merfin (Fig. 3b), while haplotypes remained fully phased (Extended Data Fig. 6). Notably, the results with Merfin were achieved by introducing only a fraction of the variants proposed by Medaka, making this approach more conservative than regular polishing (Fig. 3c).

**Fig. 3: HG002 human trio polishing and evaluation.**

We further validated the HG002 unpolished and polished assembly by aligning each haplotype assembly to GRCh38 and deriving small variants. When benchmarked against the Genome in a Bottle (GIAB) v4.2.1 truth set²⁷, the results show that using Merfin we get a better F1 score, particularly at indels (Fig. 3d and Supplementary Table 4)^27,35,36.

Evaluation, polishing and annotation of pseudo-haploid assemblies

We next applied Merfin to the polishing steps of the VGP-assembly pipeline⁵ (Extended Data Fig. 7) on pseudo-haploid assemblies from three species (flier cichlid, Archocentrus centrarchus, fArcCen1; Goode’s desert tortoise, Gopherus evgoodei, rGopEvg1; and zebra finch, Taeniopygia guttata, bTaeGut1). Using PacBio CLR and 10X Genomics linked reads for polishing, we observed a general improvement in QV as measured by Merqury (Fig. 4a and Supplementary Table 5). The largest improvement was observed in the first round of the Arrow polishing step using CLR. Arrow can replace low-quality sequences with patch sequences generated de novo from the reads that align to the region, independent of the original reference quality. We observed low coverage sequencing biases (homopolymer shortening) and mosaic haplotypes in the generated patches, leading to cases of lower QV in the polished assembly (Fig. 4a; rGopEvg1). Merfin rescued the QV decrease or improved the QV in all cases. The variant length range (−453:2,242) was not compromised after Merfin (−453:1,618) and many of the variants well above 50 bp were retained by Merfin (Supplementary Table 6), supporting the notion that if the quality of the consensus sequence is sufficient, large calls will not be negatively impacted.

**Fig. 4: Polishing and evaluation of VGP pseudo-haploid assemblies.**

In the subsequent polishing steps performed using Freebayes, the benefit of running Merfin to filter the variant set was less pronounced but still present (Fig. 4a; dashed lines). This was true in all cases except for the zebra finch, where the default pipeline performed marginally better. However, when considering low-frequency k-mers as errors from the probability model in Merfin, the QV as well as QV* increased in all cases (adjusted QV and QV*; Fig. 4b,c and Supplementary Table 5). Merqury QV counts all k-mers never seen in the reads as errors, whereas the adjusted QV additionally counts low-frequency k-mers based on the k-mer frequency spectrum as errors. The QV* further includes overrepresented k-mers as errors, therefore capturing not only base accuracy errors, but also false duplications, expressing the uncertainty associated with any particular base given the support from the raw reads.

Most long-read assemblers generate locally phased haplotypes (such as Falcon-Unzip³⁷) and it is therefore important that the polishing does not introduce haplotype switches. To test whether the increase in QV from Merfin was due to introducing haplotype switches, we tested a zebra finch (T. guttata, bTaeGut2) pseudo-haploid assembly for which parental sequence information is available to evaluate, using parent-specific k-mers, the size of haplotype blocks and the number of haplotype switches⁵. When Merfin was applied to filter variants generated by Freebayes on the Longranger alignments of the 10X reads in the zebra finch pseudo-haploid setting, we noticed an increase in the number of haplotype switches as measured with Merqury (Supplementary Table 7). We realized that this was due to many heterozygous variants being called by Freebayes, when individual reads were mapped to collapsed regions or preferentially to the more-accurate primary assembly⁵. The missing true heterozygous k-mers in the collapsed or lower quality regions were recovered by the heterozygous variant call and thus preferred by Merfin. Further, even in almost complete pseudo-haploid assemblies, short reads can be easily mismapped, leading to spurious heterozygous calls. To overcome this issue, we decided to remove all heterozygous variants before applying Merfin. This substantially prevented haplotype switches (Extended Data Fig. 8), without affecting the QV increase (Supplementary Table 7). In conclusion, we suggest removing all heterozygous variants before using Merfin as the best practice for polishing pseudo-haploid and haploid assemblies.

In addition, we validated our results using gene annotations, which are sensitive to consensus accuracy error and particularly to frameshift errors caused by indel errors. We performed de novo gene annotation using the RefSeq³⁸ gene annotation pipeline (Gnomon)³⁹ on the VGP assemblies polished with the conventional VGP pipeline (v1.6) and compared against assemblies where Merfin was applied at every polishing step. In Gnomon, if a protein alignment supports a predicted model with an indel introducing frameshift or premature stop codons, the model is labeled as ‘low quality’ and a base is added or removed from the predicted model to compensate for the indel in the genome. If more than one in ten coding genes in an assembly require corrections, the assembly is excluded from RefSeq. Based on information provided by the submitters, almost all rejected assemblies used ONT or PacBio CLR reads.

Again, Merfin substantially reduced the number of genes affected by frameshifts, validating QV and QV* results (Fig. 4d–f, Supplementary Table 8 and an example in Extended Data Fig. 9). Premature stop codons were significantly reduced with respect to the default polishing in all cases (Fig. 4d), with 42.9%, 42% and 21.7% reduction in fArcCen1, rGopEvg1 and bTaeGut1, respectively. Ultimately, 1% or less of genes had code breaks in all cases when using Merfin. Frameshifts were also positively affected (Fig. 4e), with 38%, 49.6% and 19.5% reductions in fArcCen1, rGopEvg1 and bTaeGut1, respectively. Less than 3% of genes had frameshifts in all cases when using Merfin. Similarly, the number of protein-coding gene predictions labeled as ‘low quality’ were reduced (Fig. 4f). From these results, Merfin has been included in the VGP pipeline (v1.7).

Consistent with the variant filtering for genotyping, the improvements in QV with Merfin superseded any hard-filtering attempt using variant-call quality score (QUAL) cutoffs at the Arrow polishing step (Fig. 5a–c and Supplementary Table 9). For the primary assembly, QV* estimates were consistently higher than the best results attainable by hard filtering (fArcCen1, Q32.5 versus Q31.9 at QUAL ≥ 18; rGopEvg1, Q38.7 versus Q36.7 at QUAL ≥ 21; bTaeGut1, Q44.4 versus Q42.4 at QUAL ≥ 21). The best QUAL cutoff was not necessarily consistent between species, indicating that a single cutoff cannot produce the best outcome in all cases. The alternate assembly (alternate haplotype) behaved similarly to the primary assembly, again with Merfin always performing best (fArcCen1, Q31.6 versus Q31.1 at QUAL ≥ 23; rGopEvg1, Q35.2 versus Q34.2 at QUAL ≥ 26; bTaeGut1, Q42.0 versus Q40.6 at QUAL ≥ 23). However, it notably differed in best QUAL cutoff values to maximize QV. At increased QUAL cutoffs, both genuine and erroneous corrections are filtered out. Thus, hard-filtering cutoffs perform best when the number of errors corrected exceeds the number of errors introduced at maximum. In contrast, variants selected by Merfin had a wide range of quality scores, with the majority containing higher quality scores and yet including many below 25 (Fig. 5d–f). Notably, a significant fraction of variants with the highest quality score assigned were introducing error k-mers and thus were rejected by Merfin. Potentially, accumulated sequencing biases in long reads could lead to erroneous variant calls but these can be filtered with more-accurate k-mers from short reads. No hard-filtering methods were able to achieve QV improvements in polishing as observed with Merfin.

**Fig. 5: Merfin results against quality scores.**

Effect of k-size and computational requirements

The minimum size of k can be determined by a given genome size and a tolerable k-mer collision rate⁴⁰. This has been adapted in Merqury¹⁶ and used for k-mer-based assembly evaluation. In brief, under a maximum allowed collision rate of 0.5%, k = 21 is suggested as the minimum length of k for genomes of size typically found in vertebrate species (1.2–4 Gb), including human and is used throughout our benchmarks. In theory, a larger k size could result in more-accurate filtering variants with the cost of k-mer coverage drop and increased computational burden. We tested whether using k = 31 provides a better F1 score over k = 21 on the variant-filtering GIAB benchmark and found that it provided a marginal improvement in the F1 scores (0.04%; Supplementary Table 10) at the cost of using 1.5-times more memory and 1.6 to 2.6-times more computation. As a large fraction of the read k-mers occur exactly once in the reads (72–89% of all distinct k-mers), we tested how excluding these would affect the performance of Merfin. Excluding unique k-mers in the filtering slightly increased the F1 score (0.01% to 0.03%) compared to using the entire k-mer set, by removing additional false-positive calls. Memory requirement substantially reduced from 122.6 GB to 49.2 GB for loading k-mers obtained from ~60× Illumina reads and 68 GB to 24.3 GB for loading ~25× Illumina reads along with reduced CPU hours (Supplementary Tables 10–12). As the filtering, evaluation and polishing results at different k-size or filtering were nearly identical, we recommend avoiding using larger k and excluding all unique k-mers before applying Merfin to maximize results with minimal computational burden.

Source link

Vasiprak Blog

Variant-call filtering for higher precision

Assembly evaluation

Identifying collapsed and expanded regions

QV* estimation

Assembly completeness

Sequence polishing

Evaluating a complete human genome: T2T-CHM13

Polishing a completely phased assembly: HG002

Evaluation, polishing and annotation of pseudo-haploid assemblies

Effect of k-size and computational requirements

You might also like

New findings on cognitive development in adulthood — ScienceDaily

Explanation for formation of abundant features on Europa bodes well for search for extraterrestrial life — ScienceDaily

Bose Institute Admissions 2021 PhD Autumn For Life Sciences

Stay tuned!