Chromosome-scale and haplotype-resolved genome assembly of a tetraploid potato cultivar

Plant material

Plant material was grown at Max Planck Institute for Plant Breeding Research (MPIPZ, Germany). The three potato cultivars ‘Otava’, ‘Hera’ and ‘Stieglitz’ (Extended Data Fig. 1) were clonally propagated and grown on Murashige-Skoog medium for 3–4 weeks at MPIPZ. Seedlings were transferred to soil in 7 × 7 cm² pots and grown in a Percival growth chamber for 2–3 weeks. Afterwards, potato plants were transferred to 1-l pots and grown until flowering. Potatoes were grown in long day conditions (16 h light, 8 h night cycle) at 22 °C. Details of DNA/RNA library preparation and sequencing are given in the Supplementary Information.

Genome size estimation

After trimming off 10× genomics barcodes and hexamers from the 370.3-Gb reads combined from the 10× single-cell CNV libraries and single-molecule libraries, k-mer counting (k = 21) was performed with Jellyfish (v.2.2.10)³⁸. The k-mer histogram was provided to findGSE (v.1.0)³⁹ to estimate the haploid/tetraploid genome size of ‘Otava’ under the heterozygous mode (with ‘exp_hom = 200’; Extended Data Fig. 1).

Initial tetraploid genome assembly, polishing and purging

The initial assembly of the tetraploid genome was performed using hifiasm (v.0.7)¹⁹ with default settings with the 102.2-Gb raw PacBio HiFi reads of ‘Otava’, where the output consisting of unitigs (locally haplotype-resolved contigs) was selected for further processing. Then the short reads from the two 10× single-cell CNV libraries were aligned to the assembly using bowtie2 (v.2.2.8)⁴⁰. These alignments were used to polish the assembly with pilon (v.1.22)⁴¹ with options of –fix bases –changes –diploid –mindepth 0.8. Further, short reads of the two 10× single-cell CNV libraries, additional 10× single-molecule libraries and PacBio HiFi reads were all aligned to the polished assembly (using bowtie2 and minimap2 (v.2.17-r491)⁴² respectively). Among 17,153 raw contigs, 9,041 contigs which were longer than 50 kb and with an average sequencing depth over 80× were kept. HiFi reads were re-aligned to the purged assembly and contigs covered less than 3× were removed. The HiFi reads-based purging process was repeated for five rounds to get an initial assembly of 6,366 contigs for subsequent analysis (Supplementary Fig. 1).

Coverage marker definition by sequencing depth analysis

All Illumina paired-end short reads from 10× single-cell/molecule libraries (and PacBio HiFi long reads) were aligned to the initial assembly of 6,366 contigs respectively using bowtie2 and minimap2. Potentially duplicated short reads were removed using picard MarkDuplicates function (http://broadinstitute.github.io/picard/). The depth along each contig was calculated with samtools depth function (v.1.9)⁴³ for each type of (short and long) reads and the depth at each position of each contig was taken as the sum of all types. Each contig was dissected into 10-kb windows and the average sequencing depth per base was calculated within each window. According to the average genome-wide depth of 113× per haplotype (denoted by H), windows with depths in [0, 170×], [171×, 283×], [284×, 396×], [397×, 509×] and [≥510×] were respectively determined as contig types of haplotig, diplotig, triplotig, tetraplotig and replotig, where the upper bound on depth for each type (if any) was determined by H × (i + 1/2), with i being 1, 2, 3, 4. Neighboring 10-kb windows (along the same contig) were further merged as larger 50-kb coverage markers if they were classified as the same contig type (to ensure sufficient read signal for genotyping within each single pollen genome).

Linkage-based grouping of contigs

The read count (denoted by r) at each coverage marker of size W (in bp) for each of the selected 717 pollen genomes (Supplementary Fig. 2), which were collected using bedtools (v.2.29.0)⁴⁴, was firstly normalized using n_r = r × (10⁴/W) × (10⁶/N), where N was the total number of reads aligned to the assembly of 6,366 contigs. Meanwhile, the average read count, m_r, for all coverage markers with more than 7 × (N/10⁶) reads was calculated for each pollen genome. The coverage (genotype) at a marker for each pollen was set to n_r/m_r (which would be rounded to 0, 1 or 2). In general, coverages across 717 pollen read sets at the same window marker were used to build up a PAP: X = x₁×₂…x₇₁₇, where x_i was in {0, 1}, {0, 1, 2}, {1, 2} or {2}, depending on whether the marker type was haplotig, diplotig, triplotig or tetraplotig. The correlation between two PAPs X and Y was calculated as cor_XY = sum[(x_i–m_x) × (y_i–m_y)] / [sqrt(sum(x_i–m_x)²) × sqrt(sum(y_i–m_y)²)], where m_x and m_y were the respective average values. Initially, any pair of haplotigs/vertices (with sizes ≥100 kb) was connected by an edge, if the highest correlation value between the PAPs of the markers at two ends of the two haplotigs was >0.55 (Supplementary Fig. 3). This graph-based clustering⁴⁵ led to 48 groups representing the 48 haplotypes (Extended Data Fig. 2). If the lowest correlation value between any pair of the markers of any two groups was less than –0.25, they could be determined as homologous linkage groups (same chromosome, different haplotypes). With this, the 48 groups were clustered into 12 chromosomes, each with four different haplotypes.

Each of the remaining haplotigs h (<100 kb) was integrated into the group with the marker showing the highest correlation with h. For a diplotig marker (with PAP Z) which represent the collapsed haplotypes A₁ and A₂, it can be expected that Z ≈ X + Y, where X and Y are PAPs of two markers which closely linked to the diploid marker (Fig. 1c). We can therefore expect Z&X ≈ X and Z&Y ≈ Y, where ‘&’ refers to the bit-wise AND-operation. As a result, the correlations of ‘Z&X with X’ and ‘Z&Y with Y’ should give the two highest values. Therefore, the two coverage markers from two of the 48 groups that show the highest correlations to a diplotig marker reveal the two groups that the diplotig will be assigned to. Similarly, triplotig markers can be associated with three groups. If all coverage markers of a contig are tetraplotig-type, the contig cannot be associated with any group because the information for linking a chromosome is missing. Only if at least one nontetraplotig coverage marker can be grouped, the tetraplotig marker of the contigs can be grouped.

Note, meiotic recombination can influence the linkage grouping. In the most simple case, that is, when no pollen genome carries a single meiotic recombination event, all PAPs at coverage markers from the same haplotype would be identical while PAPs at coverage markers from other haplotypes would be different (because haplotypes randomly occur in pollen genomes in a pairwise manner) and thus they can be easily grouped into haplotypes. In the presence of meiotic recombination, a few crossovers along each chromosome change the PAP values of the coverage markers but, as recombination is rare, the PAP values change only marginally. Thus, linkage grouping based on PAPs works even in the presence of meiotic recombination.

Haplotype-specific PacBio HiFi read separation and haplotype assembly

HiFi reads were classified into 48 groups on the basis of alignments to the 50-kb coverage markers using customized code⁴⁵. Specifically, to assign a read to a marker, at least 500 bp of the read had to be aligned to the marker (reads aligning two neighboring markers were assigned to the marker with a larger overlapping size). Reads overlapping non-haplotig marker were randomly assigned to one of the marker-associated groups.

Each set of HiFi reads was independently assembled using hifiasm (v.0.7) with default settings. The resulting contigs were first polished with short reads using pilon with –fix bases –changes –diploid –mindepth 0.8 and then with HiFi reads using racon (v.1.4.10)⁴⁶ with -u –no-trimming.

Evaluation of haplotyping accuracy

For each haplotype assembly and the sequencing data of the parental genomes, k-mers (k = 21) were counted using KMC⁴⁷. Specifically, k-mers found in ‘Hera’ but not in ‘Stieglitz’ (with a coverage of 6–12), as well as k-mers found in ‘Stieglitz’ but not in ‘Hera’ (with a coverage of 5–11) were selected using kmc_tools simple. For each haplotype, the sets of assembled k-mers were intersected with the two sets of parental-specific k-mers (using kmc_tools simple with subfunction intersect), which revealed k-mers common with either of the parental genomes. As a haplotype can only be inherited from one of the parents, it is expected to find parental-specific k-mers only of one parent. The overall haplotyping precision was determined as the total number of correctly phased k-mers divided by the total number of k-mers investigated in the 48 haplotype assemblies. Note, this was done before and after contig polishing, where we observed the same haplotyping accuracy.

Haplotype-specific contig scaffolding using group-specific Hi-C reads

Each haplotype-specific contig-level assembly was indexed with bwa index (with -a bwtsw) (v.0.7.15-r1140)⁴⁸ and samtools faidx. The haplotype-specific Hi-C read pairs were aligned using bwa aln and bwa sampe. Aligned reads (in pairs) were converted into BAM files using samtools view with options of -b -F12. The BAM files were filtered with filterBAM_forHiC.pl (from ALLHiC package, v.0.9.13)⁴⁹ to remove nonuniquely mapped reads. Then BAM files were converted to bed files using bamToBed (from bedtools package) and sorted by read name. The bed files were provided to SALSA2 (run with -s 100000000 -m yes -i 10 -e DNASE)⁵⁰. Potential chimeric contigs were broken at the chimeric sites given by SALSA2 output file of input_breaks, leading to a new set of contigs for each of the 48 original groups.

For each new group of contigs, the above process of contig indexing, Hi-C read alignment and BAM filtering was repeated. Then, for each haplotype, ALLHiC_partition was run with -e GATC -k 1 -m 25; allhic extract was run with –RE GATC; allhic optimize and ALLHiC_build were run with default settings; the chromosome contact map was visualized with ALLHiC_plot at 1-Mb resolution, where obvious mis-placement/orientation of large contigs were visually identified and manually corrected (Extended Data Fig. 3).

Genome annotation and assessment

Protein-coding genes for each haplotype chromosome were annotated with three types of evidences, including ab initio gene predictions (considering outputs by Augustus⁵¹, GlimmerHMM⁵² and SNAP⁵³), transcripts assembled from Illumina short RNA-seq reads and alignments of homologous protein sequences. Specifically, protein sequences from Solanum tuberosum L.^5,7, Arabidopsis thaliana and other plant proteins from UnitProtKB were aligned to each haplotype assembly independently using Exonerate⁵⁴ with options of –percent 60 –minintron 10 –maxintron 60000. RNA-seq reads from two recent potato genome assembly work^5,7 were downloaded. All reads were first aligned to the full haplotype-resolved genome assembly using HISAT2 (v.2.2.0)⁵⁵. Then for each of the 12 linkage groups, reads aligned to the respective four haplotypes were extracted and combined as one set with samtools view -L and bedtools bamtofastq. Within each linkage group, each set of reads from the group was re-aligned to each haplotype sequence independently using HISAT2 and transcripts were assembled using StringTie⁵⁶. Finally, all the above evidences were integrated with EvidenceModeler⁵⁷ to generate high-quality gene models for each haplotype assembly. Transposon elements (TE) were annotated using RepeatModeler and RepeatMasker (http://www.repeatmasker.org). TE-related genes were filtered by investigating their overlapping with TEs (overlapping percentage >30%), sequence alignment with TE-related protein sequences and A. thaliana TE-related gene sequences (requiring blastn alignment identity and coverage >30%).

To rescue potentially mis-annotated genes, all gene models were further improved. Specifically, against each of the four haplotypes, we first aligned the gene sequences of the other three haplotypes using blastn and, similarly, aligned the protein sequences of the other three haplotypes against those from the target haplotype using blastp⁵⁸ and Scipio⁵⁹. If there were counterparts in the target haplotype to the other three (based on the alignments), the potential missing genes were added according to gene models given by ab initio prediction. Besides, gene models were split into smaller genes or merged as larger genes if all the alignments of genes from the other three haplotypes indicated a mis-merged or mis-split gene model.

The final assembly and annotation completeness were evaluated by BUSCO (v.4.1.4)⁶⁰ with 2,326 single-copy genes from the lineage database eudicots_odb10. The functional annotation of genes was performed with InterProScan (v.5.48)⁶¹ with default parameters except for option -goterms. The GO terms were extracted for GO functional enrichment analysis by the R package ClusterProfiler⁶². The noncoding RNA was annotated with the tool Infernal (v.1.1)⁶³ by searching the database Rfam (v.14.3)⁶⁴. The adjacent rDNAs with distance <5 kb were clustered as the potential rDNA clusters.

Comparison of chromosome sequences

Within each of the 12 homologous linkage groups (LGs), the chromosome-level sequences of the four haplotypes were aligned to each other as well as to the recently assembled DM genome using minimap2 with -ax asm20 –eqx. For each pair of haplotypes, the alignments were provided to SyRI⁶⁵, which searched for synteny, single-nucleotide level differences as well as large-scale structural variations (with -k -F S).

Allelic expression analysis

Quality control

Short reads from RNA sequencing were trimmed with Trimmomatic (v.0.39)⁶⁶ under paired-end mode, with options ILLUMINA:adapters.fa:2:30:10:8:true (adapters provided by the tool itself) SLIDINGWINDOW:4:15 LEADING:3 TRAILING:3 MINLEN:36.

Read separation

All reads were aligned to the final haplotype-resolved assembly using HISAT2 (v.2.2.0) with option -k 1 and reads were thus separated into 48 haplotype-specific groups with samtools view -L and bedtools bamtofastq.

Expression analysis was performed following the literature⁶⁷, using Stieglitz-1 genome as reference. Within each of the 12 linkage groups, the four sets of haplotype-wise RNA-seq reads were independently aligned to the respective Stieglitz-1 chromosome as reference, using HISAT2. The number of fragments at each gene from each chromosome was quantified using HTSeq (v.0.13.5)⁶⁸ with options –mode=union –nonunique=none –secondary-alignments=ignore –stranded=no (given the gff file for that chromosome). For testing dominance in allelic expression and effect of allelic copy number on expression, the fragment counts were normalized as fragments per kilobase per million reads (FPKM) and log₂-scaled. The genes with four allelic copies were selected in analysing differential allele expression, where the log₂-transformed CPM (count per million reads) at each gene was used for measuring the allele expression level. After excluding genes with less than three replicates showing CPM > 1.0, 11,154 expressed genes were kept. Paired comparisons on expression of the four alleles were performed and 1,219 were found to be differentially expressed in at least one pair of haplotypes with P < 0.05 adjusted using Benjamini–Hochberg method (Supplementary Table 11).

Note that the well clustering of the three biological replicates regarding four haplotypes based on the haplotype-specific expression of all genes from Stieglitz-1 genome (using hclust in R) showed that there was a high consistency between the replicates (Supplementary Fig. 10), thus the final FPKM value at each gene was taken as the average of the values of the three replicates in all related analysis (except for the procedure of differential analysis where replicates were not merged). At a gene, for four alleles with FPKM values of x = {x1,x2,x3, x4}, the z-score (in heatmaps) for each allele i (in {1,2,3,4}) was calculated as (xi – mean(x))/ s.d.(x), with mean(x) giving the average of x and s.d.(x) giving the standard deviation of x.

Methylation analysis

Quality control

Short reads from methylation sequencing were trimmed with Trimmomatic (v.0.39) under paired-end mode, with options ILLUMINA:adapters.fa:2:30:10:8:true (adapters provided by the tool itself) SLIDINGWINDOW:4:15 LEADING:3 TRAILING:3 MINLEN:36.

Read separation

All reads were first aligned to the initial assembly (of 6,366 contigs) using bismark (v.0.23.0)⁶⁹ (with options –hisat2 –score_min L,0,-0.6), to avoid the default masking of reads from IBD regions. If a read pair could be aligned to coverage markers linking a single haplotype group, then it was assigned to that group. If a read pair could be aligned to coverage markers linking multiple groups, it was randomly assigned to one of the groups.

Methylation calling at 48 haplotype-specific chromosomes

The reads assigned to each haplotype-specific group were independently re-aligned to the respective haplotype-resolved chromosome using bismark under the same options plus —non_directional (which gave alignments to all four bisulfite strands). Each bam file was deduplicated using deduplicate_bismark, after which methylation at CG, CHG and CHH contexts were simultaneously called with bismark_methylation_extractor with options –comprehensive –cytosine_report –CX –bedGraph. Given a specific genomic region/window, the methylation level at CG(/CHG/CHH) context was determined as C_count/(C_count + T_count), where C_count was the number of reads carrying the methylated cytosine in CG(/CHG/CHH) context and T_count was the number of reads carrying un-methylated cytosine in CG(/CHG/CHH) context. Following this, a sliding window-based quantification of the methylation level was performed along the 48 chromosomes (window size, 2 Mb; step, 50 kb; Fig. 2b).

Methylation calling at 12 Stieglitz-1 chromosomes as references

Within each of the linkage groups, the four haplotype-wise reads were independently aligned to the respective Stieglitz-1 chromosome as reference using bismark under the same options (as above) plus –non_directional (which gave alignments to all four bisulfite strands). The methylation calling was done in a similar way as given above. The results were used to perform correlation analysis with allele-specific expression (Fig. 4g).

Comparison of the level of methylation between IBD blocks and the corresponding regions in synteny in homologous haplotypes

IBD blocks (at 50-kb resolution) that were shared by two or three haplotypes were investigated. According to the synteny between haplotypes (from SyRI-based analysis), the counterparts of such IBDs were located in the haplotypes being homologous to the haplotypes showing the IBDs. The counterparts were re-assigned to IBD groups if they were harbored by any known IBD blocks. This led to two sets, one with IBD blocks and the other with regions from other haplotypes but in synteny with the IBD blocks. Methylation level at CG(/CHG/CHH) context was calculated for each block within the two sets and t-test was used to investigate the difference between the two sets (Supplementary Fig. 9).

Note that, using the 12 Stieglitz-1 chromosomes as reference, the well clustering of the three biological replicates regarding four haplotypes based on CG, CHG or CHH methylation levels in 50-kb windows at a step of 25-kb (using hclust in R) showed that there was a high consistency between the replicates (Supplementary Fig. 10), thus the final methylation level (and the related z-score, similarly calculated as given in gene expression analysis) at each window was taken as the average of the levels at three replicates in all related analysis.

Other methods used in analysis have been provided in the Supplementary Information.

Statistics

All presented P values correspond to two-sided P values. Correlation test between the levels of allele-specific expression and methylation was done using cor.test function in R.

Reporting Summary

Further information on research design is available in the Nature Research Reporting Summary linked to this article.

Source link

Vasiprak Blog

Plant material

Genome size estimation

Initial tetraploid genome assembly, polishing and purging

Coverage marker definition by sequencing depth analysis

Linkage-based grouping of contigs

Haplotype-specific PacBio HiFi read separation and haplotype assembly

Evaluation of haplotyping accuracy

Haplotype-specific contig scaffolding using group-specific Hi-C reads

Genome annotation and assessment

Comparison of chromosome sequences

Allelic expression analysis

Quality control

Read separation

Methylation analysis

Quality control

Read separation

Methylation calling at 48 haplotype-specific chromosomes

Methylation calling at 12 Stieglitz-1 chromosomes as references

Comparison of the level of methylation between IBD blocks and the corresponding regions in synteny in homologous haplotypes

Statistics

Reporting Summary

You might also like

Here’s How To Watch Live + 9 Things To Know

Four commonly used medications reverse Alzheimer’s disease in mice — ScienceDaily

Magnesium Is Essential for the Immune System – Important in the Fight Against Cancer

Stay tuned!