Preloader

Whole-genome resequencing of Sorghum bicolor and S. bicolor × S. halepense lines provides new insights for improving plant agroecological characteristics

Genetic diversity and sequencing assessment

One hundred and seventy-two sorghum lines evaluated in this study clustered in two different populations of 153 S. bicolor genotypes and 19 S. bicolor × S. halepense recombinant inbred lines (Fig. 1). The Fstatistic (Fst) measuring the genetic structure and the level of genetic differentiation28,29 was significantly and moderately high (Fst = 0.31, p = 0.01); Fst values range from 0 in case of panmixis to 1 in case the populations do not share any genetic diversity. The first two dimensions of the principal coordinates explained 70.9% of the total genetic diversity existing in the studied populations. The one hundred and seventy-two sorghum lines were whole-genome resequenced and a corresponding number (172) of paired-end sequencing libraries constructed each with the insert size around 300 base pairs. Resequencing yielded 22.88 billion paired-end reads resulting in 3.43 trillion bases (nucleotides) and 2.6 TB of high-quality raw data. Ultimately, a total of 21.70 billion and 3.25 trillion of clean paired-end reads and bases were produced, respectively. Overall, 94.54% of total clean reads showed quality value Q20 ≥ 94.54%, which indicates a high data quality. Sorghum bicolor (Sb) and S. bicolor × S. halepense (SbxSh) showed comparable clean reads quality (Q20 = 96.17–96.38% and Q30 = 87.91–88.22%), number of clean reads (125.7–126.19 × 106) and bases (18.82–18.89 × 109), and the ratio clean bases: raw bases (94.64–94.89%). However, the guanine-cytosine (GC) rate was higher in SbxSh relative to Sb (43.42% a vs. 43.26% b) in the clean data.

Figure 1
figure 1

Genetic differentiation analysis in Sb and SbxSh populations. The ellipses are drawn accounting for the 95% confidence interval and the Euclidean distance from the center “o”. Genotypes outside the ellipses are outliers74.

Sequence alignment

Alignment assessment

Sequence reads were aligned to the Sorghum bicolor (BTx623) reference genome whose size was 732,200,000 bp, while the effective size was 675,973,270 bp (N base excluded), and GC content of 41.82%. The mapping rate i.e., the percentage coverage of the reference by reads of the sorghum samples varied from 89.15 to 95.18% with a mean of 92%. The percentage of mapped reads and that of mapped bases were identical and ranged from 82.65 to 99.92% with an average of 99.40%. The effective mapping depth i.e., LN/G, where L is the read length, N is the number of reads and G is the haploid genome length, was between 23.17X to 34.38X. In this work, SbxSh and Sb showed comparable mapped reads (99.04–99.45%), mapped bases (99.04–99.45%), sequencing depth after mapping (26.54X–26.64X), while SbxSh showed statistically significant higher percentage coverage rate (94.87a vs. 91.62b), yet lower percentage unique hit bases (81.96a vs. 77.58b) and unique hit reads (82.39a vs. 78.31b).

SNP calling and annotation

The sequence alignment of the target sorghum lines to the reference genome BTx623, the gene models, and the information derived from the reference genome allowed to identify large numbers of SNPs, InDels, CNVs, and SVs (Table 1; Fig. 2). A total of 567,046,841 SNPs was uncovered from these sorghum genomes. On average, 10,515,367.4 SNPs were observed per individual, of which 1,855,062 were located in the genic regions. The statistical analysis showed that SbxSh had more total and heterozygous SNPs, synonymous_CDS, nonsyn_CDS, exonic, genic, intronic, mRNA, pseudogenic, transcript, and tRNA SNPs, whereas Sb showed more homozygous SNPs. Among the synonymous and nonsynonymous SNPs mapped in the coding regions in either population, the synonymous SNPs represented 49% and 54%, respectively, in Sb and SbxSh populations. The SbxSh population contained 81% and 78%, respectively, of all synonymous and nonsynonymous SNPs. Contrary to synonymous mutations, nonsynonymous mutations cause variation in coding amino acids and are considered to play a significant role in changing the phenotype of organisms. Besides, nonsynonymous mutations are also strong candidates to explain the phenotypic diversity between different individuals in a population.

Table 1 Tukey HSD for SNPs, InDels, SVs, and CNVs in S. bicolor vs. S. bicolor × S. halepense results.
Figure 2
figure 2

Chromosomal distribution of WGRS information in Sb and BC1 SbxSh lines, and GWAS data from the evaluated populations74. The x-axis corresponds to the genomic coordinate. Tracks 1–4: visualize the genomic density of regions (defined as the fraction of a genomic window that is covered by genomic regions). Tracks 5, 8: significant SNPs and candidate genes are displayed according to their genomic coordinates (x-axis), while y-values were set for the sole purpose improving the resolution (legibility) of the corresponding SNPs and candidate genes.

We further analyzed the distribution of the large-effect SNPs i.e., those with a potential to disable gene functions26,30. In this work, large-effect SNPS included premature stop codon, stop codon to non-stop codon, start codon to non-start codon, and splice sites. It was found that within the 10,140 SNPs participating in codon premature termination, 2,970 SNPs disrupt splicing donor or acceptor sites of genome, 13,976 SNPs are related to alteration of initiation methionine residues, and 1,144 SNPs replace terminators with certain amino acid residues that leads to longer ORFs. The statistics are depicted in Fig. 3 where SbxSh showed higher numbers of large-effect SNPs than Sb. In SbxSh population, it was found an average of 1967, 500, 370, and 700 SNPs expected to induce premature stop codon, stop codon to non-stop codon, start codon to non-start codon, and splice sites, respectively, while in Sb it was found a maximum of 1183, 340, 100, and 340, respectively. In both populations, SNPs inducing premature stop codon were most represented with respect to the other large-effect SNPs.

Figure 3
figure 3

Statistics of different types of large-effect SNPs. Re-sequenced entries are represented on x-axis, numbers on top of each bar represent the number of SNPs74.

InDel calling and annotation

A total of 91,825,474 indels was identified of which 24% and 76% resided in Sb and SbxSh, respectively; mean individual insertions (211,283.62 vs. 649,218.63) and deletions (221,602.78 vs. 697,826.37) were statistically higher in SbxSh relative to Sb population. The genome-wide distribution of short InDels (1–10 bp) showed a lower number of these variants in genes and coding regions compared with Pseudogenes and mRNA, for instance (Table 1). Our result show that indels that are not multiples of 3 bp and produce frameshift mutations are particularly uncommon in coding regions. Frame shift mutation in CDS region, 3X-shift mutation in CDS region, 3X-shift mutation in CDS region phase 0, and 3X-shift mutation in CDS region phase No 0 were statistically higher in SbxSh than in Sb i.e., 16,544, 19,780.74, 6301.53, 13,479.21 vs. 5954.58, 6684.15, 2114.42, 4569.73, respectively. A frameshift mutation results from an insertion or deletion of a number of nucleotides that is not a multiple of three. The change in reading frame alters every amino acid after the point of the mutation and results in a nonfunctional protein. The comparative effects of frame-shifting (e.g., 1-, 2-, 4-, 5-, 7-, 8-, 10-bp.) non-frame-shift (e.g., 3-, 6-, 9-bp) shows that the former short InDels provide a much powerful explanation of the difference of traits between individuals30.

Structure and copy number variations

In this study, a total of 1,532,171 SVs was identified and found statistically comparably distributed between the two populations. Of the observed SVs, Sb and SbxSh showed statistically comparable individual mean numbers of deletions (4202 vs 4119), other SVs (4699 vs 4594), but SbxSh displayed more insertions than Sb, i.e., 72.47 vs 21.01 individual mean SV number. A total of 4,973,961 CNVs was generated from the entire population, with SbxSh producing statistically higher number of CNVs than Sb (41,296.21 vs. 27,381.26), higher CNV up-regulation (16,650.26 vs. 10,567.82) and down-regulation (24,214.95 vs 16,345.06).

Genetic variation in S. bicolor and S. bicolor × S. halepense hybrids

One of our working hypotheses was that some of the identified genetic variation might contribute to the phenotypic differentiation between S. bicolor and S. bicolor × S. halepense which pushed us to focus our analysis on SNPs in genic regions. Sorghum breeders working on the introgression of perenniality in otherwise annual S. bicolor background select for overall sorghum bicolor plant aspect in addition to the overwintering trait. We also observed from our experience in recent years14 that backcrosses are the most attractive as they show traits closely comparable to domesticated sorghum (panicle shape, seed size, etc..) than single, double, or three-way crosses. We therefore used the data produced in this and previous studies from our laboratory to investigate the contribution of Sorghum halepense in SbxSh controlled hybridizations. As Fig. 2 shows, SbxSh9 backcross line has more genes and more SNPs, CNVs, and indels than S. bicolor Sb1; nonetheless, the two lines displayed comparable number of SVs. The same pattern was observed across populations. The density of genes, SNPs, indels, and SVs increases from the pericentromeric region towards the telomeres, with SNPs/genes and short indels showing similar distribution pattern. In both populations, the distribution of CNVs was homogeneous from the centromeres to telomeres in all chromosomes. In addition, sorghum biomass related significant SNPs and candidate genes recently uncovered8 in these populations i.e., Dw genes (Dw1, Dw2, Dw3, Dw4), Ma genes (Ma1, Ma2, Ma3, Ma5, Ma6), gibberellin (GA) associated genes (SbGA2ox1, SbGA3ox1, SbGA2ox7), genes involved in controlling heading date (SbZCN8)1,31 and GA signaling and plant height regulation (SbSLR11) were mostly localized towards the distal and proximal ends of the chromosomes of interest (Fig. 2).

The analysis of genes harboring SNPs showed more genes (18,785) were shared between Sb and BC1 crosses, with relatively fewer genes i.e., 109, 230, and 291 being private to the respective three lines Sb1, SbxSh50, and SbxSh9 (Fig. 4). The analysis of the biological processes (BP) GO enrichment associated to SbxSh genes showed 33,693, and 5548 genes in the Sorghum bicolor reference genome dataset and in SbxSh9/SbxSh5 hybrids gene set that mapped to the GO terms (either directly or through inheritance), respectively (Table 2). The most granular terms included enrichment in genes associated to the plant-type cell wall organization (GO:0009664), root development (GO:004836), cell wall polysaccharide metabolic process (GO:0010383), glutathione metabolic process (GO:0006749), hydrogen peroxide catabolic process (GO:0042744), anatomical structure morphogenesis (GO:0009653), response to oxidative stress (GO:0006979), proteasomal protein catabolic process (GO:0010498), response to oxygen-containing compound (GO:1901700), generation of precursor metabolites and energy (GO:0006091), translation (GO:0006412), hormone-mediated signaling pathway (GO:0009755), response to abiotic stimulus (GO:0009628), vesicle-mediated transport (GO:0,016,192), ribonucleoprotein complex biogenesis (GO:0022613), protein-containing complex subunit organization (GO:0043933), intracellular transport (GO:0046907), cellular component assembly (GO:0022607), protein localization (GO:0008104), organelle organization (GO:0006996), nitrogen compound transport (GO:0071705), organic substance transport (GO:0071702), small molecule metabolic process (GO:0044281), regulation of transcription, DNA-templated (GO:0006355), nucleobase-containing compound metabolic process (GO:0006139), positive regulation of translation (GO:0045727).

Figure 4
figure 4

Number of shared and private genes among Sb1 (IESV 99,091 DL) and two sister perennial RILs (SbxSh9 and SbxSh50) derived from SbxSh cross backcrossed (2 recurrent parent doses: Tx623*2/Gypsum 9; BC1) to S. bicolor74.

Table 2 Hierarchical relations between over-represented (enriched) functional classes using GO biological process annotation in private genes of S. bicolor*2/S. halepense relative to S. bicolor lines.

To root development GO term (GO:004836) mapped 34 genes: GeneID:8080001, GeneID:8064680, GeneID:8054879, GeneID:8075742, GeneID:8059541, GeneID:8078975, GeneID:8081609, GeneID:8055737, GeneID:8055874, GeneID:8063307, GeneID:8058361, GeneID:8084663, GeneID:8064471, GeneID:8079326, GeneID:8079141, GeneID:8060669, GeneID:8063006, GeneID:8060622, GeneID:8080905, GeneID:8077286, GeneID:8060000, GeneID:8058075, GeneID:8074440, GeneID:8082391, GeneID:8065800, GeneID:8080849, GeneID:8085583, GeneID:8071472, GeneID:8084890, GeneID:8072111, GeneID:8063311, GeneID:8054193, GeneID:8059201, GeneID:8082281.

Source link