De novo assembly of chloroplast and mitochondrial genomes
A total of 10.5 Gb reads per sample were obtained from three CMS tomato lines (‘CMS[MSA1]’, ‘CMS[O]’, and ‘CMS[P]’), three nuclear donors (‘Sekai-ichi’, ‘O’, and ‘P’), and one cytoplasmic donor (S. acaule). Of them, 374 Mb (3.6%) and 566 Mb (5.4%) of reads per sample were aligned on publicly available sequences of mitochondrial and chloroplast genomes, respectively. The reads mapped on the two sets of reference sequences were separately assembled into contig sequences.
Mitochondrial genome sequences were constructed with reads mapped on the mitochondrial reference sequences (Table 1). The mitochondrial genomes of the nuclear donors ‘Sekai-ichi’, ‘O’, and ‘P’ were all constructed from only contigs with assembly sizes of 562.6 kb (n = 2, n represents contig numbers), 536.9 kb (n = 2), and 553.3 kb (n = 2), respectively. In S. acaule, 728.4 kb contigs (n = 7) for the mitochondrial genome were established. The assembly sizes were longer in the CMS lines than in the nuclear and cytoplasmic donors, specifically, they were 995.2 kb (n = 7) in ‘CMS[MSA1]’, 968.4 kb (n = 7) in ‘CMS[O]’, and 829.3 kb (n = 5) in ‘CMS[P]’. For chloroplast genomes, total sequence lengths of 389.2 kb (n = 2), 349.5 kb (n = 2), and 346.9 kb (n = 2) were constructed for ‘Sekai-ichi’, ‘O’, and ‘P’, respectively (Table 1). There were two contig sequences in each of the three nuclear donors. The assembly sizes were shorter in ‘CMS[MSA1]’ (296.6 kb, n = 1) and ‘CMS[O]’ (307.1 kb, n = 1) than in the nuclear donors, but longer in ‘CMS[P]’ (454.1 kb, n = 3).
Comparative genome analysis revealed that the mitochondrial genomes of the CMS lines consisted of highly fragmented, repeated, and duplicated sequences derived from both donors throughout the genome (Fig. 2). On the other hand, the structures of the chloroplast genomes of the CMS lines were moderately conserved across the nuclear and cytoplasmic donors (Fig. 2).


Mitochondrial (A) and chloroplast (B) genomes of the three CMS lines, nuclear donors, and cytoplasmic donor. Dots indicate sequence similarity between the genome sequences
In parallel, we determined the mitochondrial and chloroplast genome sequences of Solanum pimpinellifolium LA1670 and S. lycopersicum var. cerasiforme LA1673 (Table 1). Sequence reads were obtained from a public DNA database and processed as described above. Assembly sizes of the mitochondrial and chloroplast genomes were 620.6 kb (n = 3) and 299.4 kb (n = 1) for S. pimpinellifolium LA1670, respectively, and 569.9 kb (n = 2) and 337.7 kb (n = 2) for S. lycopersicum var. cerasiforme LA1673, respectively.
Gene prediction from the organelle genomes
ORFs encoding ≥25 amino acids were extracted from the assembled sequences to predict potential genes. The number of potential genes predicted from the chloroplast genome assemblies ranged from 5,130 (S. acaule) to 8,165 (‘CMS[P]’) and the number of potential genes predicted from the mitochondrial sequences ranged from 10,326 (‘O’) to 19,170 (‘CMS[MSA1]’) (Table 1).
The ORFs were clustered to identify genes unique to and shared among the CMS lines, nuclear donors, and cytoplasmic donor (Fig. 3). The ORFs in the CMS mitochondrial genomes consisted of four types of genes, namely, those unique to the CMS lines (Type 1: 9.4–11.9%), those shared with the nuclear donors only (Type 2: 14.1–17.0%), those shared with the cytoplasmic donor only (Type 3: 8.9–13.2%), and those shared with both the nuclear and cytoplasmic donors (Type 4: 61.8–64.1%). By contrast, the ORFs in the CMS chloroplast genomes mostly consisted of three types of genes, namely, those unique to the CMS lines (Type 1: 1.2–5.9%), those shared with the nuclear donors only (Type 2: 31.2–33.1%), and those shared with both the nuclear and cytoplasmic donors (Type 4: 62.9–65.7%). Few genes shared with the cytoplasmic donor only were found (Type 3: up to 0.1%).


Numbers of genes unique to the CMS lines, nuclear donors, and cytoplasmic donor are indicated in bold, standard, and italic fonts, respectively. Percentages of genes are shown in parentheses
The genome positions of the genes differed according to the gene type and organelle (Fig. 4). Type 4 genes in mitochondria were distributed across the genome with some gaps. The positions of Type 2 genes were basically the same as those of Type 4 genes, while Type 3 genes were located in the gaps between Type 4 genes. Type 1 genes were also located in the gaps and at the ends of contig sequences. On the other hand, in chloroplast genomes, the positions of Type 2 and 4 genes overlapped and Type 1 genes were located at the ends of contigs.


Dots indicate gene positions on contig sequences of the organelle genomes. Genes are grouped into the following four types: Type 1, genes unique to the CMS lines; Type 2, genes shared with the nuclear donors; Type 3, genes shared with the cytoplasmic donor; and Type 4, genes shared with both the nuclear and cytoplasmic donors
Screening of CMS-associated gene candidates
To identify candidates of CMS-associated genes in the mitochondrial genomes, we set the following four criteria: (1) amino acid length ≥70, (2) absent from male fertile lines, (3) present in all three CMS lines, and (4) expressed in anthers of the CMS lines. Among the predicted genes in the ‘CMS[P]’, ‘CMS[MSA1]’, and ‘CMS[O]’ mitochondrial genomes, 831, 1025, and 969 genes encoded ≥70 amino acids, respectively. The gene sequences from the CMS lines were compared with the mitochondrial genomes of the nuclear donors (‘Sekai-ichi’, ‘P’, and ‘O’) and S. pimpinellifolium LA1670, S. lycopersicum var. cerasiforme LA1673), S. pennellii, and Nicotiana tabacum. In total, 183, 272, and 140 genes were selected because they were absent from the nuclear donors and Solanaceae relatives, all of which possess male fertility. Furthermore, we selected 36, 41, and 33 genes commonly present in the CMS lines. The copy numbers of the genes varied. Finally, RNA-Seq reads were mapped on the mitochondrial genomes of the CMS lines. This analysis limited the number of CMS-associated gene candidates to four, including two identical sequences. The three genes were named orf137 (two copies in the genome of each CMS line: CMS-PMt002g07240 and CMS-PMt005g13392), orf193 (one copy: CMS-PMt002g06465), and orf265 (one copy: CMS-PMt010g15739). Among these genes, four RNA editing sites, where C was substituted with U, were found in only orf265 at 60,019th, 60,030th, 60,038th, and 60,047th positions of the contig of CMS-PMt010, which were corresponded to the positions of 58th, 47th, 39th, and 30th positions from the initial codon of ATG, respectively. While two substitutions at 39th and 30th positions were silent mutations, those at 58th and 47th induced non-synonymous substitutions, leucine to phenylalanine (L20F) and serine to leucine (S16L).
De novo transcriptome assembly was performed in parallel. RNA-Seq data were obtained from the anthers of ‘P’ and ‘CMS[P]’, and assembled into 62 and 43 transcript sequences, respectively, of which 37 ‘P’ and 18 ‘CMS[P]’ transcripts were predicted to have transmembrane domains. Of these sequences, eight were uniquely detected in ‘CMS[P]’. Two genes (STRG.32.1.p1 and STRG.39.1.p1) were identical to orf137 and orf265.
Because two genes were commonly identified in both analyses, a total of nine genes were finally selected as candidates of CMS-associated genes (Table 2). Sequence similarity searches with the mitochondrial and chloroplast genomes indicated that two copies of the STRG.32.1.p1 (orf137) sequence (CMS-PMt002g07240 and CMS-PMt005g13392) were present in the mitochondrial genomes of the three CMS lines. A single copy sequence of orf193 (CMS-PMt002g06465) and a single copy sequence of STRG.39.1.p1 (orf265, CMS-PMt010g15739) were found in the mitochondrial genomes of the three CMS lines in addition to that of S. acaule. The presence of the three genes in the CMS lines was validated by a PCR assay with the three CMS lines and six fertile lines. The remaining six genes were found in both the CMS and fertile lines. We selected three genes, orf137, orf193, and orf265, as highly potential candidates for CMS-associated genes due to their presence specifically in the CMS mitochondrial genomes and their expression in anthers.
Sequence similarity analysis of the candidate genes
The sequence similarity of the candidate genes including their flanking genome regions in the mitochondrial genome of ‘CMS[P]’ was investigated. A 3,045 bp genome sequence around orf193 showed high sequence similarity to a 4,682 bp region of the tomato chloroplast genome sequences. The 3,045 bp sequence was split into three sequences containing 1,590, 488, and 1,007 bp (Fig. 5A) with highly conserved boundary sequences (Fig. 5B). In the 1,590 bp chloroplast genome sequence, a gene encoding cytochrome f was encoded; however, the corresponding sequence in the mitochondrial genome had a single base insertion causing a frame-shift mutation (Fig. 5C). This mutation broke the ORF of the cytochrome f gene and generated two small ORFs, orf116 and orf193.


A Genome structure of the orf193 region. Homologous sequences between the two genomes are indicated by gray boxes. Highly conserved sequences at the borders are shown in red and blue. B Sequence alignments of the borders. C Details of the genome structure of the orf193 region. A single nucleotide insertion causing a frame-shift mutation is indicated with a red arrow. D Genome structure of the orf265 region
A portion of orf265 and its upstream sequences (177 bp in total) showed high similarity to the ATP synthase subunit 8 (atp8) gene encoded in the tomato mitochondrial genome (Fig. 5D). The remaining sequences of orf265 lacked similarity to reported sequences. orf265 was located upstream of the nad3 and rps12 genes in the mitochondrial genome. No sequence similarity was observed for orf137 and the flanking sequence.
Expression analysis of the candidate genes
The expression patterns of the candidate genes, orf137, orf193, and orf265, were investigated by RT-PCR. First, we validated the results of the transcriptome analysis by detecting the expression of the three genes in anthers of ‘CMS[P]’ and ‘CMS[MSA1]’ (Fig. 6A). orf265 was tandemly arrayed with nad3 and rps12; therefore, we assumed that these three genes were co-transcribed as an operon. As expected, transcripts spanning the three genes were also detected (Fig. 6A). Next, we analyzed gene expression in leaves, stems, roots, ovaries, and pollen in addition to anthers of Dwarf ‘CMS[P]’ which was a BC3 generation of ‘CMS[P]’ backcrossed with a tomato dwarf cultivar ‘Micro-Tom’. Expression of orf137 and orf265 was detected in all tested tissues, while that of orf193 was observed in leaves, stems, roots, ovaries, and anthers (Fig. 6B).


Gene expression patterns in anthers of two CMS lines (A) and in seven samples of Dwarf ‘CMS[P]’ (B). cox2 is a positive control
In parallel, to quantify the RNA edited rate found in orf265, we sequenced the cDNA of three sterile plants of ‘CMS[P]’ and three fertile lines of F4 progenies obtained from a cross between ‘CMS[P]’ and a fertility-restoring line, S. lycopersicum var. cerasiforme LA1673. Each data point was covered with 32,850 RNA reads in average. RNA edits were observed at the positions of 60,019 (L20F), 60,030 (S16L), 60,038 (F13F), and 60,047 (F10F) in the cDNA samples of fertile and sterile lines in comparison with genome DNA as a negative control. The RNA edits at three positions 60,019 (L20F), 60,030 (S16L), and 60,047 (F10F) were significantly higher in fertile lines (59.8%, 65.4%, and 64.7%) than sterile lines (39.5%, 50.7%, and 43.9%) (Fig. 7A). The amino acid sequence translated from the edited RNA was different from the conserved sequenced in Solanum species including the unedited sequence (Fig. 7B).


A Rate of RNA editing observed at the positions of 60,019 (L20F), 60,030 (S16L), 60,038 (F13F), and 60,047 (F10F) in the cDNA samples of sterile (gray) and fertile lines (white). Error bars indicate standard errors (n = 3). Asterisks indicate statistically significant differences (P < 0.05). B Partial sequence alignment of amino acids of orf265, CMS-PMt010g15739 (unedited and edited) and the relatives. The positions of 16th and 20th were edited sites. Amino acid substitutions by the RNA editing were indicated by bold

