Principle and process of G4-miner
In G4-miner, DNAs are whole genome sequenced only once using a standard Illumina sequencing protocol and standard buffers. The standard sequencing buffers have been optimized and do not cause strong G4 stabilization24,27. However, destabilized G4 structures still influence the sequencing reaction. Instead of altering the sequencing readout, a slight and inconstant drop in sequencing quality was recorded from the beginning of the G4 structure (Fig. 1). In a standard sequencing experiment, DNA fragments in clusters were sequenced, and the Phred quality scores were calculated for each read (cluster). An unstably formed G4 causes unexpected fluctuations in sequencing quality, but usually does not cause mismatches.
G4-miner inspects sequencing quality in the whole genome, seizes unexpected fluctuations locally, and ascribes some of the quality fluctuations to the unstable formation of G4. Median quality scores of bases mapped to each locus in the genome were calculated after read mapping. For each site, we seize the quality scores of the next 75 nt and recorded the number of loci whose median quality score significantly dropped in comparison with that of the specific site. G4 sites were mined after setting the thresholds of the number of loci, and the quality score dropped.
Validation by known sequences
We validated our approach by inspecting the sequencing quality scores of four known control sequences: two containing stable G4 structures (src and myc), one conforming requirement of canonical G4, but that has been demonstrated to prevent G4 formation (a-repeat)28, and a guanine-rich strand that cannot form a G4 (G-rich)29 (Supplementary Table 1). Phred quality scores30 were used to quantify the sequencing quality. We calculated the median quality scores for each locus in the template/genome to seize and represent inconstant sequencing quality fluctuations. Although the readouts were not altered in all control sequences, the median quality score fluctuated slightly from or before the beginning of the G4 structure in the positive controls (Fig. 2a, b). For src and myc, the median quality score dropped ≥ 4 at 55 and 25 loci among the 75 bases (half of the read length) from the beginning of the G4 structure. In contrast, sequencing of the negative controls observed fewer than 6 loci (Fig. 2c, d).


a Src, a sequence containing a stable G4 structure; b myc, a sequence containing a stable G4 structure; c a-repeat, negative control; d g-rich, negative control. The green shadows present the regions of the sequences.
The inspection of quality scores and readouts along the positive control revealed that sequencing accuracy was slightly reduced after the G4 start sites, but the readouts were not altered (Fig. 1). At the same time, the reduction of quality scores was not observed in all reads that covered the G4 structure and the sequence downstream. Some of the reads had constant high-quality scores. The observations suggested that using standard Illumina sequencing buffers, the formation of G4s is not stable, and polymerase stalling does not occur in most template molecules. The sequencing quality score of a read was generated by comprehensively evaluating the accuracy of synthesis reactions in tens of thousands of molecules in a cluster. Under G4-destabilizing conditions (standard Illumina sequencing buffers), quadruplex structures were unstably formed in only a small portion of template molecules in a cluster. The pause of polymerase stalling that occurs in these molecules causes phase shift signals, which interfere with the real signals and induce a drop in the quality score.
Whole genome seizing the quality fluctuation
Next, we explored whether quality fluctuation could be specifically seized genome-wide. A set of 75 nt (half of the read length and longer than 99% predicted G-quadruplexes (PG4s); Supplementary Fig. 1) segments that contained ~178,606 PG4s and downstream sequences at the 3′ end were cut from the plus strand of the human genome. The remaining regions of the plus stand that did not contain PG4s were cut into 75 nt segments. The analysis based on approximately one million reads (~45× of the plus strand) generated by the standard pipeline showed more significant quality fluctuations in the segments containing PG4s than the remaining segments. Over 40% of PG4-positive segments contained over 15 loci whose quality scores decreased no <4 in comparison with the loci before PG4s, and only 4.05% of non-PG4 segments passed the criteria (Supplementary Table 2). Considering that the guanine content is >28% in over 99% PG4-positive segments (Supplementary Fig. 2), we filtered the non-PG4 segments with low guanine content (G% <28%), and <1% of the non-PG4 segments remained under the same criteria (Supplementary Table 3). Although the sequence quality for most non-PG4s was stable with no observable decrease, the remaining 1% of non-PG4 segments were found to have significant quality fluctuation (~370,000 segments). The number of these non-PG4 fragments showed a considerable decline in a considerable number of loci, much greater than the predicted number of PG4 (178,606), which is consistent with the previous approach in G4-stabilizing conditions24, indicating that the human genome contains more G4s than predicted11.
We applied G4-miner to profile DNA G4 structures in the human genome by resequencing primary B lymphocytes (GM12878, Coriell Institute) using the Illumina HiSeq4000 platform under standard conditions. To identify reliable G4 sequences, we inspected the quality scores of 75 bases at the 3′ end of a specific site and screened median quality scores of each site in both strands of the human genome. We set individual thresholds for the two strands of the parallel runs to ensure that the false-positive rate was similar and smaller than 1% (Supplementary Table 4). The influence evaluation of sequencing depth revealed that the PG4 detection rate plateaued when the sequencing depth was greater than 20-fold for one stand (Supplementary Fig. 3). Therefore, two duplicates of the experiment were performed, and at least 1.88 billion reads were generated with an average coverage of 45× for each strand of the human genome (Supplementary Table 5). We set the threshold of 15 loci quality score decreased 4 (15, 4) for the plus strand of Experiment 1, (15, 3) for the minus strand of Experiment 1, (12, 4) for the minus strand of Experiment 2, and (11, 4) for the minus strand of Experiment 2, respectively. Any segment of the genome that passed these thresholds was termed the mined G4 (MG4) sequence. Within both strands of the human genome, we identified 1,054,941 MG4s and 936,545 MG4s in the two parallel experiments (Supplementary Table 6). Approximately 52% and 49% of all 356,298 predicted canonical quadruplexes were also detected by G4-miner and present in MG4s (Supplementary Table 6). Moreover, 736,689 of the total number of MG4s (70% for Experiment 1 and 78% for Experiment 2) were detected in both experiments (Fig. 3a). The high overlap between parallel experiments indicated that G4-miner is stable in quadruplex assigning.


a Overlap between PG4s that were detected in MG4s in two parallel runs and overlap between MG4s in two parallel runs. b Prevalence of MG4s by different G4 structural families. c Detection rate of canonical G4 sequences categorized by loop length. d Genome browser view of MG4 and PG4 across two oncogenes, MEGF6 and SRC. The median sequencing quality score dropped in each locus was tracked along the whole gene sequence. The regions above the threshold are shown as red bars (MG4s). The PG4s are shown as black bars.
Structural analysis and genome browser of MG4 sequences
We noticed that only 17% of the MG4s were predicted to be classically described G4s. Recently, noncanonical G4 structures such as long loops, bulges, and two quartets have been identified by structural biological and biophysical studies31,32,33. Therefore, we hierarchically assigned MG4 to five categories to elucidate the structural features (Fig. 3b). Canonical G4s, long loops, bulges, and two quartets accounted for 17%, 6%, 23%, and 34% of total MG4s based on the threshold of sequencing quality. Canonical G4 sequences with short loops were more easily identified than sequences with longer loops (Fig. 3c), which is consistent with their relative thermodynamic stability32,33. The MG4s in the “other” category does not contain known G4 structures, but also cause a slight and inconstant drop in sequencing quality. Sequence analysis revealed that most of them contained other DNA secondary structures, such as hairpin structures, polyadenine/thymine, and i-motifs (Supplementary Table 7). To ensure fidelity, this category was removed for the following analyses. Our results revealed that G4-miner is able to mine and identify G4 sequences in the whole genome from ordinary sequencing data.
We quantified MG4s in exons, introns, promoters, untranslated regions (UTRs), and splicing junctions (Table 1). MG4s were found to be densest in 5′-UTRs and splicing junctions, where G4s functioned in posttranscriptional regulation34,35. The density of G4s in genomic regions is consistent with the result obtained by adding K+ or PDS to sequencing buffers24. Visual inspection of genes rich in PG4s revealed that G4-miner is effective in mining G4 regions (Fig. 3d).
Comparison with previously published data
Compared to the result of the optimized G4-seq, which combined polymerase stop assays with next-generation sequencing26, over 89% and 99% of PG4s detected by G4-miner were also detected by adding K+ and PDS to sequencing buffers, respectively (Fig. 4a). Meanwhile, 73% of PG4s detected by adding K+ to sequencing buffers were detected by G4-miner. The detected PG4s highly overlapped between adding K+ to the sequencing buffer and utilizing G4-miner. More PG4s are detected under G4-stabilizing conditions (PDS) than under standard sequencing conditions, which confirms the existing knowledge that G4 structures are thermodynamically stable in the presence of PDS. Over 40% of all MG4s were detected by adding PDS to sequencing buffers (Fig. 4b). In addition, 44% of the G4 sequences observed by adding K+ to sequencing buffers were detected in our method. We divided the G4 sequences into three parts and utilized structural analysis for each part (observed by both methods, only observed by G4-miner, and only observed under K+ or PDS conditions) (Fig. 4c, d). Unexpectedly, the most abundant structural category in the overlapping parts was canonical G4s, especially in the overlap under standard sequencing conditions and under K+ conditions (60%). Among noncanonical G4 structures, the bulge is the dominant category in the G4 sequences, which is only observed under the PDS condition (50%) and is the most abundant category under the K+ condition (36%). In fact, the bulge was the most abundant category in all G4 sequences detected under PDS conditions (45%) and was the second most abundant category in all G4 sequences under K+ conditions (30%; Supplementary Fig. 4). In contrast, two quartets were dominant in the G4 sequences only observed by G4-miner (52% for K+ and 58% for PDS), which was also the most abundant category in all G4 sequences detected by G4-miner (40%, Fig. 3b). The results show that both optimized G4-seq and G4-miner are reliable in detecting canonical G4s. However, the prevalence of detected G4 sequences by category varies between methods and detection conditions. Bulges are more likely to be detected under PDS conditions using G4-seq, and two quartets are prone to be seized using G4-miner.


a Overlap between PG4s that were detected in MG4s in standard sequencing buffers and K+ or PDS conditions26. b Overlap between MG4s in standard sequencing buffers and K+ or PDS conditions26. c Prevalence of G4 sequences detected both under K+ conditions and by G4-miner, only detected under K+ conditions and only detected by G4-miner. d Prevalence of G4 sequences detected both under PDS conditions and by G4-miner, only detected under PDS conditions and only detected by G4-miner.
Notably, the distribution of MG4 categories was highly consistent with the result of rG4-seq25, which exploited reverse transcriptase stalling reactions containing K+ with G4-stabilizing PDS prior to sequencing (Supplementary Fig. 5). Two quartets are most abundant in the result of rG4-seq (K+–PDS), ~39% of detected G4 sequences, which is 40% in our method.
Applicability evaluation of G4-miner by public datasets
To evaluate the applicability of G4-miner, we tested the method by using public resequencing datasets of six species whose reference genomes are known. The sequencing depth of the selected species was at least ×50 (Supplementary Table 8). The density of PG4s varies among species, from 1.56 PG4s per million bases to 89.42 PG4s per million bases (Table 2). The threshold guanine content calculated for each genome ranged from 24 to 29% based on the criterion that 99% of PG4-positive segments were greater than the threshold. As a result, the detection rates of predicted canonical quadruplexes range from 32 to 58%. The results of Homo sapiens using public datasets are highly consistent with the results of local datasets, not only the number of detected PG4 but also the distribution of MG4 categories. All kinds of G4 sequences were detected in the datasets of each species, and the distribution of MG4 categories was diverse among species (Supplementary Table 8). All the results demonstrated that G4-miner is reliable and widely applicable in quadruplex detection.
Evaluation of the influence of SNVs on G4 formation
SNVs are the differences at specific nucleotide locations in the genome and are one of the most important genetic variations among individuals. The alteration of single nucleotides might have different consequences at the phenotypic level36. Some of them might have different consequences in G4s, including controlling the formation, altering the structure, and influencing the stability. Therefore, we inspected SNVs in the whole human genome of GM12878, both homozygous SNVs and heterozygous SNVs. For homozygous SNVs, we PG4s again using the modified genome, which altered nucleotides in specific locations. Due to the alternation of homozygous SNVs, 1502 new PG4s were predicted in this time, where 48% of them (728 PG4s) were identified as MG4s and 1077 existing PG4s were not predicted, where 60% of them (643 PG4s) were not recognized as MG4s (Experiment 1, Fig. 5a, b).


a The alteration of a homozygous SNV (rs11264880) causes the formation of a G4 structure and was detected. b The alteration of a homozygous SNV (rs10915961) prevents the formation of the G4 structure and was not detected. c Two alleles of the heterozygous SNV (rs369859441, G/A) have different influences on the formation of the G4 structure and were demonstrated by G4-miner (formed: allele G; not formed: allele A). The green shadows present the regions of the sequences.
For heterozygous SNVs, we classified the sequencing reads into two groups based on allele types to identify MG4s separately. Among all heterozygous SNVs, 7.17% (186,904) showed different influences on the formation of G4s in the plus strand and 6.28% (163,666) in the minus strand (Fig. 5c and Table 3). All homozygous and heterozygous SNVs associated with G4s are individual-specific G4s that might contain individual differences in chromatin architecture and many fundamental biological processes, such as DNA replication and gene regulation. Our results show that G4-miner can characterize G4s for specific individuals.

