Plasmid vectors
Sequences encoding human RPA70-C and Rad51DBD were synthesized by GeneScript, after which they were amplified by PCR and cloned into the pCMV-PE2 (Addgene, no. 132775) plasmid to generate the ssDBD-PE2-encoding plasmids. These plasmids were named PE2-mid_RPA70, hyPE2, PE2-N_Rad51, and PE2-C_Rad51 (Fig. 1a). Linker variants were derived from the hyPE2 plasmid and cloned using Gibson assembly16. The sequences of the primers and plasmids used in this study are shown in Supplementary Table 4 and the Supplementary Note, respectively.
Plasmid-library preparation and cell-library generation
We previously generated a plasmid library of 54,836 pairs of pegRNA-encoding and target sequences2. We randomly selected 107 plasmids from this library by colony picking and mixed the selected plasmids at an equimolar ratio (library A).
Next, we additionally designed another library named library B that includes 665 pairs of pegRNAs and target sequences for a more extensive evaluation of various types of editing at a larger number of target sequences. To design library B, we selected 100 deletion-, 100 insertion-, and 200 substitution-inducing pegRNAs from the previously published library of 54,836 pairs of pegRNA-encoding and target sequences2. For this selection, we divided the editing efficiencies from the previous study2 into eight strata (<1%, 1–3%, 3–6%, 6–10%, 10–20%, 20–30%, 30–40%, and >40%) and randomly selected a similar number of pegRNAs from each stratum, so that pegRNAs associated with all levels of efficiency would be included. To these 400 pegRNAs, we added 107 pegRNAs used in library A, resulting in a total of 507 pegRNAs. Among the 507 pegRNAs, 158 could be modified to induce a silent mutation in the NGG PAM sequence; we added the 158 modified pegRNAs, which can induce a silent mutation in the PAM sequence in addition to the initially designed edit, to library B. Thus, the total number of pegRNAs in library B was 507 + 158 = 665. Each pegRNA was associated with three barcodes. Thus, the number of oligonucleotides used to generate library B was 665 × 3 = 1995.
In preparation for generating lentivirus from the library of 107 plasmids, HEK293T cells were seeded at a density of 4.0 × 106 cells per plate on 100-mm dishes that contained Dulbecco’s Modified Eagle Medium (DMEM). After 15 h, the culture medium was replaced with DMEM containing 25 μM chloroquine diphosphate (Sigma) and the cells were incubated for another 5 h. The plasmid library was mixed with psPAX2 (Addgene no. 12260) and pMD2.G (Addgene no. 12259) at a molar ratio of 1.3:0.72:1.64; the plasmids were then cotransfected into HEK293T cells using polyethylenimine (PEI MAX, Polysciences). The next day, the culture medium was replaced with fresh medium. At 48 hrs after the transfection, the medium, which contained the lentivirus, was collected and filtered using a Millex-HV 0.45-μm low protein-binding membrane (Millipore). The filtrate was then aliquoted and stored at −80 °C. For titration of the lentivirus, serial dilutions of a viral aliquot were transduced, in the presence of 8 μg/ml polybrene (Sigma), into HEK293T cells that had been cultured in DMEM supplemented with 10% fetal bovine serum (FBS). Untransduced and transduced cells were then both cultured in DMEM supplemented with 10% FBS and 2 μg/ml of puromycin (Invitrogen). After essentially all of the untransduced cells had died, we counted the number of living cells in the transduced population to estimate the viral titer17.
For lentivirus transduction, HEK293T or HCT116 cells were seeded on 100-mm dishes at a density of 1.0 × 106 cells per dish and incubated overnight. The lentiviral library was transduced at an MOI of 0.3 to achieve a coverage greater than 3000 × relative to the number of selected pegRNA-encoding plasmids. The next day, the culture medium was replaced with DMEM supplemented with 10% FBS and 2 μg/ml puromycin (InvivoGen). Cultures were maintained with these conditions for the next five days to remove untransduced cells.
Delivery of PE2 or PE2 variants into the cell library
To deliver each PE2 variant to cell library A or B, PE2 variant-, pcDNA-BSD-, and puro-eGFP-encoding plasmids were mixed at a weight ratio of 10:1:1 to yield a total of 12 μg (for experiments using library 1) or 24 μg (for library B) of plasmid mixture, which was then transfected into a total of 1 × 106 cells from cell library A or a total of 6 × 106 cells from cell library B using Lipofectamine 2000 (Invitrogen), following the manufacturer’s protocol. After incubation overnight, the culture medium was exchanged with DMEM containing 10% FBS and 40 μg/ml blasticidin S (InvivoGen). Five days later, the transfected cells were harvested with 0.25% trypsin for genomic DNA extraction and deep sequencing.
Measurement of prime-editing activities at endogenous sites
To evaluate hyPE2 and PE2 activities at endogenous sites, HEK293T or HCT116 cells were seeded into 24-well plates and transfected at 70–80% confluency. In all, 750 ng of PE2-, 250 ng of pegRNA-, and 100 ng of eGFP-Puro- (Addgene no. 45561) encoding plasmids were mixed and co-transfected into the cells using Lipofectamine 2000, following the manufacturer’s protocol. The next day, the culture medium was replaced with DMEM supplemented with 10% FBS and 2 μg/ml puromycin (InvivoGen). Five days later, the transfected cells were harvested with 0.25% trypsin for genomic DNA extraction and deep sequencing.
After written informed consent was obtained from a study participant who is a healthy individual, a dermatology specialist conducted skin-punch biopsy from the participant. The Institutional Review Board of Severance Hospital, Yonsei University Health System approved the consent procedure and the study (No. 4-2012-0028). The fibroblasts derived from the skin biopsy were cultured in DMEM containing 10% FBS and penicillin/streptomycin. A total of 1 × 106 human skin fibroblasts were mixed with 3 μg of PE2-, 1 μg of pegRNA-, and 1 μg of eGFP-Puro-encoding plasmids and electroporated using a Neon electroporation kit, following the manufacturer’s protocol. Five days after the transfection, the cells were harvested with 0.25% trypsin for genomic DNA extraction and deep sequencing.
Deep sequencing
The protocol used for deep sequencing has been previously described2,18,19,20. Briefly, genomic DNA was extracted from pelleted cells using a Wizard Genomic DNA purification kit (Promega), following the manufacturer’s protocol. To measure prime editing efficiencies for the library experiments, a total of 16 μg (greater than 16,000 × coverage) of genomic DNA was PCR-amplified using a 2× pfu PCR Smart mix (Solgent). The resulting PCR products were combined and purified with a MEGAquick-spin total fragment DNA purification kit (iNtRON Biotechnology). Next, 20 ng of purified product was PCR-amplified using primers containing Illumina adapter and barcode sequences. To determine prime-editing efficiencies at endogenous sites, ~200 ng of individual genomic DNA samples were PCR-amplified in 20-μl reaction volumes. The resulting PCR products were combined and purified. Next, 100 ng of purified product was PCR-amplified in a 20 μl reaction volume using primers containing Illumina adapter sequences. The resulting products were purified and sequenced with MiniSeq (Illumina). The primers used for PCRs are listed in Supplementary Table 4.
Analysis of prime-editing activities
The prime-editing efficiencies (i.e., the frequencies of intended edits) in the library experiments were calculated using previously published Python scripts2 as follows:
$$frac{{{{{{rm{Read}}}}}},{{{{{rm{counts}}}}}},{{{{{rm{with}}}}}},{{{{{rm{intended}}}}}},{{{{{rm{edit}}}}}},{{{{{rm{and}}}}}},{{{{{rm{specified}}}}}},{{{{{rm{barcode}}}}}}}{{{{{{rm{Total}}}}}},{{{{{rm{read}}}}}},{{{{{rm{counts}}}}}},{{{{{rm{with}}}}}},{{{{{rm{specified}}}}}},{{{{{rm{barcode}}}}}}}{{times}}100$$
(1)
To identify individual pegRNA and target–sequence pairs, a 22-nt sequence, consisting of an 18-nt barcode and a 4-nt sequence upstream of the barcode, was used. To improve the accuracy of our analysis, pegRNA and target-sequence pairs with deep-sequencing read counts below 100 were excluded2,19,21. The reads that contained the desired edit but lacked unintended mutations in the wide target sequence containing the PAM were classified as PE2-induced mutations.
To evaluate the frequencies of intended edits, unintended edits, and indels at endogenous sites, Cas-analyzer was used22 and the values were calculated as described below. For analysis of unintended substitutions near the target position, a 40-nt region spanning from -10 nucleotides (nts) to +25 nts from the nick site was evaluated for substitutions and the average values were considered as read counts for subsequent calculations.
$${{{{{rm{Intended}}}}}},{{{{{rm{editing}}}}}},{{{{{rm{frequency}}}}}}=frac{{{{{{rm{Read}}}}}},{{{{{rm{counts}}}}}},{{{{{rm{with}}}}}},{{{{{rm{intended}}}}}},{{{{{rm{edit}}}}}}}{{{{{{rm{Total}}}}}},{{{{{rm{read}}}}}},{{{{{rm{counts}}}}}}}{{times}}100$$
(2)
$${{{{{rm{Unintended}}}}}},{{{{{rm{editing}}}}}},{{{{{rm{frequency}}}}}}=frac{{{{{{rm{Read}}}}}},{{{{{rm{counts}}}}}},{{{{{rm{with}}}}}},{{{{{rm{unintended}}}}}},{{{{{rm{edit}}}}}}}{{{{{{rm{Total}}}}}},{{{{{rm{read}}}}}},{{{{{rm{counts}}}}}}}{{times}}100$$
(3)
$${{{{{rm{Indel}}}}}},{{{{{rm{frequency}}}}}}=frac{{{{{{rm{Read}}}}}},{{{{{rm{counts}}}}}},{{{{{rm{with}}}}}},{{{{{rm{indel}}}}}}}{{{{{{rm{Total}}}}}},{{{{{rm{read}}}}}},{{{{{rm{counts}}}}}}}{{times}}100$$
(4)
In some cases, we calculated an adjusted fold increase in which +0.1% was added to both the hyPE2 and PE2 efficiencies in order to avoid mathematical errors that would have otherwise been generated when the PE2 efficiency is 0% and to attenuate insignificant fold increases as shown below.
$${{{{{rm{Adjusted}}}}}},{{{{{rm{fold}}}}}},{{{{{rm{change}}}}}}=frac{{{{{{rm{hyPE}}}}}}2,{{{{{rm{efficiency}}}}}},( % )+0.1 % }{{{{{{rm{PE}}}}}}2,{{{{{rm{efficiency}}}}}},( % )+0.1 % }$$
(5)
For example, the adjusted fold increase from 0.015% to 0.15% can be calculated as (0.15% + 0.1%)/(0.015% + 0.1%) = 2.2-fold instead of 10-fold; however, the increase from 1.5% to 15% can be calculated as (15% + 0.1%)/(1.5% + 0.1%) = 9.4-fold, which is close to 10-fold. When we used an adjusted fold increase instead of the fold increase, we mention this point in the legends to the relevant figures.
Measurement of prime-editing activities at potential PE2 off-target sites
Potential PE2 off-target sites that have up to two nucleotide mismatches or a one-nucleotide RNA or DNA bulge were identified by Cas-OFFinder23. Information about the potential off-target sites is shown in Supplementary Table 2. To evaluate prime-editing efficiencies at the potential off-target sites, the genomic DNA samples that were used for the measurement of prime-editing activities at endogenous sites described above were used as templates for PCR amplification. The resulting products were purified and sequenced with MiSeq.
Conventional machine learning-based model training
The data of hyPE2- and PE-induced prime-editing efficiencies obtained using library B were split into training and test datasets by random sampling, such that neither pegRNAs nor target sequences are shared between the two datasets (Supplementary Table 3). Each of seven conventional machine learning algorithms—extreme-gradient boosting (XGBoost), gradient-boosted regression tree (Boosted RT), random forest, L1-regularized linear regression (Lasso), L2-regularized linear regression (Ridge), L1L2-regularized linear regression (ElasticNet) and support-vector machine (SVM)—were used to train a model. We used the XGBoost Python package (version 1.3.3)24 and scikit-learn (version 0.23.2)25. A set of 1820 features, including position-independent and position-dependent nucleotides and dinucleotides, melting temperature, GC counts, the minimum self-folding free energy26,27, and the DeepSpCas9 score27, were extracted from the wide target sequences and the PBS and RT-template sequences2. The MeltingTemp module (https://biopython.org/docs/1.74/api/Bio.SeqUtils.MeltingTemp.html) was used to calculate the melting temperature using a default setting. To select a model from the regularization parameters and hyperparameter configurations in each algorithm, fivefold cross-validation was done. Details for each of the machine-learning algorithms follow. XGBoost and gradient-boosted regression tree: we searched over 16 models that had been chosen from various hyperparameter configurations {the number of base estimators (chosen from [50, 100]), the maximum depth of the individual regression estimators (chosen from [5, 10]), the minimum number of samples to be at a leaf node (chosen from [1, 2]), and learning rate (chosen from [0.1, 0.2])}. Random forest: we searched over 16 models chosen from the same hyperparameter configurations used for XGBoost, except that the learning rate was not used; we searched over the maximum number of features to consider when looking for the best split (chosen from [all features, the square root of all features, the binary logarithm of all features]). L1-, L2-, and L1L2-regularized linear regression: we searched over 16 points that were evenly spaced between 10−6 and 106 in log space to optimize the regularization parameter. SVM: we searched over 16 models from the following hyperparameters: penalty parameter C and kernel parameter γ, four points that were evenly spaced between 10−3 and 103.
Three-dimensional structural modeling
The structural model for hyPE2 shown in Supplementary Fig. 5b was built with the Coot program (version WinCoot 0.9.6.1)28. The three-dimensional model of Cas9 in complex with a guide RNA and a target DNA fragment was obtained from the structure of a SpCas9 DNA adenine-base editor (PDB code: 6VPC)29. To model the 3′ extension of a 121-nt pegRNA (residues 83–121) manually, we used RNAfold WebServer30 to predict the secondary structure of this region and adopted a hairpin structure for residues 83–97. The pegRNA RT-template region hybridized with the 16-nt DNA primer region was manually modeled based on the structure of XMRV RT in complex with an RNA:DNA hybrid (PDB code: 4HKQ)31. The three-dimensional model of the Rad51 ssDBD (residues 16–85) was obtained from the structure of the N-terminal domain of Rad51 (PDB code: 1B22)32. To find putative α-helices in flexible N- and C-terminal regions of the Rad51 ssDBD, Linker A, and Linker B, we predicted the secondary structures using RaptorX33. Figures showing the three-dimensional structures (Supplementary Fig. 5b) were produced using the UCSF Chimera program34, and two linkers were represented on the three-dimensional structures, taking into account their lengths and secondary structures.
For the schematic structural model of hyPE2 shown in Supplementary Fig. 5a, we used the coordinates of Cas9 (PDB 4OO8), RT (PDB 5DMQ), and Rad51 (PDB 1B22). The structural image was prepared using the program CueMol (version 2.2.3.443; http://www.cuemol.org).
Statistics and reproducibility
Data are presented as means ± S.D. from independent experiments. P-values were calculated by two-tailed, unpaired Student’s t-test or one-way ANOVA with post hoc analysis by Tukey’s multiple comparisons, depending on the number of independent variables. The high-throughput experiments were independently repeated three times for library A, and two times for library B and linker variants. All replications showed similar results. The individual evaluation experiments of HEK293T cells, HCT116 cells, and human fibroblasts were independently repeated three times, with comparable results.
Reporting summary
Further information on research design is available in the Nature Research Reporting Summary linked to this article.

