Systematic decomposition of sequence determinants governing CRISPR/Cas9 specificity

Oligonucleotide library design

In this study, we designed 11 high-throughput libraries for systematically decomposing sequence determinants that govern CRISPR/Cas9 specificity. Each library consists of ~12,000 oligonucleotides. These 11 libraries are summarized as follows and more details are provided in Supplementary Data 6.

Library T1, T2, and T3: paired gRNA and dual-target design (Fig. 1b) for method evaluation and rule mining. The T1 library includes control gRNAs, benchmark gRNAs and random gRNAs for method evaluation. The T2 and T3 libraries, together with the fraction of random gRNAs in T1, are used to uncover the rules underlying off-targeting.
Library S1: paired gRNA and single-target design (Fig. 1a) for method evaluation. We designed this library following a previously described high-throughput library design protocol³⁴. For a systematic comparison, Library S1 contains the same sets of gRNA and corresponding target combination as Library T1.
Library Allele: paired gRNA and dual-target design (Fig. 1b) for the allele-editing screen. We designed this allele-editing library to leverage the mismatched gRNAs to target 18 cancer hotspot mutations, which were curated from Memorial Sloan Kettering Cancer Center (MSKCC) Cancer Hotspots (https://www.cancerhotspots.org) and cBioportal for Cancer Genomics (https://www.cbioportal.org) with reported gain-of-function or dominant-negative effects. For dual-target design, the wildtype sequence is placed at the off-target position and the mutant sequence at the on-target position (Fig. 5b).
Library sg1, sg2, sg3, sg4, sg5, and sg6: target-only design (no gRNA pooling) for GMT exploration (Fig. 2a, b). Two gRNAs, sg2 and sg6, have the crystal structure with Cas9 and their targets (PDBs: 4OO8 and 4UN3). Target sequences in each library include the on-target, all the possible 1 mismatch (1-MM, 66 sequences) and 2 mismatches (2-MM, 2,079 sequences) targets, and randomly selected 3 mismatches (3-MM, 100 sequences), 4 mismatches (4-MM, 100 sequences), 5 mismatches (5-MM, 100 sequences), and 6 mismatches (6-MM, 47 sequences) targets.

The gRNAs in these 11 libraries are categorized into 5 types: control gRNAs, benchmark gRNAs, randomly generated gRNAs, allele-editing gRNAs, and truncated gRNAs. Every control gRNA was designed with 4 types of dual-target sequences by the replacement of “NGG” with “NTT” sequence at the PAM to represent 4 combinations of cleavage events (no cleavage, left, right, and both) (Fig. 1c). The benchmark gRNAs with reported genomic off-target sites were curated from different methods detecting genome-wide off-targets induced by Cas9 editing in vitro or in vivo and used to evaluate the dual-target approach^1,12,17,36. Unreported genomic off-target sites with <5 mismatches relative to corresponding benchmark gRNAs were generated by CRISPRitz⁶³ (Fig. 1h–j). Except for the gRNAs in Libraries sg1-6, each randomly generated gRNA was designed with 7 types of off-target sequences to systematically decompose the estimated off-on ratios into single-mismatch effect, combinatorial effect, and GMT. These seven sequence types include three targets with 1-MM, 3 with 2-MM, and 1 with 3-MM. The mismatches in the 2-MM and 3-MM targets are the combinations of the mismatches in 1-MM targets (Fig. 1k). Frequencies of four nucleotide types (A, G, C, and T) at each position for randomly designed gRNAs are roughly evenly distributed (Supplementary Fig. 16). Allele-editing gRNAs were designed to test the selectivity for allele-specific targeting on 18 cancer hotspot mutations (Fig. 5b). Truncated 17–19 nt gRNAs were designed to compare to mismatched gRNAs for allele-editing assessment (Fig. 5d, e).

In the paired gRNA and dual-target libraries (Library T1, T2, T3, and Allele), each 190-bp oligonucleotide consists of an 18-bp left primer sequence, a 10-bp barcode 1, a 23-bp corresponding off-target sequence containing a PAM, a 15-bp linker to segregate two targets, a 23-bp corresponding on-target sequence containing a PAM, a 15-bp barcode 2, a 45-bp cloning linker sequence containing a vector homology sequence, an SspI enzyme recognition site and an hU6 homology sequence, a 21-bp guide sequence beginning with “g” (gN₂₀), and a 20-bp right primer sequence. In the paired gRNA and single-target library (Library S1), each 157-bp oligonucleotide consists of a longer 20-bp barcode 2 but lacks the 15-bp linker and the second 23-bp target. All other sequence components are identical between the single- and dual-target design. In the target-only libraries (Library sg1, sg2, sg3, sg4, sg5, and sg6), each 109-bp oligonucleotide consists of a 25-bp left primer sequence, a 10-bp barcode 1, a 27-bp corresponding target sequence containing a PAM, a 20-bp barcode 2, and a 27-bp right primer sequence (Supplementary Note). All the oligonucleotides in these libraries passed the filters of having no thymine homopolymers more than three nucleotides long in gRNA sequences, carrying no more than one SspI enzyme recognition site, one left primer site, and one right primer site, and no sites besides the target sequence being cut by gRNAs in the constructed plasmids.

Plasmid cloning

gRNA I and D associated with distinct repair mechanisms upon double-strand breaks were selected from a previous single-target library predicting mutations generated by CRISPR/Cas9. Dual-target I and D were designed with four types of dual-targets by the replacement of “NGG” with “NTT” sequence at the PAM as described above. For gRNA I/D and corresponding dual-target I/D pairs cloning, the 152-bp single-strand Ultramer™ DNA oligonucleotides with 4 types of dual-target I and D were synthesized (IDT) and amplified with Q5 High-Fidelity DNA polymerase (NEB #M0492). Meanwhile, gRNA I and D oligonucleotides (Sigma) were annealed and ligated to the BsmBI (NEB #R0739)-linearized LentiGuide-BSD-PacI plasmid. The LentiGuide-BSD-PacI plasmid was derived from the LentiGuide-BSD plasmid by inserting a PacI recognition site to 33-bp downstream of the scaffold sequence using Q5 site-directed mutagenesis (NEB #E0552). The resulting plasmids were then linearized by PacI (NEB #R0547), and undergone Gibson assembly (NEBuilder HiFi DNA Assembly Master Mix, NEB #E2621) with the amplified four types of dual-target I and D fragments. Assembled products were then transformed to homemade Stbl3 competent cells, and the plasmids were extracted using QIAprep Spin Miniprep Kit (QIAGEN) and confirmed by Sanger sequencing.

For plasmid library cloning, the oligonucleotide libraries with paired gRNA and dual-target design were synthesized by Twist Bioscience, and libraries with paired gRNA and single-target design or target-only design were ordered from CustomArray. The plasmid library containing pooling gRNAs and corresponding single- or dual-target pairs was prepared by undergoing an intermediate circulation, SspI enzyme linearization, and Gibson assembly into a lentiviral plasmid LentiGuide-BSD-Δscaffold. This multistep cloning procedure was adapted and modified from previously described protocols^33,34. Briefly, the oligonucleotide pools were amplified with Q5 High-Fidelity DNA polymerase for extension of the oligonucleotide sequences with overhangs complementary to a donor G-block. Gibson assembly was performed to ligate the amplified pools to the 125-nt donor gBlock fragment (IDT) encoding the gRNA scaffold at a molar ratio of 1:3 at 50 °C for 1 h, and the resulting products were incubated with Plasmid Safe ATP-Dependent DNase (Lucigen #E3101K) at 37 °C for 1 h to remove the linear fragments. The assembled circular DNA was purified with QIAquick PCR Purification Kit (QIAGEN), followed by linearization with SspI enzyme (NEB #R3132) at 37 °C for 4 h, and purification with QIAquick PCR Purification Kit. Next, the purified products were performed the second amplification with Q5 High-Fidelity DNA polymerase for the addition of overhangs complementary to gRNA-expressing plasmid LentiGuide-BSD-Δscaffold, and the PCR products were then purified with QIAquick Gel Extraction Kit (QIAGEN). This LentiGuide-BSD-Δscaffold plasmid lacking the scaffold sequence was generated by removing the gRNA scaffold from LentiGuide-BSD plasmid using Q5 site-directed mutagenesis. Gibson Assembly was then employed to fuse the second amplified oligonucleotide pools to BsmBI-linearized LentiGuide-BSD-Δscaffold at a molar ratio of 3:1 at 50 °C for 1 h. The resulting products were then purified by isopropanol precipitation with GlycoBlue^TM Coprecipitant (Thermo Fisher Scientific #AM9516), dissolved in TE buffer (pH 8.0) to a final concentration of 100 ng/μl at 55 °C for 10 min, and transformed into Endura electrocompetent cells (Lucigen #60242). The plasmid library was extracted by NucleoBond Xtra Maxi Kit (MACHEREY-NAGEL). The construction of the plasmid library containing single gRNA and corresponding targets followed the cloning processes of gRNA I/D and dual-target I/D pairs as mentioned above. In brief, oligonucleotides of each gRNA were annealed and inserted into BsmBI-linearized LentiGuide-BSD-PacI, and the constructs then were verified by Sanger sequencing. In the meantime, the target-only library was amplified with Q5 High-Fidelity DNA polymerase and purified with gel purification. Gibson Assembly was applied to ligate the amplified target-only oligonucleotide pools to their corresponding PacI-linearized single gRNA-expressing plasmids LentiGuide-BSD-PacI at a molar ratio of 6.5:1 at 50 °C for 1 h. The assembled products were purified by isopropanol precipitation and transformed into electrocompetent cells, followed by plasmid extraction as described above.

To validate the results of the high-throughput allele-editing screen, 6 genes with cancer hotspot mutations were selected. For allele-editing validation plasmid cloning, oligonucleotides containing cancer hotspot mutation-derived wildtype and mutant target sites, also referred to as wildtype and mutant target, were synthesized (IDT) and annealed, followed by insertion into BsmBI-linearized plasmid LentiEGFP-P2A-Puro. Meanwhile, 6 gRNAs were designed to target each selected cancer hotspot mutation, including one perfectly matched, two truncated, and three effective mismatched gRNAs based on the screen results. LentiCRISPR-V2-BSD (Addgene plasmid #83480) was employed to insert each gRNA according to the protocol from Feng Zhang’s lab. The ligated products were extracted using QIAprep Spin Miniprep Kit and confirmed by Sanger sequencing.

All the primers, gRNA, and fragments used for plasmid construction are listed in Supplementary Data 7.

Cell culture

HEK293T cell line was purchased from ATCC (CRL-3216) and cultured in DMEM medium (Gibco) supplemented with 10% FBS (Sigma) and 1% penicillin-streptomycin (Gibco) at 37 °C with 5% CO₂. A monoclonal HEK293T cell line expressing doxycycline-inducible Cas9 (HEK293T-iCas9) was generated by infecting parental cells with pCW-Cas9 (Addgene plasmid #50661) lentivirus. All the cell lines were routinely tested for being free of mycoplasma contamination using MycoAlert™ Mycoplasma Detection Kit (Lonza #LT07-218).

Lentivirus production

HEK293T cells (5 × 10⁶) were seeded in a 10-cm tissue culture dish 1 day before transfection. On the day of transfection, 4 μg of lentiviral plasmid, 4 μg of psPAX2 (Addgene plasmid #12260), and 2 μg of pMD2.G (Addgene plasmid #12259) in 500 μl Opti-MEM (Gibco) were mixed with 25 μl X-tremeGene HP DNA transfection reagent (Roche #06366236001) and then added to the pre-seeded HEK293T cells. The culture medium was changed with the addition of 20 μM HEPES buffer 6 h after transfection. Lentiviral supernatant was harvested and filtered through a 0.45-μM syringe filter (Millipore) 48 h after transfection. Aliquots were frozen at −80 °C for later use.

CRISPR screen and deep sequencing

HEK293T-iCas9 cells were seeded in eighteen 10 cm tissue culture dishes (4 × 10⁶ cells per dish) with three independent biological replicates 1 day before transduction. On the day of transduction, the lentiviral plasmid library was transduced into the cells at an MOI of 0.3 in the presence of 8 μg/ml polybrene (Millipore). The infected cells were passaged to 15 cm tissue culture dishes 48 h after infection, selected with 20 μg/ml blasticidin (InvivoGen #ant-bl-1) for 3 days, and then incubated with doxycycline (Sigma #D9891) to induce Cas9 expression. Cells were passed every 2–3 days with at least 2 × 10⁷ cells for each passage to maintain the library diversity and harvested (2 × 10⁷ cells per replicate) on days 0, 3, 5, and 9 after Cas9 induction. Cell pellets were frozen at −80 °C for later genomic extraction using Blood & Cell Culture DNA Midi Kit (QIAGEN).

For the transduction of lentiviral plasmids with gRNA I/D and corresponding dual-target I/D, HEK293T-iCas9 cells were seeded in six-well plates (5 × 10⁴ cells per well) with two biological replicates following the same procedure as described above. In all, 2 × 10⁶ cells were harvested on day 9 after Cas9 induction for genomic extraction using QIAamp® DNA Mini Kit (QIAGEN).

After genomic extraction, the integrated target sequences were PCR amplified using NEBNext® Ultra™ II Q5® Master Mix (NEB #M0544) and prepared for deep sequencing by 2 rounds of PCR as described previously⁶⁴. The 1^st round PCR (16 cycles) was performed with 40 μg genomic DNA divided into eight 100 μl-PCR reactions to achieve 500-fold coverage over the library. To attach the Illumina adaptor and barcode sequences, the 2^nd round PCR (12 cycles) was conducted with 5 μl of the 1^st purified PCR product. The gel-purified samples were pooled and sequenced on Illumina Miseq, Hiseq 3000 or Nextseq 500 by 75-bp paired-end sequencing at MDACC-Smithville Next Generation Sequencing Core. All the primers applied for sequencing library preparation are provided in Supplementary Data 7.

Allele-specific editing validation

To validate the allele-editing results from the high-throughput screen, 2 rounds of transduction were performed. HEK293T cells were seeded in six-well plates (1 × 10⁵ cells per well) and infected with lentivirus of LentiEGFP-P2A-Puro harboring the wildtype or mutant target sequence related to the 6 selected cancer hotspot mutations. On day 4 after puromycin (InvivoGen #ant-pr-1) selection, the resulting HEK293T cells were re-seeded in six-well plates (1 × 10⁵ cells per well), followed by infection with lentivirus of LentiCRISPR-V2-BSD expressing perfectly matched, truncated or mismatched gRNAs. The infected cells from 2 biological replicates were passaged every 2–3 days and harvested (2 × 10⁶ cells) on day 9 after blasticidin selection for genomic extraction.

Genomic DNA was isolated from cell pellets with QIAamp® DNA Mini Kit (QIAGEN) according to the manufacturer’s protocol. In total, 2 μg genomic DNA of each sample was used to amplify the integrated wildtype or mutant target sequence with Q5 High-Fidelity DNA polymerase. PCR products were purified by gel purification and undergone Sanger sequencing at MDACC-Smithville Molecular Biology Core. The wildtype and mutant allele-editing frequencies by perfectly matched, truncated, and mismatched gRNAs were analyzed through a Sanger sequencing-based CRISPR analysis online tool Inference of CRISPR Edits (ICE, https://ice.synthego.com).

Sequencing data analysis

A computational pipeline was developed to process the generated sequencing data. We first combined the partially overlapped, paired-end reads into a single sequence using FLASH v1.2.11⁶⁵ with the options “-m 20, -M 150, -x 0.1, -allow-outies” (a minimum overlap of 20 bp, a maximum overlap of 150 bp, a maximum mismatch density of 0.1 and allow outie pairs). Next, we extracted the unique barcode 1 and 2 pairs from the merged reads and compared them to the barcodes of the designed constructs in the library. Sequencing reads were assigned to a specific construct if their barcodes were perfectly matched to the 10-nt of barcode 1 and last 10 bp of 15-nt of barcode 2 of the construct. The sequences between the two barcodes were then aligned to the designed sequences for indel calling using the Smith-Waterman algorithm⁶⁶ with gap open penalty of 8, mismatch penalty of 1, match score of 2, and mismatch score of −2. Finally, the sequencing reads were assigned to one of five different indel types, i.e., wt + wt, indel + wt, wt + indel, indel + indel and large deletion, based on the alignment results. Specifically, if a deletion larger than 32-bp was identified in the alignments, the read was assigned to the large deletion category. Otherwise, we adopted a 10-bp window around the cutting site to capture the potential CRISPR-induced editing at on-target and off-target sequence. We can then easily assign the reads into four other categories: indel + wt (indels detected at on-target site only), wt + indel (indels detected at off-target site only), indel + indel (indels detected at both on- and off-target sites), and wt + wt (indels detected at neither on- nor off-target sites). Of note, we only observed rare cases (~0.005%) where two cleavage sites are inverted in our dual-target design⁶⁷. This is consistent with previous report where inversion rate of small DNA fragments with size <100 bp is much lower compared to that of larger fragments of hundreds to thousands bps⁶⁸. Thus, here we did not include this inversion type in the classification of indel types. Distributions of the number of sequenced reads per gRNA-target pair to indicate the sequencing depths of the experiments with dual-target design are shown in Supplementary Fig. 17.

Estimation of off-on ratios

We developed a statistical method to infer the read counts of different cleavage states from the observed editing types of the sequencing data. Given a gRNA-target pair, we denote (C={left[{c}_{1},{c}_{2},{c}_{3},{c}_{4},{c}_{5}right]}^{T}) to represent the observations of read counts, where ({c}_{1}) to ({c}_{5}) correspond to the editing types of wt + wt, indel + wt, wt + indel, indel + indel, and large deletion, respectively. We denote (S={left[{s}_{1},{s}_{2},{s}_{3},{s}_{4}right]}^{T}) to represent the numbers of reads in the four cleavage states, where ({s}_{1}) to ({s}_{4}) correspond to the states of “no cleavage”, “cleavage at off-target only”, “cleavage at on-target only”, and “both cleavage”, respectively. We model the observation (C) to be:

$$C={P}^{T}S+varepsilon$$

(1)

where (P) is a (4times 5) stochastic matrix and (varepsilon) is a vector of Gaussian noise. An entry ({p}_{i,j}) ((i=1,2,3,4{;; j}=1,2,ldots ,5)) in the matrix (P) represents the conditional probability of observing the (j)th editing type given the (i)th cleavage state. The sum of each row in (P) is 1, i.e., ({sum }_{j=1}^{5}{p}_{i,j}=1), (forall i). Given the model in Fig. 1e, (P) can be written as:

$$P=left[begin{array}{cc}begin{array}{ccc}{p}_{1,1} & {p}_{1,2} & {p}_{1,3}\ ,0 & {p}_{2,2} & 0\ ,0 & 0 & {p}_{3,3}end{array} & begin{array}{cc}{p}_{1,4} & {p}_{1,5}\ 0 & 0\ 0 & {p}_{3,5}end{array}\ begin{array}{ccc}0, & ,0, & 0end{array} & begin{array}{cc}{p}_{4,4} & {p}_{4,5}end{array}end{array}right]$$

(2)

The parameters in (P) were estimated from the observations of read counts on the control dual-target sequences. The first row, denoted ({P}_{1}), represents the background noise and PCR artifacts, which were estimated based on the read distribution of the NTT + NTT (no cleavage) controls. The second row, denoted ({P}_{2}), was estimated based on the observations from the NGG + NTT controls. Depending on the gRNA cutting efficiency, the reads derived from the NGG + NTT controls are subject to a mixture of the “no cleavage” and “cleavage at off-target only” states. Given the estimated ({P}_{1}) and the read counts of the five editing types on the NGG + NTT controls, ({P}_{2}) can be explicitly estimated by solving linear equations. Similarly, the parameters in the third and fourth rows in (P) were computed from the observations on the NTT + NGG and NGG + NGG controls.

Given the stochastic matrix (P), we adopted a maximum-likelihood estimation (MLE) approach to compute (S). This can be achieved by minimizing ({Vert {P}^{T}S-CVert }^{2}), under the constraint of ({s}_{i}ge 0), (forall i). With the estimated reads for the four cleavage states, the off-on ratio of a gRNA-target pair was calculated as:

$$r=frac{{s}_{2}+{s}_{4}}{{s}_{3}+{s}_{4}}$$

(3)

Where s_2, s₃ and s₄ are the estimated read counts for “cleavage at off-target only”, “cleavage at on-target only”, and “both cleavage”, respectively. Considering that the calculation of off-on ratios is associated with large variation using a limited number of reads in the denominator, we filtered out gRNA-target pairs with less than 100 total edited reads at on-target site, i.e., ({s}_{3}+{s}_{4} < 100), resulting in a total number of 10,460 gRNA-target pairs with the estimated off-on ratios (Supplementary Data 1). When computing log off-on ratio, we added a small constant c to control the variation, i.e., ({{log }}left(r+cright)). The constant c is chosen to be 0.01 because the dynamic range of off-on ratio evaluation is approximately 0.02–1.

Estimation of GMT

We assume that the intrinsic sequence of the gRNA and the mismatch context between gRNA and target sequence jointly contribute to the overall off-target effects. To decompose these two factors, we collected the off-on ratios of 3897 1-mismatch targets corresponding to 1438 gRNAs in our dataset (Supplementary Data 4). There are 12 types of unmatched nucleotide pairs (A-C, A-G, …, U-C), and 20 different positions of the mismatches, resulting in 240 mismatch types in total. Suppose ({r}_{{ij}}) represents the off-on ratio of the (i)th gRNA with a mismatch of the (j)th type, we model ({r}_{{ij}}) to be the multiplication of mismatch-dependent effect ({m}_{j}) and gRNA-intrinsic mismatch tolerance ({g}_{i}). In a log scale, this can be written as:

$${{{{{rm{Log}}}}}}({r}_{{ij}}+c)={{log }}({m}_{j})+{{log }}({g}_{i})+varepsilon$$

(4)

where (varepsilon) is the Gaussian noise and c is the small constant for controlling the variation of log off-on ratio. We then applied a gradient descent algorithm to compute the MLE estimation of ({m}_{j}) and ({g}_{i}), which represents the mismatch-dependent effect and the gRNA-intrinsic mismatch tolerance, respectively. The mismatch-dependent effects for single-mismatches are presented as a 12 (times) 20 matrix, namely M1 matrix, in which each entry represents the effects of a nucleotide mismatch at a specific position of the target sequence. The estimated ({g}_{j}) was subsequently used to explore the sequence determinants of GMT.

Prediction of GMT

To predict GMT from a gRNA sequence, we took all the 1438 gRNAs with estimated GMT scores (({g}_{j})) to train a CNN model. For each gRNA, we tested two different encoding strategies, i.e., mononucleotide and dinucleotide, to vectorize the gRNA sequence as inputs (Supplementary Fig. 8). For the mononucleotide encoding, a 20-nt gRNA sequence was binarized into a 4 (times) 20 two-dimension array, with 0 s and 1 s indicating the absence or presence of 4 different nucleotides (A, T, C, G) at every single position. For the dinucleotide encoding, 20-nt gRNA sequence was binarized into a 16 (times) 19 two-dimension array, with 0 s and 1 s indicating the absence or presence of 16 different dinucleotides (AT, AC, AG, AA, TT, TA, TG, TC, CC, CA, CG, CT, GG, GA, GT, GC) at each of the constitutive positions along the gRNA sequence. A CNN regression model was then designed using Keras (https://keras.io/) with a TensorFlow backend engine, consisting of one convolution layer and one dense layer, terminating in a single neuron.

We compared two encoding methods for data vectorization with different settings of parameters including the size, shape, and number of convolution kernels using a five-fold cross-validation strategy. The performance was assessed by computing Spearman’s correlation between the predicted and observed GMT scores. Finally, we selected the dinucleotide inputs and CNN model with three kernels in the convolutional layer. We further tested our model on two independent datasets: the TTISS dataset, which includes 59 sgRNAs with genome-wide off-targets detected by TTISS for nine different Cas9 variants¹⁸; the CHANGE-seq dataset, which includes 110 gRNAs with genome-wide SpCas9 off-targets detected by CHANGE-seq¹⁹. For each gRNA, an overall off-on ratio was calculated as the sum of detected off-target reads divided by on-target reads. We predicted the GMT scores for all the gRNAs and compared the overall off-on ratios of gRNAs with high (top 25%) and low (bottom 25%) GMT scores. Statistical significance was measured using Mann–Whitney U test.

Estimation of combinatorial effects

We denote ({delta }_{{ij}}) as the position-dependent combinatorial effect between two mismatches that occur at the (i)th and the (j)th nucleotide relative to the PAM ((i,j=0,1,2,ldots ,20)). A value of ({delta }_{{ij}},)close to 1 suggests “independence” of the combination, whereas ({delta }_{{ij}},)close to 0 suggests a strong “epistasis-like” combinatorial effect. We estimated ({delta }_{{ij}}) from the off-on ratios of 2-MM and individual 1-MM targets, as specified below.

To obtain a sufficient number of data points for a robust estimation of ({delta }_{{ij}}), we consider the mismatches located in the (i)th and the (j)th nucleotide, as well as those in the adjacent locations. Suppose there are ({N}_{{ij}}) 2-MM targets with mismatches at positions (i, j), (i, j-1), (i, j+1), (i-1, j), and (i+1, j), we denote ({x}_{{ij}}^{k}) to be the off-on ratio of the (k)th 2-MM target in this group ((k=0,1,2,ldots ,{N}_{{ij}})) and model ({x}_{{ij}}^{k}) as:

$${x}_{{ij}}^{k}={y}_{{ij}}^{k}{z}_{{ij}}^{k}{delta }_{{ij}}+varepsilon$$

(5)

where ({y}_{{ij}}^{k}) and ({z}_{{ij}}^{k}) are the off-on ratios of the 1-MM targets corresponding to the 2-MM target in the library design, and (varepsilon) is the Gaussian noise. The MLE estimate of ({delta }_{{ij}}) can be explicitly computed as:

$$hat{{delta }_{{ij}}}=mathop{sum }limits_{k=1}^{{N}_{{ij}}}left({x}_{{ij}}^{k}{y}_{{ij}}^{k}{z}_{{ij}}^{k}right)bigg/mathop{sum }limits_{k=1}^{{N}_{{ij}}}{left({y}_{{ij}}^{k}{z}_{{ij}}^{k}right)}^{2}$$

(6)

The combinatorial effects for 2-position combinations were presented as a 20 × 20 matrix, namely M2 matrix, in which each entry represents the combinatorial effects between two mismatched positions.

To cross-reference the combinatorial effects derived from our data, we also calculated the relative co-occurrence score (RCS) that represents the observed frequency of two mismatches relative to random expectation based on CHANGE-seq dataset¹⁹. Assuming that there are n off-target sites detected in the genome, the RCS is defined as:

$${RCS}=,frac{{a}_{{ij}}* n}{{b}_{i}* {c}_{j}}$$

(7)

Where ({a}_{{ij}}) is the number of off-target sites harboring mismatches at both positions i and j, ({b}_{i}) is the number of target sites harboring mismatches at position i, and ({c}_{j}) is the number of target sites harboring mismatches at position j. We computed the RCS for all the gRNAs in the CHANGE-seq dataset and took the average RSC over the gRNAs to obtain the matrix in Fig. 3e.

To explore the combinatorial effect from a biophysical point of view, we used a previous kinetic model⁴⁰ to perform simulation and to estimate ({delta }_{{ij}}) from the simulated data. In brief, the cleavage rate can be directly calculated given the free energy gain of binding the PAM, the energy gain for extending the hybrid over a match, the cost associated with extending the hybrid over an isolated mismatch, as well as the cost of extending the hybrid over neighboring mismatches. As an illustration, we plotted the combinatorial effect calculated for a specific choice of these parameters with the gain in energy due to PAM binding of 5 ({k}_{{{{{{rm{B}}}}}}}T), the gain per correctly matched hybrid base pair of 0.25(,{k}_{{{{{{rm{B}}}}}}}T), the cost of a mismatch of 4 ({k}_{{{{{{rm{B}}}}}}}T) if it is isolated and 3 ({k}_{{{{{{rm{B}}}}}}}T) if it is added to an existing bubble (Fig. 3g).

Model-based off-target prediction with MOFF

In MOFF, we integrate three factors including the individual mismatch effect (IME), the combinatorial effect (CE), and the GMT effect.

To explain the MOFF model, we start with a gRNA g and a target with a single mismatch ({m}_{1}). The expected off-on ratio, S(g, m₁), is the multiply of IME and GMT, i.e.,

$$Sleft(g,,{m}_{1}right)={s}_{1}{s}_{{{{{{{mathrm{GMT}}}}}}}}$$

(8)

where ({s}_{1}) is the effect of ({m}_{1}) computed from the M1 matrix as described in “Estimation of GMT” section, and ({s}_{{{{{{{mathrm{GMT}}}}}}}}) is the GMT effect estimated from the dinucleotide CNN regression model as described in “Prediction of GMT” section.

Next, we consider two mismatches ({m}_{1}) and ({m}_{2}). The expected off-on ratio, S(g, m₁, m₂), is the multiply of the effects of individual single mismatches and the combinatorial effect:

$$Sleft(g,,{m}_{1},{m}_{2}right)=Sleft(g,,{m}_{1}right),Sleft(g,,{m}_{2}right){delta }_{12}={s}_{1}{s}_{2}{delta }_{12}{({s}_{{{{{{{mathrm{GMT}}}}}}}})}^{2}$$

(9)

where ({delta }_{12}) is the pairwise combinatorial effect with respect to the positions of ({m}_{1}) and ({m}_{2}), as computed from M2 matrix as described in “Estimation of Combinatorial Effects” section.

When three mismatches are considered, it is ideal to estimate the combinatorial effect of all three. Unfortunately, the number of possible combinations increases exponentially, which makes experimental estimation of parameters impractical. Here, we consider a sequential model to add the 3^rd mismatch, ({m}_{3}), to the model of two mismatches (Sleft(g,{m}_{1},{m}_{2}right)):

$$Sleft(g,,{m}_{1},{m}_{2},{m}_{3}right)=Sleft(g,,{m}_{1},{m}_{2}right)Sleft(g,,{m}_{3}right){delta }_{12,3}$$

(10)

where ({delta }_{{{{{mathrm{12,3}}}}}}) is the additional combinatorial effect and is modeled to be the geometric mean of pairwise combinatorial effects of ({delta }_{13}) and ({delta }_{23}).

Combining Eqs. (8)–(10), and the above definition of ({delta }_{{{{{mathrm{12,3}}}}}}), we have:

$$Sleft(g,,{m}_{1},{m}_{2},{m}_{3}right)={s}_{1}{s}_{2}{s}_{3}{delta }_{12}sqrt{{delta }_{13}{delta }_{23}}{({s}_{{GMT}})}^{3}$$

(11)

Note that ({m}_{1},{m}_{2},{m}_{3}) are indeed unordered. With a different order in the sequential model, (Sleft(g,{m}_{1},{m}_{2},{m}_{3}right)) can also be computed as ({s}_{1}{s}_{2}{s}_{3}{delta }_{13}sqrt{{delta }_{12}{delta }_{23}}{({s}_{{GMT}})}^{3}) or ({s}_{1}{s}_{2}{s}_{3}{delta }_{23}sqrt{{delta }_{12}{delta }_{13}}{({s}_{{GMT}})}^{3}). Therefore, we compute (Sleft(g,{m}_{1},{m}_{2},{m}_{3}right),)to be the geometric mean of the scores computed with all three possible orders in the sequential model, simplified as follows:

$$Sleft(g,,{m}_{1},{m}_{2},{m}_{3}right)={s}_{1}{s}_{2}{s}_{3}{({delta }_{12}{delta }_{13}{delta }_{23})}^{2/3}{({s}_{{GMT}})}^{3}$$

(12)

Finally, we extend the three-mismatch model to k mismatches using mathematical induction approach, and define the MOFF score to be the logarithm of predicted off-on ratio:

$${S}_{{MOFF}}=mathop{sum }limits_{i=1}^{k}{{log }}({s}_{i})+frac{2}{k}mathop{sum }limits_{i=1}^{k}mathop{sum }limits_{j=1}^{i-1}{{log }}({delta }_{{ij}})+k,{{log }}({s}_{{GMT}})$$

(13)

To assess the genome-wide specificity of a given gRNA, we first mapped the gRNA to the genome to search for potential off-target sites harboring up to 6 mismatches using CRISPRitz, a software for rapid and high-throughput in silico off-target site identification⁶³. Next, we defined a MOFF-aggregate score, which is the logarithm of the sum of the MOFF-target scores for all the potential off-target sites. We note that, the current version of MOFF only considers genomic sites with mismatches, but not indels, relative to the 20-nt crRNA sequence as the potential off-targets.

Evaluation of the models and feature importance analysis

To evaluate the performance of our model, we curated three independent testing datasets generated by three different platforms, i.e., GUIDE-seq¹², TTISS¹⁸, and CHANGE-seq¹⁹. GUIDE-seq dataset includes 348 detected off-target sites for 9 gRNAs, TTISS dataset contains 630 detected off-target sites across 59 gRNAs and CHANGE-seq dataset consists of 96,555 detected off-target sites corresponding to 109 gRNAs. For each gRNA-target pair, we measured the off-target effect as off-on ratio, which is calculated as the detected off-target reads divided by on-target reads. For each gRNA, we measured its genome-wide specificity as overall off-on ratio, which is calculated as the total detected off-target reads across the genome divided by on-target reads. We reasoned that the classification of off-target and untargeted sites is highly dependent on the sensitivity of each platform, where the off-target effects indeed take continuous values that may differ by several orders of magnitude. Thus, we adopted the Spearman correlations between measured and predicted off-on ratios for quantitative evaluations.

We further evaluated the importance of different features using Gini importance, which was implemented through the Random Forest Regressor module from the scikit-learn package in Python with default parameters. For MOFF-target, we consider three features: IME, CE, and GMT. For MOFF-aggregate, we consider two features: (1) the mismatch-dependent feature, which is the sum of the predicted mismatch-dependent effects (IME + CE) without considering GMT; (2) the GMT effect corresponding to each gRNA.

Comparison of off-target prediction methods

We compared the performance of MOFF to five representative off-target prediction methods: (1) Cutting Frequency Determination (CFD) score is the multiplication of single mismatch effects derived from a cleavage dataset targeting the coding sequence of the human CD33 gene in MOLM-13 cells⁹. For the implementation of CFD, we used the Supplementary Table 19 of their original publication which includes all the single-mismatch effects. (2) Elevation is a two-layer regression model where the first layer learns to predict the effects of single-mismatch and the second layer learns how to combine single-mismatch effects into a final score²⁴. The source codes of Elevation were downloaded from https://github.com/Microsoft/Elevation. (3) CNN_std is a deep learning model to predict off-target effects using a standard convolutional neuron network²⁷ and the source codes to implement CNN_std were downloaded from https://github.com/MichaelLinn/off_target_prediction. (4) CRISPR-Net is a more recent deep learning method using a recurrent convolutional network, which shows superior performance compared to other machine learning approaches²⁹. The source codes to implement CRISPR-Net were downloaded from https://codeocean.com/capsule/9553651/tree/v1. And (5) CRISPRoff is an approximate binding energy model for the Cas9-gRNA-DNA complex, which systematically combines the energy parameters obtained for RNA-RNA, DNA-DNA, and RNA-DNA duplexes³¹. The source codes for executing CRISPRoff were downloaded from https://github.com/RTH-tools/crisproff.

We adopted the same strategy to evaluate the performance of different methods as described in “Evaluation of the models and feature importance analysis” section. We note that aggregation models for Elevation and CRISPR-Net assigned different weights to off-targets sites occurring at genic or non-genic regions, which were trained on cell viability screen data of gRNAs targeting non-essential genes; however, cell viability is not a direct indication of DNA cleavage at off-target sites and the genomic features used for training are not associated with the sequence determinants for gRNA specificity. Therefore, the performance of their models degraded when applied to the datasets that directly measure DNA cleavage at off-target sites across the genome (Supplementary Fig. 18). For a fair comparison, we used logarithm of the sum of individual scores for all potential off-target sites identified by CRISPRitz⁶³ to predict gRNA specificity for all these methods.

Application to gRNA design

We collected Avana, GeCKO-v2, and Sanger CRISPR screen data^43,69 from the Depmap portal at https://depmap.org/portal/download. The efficiency of gRNAs was predicted using the DeepHF method⁴⁴, and the off-target effect of gRNAs was predicted using MOFF-aggregate. As the analysis involves the computation on tens of thousands of gRNAs, we configured the alignment input to allow up to five mismatches for computational efficiency. All the gRNAs were then classified into three categories: Low efficiency (DeepHF score < 0.6), High efficiency High off-target (DeepHF score > 0.6 and MOFF-aggregate > 1), and High efficiency Low off-target (DeepHF score > 0.6 and MOFF-aggregate < 1).

To compare the effective size of gRNAs within different categories, we compared the average log-fold change (LFC) of gRNAs targeting 1246 core essential genes and 758 non-essential genes⁷⁰ within each category using strictly standardized mean difference (SSMD), which is widely used for quality control in the high-throughput screen data. Higher SSMD values indicate that gRNAs can better discriminate essential and non-essential genes, therefore achieving greater effective size.

To test the cross-library reproducibility for different types of gRNAs, we compared gRNAs targeting 529 cell-specific essential genes from Avana data and Sanger data. Specifically, for gRNAs in different categories, we measured the Pearson correlation between averaged LFC of gRNAs targeting the same genes in Avana and Sanger data across different cell lines. Higher correlation indicates that the effects of gRNAs are more reproducible among different libraries. The cell-specific essential gene list was downloaded from the Depmap portal at https://ndownloader.figshare.com/files/27902064.

gRNA selection for allele-specific editing

Given the local DNA sequences of the wildtype and mutant alleles, we first searched for all the possible gRNAs followed by a PAM (NGG) motif targeting the DNA sequence of the mutant allele that harbors a single mutation compared to the wildtype allele. The selected gRNAs, which we termed seed gRNAs, are perfectly matched to the mutant allele, and have a single mismatch relative to the wildtype allele. Next, we introduced all possible single mismatches to the seed gRNAs to generate the candidate gRNAs that differ by 1-nt from the mutant and 2-nt from the wildtype allele. To select the best gRNAs from these candidates, we predicted the MOFF-target scores between the gRNAs and the targeted DNA sequence from the mutant and wildtype alleles, which indicate the sensitivity and selectivity of the gRNAs, respectively. In practice, we selected gRNAs that satisfy: (1) MOFF-target scores at mutant allele >0.5 to ensure high knockout efficiency, (2) the ratio of MOFF-target scores between wildtype allele and mutant allele <0.2, for high selectivity.

Reporting summary

Further information on research design is available in the Nature Research Reporting Summary linked to this article.

Source link

Vasiprak Blog