Preloader

Machine learning-coupled combinatorial mutagenesis enables resource-efficient engineering of CRISPR-Cas9 genome editor activities

Generation of data input for MLDE

We used the previously published SpCas9 data8 surveying the on-target activity of Sg5 (650 empirical datapoints) and Sg8 (729 empirical datapoints) that target a red fluorescent protein (RFP) sequence as the input data for the MLDE package. Specifically, we used the dataset of on- and off-target activities measured from the same CombiSEAL library of SpCas9 variants engineered at 8 amino-acid positions (R661, Q695, K848, E923, T924, Q926, K1003, and R1060) that interact with the sgRNA–DNA complex. The on-target activity was measured in screens where the sgRNA and target site have perfectly matched protospacer sequences, while off-target activity was measured where the sgRNA is targeting a site that bears an artificially introduced synonymous mutation. We removed the extreme outliner by setting the minimal enrichment score (E-score) as −2 for both Sg5 and Sg8 before the min–max normalized to the scaled fitness score ranging between 0 and 1. First, we isolated 20% of the entire library (190 SpCas9 variants) as test data; among these selected variants, 122 of Sg5 and 136 of Sg8 have empirical measurements. We then generated input training datasets that do not overlap with the test data. The training datasets consist of 5%, 10%, 20%, 50%, and 70% of randomly drawn empirical measurements to test the minimal input for effective selection of top variants from MLDE prediction, corresponding to datasets of 33, 65, 130, 325, and 445 empirically measured Sg5 on-target activity and 37, 73, 146, 365, and 510 Sg8 on-target activity measurements. We generated three replicates for each size, subjected to either randomized or diverse selection schemes for variants. To generate the randomized dataset, we used the sample_n() function from dplyr in R to randomly select the predefined number of E-scores. Taking the above-mentioned 20% of the entire library as nonoverlapping test data allows the 70% randomly selected data not being the same for all three replicates, while maximizing the variant numbers for evaluation. Using the same method, we prepared the datasets of Sg5 off-target activities for MLDE. First, we withheld 190 variants as the test set for the off-target activities. Then we randomly sampled the remaining variants that consist of 5%, 10%, 20%, 50%, and 70% of the library to generate 3 replicates of training datasets for each size, corresponding to 41, 83, 165, 414, and 579 empirical measurements of Sg5 off-target activity. The off-target activity was derived from min–max normalized E-score after setting a lower bound of −2.5.

We sought to increase the sequence diversity of the input training data by restricting the presence of variants, whose share only 1 or 2 mismatches (mutations apart) with each other, to the minimum, given the data size. To do so, we kept randomly sampling variants to size N with available E-scores until each variant sharing no more than p1-mismatch neighbors and q2-mismatch neighbors in the subgroup N. The thresholds p and q for each dataset are listed in Supplementary Data 1. As N increases in size, it is more difficult to remove the 1- and 2-mismatch neighbors in the input dataset and the overall sequence diversity of the diverse dataset becomes similar to that of the randomly subsampled input datasets. Consequently, we generated diverse datasets of four different sizes for Sg5 and Sg8 on-target activity that correspond to approximately 5, 10, 20, and 50% of the empirical data. Because the 50% training datasets showed the same level of diversity as randomly selected data, we only run MLDE on 5, 10, and 20% datasets. The resultant diversity of the dataset, summarized as the total number of pairwise sequences with N mismatches, is listed in Supplementary Data 1.

We also used the dataset with a total of 58 SpCas9 variants bearing rational substitutions at five positions located in the PI domain that had their activities on noncanonical NGN PAMs assessed by HT-PAMDA30. The on-target activity of the variant against 4 sgRNAs representing NGAT, NGCC, NGGG, and NGTA PAMs was min–max normalized in the training data. To avoid having too few variants in the test set given the small dataset size, we withheld 29 variants (50% of the library) as test data and performed MLDE with combinations of Bepler and Georgiev and modeling parameters p1 and p2 to predict on-target activity predictions using 10, 20, 25, and 50% input (empirical data of 29 and 15 variants).

The in-house SaCas9 dataset consists of 1,296 variants that were constructed and tested in this study. Substitutions on 8 amino-acid positions (887, 888, 889, 985, 986, 988, 989, 991) that are widely scattered over the WED and PI domains were rationally chosen based on protein-structure analyses (see Supplementary Data 3 for details). The SaCas9 variants’ on-target activities against sgRNA 1, 2, and 3 were measured as the E-score derived from the high-throughput fluorescent protein disruption assay. We again withheld 20% of the empirical data (260 variants) as the test set. From the remaining variants, we generated 3 replicates of randomly selected datasets that consisted of 65, 130, 260, 648, and 907 variants that correspond to 5, 10, 20, 50, and 70% of the full library as training data for MLDE. We run MLDE using the training data of different sizes and evaluate the MLDE performance using the test-set variants.

We run MLDE according to the default parameters. Briefly, we applied the Bepler and Georgiev embedding of the full-length amino-acid sequences of SpCas9 (UniProtKB—Q99ZW2 (CAS9_STRP1)) [https://www.uniprot.org/uniprot/Q99ZW2] and SaCas9 (UniProtKB—J7RUA5 (CAS9_STAAU)) [https://www.uniprot.org/uniprot/J7RUA5] substituted with the designated variant’s amino-acid residue combination. We modified the MLDE GenerateEncodings.py so that it processed a customized input fasta file containing the protein sequences of all the variants designed in the SpCas9 as well as the SaCas9 dataset rather than generating the full set of saturated mutagenesis variants. We run the MLDE ExecuteMlde.py with default parameters on the Bepler and the Georgiev embeddings and with two different sets of parameters. We assigned them as parameters 1 and 2: parameter 1 used the neural network models “NOHidden”, “OneHidden”, “TwoHidden”, “OneConv”, and “TwoConv” available in MLDE, each with 20 rounds of hyperparameter optimization, and parameter 2 used less complex models “Linear-Tweedie”, “RandomForestRegressor”, “LinearSVR”, and “ElasticNet”, each with 50 rounds of hyperparameter optimization.

We evaluated the performance of the ML algorithm parameters and embeddings with precision, specificity, and sensitivity, using the withheld test-set variants. More specifically, we assign variants with at least 70% of the wild-type activity as positives and the rest as negatives. Thus, true positives are variants with at least 70% activity of the wild type when empirically tested with the sgRNA. Otherwise, they are true negatives. For each MLDE result, we also labeled the positives and negatives using the 70% wild-type activity threshold. We then counted the number of true positives (TP), true negatives (TN), false positives (FP), and false negatives (FN) for each result and derived the performance metrics according to the formula stated below:

$${{{{{{rm{specificity}}}}}}}=frac{{{{{{rm{TN}}}}}}}{{{{{{rm{TN}}}}}}+{{{{{rm{FP}}}}}}}$$

(1)

$${{{{{{rm{sensitivity}}}}}}}=frac{{{{{{rm{TP}}}}}}}{{{{{{rm{TP}}}}}}+{{{{{rm{FN}}}}}}}$$

(2)

$${{{{{{rm{Precision}}}}}}}=frac{{{{{{rm{TP}}}}}}}{{{{{{rm{TP}}}}}}+{{{{{rm{FP}}}}}}}$$

(3)

We also applied another performance metric, enrichment, proposed by Sarfati et al.42. The enrichment reports the ratio of identifying true top 5% of hits when using the ML prediction to random selection (the null background)

$${{{{{{rm{Enrichment}}}}}}}={I}_{5}^{{{{{{{rm{prediction}}}}}}}}/{I}_{5}^{{{{{{{rm{random}}}}}}}}=frac{400cdot {I}_{5}^{{{{{{{rm{prediction}}}}}}}}}{N},$$

(4)

where N is the total size of the test set, here N is the number of all the variants in the prediction. Enrichment provides us with an estimate of identifying high-fitness variants when we select the top-5% variants by predicted fitness for downstream experimental validation. When the larger fraction of highest-fitness variants is captured in the top-5% prediction, enrichment increases from 1.

Finally, NDCG is also calculated to evaluate how well the model identifies top-ranking variants

$${{{{{{rm{NDCG}}}}}}}=left(mathop{sum }limits_{i=1}^{N}frac{{f}_{{{{{{{rm{rel}}}}}}; i}}}{{{{log }}}_{2}left({{{{{{rm{rel}}}}}}; i}+1right)}right)/left(mathop{sum }limits_{i=1}^{N}frac{{f}_{i}}{{{{log }}}_{2}left(i+1right)}right),$$

(5)

where f is the true fitness value of the variant, i is the true rank (from highest to lowest fitness), and rel i is the predicted rank from the model. NDCG compares the predicted ranking to the actual ranking, aligns with the goal of MLDE to identify high-fitness variants as top-ranking variants24,25. If the predicted ranking and the actual ranking is identical, NDCG reaches its maximum value of 1. Models that misidentify low-fitness variants as top-ranking ones would result in low NDCG.

The input data handling, statistical analyses, and graph plottings are carried out in R v4.1.2 using packages ggplot2 v3.3.5, tidyverse 1.3.1, readxl, Cairo, and stringdist.

Plasmid construction

The plasmids generated from this study (Supplementary Data 9) were done with standard molecular cloning techniques such as PCR, restriction-enzyme digestion, ligation, one-pot ligation, and Gibson assembly. Customized oligonucleotides were ordered through Genewiz. Vectors were transformed into E. coli strain DH5α-competent cells and selected with ampicillin (100 mg/ml, USB) or carbenicillin (50 mg/ml, Teknova). DNA was extracted and purified by Plasmid Mini (Takara and Tiangen) or Midi preparation (QIAGEN) kits. Sequences of the vectors were verified with Sanger sequencing.

Storage vectors AWp28 (Addgene #73850) and AWp112 were used to assemble the sgRNA chosen to target a specific gene. The sgRNA sequences used are listed in Supplementary Data 10. Oligonucleotide pairs of the sgRNA target sequences with BbsI sticky ends were synthesized, annealed, and cloned into the BbsI-digested storage vector using T4 DNA ligase (New England Biolabs). To prepare the lentiviral vector for SaCas9 variant expression, AWp124 vector was modified via Gibson assembly to remove all existing Esp3I enzyme sites. Esp3I sites were then reintroduced flanking the PI and WED regions to incorporate the intended mutations, giving the DTp2 vector. To insert the sgRNA expression cassette, they were amplified from the storage vector with flanking BamHI and EcoRI (Thermo Fisher Scientific) sites to and ligated with the digested lentiviral vector DTp2. To generate the PI and WED mutations, oligonucleotides with the WED-domain mutations were pooled in a 1:1 ratio as the forward primer, and the same was applied with the PI domain for the reverse primer. PCR amplifications were done using these pooled forward and reverse primers with the original KKH-SaCas9 template to create the pooled mutations. Using a one-pot ligation method, the pooled mutations were inserted into the Esp3I sites of DTp2. The EFS promoter drives the SaCas9 expression, together with a fluorescent protein expression from the downstream T2A-BFP. To create SaCas9-KKH-SAV2-plus (DTp47A), we incorporated the Esp3I sites similarly done with DTp2 into SaCas9-KKH-SAV2 (DTp52) via Gibson assembly, and then with one-pot ligation inserted the ‘plus’ mutations that are the N888R/A889Q. All plasmids created in this study are available from the authors.

Cell culture and transduction

HEK293T cells obtained from American Type Culture Collection (ATCC), and MHCC97L-Luc cells gifted by S. Ma (School of Biomedical Sciences, The University of Hong Kong), were maintained in Dulbecco’s Modified Eagle Medium (DMEM) supplemented with 1× antibiotic–antimycotic and 10% FBS (Thermo Fisher Scientific). OVCAR8-ADR cells gifted by T. Ochiya (Japanese National Cancer Center Research Institute, Japan), were maintained in RPMI 1640 medium supplemented with 10% FBS (Gibco). The HEK293T cells were used for lentiviral production for KKH-SaCas9 variant expression and for generating stable cell lines. OVCAR8-ADR cells were transduced with a pAWp9 vector (Addgene #73851) expressing RFP and GFP gene, driven by the hUbCp and CMV promoters, respectively, for the initial screening of KKH-SaCas9 pooled variants and for further validation. OVCAR8-ADR cells were also transduced with lentiviruses encoding RFP and GFP genes expressed from UBC and CMV promoters, respectively, and a tandem U6 promoter-driven expression cassette of sgRNA targeting the GFP site. For the initial screening, the KKH-SaCas9 variants were expressed with sgRNA targeting GFP using EFS and U6 promoters, respectively, followed by a T2A-BFP to determine KKH-SaCas9 expression. The cells were sorted with a Becton Dickinson BD Influx cell sorter. With the mutational screening, the KKH-SaCas9-selected variants were transduced into the stable OVCAR8-ADR cell lines harboring the GFP, RFP genes, and sgRNA. The MHCC97L-Luc cell lines were transduced to create the stable expression of the selected KKH-SaCas9 variants for the T7E1 and GUIDE-seq experiments. The cells were regularly tested and showed negative for mycoplasma contamination. Lentivirus production and transduction were carried out as previously described8.

Fluorescent protein disruption assay

Fluorescent protein disruption assays were conducted to determine DNA cleavage and indel-mediated disruption at the target site of the fluorescent protein, GFP, by the KKH-SaCas9 variants with the gRNA expressions, resulting in loss of cell fluorescence. The stable cell lines integrated with the GFP and RFP reporter gene, expressing the SaCas9 variants and sgRNA, were washed, resuspended with 1× PBS supplemented with 2% heat-inactivated FBS, and analyzed with Becton Dickinson LSR Fortessa Analyzer or ACEA NovoCyte Quanteon. Cells were gated on forward and side scatter, and at least 1 × 104 cells were recorded per sample for each dataset. FlowJo v10.7 was used to analyze data generated from flow cytometry experiments.

Immunoblot analysis

Immunoblots were carried out as previously described8. Anti-SaCas9 (1:1000, Cell Signaling #85687) and anti-GAPDH (1:5000, Cell Signaling #2118) primary antibodies were used, followed by HRP-linked anti-mouse IgG (1:10,000, Cell Signaling #7076) and HRP-linked anti-rabbit IgG (1:20,000, Cell Signaling #7074) secondary antibodies. The unprocessed scan of the immunoblots is available in the Source Data file.

T7 endonuclease-I assay

T7 endonuclease-I assay was performed as previously described to quantify the Cas9-induced mutagenesis in endogenous loci8. The targeted loci were amplified from 15 to 30 ng of genomic DNA extracted using dNeasy Blood and Tissue Kit (QIAGEN) using the primers as listed in Supplementary Data 11. Quantification was based on relative band intensities measured using ImageJ. Editing efficiency was estimated by the formula

$$100times (1-{(1-(b+c)/(a+b+c))}^{1/2}),$$

(6)

as previously described43, where a is the integrated intensity of the uncleaved PCR product, and b and c are the integrated intensities of each cleavage product.

GUIDE-seq

GUIDE-seq was performed as previously described8. Approximately 1.6 million MHCC97L cells stably expressing the KKH-SaCas9 variants were transduced with sgRNAs. After 72 h, electroporation was conducted according to the manufacturer’s protocol using 1100 pmol freshly annealed end-protected dsODN with 100 μl Neon tips (Thermo Fisher Scientific). The dsODN oligonucleotides used were 5′-P-G*T*TTAATTGAGTTGTCATATGTTAATAACGGT*A*T-3′ and 5′-P-A*T*ACCGTTATTAACATATGACAACTCAATTAA*A*C-3′, where P represents 5′ phosphorylation and asterisks indicate a phosphorothioate linkage. Electroporation voltage, width, and the number of pulses were 1100 V, 20 ms, and 3 pulses, respectively. Cells were harvested at day 7 post transduction of the sgRNA. Genomic DNA was extracted using dNeasy Blood and Tissue Kit (QIAGEN) according to the manufacturer’s protocol. The gDNA collected for the SaCas9 variant and the sgRNA were sequenced on Illumina NextSeq System and analyzed with GUIDE-seq software44.

Deep sequencing

Deep sequencing was carried out as previously described45. The same gDNA samples used for the T7E1 assays were amplified for the region of edit and sent for deep sequencing. About ~0.6 million reads per sample on average were used to evaluate the genomic diversity of the >10,000 cells. HEK293T cells were infected with sgRNAs and then transfected with KKH-SaCas9-derived BE4max editor, together with a fluorescent protein expression from the downstream T2A-BFP. The cells harboring both base editor and sgRNA were sorted with a cell sorter based on fluorescence. Amplicons harboring the targeted endogenous loci were generated by PCR. About ~0.2 million reads per sample on average were used to evaluate the genomic diversity of the >10,000 cells. Crispresso246 with default setting was used to quantify indels and base editing outcomes from the deep-sequencing data.

Molecular modeling

Molecular dynamic simulations were conducted on the variants using DynaMut31. The variant mutations were singly inputted into the webserver, and the structural outputs were then aligned with the crystal structure of SaCas9 (PDB: 5CZZ) [https://www.rcsb.org/structure/5CZZ] on PyMol. The predicted rotamer of the mutations as indicated by DynaMut was then used to replace the amino-acid positions on the SaCas9 crystal structure. The predicted interactions determined by DynaMut and Pymol were then indicated on the crystal structure to provide a putative representation of the SaCas9 variants.

Reporting summary

Further information on research design is available in the Nature Research Reporting Summary linked to this article.

Source link