Predicting base editing outcomes with an attention-based deep learning algorithm trained on high-throughput target library screens

Generation of large datasets for adenine- and cytidine base editing via high-throughput screening with self-targeting libraries

To capture base editing outcomes of SpCas9 CBEs and ABEs across thousands of sites in a single experiment, we generated a pooled lentiviral library of constructs encoding unique 20-nt sgRNA spacers paired with their corresponding target sequences (20-nt protospacer and a downstream NGG PAM site) (Fig. 1a). Our library included 23,123 randomly generated target sequences and 5,171 disease-associated human loci with transition mutations, comprising a comprehensive and diverse library for machine learning (Supplementary Data 1). Oligonucleotides containing the sgRNAs and corresponding target sequences were synthesized in a pool and cloned into a lentiviral backbone containing an upstream U6 promoter and a puromycin resistance cassette. HEK293T cells were then transduced at a 1000× coverage with a multiplicity of infection (MOI) of 0.5, and selected with puromycin. Next, cells were transfected with Tol2 compatible plasmids encoding for blasticidin resistance and one of the four commonly used base editors: ABEmax (containing ecTad7.10), CBE4max (containing rAPOPEC1), ABE8e (containing ecTadA-8e), and Target-AID (containing the AID ortholog PmCDA1) (Supplementary Fig. 1). Co-transfection with a Tol2 transposase plasmid allowed stable integration and prolonged expression of base editors. After 10 days in culture, cells were harvested, and genomic DNA was collected for amplicon high-throughput sequencing (HTS) (Fig. 1b and see the ‘Methods’ section).

**Fig. 1: A high-throughput platform for assessing base editor activities.**

We observed high consistency between both experimental replicates (Pearson’s r² = 0.88 (ABEmax), 0.86 (CBE4max), 0.92 (ABE8e), and 0.88 (Target-AID)) (Supplementary Fig. 2), indicating comprehensive and robust sampling of edited target sites. Mean base editing efficiencies (defined here as the fraction of mutant reads overall sampled reads of a target site) were 4.26% for ABEmax, 3.61% for CBE4max, 3.15% for ABE8e, and 3.13% for Target-AID (Supplementary Fig. 3). In line with previous studies, we observed maximum editing at position 6 (counting from PAM distal) with ABEmax, CBE4max, and ABE8e, and at position 3 for Target-AID (Fig. 1c–f)^7,8,9,10. Interestingly, the editing window of ABE8e was broader than ABEmax, and that of Target-AID was shifted PAM-distally compared to CBE4max (Fig. 1e, f). Analysis of the trinucleotide sequence context, moreover, confirmed that ecTadA7.10 of ABEmax and rAPOBEC1 of CBE4max have a preference for editing at bases that are preceded by T (Fig. 1g, h)^10,11,12,13. ecTadA7.10 additionally has an aversion for an upstream A and preference for a downstream C. Notably, ecTad-8e of ABE8e displayed a reduced sequence preference, although editing of bases that were preceded by an A was still largely disfavored (Fig. 1i). Compared to rAPOBEC1 PmCDA1 of Target-AID lacked the requirement of a preceding T for efficient editing, but motifs, where the targeted base is followed by a C, were disfavored (Fig. 1j).

Development of BE-DICT, an attention-based deep learning model predicting base editing outcomes

Potentially predictive features that influence CRISPR/Cas9 sgRNA activity, such as the GC content and minimum Gibbs free energy of the sgRNA, did not influence base editing rates (Supplementary Fig. 4). This prompted us to utilize the comprehensive base editing data generated in the ABE and CBE target library screens for designing and training a machine learning model capable of predicting base editing outcomes at any given target site. We established BE-DICT (Base Editing preDICTion via attention-based deep learning), an attention-based deep learning algorithm that models and interprets dependencies of base editing on the protospacer target sequence. The model is based on multi-head self-attention inspired by the Transformer encoder architecture¹⁴. It takes a sequence of nucleotides of the protospacer as input and computes the probability of editing for each target nucleotide as output (Fig. 2a). The formal descriptions of the model and the different computations involved are reported in Supplementary Notes 1–3. In short, BE-DICT assigns a weight (attention-score) to each base within the protospacer (i.e. learned fixed-vector representation). The input mode is dichotomous, where bases with editing efficiencies above or equal mean editing were classified as edited, and bases below were classified as non-edited. The output is a probability score, reflecting the likelihood (between 0 and 1) with which a target base will be edited (C-to-T or A-to-G). To train and test the model, we included all target sequences with at least one classified base edit (8,558 for ABEmax; 9,534 for CBE4max; 3,416 for ABE8e; 10,177 for Target-AID). In order to reduce the tendency towards edited target sequences, which could result in an inherent bias of the prediction tool, we also added unedited target sequences at a ratio of 1:4 (Supplementary Data 1). For model training, we used ∼80% of the dataset and performed stratified random splits for the rest of the sequences to generate an equal ratio (1:1) between the test and validation datasets. We repeated this process five times (denoted by runs), in which we trained and evaluated a model for every base editor separately for each run. BE-DICT performance was then plotted using the area under the receiver operating characteristic curve (AUC), and the area under the precision-recall curve (AUPR). For all four models, an AUC of between 0.92–0.95 and AUPR between 0.733–0.806 was achieved (Fig. 2b–e). Notably, at positions within the activity window where we have a balanced distribution of edited vs. unedited substitute bases, BE-DICT performed with significantly higher accuracy than a per position majority class predictor—a baseline model that predicts nucleotides conversions as a Bernoulli trial, using maximum-likelihood estimation for computing the probability of editing success at each position (Fig. 2f–i, Supplementary Fig. 5).

**Fig. 2: BE-DICT: A machine learning model for predicting base editing outcomes.**

BE-DICT can be utilized to predict editing efficiencies at endogenous loci and predominantly puts attention to bases flanking the target base

Base editing at endogenous loci may also be affected by protospacer sequence-independent factors, such as chromatin accessibility. We, therefore, tested the accuracy of BE-DICT in predicting base editing outcomes at 18 separate endogenous genomic loci for ABEmax and ABE8e, and 16 endogenous genomic loci for CBE4max and Target-AID. HEK293T cells were co-transfected with plasmids expressing the sgRNA and base editor, and genomic DNA was isolated after 4 days for targeted amplicon HTS analysis. Across all tested loci we observed a strong correlation between experimental editing rates and the BE-DICT probability score (Pearson’s r = 0.78 for ABEmax, 0.68 for CBE4max, 0.57 for ABE8e, and 0.64 for Target-AID; Fig. 3a–d; Supplementary Data 2). Further validating our model, BE-DICT also accurately predicted base editing efficiencies from previously published experiments (Pearson r = 0.82 for ABEmax, 0.71 for CBE4max, 0.91 for ABE8e, and 0.76 for Target-AID; Supplementary Fig. 7; Source Data)^8,15. These results demonstrate that the BE-DICT probability score can be used as a proxy to predict ABEmax and CBE4max editing efficiencies with high accuracy.

**Fig. 3: BE-DICT accurately predicts base editing activities on endogenous genomic loci in HEK293T cells.**

The attention-based BE-DICT model provides insights (attention scores) for each position within the protospacer with regard to the position’s influence on the editing outcome. These attention scores provide a proxy for identifying relevant motifs and sequence contexts for editing outcomes. Interestingly, we found that for all base editors (ABEmax, CBE4max, ABE8e, and Target-AID) BE-DICT attention was mainly focused on bases flanking the target base and on the target base position itself (Fig. 4a–d). In addition, we observed that base attention patterns were dependent on the position of the target base, and occasionally consisted of complex gapped motifs rather than consecutive bases (Supplementary Fig. 6) underscoring the necessity of using machine learning for predicting base editing outcomes.

**Fig. 4: BE-DICT attention patterns for edited target bases.**

Development of the BE-DICT bystander module

Multiple A or C nucleotides within the editing window can lead to bystander base conversions. These are often undesired, in particular, if they induce coding mutations in the targeted gene. Given that BE-DICT per-base models the ‘marginal probability’ of target base editing by providing a probability score whether a single base will be edited, it does not directly predict the editing efficiency of a locus (i.e. it cannot predict co-occurrences of target base- and bystander editing). Therefore, we next developed an extension module of BE-DICT, which is adapted to predict the relative proportions of all different editing outcomes (combinations of target base and bystander transitions) per target locus (BE-DICT bystander module—Fig. 5a). The model is based on an encoder–decoder architecture (adapting the Transformer architecture used in the BE-DICT per-base model), which takes a sequence of nucleotides of the protospacer as input, and computes the probability of the different output sequences (i.e. probabilities for all combinations of sequences with target-based and bystander transitions, as well as the probability of observing a wild-type sequence) (Fig. 5a). The formal description of the model is reported in Supplementary Notes 2 and 3. In short, it uses an encoder module that computes a vector representation for each nucleotide in the input protospacer sequence, and then uses a decoder that has the same components of the encoder module with the exception of a masked self-attention and cross-attention layer. The masked self-attention layer acts as an “autoregressive layer”, ensuring the use of only past information while computing the probability of the output. The cross-attention layer learns what parts of the input sequence are important when computing the vector representation of the nucleotides in the output sequence, subsequently allowing the model to compute the probability of each output sequence. For model training, we used the edited input sequences from the ABEmax-, CBE4max-, ABE8e-, and Target-AID library screens that were already used to train and test the BE-DICT per-base model, and again partitioned them in an 8:1:1 ratio for training, testing, and validation. Unlike for the per-base BE-DICT model, however, the outcome is non-binary and represented the frequencies of all outcomes on the target sites (unedited read and the different edited outcomes) for a given input sequence (i.e. protospacer). The trained BE-DICT bystander module predicted various possible editing outcomes per target sequence, including combinations with multiple base conversions (Fig. 5b, c). Importantly, the performance was reliable for all four base editors, as we achieved strong correlations between predicted and experimentally observed sequence proportions in the validation datasets (Pearson’s r = 0.86 for ABEmax, 0.94 for CBE4max, 0.66 for ABE8e, and 0.97 for Target-AID; Supplementary Fig. 8).

Recently, two other machine learning models capable of predicting base editing outcomes have been developed. BE-Hive¹⁰, which is a deep conditional autoregressive model, and DeepBaseEditor¹³, which is based on a two-hidden layer convolutional neural network framework. Contrary to the BE-DICT bystander module that directly predicts the proportions of all outcomes at the target locus, both models separately predict the proportions of edited outcomes and the overall editing efficiency of the target site, and combining both values is required to estimate the frequency of precise target base conversion without bystander mutations (Fig. 5d). Since also BE-Hive and DeepBaseEditor have been trained and applied on TadA7.10-ABE and APOBEC1-CBE datasets, we decided to compare their performance to our attention-based machine learning model. First, we only benchmarked the ability of the three models to predict the proportions of edited outcomes. Therefore, we adapted the BE-DICT bystander model to only calculate the proportions of edited outcomes, comparable to the BE-Hive bystander and DeepBaseEditor proportion models. When applied to the high-throughput datasets of the three studies, all models achieved similarly good correlations with the experimentally observed values using Pearson’s correlation (Fig. 5e, f; Supplementary Fig. 10a, b) or Spearman’s correlation (Supplementary Fig. 9). Next, we compared the ability of the three models to predict the proportions of all outcomes (including the wildtype sequence) at a target locus. Again, predicted values correlated well with the experimentally observed values for all three models (Fig. 5g, h; Supplementary Fig. 10c, d). Interestingly, the performance of the three models was not substantially affected by the differences in the experimental setup of the three datasets (Fig. 5e–h; Supplementary Fig. 10), suggesting that they can tolerate variations in experimental procedures between laboratories. Confirming this hypothesis, when BE-DICT was retrained on the ABE datasets of Song et al. ¹³ (HT_ABE_Train), correlations between predicted and experimentally observed editing outcomes on the HT_ABE_Test dataset of Song et al. increased only incrementally to r = 0.94 (Supplementary Fig. 11). Altogether, we conclude that the three machine learning models operate robustly on different experimental datasets and with comparable accuracy.

Source link

Vasiprak Blog

Generation of large datasets for adenine- and cytidine base editing via high-throughput screening with self-targeting libraries

Development of BE-DICT, an attention-based deep learning model predicting base editing outcomes

BE-DICT can be utilized to predict editing efficiencies at endogenous loci and predominantly puts attention to bases flanking the target base

Development of the BE-DICT bystander module

You might also like

Study Untangles Mitochondria – The Energy Powerhouse of Our Cells – To Reap Rewards of Exercise

Black Hole Event Horizons Are Tunable Factories of Quantum Entanglement

Neural networks could improve patient outcomes and reduce care costs — ScienceDaily

Stay tuned!