Preloader

Machine learning-guided acyl-ACP reductase engineering for improved in vivo fatty alcohol production

Chemicals, reagents, and media

E. coli RL08ara21 and CM248 assay media used for this study are the same composition as Miller LB, except with 10 g/L peptone instead of 10 g/L tryptone. CM24 media was supplemented with 1% w/v glucose, and sterile filtered using a 2 µM filter. E. coli RL08ara assay medium was sterilized by autoclaving. Both media were adjusted to a pH of 7.0 prior to sterilization.

Individual fatty alcohol standards were prepared at a concentration of 100 mg/mL by dissolving alcohols ranging from C3 to C17 in 200 proof ethanol. Then, alcohols were mixed to make 10 mg/mL standards of even-chain alcohols (C4, C6, C8, C10, C12, C14, and C16) and odd-chain alcohols (C3, C5, C7, C9, C11, C13, C15, C17). All unique biological materials are available upon request.

Measuring in vivo fatty alcohol titers

We measured in vivo alcohol titers produced by each enzyme variant using gas chromatography (GC). Overnight cultures started in LB + Kanamycin from individual colonies from the transformation were grown for 16–20 h and diluted into a 50 mL culture of E. coli RL08ara Assay Medium + Kanamycin in a 250 mL baffled shake flask such that the final OD was about 0.01. The media had a 20% (10 mL) dodecane overlay, and we supplemented the media with 1 mL of 50% v/v glycerol. The cultures grew at 37 °C for 45 min at 250 rpm, and then we induced protein expression by adding 500 µL of 100 mM IPTG (final concentration 100 µM IPTG). As a control, each batch also included blank cultures that were prepared by mixing media, dodecane, glycerol and antibiotic in the same amounts as the expression cultures, but without any cells added. The expression cultures incubated for 18 h at 30 °C after induction.

Afterwards, we cooled the expression cultures on ice to prevent evaporation. Then, we added 150 µL of 10 mg/mL odd-chain internal standard mixture to each culture flask and mixed them vigorously to make an emulsion. Immediately after mixing, we transferred 5 mL of the emulsion to a glass centrifuge tube pre-loaded with 1 mL of n-hexanes. We vortexed the tubes for 20 s, shook for 20 s, and vortexed for another 20 s. Then, we centrifuged the samples for about 10 min until the organic layer and aqueous layers separated and extracted about 900 µL of the organic layer to load into a GC vial for analysis on GC-FID.

We analyzed all GC samples using a Shimadzu Model 2010 GC-FID system with an AOC-20i autosampler and a 60 m 0.53 mm ID Stabilwax column (Restek 10658). The oven temperature program used to analyze samples from RL08ara and CM24 samples was based on Mehrer et al.8 and is as follows: 45 ˚C hold for 10 min, ramp to 250 ˚C at 12 ˚C/minute, hold at 250 ˚C for 10 min. In some individual experiments we shortened the hold time. Each run included standards of the odd-chain internal standard mixture and even-chain standard mixture to control for any changes in the retention times of the analytes. We estimated the concentrations of even-chain fatty alcohols by averaging the areas (Ai − 1 and Ai + 1) and concentrations (Ci − 1 and Ci + 1) of the odd-chain internal standards that bracketed the particular even-chain analyte. We used the resulting response factor to convert the area of the even-chain species (Ai) to the original media concentration (Ci) per the following equation:

$${C}_{i}={A}_{i}* frac{{{mbox{avg}}}left({C}_{i-1},{C}_{i+1}right)}{{{mbox{avg}}}left({A}_{i-1},{A}_{i+1}right)},left(i=2,4,6,8,10,12,14,16right)$$

(1)

Aerobic alcohol production in BL21(DE3)

We cloned the initial seed sample ACR chimeras into the pET28 backbone and transformed into BL21(DE3). Cultures were started in LB + Kanamycin from individual colonies from the transformation and grown overnight for 16–20 h. We diluted the cultures 100-fold into 5 mL cultures of LB + Kanamycin in culture tubes. We grew the cultures for 2.5–3 h, measured the ODs, and then induced with 5 µL of 100 mM IPTG and incubated for 24 h at 20 °C with shaking at 250 rpm.

Following protein expression, we incubated the cultures on ice for 1.5–2.5 h. Nonanol (C9) and heptadecanol (C17) were used as internal standards; a solution that was 5 µM nonanol and 5 µM heptadecanol in hexanes was prepared and added (1 mL) to each 5 mL expression culture. We then vortexed and spun down the sample in a centrifuge (1000x G for 10 min) to separate the phases. In total, 900 µL of the organic layer was extracted for analysis on GC-FID. Titers of fatty alcohols were determined using an external standard curve with standards of each of the even chain fatty alcohols in hexanes and dividing by the extraction ratio (5) to convert from the concentration in the organic phase to the original concentration in the media.

Anaerobic alcohol production in CM24

ACR chimeras were cloned into the pBTRCK plasmid backbone and transformed into CM24 along with seFadBA (g130, pACYC-seFadBA) and tdTER (g131, pTRC99A-tdTER-fdh)8. We started overnight cultures from individual colonies in LB + Kanamycin + Carbenicillin + Chloramphenicol. The following day, after 16–20 h, 600 µL of overnight cultures were diluted in 30 mL of CM24 Assay Medium + Kanamycin + Carbenicillin + Chloramphenicol with a 20% (6 mL) dodecane overlay in a 50 mL serum vial, which was sealed. We grew the cultures for 2 h at 30 °C, and then induced by injecting 300 µL of 100 mM IPTG (for a final IPTG concentration of ~100 µM) through the septum with a needle. Cultures were then incubated at 30 ˚C for 48 h.

Following expression, we cooled the cultures on ice and added 180 µL of an internal standard mixture (the same fatty alcohol mixture used for quantitation of alcohols in RL08ara). We mixed the samples thoroughly and extracted 5 mL of the emulsion with 1 mL of hexane per the same protocol as RL08ara above.

Structural modeling and SCHEMA library design

We utilized the MODELLER34 homology modeling software to build 100 models of each of the acyl-thioester reductase domains of MA-ACR, MB-ACR, and MT-ACR using the following PDB entries as templates: 3M1A-A, 3RKR-A, 3RIH-A, 3AFM-B, 3AFN-B, and 4BMV-A. We built a contact map by determining which pairwise amino acid contacts (defined as two amino acids within a 4.5 Å radius based on any atoms in the amino acids) were present in each model, and weighted each contact by the percentage of models in which the contact was present.

We determined the crossover between the aldehyde-reductase domain and the acyl-thioester reductase (ATR) domain by aligning the sequences of MA-ACR, MB-ACR, and MT-ACR and selecting a crossover point at the conserved LDPDL, ~350–360 residues from the N-termini. Then, we used SCHEMA-RASPP to determine 7 additional crossover locations within the ATR domain that were compatible with Golden Gate assembly.

Gene assembly and strain construction

All ATR enzymes tested were cloned into the pBTRCK plasmid backbone and transformed into E. coli RL08ara21. We obtained the three natural parent sequences from prior studies8,9. We amplified the AHR and ATR domains of each of the natural sequences, as well as the plasmid backbone, by PCR using primers (Supplementary Table 7) that contained Golden Gate overhangs. We used Phusion Hot Start Flex 2X Master-Mix (NEB) for all PCR reactions. Then, we used Golden Gate assembly to combine the pieces and synthesize the domain shuffled variants. Golden Gate reactions were carried out either using commercial Golden Gate assembly mix (NEB), or an in-house mixture of the components from NEB (T4 DNA ligase buffer, BsaI HF v2 and T4 DNA ligase).

We designed plasmids containing each of the 24 blocks determined by RASPP such that each block was flanked by BsaI restriction sites. The plasmids were synthesized by TWIST Biosciences. The blocks (including the BsaI site) were amplified by PCR and cloned into a backbone vector harboring the AHR domain of MA-ACR by Golden Gate assembly. For sequences that we studied in vitro, we amplified the whole FAR sequence and used Golden Gate assembly to add the insert into a pET 28 backbone.

Greedy algorithm to design an informative seed sample

We sought to identify the set of 20 chimera sequences that is maximally informative of the full chimera landscape. We quantify “informativeness” as the Gaussian mutual information I(S;L) between the chosen sequences S and the full landscape L. This mutual information simplifies to the Gaussian entropy H(S) because S is a subset of L. Entropy is a submodular set function and can therefore be efficiently optimized using a greedy algorithm.

We started with our three parent sequences and scanned over all possible chimera sequences si to determine which resulted in the largest Gaussian entropy H(S {si}). This top sequence was added to the chosen set of sequences S and the greedy sequence selection process was repeated until 20 sequences were chosen.

Sequence-function machine learning

We modeled the sequence-function landscape using a combination of a Gaussian Naïve Bayes (GNB) classifier to distinguish inactive versus active sequences and Gaussian process (GP) regression to model a sequence’s fatty alcohol titer.

The active/inactive classifier was trained on chimera sequence-function data using scikit-learn’s Naïve Bayes classifier. We categorized sequences as active if their alcohol titer was above a certain threshold; otherwise, they were considered inactive. The amino acid sequences for each tested chimera were one-hot encoded and used as inputs for the classifier. The resulting model provides a prediction of the probability that a sequence is an active enzyme.

We also trained a GP regression model on the active sequences’ fatty alcohol titers. Our GP regression model used a homogeneous linear kernel to define the covariance between pairs of sequences

$${k}_{i,j}={sigma }^{2}{{{mathbf{x}}}}_{i}cdot {{{mathbf{x}}}}_{j}$$

(2)

where ({sigma }^{2}) is a tunable variance hyperparameter, and ({{{mathbf{x}}}}_{i}) and ({{{mathbf{x}}}}_{j}) are the encodings for sequences i and j, respectively. The Hamming kernel one-hot encoded each amino acid option at each sequence position, while our structure kernel one-hot encoded amino acid combinations at each residue-residue pair that was contacting in the three-dimensional structure. We calculated the GP’s posterior mean and variance following Algorithm 2.1 in Rasmussen & Williams35 (Supplementary Method 1).

We used leave-one-out cross-validation to scan variance (({sigma }^{2})) hyperparameter values ranging from 10−6 to 105 and selected values that maximized the correlation coefficient and minimized the mean squared error (Supplementary Fig. 6). When these two objectives could not be realized simultaneously, we chose ({sigma }^{2}) values that balanced them. We then used the chosen ({sigma }^{2}) value to fit the GP model on all the data and predict the activities of all untested sequences that the GNB classifier labeled as active.

Upper-confidence bound optimization

We utilized UCB optimization to select informative sequences to build and test for the next round. For UCB rounds 2–10, we trained the active/inactive GNB classifier and the alcohol titer GP regression model on all prior data. We then applied the GNB and GP models to make functional predictions over all untested chimeras. We chose a panel of sequences to test using a “batch mode” UCB selection strategy36, while excluding any sequences that were predicted to be inactive from the GNB classifier. We first chose the sequence that maximized the GP upper confidence bound (mean + one standard deviation). This is the UCB optimal sequence. We then retrained the GP model with the assumption that the UCB optimal sequence’s true titer was equal to its predicted titer. We then recalculated the UCBs and chose the new UCB optimal sequence. This process enables selection of multiple UCB optimal sequences per round, and it was repeated until 10–12 sequences were chosen per batch. The details of each round of UCB optimization can be found in Supplementary Table 4.

The first UCB round was performed slightly differently than the others because we were still refining our method. For the first UCB round, we trained GP regression models on alcohol titers from both BL21(DE3) and CM24 strains. We chose sequences that maximized the sum of the BL21(DE3) and CM24 UCB scores and selected a panel of ten chimeras using the batch mode UCB approach described above.

Measuring in vivo enzyme expression levels using SDS-PAGE

To verify that increases in fatty alcohol titers were due to enzyme activity, we performed additional characterization of the protein expression levels for the parents and selected chimeras. To estimate the expression level of the ATR enzyme, we performed additional replicates using the same expression conditions as were used during UCB optimization. Then, after extracting the fatty alcohols, we suspended the remaining 5 mL pellet in 1 mL of media. We normalized the ODs of the suspensions to an OD of 10 and pelleted and froze 500 µL of the OD 10 culture. We later thawed the frozen pellets and lysed them using 250 µL lysis buffer (3872 µL 100 mM Tris pH 7.4, 120 µL Bugbuster, 4 µL lysozyme and 4 µL DNAse I).

We prepared a standard curve using dilutions of purified MA-ACR. We added 3 µL of each MA-ACR dilution to 12 µL of SDS master mix (which consisted of 5 parts 2X SDS mix and 1 part 1 M DTT) and mixed them in a 1:1 ratio (volume:volume) with empty vector lysate. The other lysates were mixed with 2X SDS buffer and 3 µL 100 mM Tris pH 7.4 (to ensure equal volumes of lysate between the standards and the samples). We heat denatured the lysates (at 85 °C for 2–5 min) and analyzed them by SDS-PAGE.

We used FIJI, an image analysis software37, to estimate the intensities of the ATR band in the MA-ACR standards and generate a standard curve (Supplementary Fig. 3). We made new standard curves for each replicate to reduce gel to gel variability, and only compared samples to standards on the same page gel. Expression levels are reported as µg/mL of ATR (at an OD of 20).

Biosynthesis of fatty acyl-ACP substrates

We synthesized the acyl-ACP substrates by functionalizing purified E. coli ACP with a 4ʹ-phosphopantetheine arm by the acyl-ACP synthetase from Vibrio harveyi38, and then attaching the acyl-chain to the thiol end of the arm using a phosphopantetheinyl transferase (SfP) from Bacillus subtilis.

Expression of V. harveyi AasS, B. subtilis SfP and E. coli ACP

The enzymes needed to functionalize palmitoyl-ACP were expressed using the method in Hernández-Lozada et al. with some minor modifications39. The cells were grown for 2 h at 37 ˚C (200 rpm) and then induced with 1 mM IPTG (final concentration) without cooling the cultures as was done in Hernández-Lozada et al. AasS and SfP were expressed overnight at 18 ˚C for 18–24 h, and ACP was expressed at 20 ˚C overnight (18–24 h) and harvested by centrifugation. We also purified the proteins using the method from Hernández-Lozada et al., however rather than using dialysis, we used Amicon filter columns to carry out buffer exchange. The final concentrations of the proteins were determined using Bradford assays.

Functionalization of E. coli ACP

To cleave the His-tag from the apo-ACP, we added 700 uL of 2.1 mg/mL TEV protease to the 4 mL ACP solution. The reaction incubated overnight (16–20 h) at 20 ˚C shaking at a speed of 250 rpm. At the conclusion of the digestion, we stored the mixture in 50% glycerol at −80 ˚C. Later, to purify the cleaved apo-ACP, we thawed the digestion and ran it over parallel gravity columns packed with Nickel Sepharose Fast Flow resin. We pooled the flow-through and buffer exchanged with 50 mM Na2HPO4 pH 8 + 10% glycerol using an Amicon filter unit (MWCO 3000 kDa). The concentration of the cleaved apo-ACP was determined by a Bradford assay.

The conditions for the reactions to generate holo-ACP were: 500 µM apo-ACP, 5 µM SfP, 5 mM Coenzyme A, and 10 mM MgCl2 in 100 mM Na2HPO4 pH 8. The reactions took place in 500 uL aliquots in 1.5 mL Eppendorf tubes and shaken in a beaker at 37 ˚C for 1 h.

We dissolved sodium palmitate in water heated to 65 ˚C to a concentration of 100 mM. After the holo-ACP reactions were finished, we added palmitate, ATP, and AasS to the reaction mixture, (along with enough buffer to double the reaction volume), to give final concentrations of 5 mM palmitate, 5 µM AasS and 10 mM ATP. The reactions incubated overnight (16–20 h) at 37 ˚C. Then, we pooled the reactions, purified the palmitoyl-ACP by running the mixture through a gravity column packed with Nickel Sepharose Fast Flow Resin. We buffer exchanged the purified palmitoyl-ACP into 100 mM Na2HPO4 + 10% glycerol pH 8.

Purification of ATRs

We expressed parental ATRs (A-AAAAAAAA, A-BBBBBBBB, and A-TTTTTTTT) and purified them per the same method as E. coli ACP, except for the buffer exchange step. We buffer exchanged them into 20 mM Tris, 50 mM NaCl pH 7 using an Amicon filter unit (30,000 kDa MWCO). Then, we added glycerol to the proteins (about 15 % v/v for parents 1-3). We expressed ATR-83 at 30 ˚C rather than 20 ˚C but purified it in the same manner, though we added more glycerol to the purified ATR-83 (final concentration ~50 % v/v glycerol). We determined the concentration of the enzymes by Bradford assays.

In vitro enzyme kinetics on palmitoyl-ACP and palmitoyl-CoA

We determined the activity of the above ATRs in a 96 well plate based assay using 5′5 Dithiobis(2-nitrobenzoic acid) or DTNB to monitor the progress of the conversion of palmitoyl-ACP to hexadecanol and free holo-ACP (measuring the absorbance at a wavelength of 412 nm). We tested palmitoyl-ACP concentrations up to 40 µM (as this concentration should be within the physiological range within cells)40. Reactions contained 1 µM of the respective ATR and 200 µM NADPH in 20 mM Tris + 50 mM NaCl pH 7 and the total reaction volume was 100 µL. The concentration of DTNB was 250–252 µM (the difference is due to slightly different preparations of a NADPH/DTNB master mixes on different dates).

To gauge activity of the ATRs on CoAs in vitro, we carried out reactions using palmitoyl-CoA as a substrate. The in vitro assay used to determine CoA activity was identical to that used for ACP activity above.

Computational docking and analysis of interfacial charge

We used the RosettaDock30,41 application to perform local docking simulations to dock a structure of palmitoyl-ACP (from PDB entry 6DFL) to MA-ACR. We did not include the acyl-chain in the docking simulations. We ran 1000 docking simulations and selected a model based on minimizing the total energy and the interface score. Then, using PyMOL, we determined which residues in the model of MA-ACR were within a 10 Å radius of the ACP molecule. The number of charged residues within that radius was then determined, and the net interface charge was defined as the number of positive residues minus the number of negative residues.

Reporting summary

Further information on research design is available in the Nature Research Reporting Summary linked to this article.

Source link