Machine learning modeling for solubility prediction of recombinant antibody fragment in four different E. coli strains

Protein expression

The expression of the scFv protein was assessed in four E. coli strains before optimization using SDS-PAGE method. Utilizing western blotting, anti-His-tag monoclonal antibody can confirm the expression of His –tagged scFv in all stains studied here (Fig. 2).

Predictive modeling and optimization methods

Response surface methodology modeling

Based on the published data, four numerical (post-induction time, concentration of inducer, post-induction temperature, and optical cell density) and one categorical (different strains) factors were selected for statistical optimization. As presented in Table 1, the five-level CCD with a total of 232 runs was employed (Supplementary Table 1). The dependent response (soluble production of scFv) was correlated with the independent numerical factors (coded values) in different strains using predicted following equations:

$$ begin{aligned} & left( {text{Y}} right)^{{0.5}} : \ & {text{BW }}25113left( {{text{DE}}}3 right) \ & {text{Y}} = , – 36.1443A + 137.399B – 6761.67C + 2608.58D \ & qquad- , 0.501425AB + 17.572AC – 35.8366AD – 32.7466BC \ & qquad+ 48.2716BD – 2,588.53CD + 2.22388A^{2} – 2.01959B^{2} \ & qquad+ 7078.58C^{2} – 1294.25D^{2} + 420.482 \ end{aligned} $$

$$ begin{aligned} & {text{Origami}}left( {{text{DE}}} 3right) \ & {text{Y}} = 209.604A – 40.9083B – 12140.1C – 91.7678D \ &qquad – 1.83904AB – 108.763AC – 16.7602AD + 167.013BC \ & qquad- 17.7474BD – 162.912CD – 1.43726A^{2} – , 0.861316B^{2} \ & qquad+ 5869.2C^{2} + 447.055D^{2} + 4861.88 \ end{aligned} $$

$$ begin{aligned} & {text{SHuffle T}}7 \ & 173.319A + 120.981B + 13795C + 1830.34D – 1.46395AB \ &qquad – 34.9288AC – 66.2975AD – 129.442BC – 7.38606BD \ &qquad – 4836.21CD – 2.36028A^{2} + , 0.178132B^2 – 4977.67C^{2} \ & qquad+ 1802.92D^{2} – 7034.69 \ end{aligned} $$

$$ begin{aligned} & {text{BL}}21left( {{text{DE}}} 3right) \ &qquad – 9.03435A – 2.64382B – 7902.75C – 687.916D + 1.13725AB \ &qquad – 134.612AC + 37.2579AD + 45.6254BC + 85.7176BD \ &qquad – 5129.02CD + 1.81334A^{2} – 1.73011B^{2} + 8359.54C^{2} \ &qquad + 853.879D^{2} +4084.09 \ end{aligned} $$

In the above equations, Y denotes response (soluble production of anti EpEX-scFv), and A, B, C, and D denotes post-induction time, post-induction temperature, cell density before induction, and IPTG concentration, respectively.

According to ANOVA results, significant “F value” (15.78) as well as insignificant “Lack of Fit for value of F” indicates that the model is valid to predict soluble production of scFv. The low p-value (Prob > F) (< 0.0001) of the model resignifies its significance. R² (the coefficient of determination) of 0.950 implies that 95.0% of the variability in the response can be described by the model. Furthermore, the difference value less than 0.2 confirms a high degree of correlation between the predicted R² (0.7487) and adjusted R² (0.7906) values. Plot illustrated in Supplementary Fig. S1 confirms this correlation again. Also, the accuracy and predictability of the selected model were validated by the normal probability plot of the studentized residuals (Supplementary Fig. S1). Based on ANOVA results, the proposed model fits the experimental data well. So it can be effectively utilized to navigate the design space (Table 2).

Table 2 Analysis of variance for the experimental results of the central-composite design for soluble production of anti EpEX-scFv.

As depicted in Table 2, three linear terms (post-induction time (A), concentration of inducer (D) and different strains (E)) were found to be significant for soluble production of scFv whereas post-induction temperature and optical cell density variables had no significant impact on solubility of scFv. All interactive terms except temperature- optical cell density (BC) were found to be significant which was evident from their p-values (less than 0.05). Also, two quadratic terms (A² and D²) were not significant according to Table 2. Moreover, it can be concluded that post-induction time is largely affecting soluble production of anti EpEX-scFv.

Utilizing two-dimensional graphs, the interactive effects between two significant independent variables (A and D (Fig. 3), A and B (Supplementary Fig. S2), A and C (Supplementary Fig. S3), B and D (Supplementary Fig. S4) and C and D (Supplementary Fig. S5)) were studied in different strains while keeping other two numerical factors at their constant middle levels. From Fig. 3, and Supplementary Fig. S2 and S3, it was evident that increasing the post-induction time led to solubility increase in three strains including BW25113(DE3), Origami(DE3) and BL21(DE3), and decrease in SHuffle T7. Moreover, upon increasing the concentration of inducer, the solubility had significantly decreased in Origami(DE3) and SHuffle T7 which was more substantial in SHuffle T7 than that in Origami(DE3) in similar post-induction time (Fig. 3). Also, increasing the temperature had a negative effect on scFv solubility in Origami(DE3) (Supplementary Fig. S2). As illustrated in Supplementary Fig. S3, more soluble protein was provided in BW25113(DE3) when protein production was induced at higher OD600 nm while the amount of soluble scFv obtained in Origami(DE3) and SHuffle T7 had been negatively affected by increasing the OD600 nm before induction. A significant interaction between temperature and inducer concentration is also indicated by ANOVA (p-value of 0.0017) (Table2). As depicted in Supplementary Fig. S4, when the levels of post-induction time (A) and optical cell density (C) were kept constant at their medium value (16 and 0.7 respectively), temperature raise could lead to increase the solubility in BW25113(DE3) and SHuffle T7. In BW25113(DE3), although increasing IPTG concentration at lower temperature decreased the amount of soluble fraction, an increase in inducer concentration at higher temperature had a positive effect on protein solubility. The dependency of OD600 nm before induction (C) and IPTG concentration (D) on scFv solubility when the post-induction time (A) as well as temperature (B) is kept constant (16 °C and 30 °C respectively) is illustrated in Supplementary Fig. S5. According to this graph, an increase in OD600 nm at higher IPTG concentration (0.8) led to a decrease in solubility in BL21(DE3) and SHuffle T7 and at lower inducer concentration (0.4), increasing the OD600 nm enhanced protein solubility. Interestingly, Supplementary Fig. S5 also declares that increasing the OD600 nm at both IPTG concentration levels leads to a solubility increase in BW25113(DE3) and decrease in Origami(DE3). The interactive effects between each independent numerical variable and strain type were studied while keeping other three numerical factors at their constant middle levels. As depicted in Fig. 4 and confirmed by ANOVA results, post-induction time was the most effective factor on soluble production of scFv in four strains studied here.

Artificial neural network modeling

Using artificial neural network (ANN) models, the behavior of nonlinear multivariate systems can be predicted. The multilayer feed forward neural network with Quasi-Newton algorithm was the model considered for the present work. In this study, the same DoE used in building the RSM model was also employed to develop the ANN-based model. The experimental data was divided into three subsets including training, testing and validation (70%, 15%, 15% of data respectively) (Table 3). A small amount of noise was added to the data set and regularization of weight was done to prohibit overfitting the training data and make smoother responses. The network topology developed for ANN determines the accuracy of a model prediction. To achieve optimal ANN structure for prediction, the number of hidden layers and neural composition were determined by varying the number of hidden layers (1–5) as well as number of neurons (8–48). We had 8 neurons in the input layer and the scaling layers were set at automatic with 8 neurons. For perceptron layers, different architectures were investigated and best results were achieved when we had 15, 10 and 3 neurons in the first, second and last hidden layers respectively. Activation function in all hidden layers was a hyperbolic tangent. The scaled outputs from the hidden layers connected to the unscaled layer with one neuron to produce the original units. Moreover, the model selection was carried out to achieve better network architecture with the best generalization. Finally, the performance of the developed network was examined based on NRMSE and R² of testing data. The fitness of the model was confirmed by its overall R² which was found to be 0.87. NRMSE value also indicates a good prediction of outputs (0.288).

Table 3 The number and percentage of experimental data used for training, testing and validation in artificial neural network.

Comparison of predictive capabilities and validation of the RSM and ANN-based models

In the current study, based on R² and the error analyses, the effectiveness of the empirical models was statistically evaluated between estimated and actual responses. A dataset having 145 data points was randomly selected from the total dataset. The experimental response along with the predicted data obtained for soluble production of scFv are given in Supplementary Table 2. According to obtained results, for random dataset, the R² for ANN and RSM models are 0.913 and 0.856 respectively, demonstrating the ability of these models to describe 91% and 85% of the variations of the actual values respectively. The NRMSE is more for RSM model (0.264) than for the ANN model (0.154), which means that the predicting capacity of the ANN model is higher over the RSM model. According to comparative plot for predicted and actual values, the ANN model has fitted the experimental responses with an excellent accuracy. Greater deviation is seen in RSM-based prediction for soluble scFv yield than ANN (Fig. 5). For validation of models, utilizing the RSM model based predicted optimum conditions (Table 4), experimental densitometric analysis result of 112.4 mg/L was obtained for soluble fraction which was in good correlation with the predicted value of 97.9 mg/L. When the levels of the variables were replaced in the ANN model, the maximum predicted response value was 106.1 mg/L, which was closer to the experimental result (112.4 mg/L) than the RSM (97.9 mg/L). Reaffirms the higher accuracy of ANN model.

Table 4 Optimum condition and strain for soluble production of anti EpEX-scFv.

Source link

Vasiprak Blog

Protein expression

Predictive modeling and optimization methods

Response surface methodology modeling

Artificial neural network modeling

Comparison of predictive capabilities and validation of the RSM and ANN-based models

You might also like

Outstanding question in theoretical models of memory addressed — ScienceDaily

Life Sciences Jobs IISER Pune Recruitment, Applications Invited

Scientists make sperm from mouse pluripotent stem cells that lead to healthy, fertile offspring — ScienceDaily

Stay tuned!