Preloader

Identification of stress response proteins through fusion of machine learning models and statistical paradigms

This section explains the proposed methodology based on the described stepwise approach. Each step of the described model is illustrated in Fig. 1. Firstly, a robust diverse, and homology restricted dataset is accumulated. In the next step, a comprehensive feature extraction methodology is formulated that ensures that all the crucial obscure features for deciphering the attributes of each protein have been extracted. Subsequently, based on the obtained feature vectors machine learning models are trained. Consequently, the precision of each model is calibrated to identify the most accurate model. Ultimately, the most assiduous model is integrated into a web server for public use.

Figure 1
figure1

Proposed stepwise approach.

Benchmark dataset collection

The protein sequences of the benchmark dataset were meticulously extracted from the well-known UniProt database for proteins13. Reviewed data of stress protein sequence of different cell lines (organisms) were retrieved from the UniProt dataset using the UniProt keyword “Stress Response” labelled as KW-0346 in the database, where the query resulted in 7092 reviewed positive protein sequences. Again, using the inverse query, 7500 reviewed negative protein sequences were retrieved. The data contained sequences of proteins in FASTA format, where each sequence was comprised of amino acids letters. Furthermore, all the sequences were of non-uniform length. However, this was the raw data, which might have been containing homologous sequences. Thus, to cater for this, redundancy from the dataset was removed using the CD-HIT suite to an acceptable level. A cutoff value was set at 0.7 to form homologous clusters with similarities greater than or equal to 70%, as reported by14. Ultimately, 6140 clusters were formed out of the positive samples while 7163 were formed for negative samples. A representative protein sequence was taken from each cluster. The ratio of positive and negative proteins sequence is illustrated in Fig. 2.

Figure 2
figure2

Ratio of positive AND negative dataset.

Sample encoding

Machine-learning algorithms learn based on numeric representations of data, instead of raw sequences, as expounded in15,16]. Thus, to represent sequences as vectors, the pseudo amino acid composition (PseAAC) was proposed17. The idea of PseAAC is popular in bioinformatics research18,19 and has been used in numerous bio-medicine and medication improvement studies20,21 as well as other disciplines of computational proteomics. An extensive rundown of references is provided in a survey paper22. Since it has been generally and progressively utilized, many profound open-access projects23,24 were developed to create different methods of feature extractions using PseAAC. Enlivened by the achievements of utilizing PseAAC for feature extraction, multiple predictors were proposed by researchers25,26. Specific components for protein groups have been utilized for enabling vector encoding of samples to be processed through machine learning algorithms. These vector encoding techniques are used in various genomic research contributions as illuminated in13 including, a robust web server called Pse-In-One27. Both protein/peptide and Protein groups can be used to make a perfect fixed scale feature vector as8

$${mathrm{L}}_{upxi =7}(mathrm{I})=[{Psi }_{1}{Psi }_{2}…{Psi }_{mathrm{u}}…{Psi }_{Omega }]^{mathrm{D}}$$

The parts Ψj (j = 1, 2, , Ω) of the sequence are considered as a method of incorporating the properties of the Protein’s sequence. We encoded the PseAAC values using the simplest encoding as shown in Table 1.

Table 1 Amino acid encodings for feature extraction.

For feature extraction, we used the structure of measurable statistical moments to capture the characteristics and measurements as discussed in the following sections.

Statistical moments

Statistical moments were used in this study for feature extraction. Arbitrary statistical moments explain different aspects of the dataset defined moments defining functions and the distribution polynomial. Some specific moments were incorporated such as “Raw, Central and Hahn moments”28,29. The raw moment exhibits location and scale variance are used for mean calculation as well as determination of dataset asymmetry based upon probability distribution. Central moments are location invariant because centric calculations are performed. They two provide information regarding mean, variance, and distribution of data along with the mean30,31. Hahn moments are used to calculate the variation of size and position based on “Hahn Polynomials”. All these moments provide significant information about the sequence order and composition of data32. The benefit of selecting these measurable moments is the availability of the sensitive hidden patterns of peptide sequence uncovered by these moments33. Some of these moments especially the Hahn moments require a two-dimensional square matrix as input. For this purpose, a one-dimensional protein sequence is mapped onto a two-dimensional array. A square matrix, L′, is formed which can be expressed as

$${L}^{^{prime}}=left(begin{array}{ccc}{c}_{11}& dots & {c}_{1h}\ vdots & ddots & vdots \ {c}_{g1}& cdots & {c}_{gh}end{array}right)$$

where each ({c}_{ij}) is an amino acid residue. The square root of the length of each sequence is computed and ceiled, and a square matrix of obtained value is formed. Each element of that square matrix is filled with residues of the respective sequence, sequentially. Moments up to the degree of 3 were computed and using the components of L′ the raw moments are determined as

$${G}_{xy}={sum }_{l=1}^{h} {sum }_{n=1}^{hsum }{l}^{x}{n}^{y}{beta }_{ln}$$

where (l + n) represents the degree of moments, while up to 3-degree moments are G00, G10, G20, G30, G01, G11, G21, G02, G12, and G03. Further, the central moments are determined as

$$H_{{xy}} = sum _{{l = 1}}^{h} sum _{{n = 1}}^{{hsum }} (l – bar{a})^{i} (n – w)^{y} beta _{{ln}}$$

Furthermore, discrete Hahn moments effectively enrol for an even-dimensional information connection. Discrete Hahn moments require square cross-segment as information. The computing of Hahn moments does not change any meaning of the data, thus, due to this orthogonality, the Hahn moments are reversible and the sequence could be regenerated by using the inverse function of Hahn moments. These moments comprise sequence composition and relative positioning of amino acid residues. Hahn moments were computed through

$${N}_{h}^{z,t}(j,A)=(A+T-1{)}_{a} times {sum }_{i=0}^{h}(-1{)}^{i}frac{(-h{)}_{k}(-j{)}_{i}(2A+z-t-a-1{)}_{i}}{(A+t-1{)}_{i}(A-1{)}_{i}}frac{1}{I!}$$

where the definition of the operators used is discussed in detail by Akmal et al.34. The orthogonal normalized Hahn moments for 2D data are further computed as

$${E}_{xy}={Sigma }_{n=0}^{H-1} {Sigma }_{l=0}^{H-1} {beta }_{xy}{e}_{x}^{stackrel{sim }{a,t}}left(n,Hright){e}_{y}^{stackrel{sim }{a,t}}left(l,Hright)$$

$$text{g, h= 0,1,}cdots Htext{-1}$$

Thus, for computing all the three types of moments, their respective equations were used. Each type of moment yielded 10 moments of order 3, thus, a total of 30 moments were computed for each sample.

Computing position relative incidence matrix (PRIM) by original and reverse sequence

The relative positioning of amino acid residues in a polypeptide chain plays a pivotal role in determining the biological function and physiochemical characteristics of that peptide. This pivotal model is based on the distinctive arrangement of proteins and the relative positions of the amino acids of the residue chain.

The relative positioning statistics of a total number of residues are used to assemble a 20 × 20 position relative incidence Matrix (PRIM), while the resultant matrix is used to extract features. The PRIM matrix is shown as

$${M}_{PRIM}=left[begin{array}{cccc}{M}_{1to 1}& {M}_{1to 2}cdots & {M}_{1to y}cdots & {M}_{1to 20}\ {M}_{2to 1}& {M}_{2to 2}cdots & {M}_{2to y}cdots & {M}_{2to 20}\ {M}_{xto 1}^{vdots }& {M}_{xto 2}^{vdots }cdots & {M}_{xto y}^{vdots }cdots & {M}_{ito 20}^{vdots }\ {M}_{Ato 1}^{vdots }& {M}_{Ato 2}^{vdots }cdots & {M}_{Ato y}^{vdots }cdots & {M}_{Ato 20}^{vdots }end{array}right]$$

where each element ({M}_{ito j}) is the sum of all positions of jth amino acid, relative to the first occurrence of ith amino acid. Through this 20 × 20 matrix (as there are 20 amino acids), a total of 400 coefficients are produced by the PRIM. For reverse position relative incidence matrix (RPRIM), the same process is used on the reverse proteomic sequence and RPRIM is shown as

$${M}_{RPRIM}=left[begin{array}{cccc}{M}_{1to 1}& {M}_{1to 2}cdots & {M}_{1to y}cdots & {M}_{1to 20}\ {M}_{2to 1}& {M}_{2to 2}cdots & {M}_{2to y}cdots & {M}_{2to 20}\ {M}_{xto 1}^{vdots }& {M}_{xto 2}^{vdots }cdots & {M}_{xto y}^{vdots }cdots & {M}_{ito 20}^{vdots }\ {M}_{Ato 1}^{vdots }& {M}_{Ato 2}^{vdots }cdots & {M}_{Ato y}^{vdots }cdots & {M}_{Ato 20}^{vdots }end{array}right]$$

To reduce the dimensionality of both the 20 × 20 matrix and to extracting more significant and meaningful information from PRIM and RPRIM, again 30 moments were computed for both matrices.

Determination of frequency vector

PRIM and RPRIM mainly provide information regarding the relative positioning of the residues in the amino acid sequences, however, amino acid frequencies in a sequence also play an important role. To elucidate the compositional confirmation of a primary sequence, another vector is formulated namely the frequency vector. The vector is of length 20, where each index i represents the ith amino acid from A to Y and each coefficient in this vector is used to measure the frequency occurrence of the corresponding amino acid. The frequency vector is represented as

$$xi ={{tau }_{1,}{tau }_{2,}..{.}_{,}{tau }_{20}}$$

where each ({tau }_{i,}) characterizes the frequency occurrence of that respective ith amino acid.

Computing accumulative absolute position incidence vector (AAPIV) by original and reverse sequence

The frequency vector is only used for extraction of the information of the composition of amino acids, whereas, the PRIM and RPRIM only provide information of relative amino acid positioning. To encode the absolute position of amino acids in a sequence, the Accumulative Absolute Position Incidence Vector (AAPIV), is used. It provides an estimate of the absolute positioning of residues. It computes the ordinal value of each residue and accumulates this ordinal value into a 20-length vector at the respective coefficient where each index represents the respective amino acid from A to Y. Thus, an arbitrary ith element of AAPIV is calculated as

$${mu }_{i}={sum }_{k=1}^{n}{p}_{k}$$

where ({p}_{k}) represents the ordinal value of an arbitrary occurrence of ith amino acid. Similarly, Reverse Accumulative Absolute Position Incidence Vector (RAAPIV) is computed based on the same mechanism but with reversed sequence. More obscure features are unravelled by its enumeration. Generic representation of AAPIV and RAAPIV could be seen as

$$Lambda ={{eta }_{1},{eta }_{2},{eta }_{3},…,{eta }_{20}}$$

Model training and optimization

This study is focused on a specific type of protein and pertaining to the stress response. Three different classification algorithms were analyzed for the prediction of stress response proteins. A feature vector was assimilated using the raw, central, and Hahn moments of the two-dimensional depiction of protein arrangement along with PRIM moments, RPRIM moments, Frequency Vector, AAPIV, and RAAPIV. This yielded a feature vector of length 150, which was further input to all three classification algorithms.

Random FOREST

Firstly, the Random Forest (RF) was used which is a well-known algorithm used for regression and classification problems. While training, it is operated by generating a forest of decision trees using a feature matrix and outputs the predicted class, that is the mode of the classes predicted by all trees in that forest. Random forest is non-parametric, have higher classification accuracy, and is capable of determining the set of coefficients that are most crucial for predicting the class with maximum accuracy35. Feature vectors as input matrix and their corresponding class labels as expected output matrix are congregated to train the random forest predictor. The architecture of Random Forest is shown in Fig. 3.

Figure 3
figure3

Architecture of random forest classifier for the proposed prediction model.

Each tree uses a subset of the vector to classify the input where n is the number of decision trees. A voting algorithm finally decides the actual predicted class based on the majority votes.

In the present study, RF implementation was used from Scikit-Learn, with default parameters, while n_estimators used were 50.

Artificial neural network

An artificial neural network is a connectionist network of neurons. An input neuron receives the transformed input while each subsequent neuron receives the output yielded by all the former neurons. The output of each neuron is calculated as the activated consequence of the weighted sum of the inputs to that neuron, as shown in Fig. 4. The feature vectors formed are clamped to the neural network input layer36. An optimized number of hidden layer neurons are used. During each epoch, the backpropagation along with gradient descent technique is used to find the most optimal neuron weights. The gradient descent method makes use of the gradients of the cost function to take a step towards the optimal solution with respect to a parameter (theta) as

Figure 4
figure4

Architecture of ANN classifier for the proposed prediction model.

$$theta = theta – a{nabla }_{theta }M (theta )$$

Here, (M (theta )) is the objective function while (theta in {mathcal{R}}^{d}, a) is the learning rate and the gradient of the objective function is given as ({nabla }_{theta }M (theta )). The learning rate is considered to be problem-specific and its value usually relies on the cost function. It governs the step size of gradient descent at each iteration. The learning rate differential is usually a constant but it can be variable in which case it is adaptively set to find the most optimal point37. To find the most optimal point, the gradient is calculated for consecutive points while the weights are readjusted and the learning rate is fine-tuned38. When the algorithm is fully trained, it can be used to predict outcomes for unknown data.

In the present study, fully connected NN was used from Keras, with dense layers. 1 hidden layer was employed with 50 neurons, while the size of the input layer was 150 (equal to FV length). Output neurons were 2 for binary classification based on one-hot encoding. For the hidden layer, ReLU was used as an activation function while for the output layer, Sigmoid was used. The learning rate was set as 0.001.

Support vector machine

Lastly, a support vector machine (SVM) was used, which is also a supervised learning model, known for classification and regression tasks. SVMs has been abundantly and successfully deployed to solve numerous classification problem39. SVMs works on the principle of finding a hyperplane that could separate the classes of a dataset, as shown in Fig. 5.

Figure 5
figure5

Architecture of SVM classifier for the proposed prediction model.

The hyperplane is adjusted with the help of support vectors so that the distance between the hyperplane and the nearest training data points is maximal. This would provide a large margin and hence improve the accuracy by reducing the generalization error.

For the present study, Linear SVM implementation from Scikit-Learn was used, which is based on libsvm, while default parameters were employed.

Source link