Preloader

Computational identification of multiple lysine PTM sites by analyzing the instance hardness and feature importance

Dataset construction

Accurate identification of protein’s post-translational modifications often requires a rigorously processed benchmark dataset. As this study is related to the multi-class multi-label classification problem, a few steps have been followed to construct five valid benchmark datasets.

Primary data collection and preprocessing

In the current study, human protein sequences have been utilized for prediction model development and benchmarking. About 9380 protein sequences have been collected from the Universal Protein Resource (UniProt)25 by applying various constraints (accessed 22 September 2020). Firstly, navigate to the ‘Advanced Search’ option, select the ‘PTM/Processing’ and ‘Modified residue [FT]’ option, keep ‘Any assertion method’ as ‘Evidence’. Then include another query space as ‘Organism [OS]’, choose ‘Homo sapiens (Human) [9606]’ from the suggestions as ‘Term’. Finally, select the ‘Reviewed’ option as the third field by adding one more query space. As this study is concerned with a multi-label classification problem, 5 different types of K-PTMs (i.e. acetylation, crotonylation, methylation, succinylation, and glutarylation) have been considered for the dataset construction. After applying a preliminary selection process with the specific keywords of each K-PTM, 1841 proteins have been obtained. For formulating peptide samples meticulously and comprehensively, Chou’s scheme26 has been adopted. According to this scheme, a peptide segment can generally be expressed by,

$$begin{aligned} small P_zeta (K)=Q_{-zeta } Q_{-({zeta }-1)}…Q_{-2} Q_{-1} K Q_{+1} Q_{+2}…Q_{+({zeta }-1)} Q_{+zeta } end{aligned}$$

(1)

where the symbol K denotes the responsible residue ’lysine’ at the centre, the subscript (zeta) being an integer, (Q_{-zeta }) and (Q_{+zeta }) denotes the (zeta)th leftward and (zeta)th rightward amino acid residues from the centre, and so forth. In this study, primarily a peptide sequence (P_zeta (K)) can be categorized into two types,

$$begin{aligned} small P_zeta (K)in {left{ begin{array}{ll} P_zeta ^+ (K),text {if its center is K-PTM site} \ P_zeta ^- (K),text {if its center is Non-K-PTM site} end{array}right. } end{aligned}$$

(2)

where (P_zeta ^+(K)) contains the positive subset of the peptides and (P_zeta ^-(K)) contains the negative subset of the peptides with a lysine (K) residue at its centre, and the symbol (in) indicated the set theory relationship. For equal-sized K-PTM site formation, ((2zeta +1))-tuple peptide window with K at its centre has been employed. During segmentation, the lacking amino acid at both the right and left end has been filled with the nearest residue5. After the peptide fragments have gone through some screening, such as the elimination of sequences in case of redundancy, the primary dataset has been constructed with the following form,

$$begin{aligned} small S_zeta (text {K}) = S_zeta ^+ (text {K}) cup S_zeta ^- (text {K}) end{aligned}$$

(3)

where the positive subset (S_zeta ^+ (text {K})) can contain any peptide samples which have one or more modifications (i.e. acetylation, crotonylation, methylation, succinylation, glutarylation) with K at the centre, while the negative subset (S_zeta ^- (text {K})) can contain only the false K-PTM samples which have no modifications at all. The sliding window method10 was adopted to segment the protein sequences with different window sizes where (zeta = 1, 2, 3, ldots 24). Based on the Accuracy value, window size was selected as ((2zeta +1)= 49) where (zeta = 24) (i.e. 24 right stream and 24 left stream amino acid residues). It should be mentioned that only the window sizes less than 51 were taken under consideration due to the compelling protein sequence length10. Therefore, Eq. (1) has been reduced to,

$$begin{aligned} small P(K)=Q_{-24} Q_{-23}…Q_{-2} Q_{-1} K Q_{+1} Q_{+2}…Q_{+23} Q_{+24} end{aligned}$$

(4)

Following the aforestated process, 5059 K-PTM samples and 81507 Non-K-PTM samples have been obtained.

Data imbalance management and benchmark dataset formation

It can be observed that the primary dataset is highly imbalanced where the ratio between K-PTM and Non-K-PTM sites is  1:16. The instance hardness (IH) based undersampling technique has been employed for reducing this skewness27. Later at the classification level, a cost-sensitive SVM classifier has been utilized to address the imbalance in each K-PTM dataset.

Instance hardness undersampling

Smith, Martinez, and Giraud-Carrier have proposed the instance hardness (IH) undersampling technique for binary classification problems27,28. In this study, we adopted this technique by measuring the hardness of the sequence-coupling information which have been extracted from the primary dataset by using Eqs. (10), (11) and (12). The detailed methodology of the vectorized sequence-coupling feature extraction technique has been discussed in the “Feature construction” section. From Fig. 1, it can be observed that one or more modifications can occur at 5059 ’K-PTM’ samples, where 81507 ’Non-K-PTM’ samples lack any of the modifications. The objective here is to find out the most suitable peptide samples which represent no modification at all. In this work, the hardness of an instance in the coupling feature set measures how likely it is to be misclassified. Higher hardness values indicate that the data samples are noisy or on the border between ’K-PTM’ and ’Non-K-PTM’ classes, as the learning algorithms would cause them to overfit correctly28. For a peptide sample ((x_i, y_i)), (p(y_i|x_i, h)) denotes the conditional probability of label (y_i) for the input feature vector (x_i) given by the learning algorithms h. The higher the value of (p(y_i|x_i,h)) is, the more likely h assigns the correct label to (x_i), and it is quite opposite for the smaller value of (p(y_i|x_i,h))27,28. The hardness of an instance ((x_i, y_i)), concerning h, is defined as,

$$begin{aligned} I_h[(x_i,y_i)]=1-p(y_i|x_i,h) end{aligned}$$

(5)

Let ({mathscr {H}}) be the set of weak learners and p(h|t) be the corresponding weight of (h in H), where (t = {(x_i,y_i) : x_i in X wedge y_i in Y }). Hence, the hardness of an instance in the data sample takes the following form,

$$begin{aligned} begin{aligned} I[(x_i,y_i)]&=sum _{mathscr {H}}(1-p(y_i|x_i,h))p(h|t) \&=sum _{mathscr {H}} p(h|t)-sum _{mathscr {H}} p(y_i|x_i,h)p(h|t) \&= 1-sum _{mathscr {H}} p(y_i|x_i,h)p(h|t) end{aligned} end{aligned}$$

(6)

Following this concept, the imbalanced dataset has been resampled by eliminating the data points from the majority class with high instance hardness values, until the desired balancing ratio of 1:1 has been reached. To estimate the hardness of an instance, we utilized the cost-sensitive support vector machine29,30,31 which will be discussed later in this study. It should be mentioned that scikit-learn’s library32 has been used to implement the instance hardness (IH) based undersampling technique. Finally, 5059 positive and 5059 negative samples have been obtained, and the original peptide sequences with the expression of Eqs. (3) and (4) have been retrieved from the returned indices of the resampled dataset. The final benchmark datasets have been constructed by mapping the samples labeled as ’K-PTM’ and ’Non-K-PTM’ into each individual classes which takes the following form,

$$begin{aligned} small {left{ begin{array}{ll} S_zeta (text {acetylation}) = S_zeta ^+ (text {acetylation}) cup S_zeta ^- (text {acetylation}) \ S_zeta (text {crotonylation}) = S_zeta ^+ (text {crotonylation}) cup S_zeta ^- (text {crotonylation}) \ S_zeta (text {methylation}) = S_zeta ^+ (text {methylation}) cup S_zeta ^- (text {methylation}) \ S_zeta (text {succinylation}) = S_zeta ^+ (text {succinylation}) cup S_zeta ^- (text {succinylation}) \ S_zeta (text {glutarylation}) = S_zeta ^+ (text {glutarylation}) cup S_zeta ^- (text {glutarylation}) \ end{array}right. } end{aligned}$$

(7)

A comprehensive summary of dataset preparation has been presented in Fig. 1. The numbers of samples in the benchmark datasets are outlined in Table 1, and their detailed sequences and positions in the proteins are given in the Supplementary File. The distributions of different types of modifications in the benchmark datasets are tabulated in Table 2. It could be observed that our benchmark datasets contain 4089 samples belonging to one type of K-PTM, 861 to two types, 77 to three types, 26 to four types, and 6 to five types modifications.

Table 1 Number of samples in the benchmark dataset for different K-PTMs.
Table 2 K-PTM distributions in the training set.
Cost-sensitive classifiers

We have handled the imbalance between the K-PTMs and Non-K-PTM sites by utilizing the instance hardness undersampling technique. However, it can be observed from Table  1 that still there exists some skewness between the positive and negative sites of each of the five modifications. Therefore, further adjustments are needed to deal with this issue. We have utilized five cost-sensitive SVM classifiers for mitigating the imbalance problem of five datasets in Table 1. A detailed discussion on the support vector machine prediction algorithm and the proposed model development are presented in the “Support vector machine” and “Model development and validation” sections respectively.

Feature construction

With the evolution of the biological sequences, several encoding methods have been developed for extracting pertinent features hidden in the sequences. After preliminary analysis, it has been observed that the amino acid factors, encoded binary features, pairs of k-spaced amino acids, and the vectorized sequence coupling12,15,20 technique are more appropriate for representing the protein sequences of the multiple lysine modification sites than any other encoding methods.

Amino acid factors

Five multidimensional attributes20,33, which include polarity, secondary structure, molecular volume, electrostatic charge, and codon diversity34, have been constructed from AAIndex by using multivariate statistical analysis12. These five transformed properties can be introduced as amino acid factors (AAF)34. Since the AAF can reduce the dimensionality of the feature space of physicochemical properties efficiently, it has been utilized in many biological studies12,34. The dimensionality of feature vectors has been calculated as follows,

$$begin{aligned} D= peptide sequence length times number of factors end{aligned}$$

(8)

With a peptide sequence of length 27 and previously described five amino acid factors, (49{times }5 = 245) dimension features have been derived by using this formula.

Binary encoding

Binary encoding12 can represent the amino acid position and composition by using 20 binary bits for one amino acid12. But one additional bit has been conjoined to handle the complexity of sliding windows. For 21 amino acids structured as ‘ACDEFGHIKLMNPQRSTVWYZ’, each residue inside a sequence fragment can be formed by a 21-dimension binary vector12. For instance, residue ‘A’, ‘G’ and ‘Z’ have been encoded as ‘100000000000000000000’, ‘000000100000000000000’ and ‘000000000000000000001’ respectively. According to this concept, each resultant peptide segment is expressed as (49{times }21 = 1029)-dimensional feature vectors.

Pairs of k-spaced amino acids

The formation of k-spaced amino acid pairs encoding technique12,21,35 calculates the occurrence frequencies of the pairs of k-spaced amino acids from a segmented protein sample, that can express the short linear motif information out of it12,30. For instance, the encoding of a peptide segment will be a 441-dimensional feature vector if (k=0). This can be defined as,

$$begin{aligned} (N_{AnA}/N_{Total}, N_{AnC}/N_{Total},…, N_{YnY}/N_{Total})_{441} end{aligned}$$

(9)

where n stands for any of amino acid, (N_{Total}) means the occurrence frequency of all k-spaced amino acid pairs35 and (N_{AnA}) means the occurrence frequency of the AnA pairs in the segment20 when (k=0). In this study, after merging each of the 441-dimension feature vectors for (k = 0, 1, 2, 3, 4), a total of 2205-dimensional features have been formed.

Sequence coupling

The composition of pseudo amino acid or PseAAC10,36,37 has been designed to preserve the sequence pattern information, which is a much harder task for any existing machine learning algorithm38. In this study, incorporating sequence coupling information into Chou’s general PseAAC has been adopted for extracting features from peptide sequences5,15,18. It can be defined as,

$$begin{aligned} small P(K) = P^+(K) – P^-(K) end{aligned}$$

(10)

where,

$$begin{aligned}&{P^+(K)=Bigg [ P_{-24}^{C^{+}} P_{-23}^{C^{+}} ldots P_{-1}^+ P_{+1}^+ ldots P_{+23}^{C^{+}} P_{+24}^{C^{+}} Bigg ]^T} qquad end{aligned}$$

(11)

$$begin{aligned}&{P^-(K)=Bigg [ P_{-24}^{C^{-}} P_{-23}^{C^{-}} ldots P_{-1}^- P_{+1}^- ldots P_{+23}^{C^{-}} P_{+24}^{C^{-}} Bigg ]^T} qquad end{aligned}$$

(12)

where (P_{-24}^{C^{+}}) in Eq. (11) denotes the conditional probability of amino acid (Q_{-24}) at the leftmost position given that its adjacent right member is (Q_{-23}) and so forth5,18. In contrast, only (P_{-1}^+) and (P_{+1}^+) are of non-contingent probability as K is the adjoining member of both amino acids at position (Q_{-1}) and (Q_{+1}). All the conditional probability values have been extracted from the positive training dataset. Additionally, all the probability values in Eq. (12) are identical to those of Eq. (11) other than that they can be derived from the negative training dataset. Thus, after omitting K from the center, ((49-1)=48) dimension features have been obtained.

Feature ensembling

Initially, the four aforestated feature encoding techniques (i.e. AAF, BE, CKSAAP, and sequence coupling) have been implemented separately to encode the training peptides. However, for extracting more PTM-contextual information from the protein sequences, encoded features have been ensembled serially, and scaled through standardization. Finally, ((49{times }5) + (49{times }21) + (441{times }5) + 48 = 3527) dimension features have been obtained.

Feature selection

Since the dimension of the encoded features is higher, irrelevant, and redundant features should be removed to avoid learning complexity. For this reason, the analysis of variance (ANOVA) F test statistic technique22,39 has been adopted. It tests the null hypothesis (i.e. all the means of different groups were equal) against the alternative hypothesis (i.e. all the means differed from each other). The one-way ANOVA can be defined as,

$$begin{aligned} F= frac{(n-k)sum n_i(overline{mathrm{Y}}_{i.}-overline{mathrm{Y}}_{..})^2}{(k-1)sum (n_i-1)s_i^2} end{aligned}$$

(13)

where (n=sum _{i=1}^k n_i), (overline{mathrm{Y}}_{i.}=Y_{i.}/{n_i}), ({overline{Y}}_{..}=Y_{..}/{n})

and

(s_i^2=sum _{j=1}^{n_i} (Y_{ij}-{overline{Y}}_{i.})^2/{({n_i-1})}).

It should be mentioned that the dot in (Y_{i.}) indicates an aggregation over the j index39. Where (Y_{i.}=sum _{j=1}^{n_i} Y_{ij}) and (Y_{..}=sum _{i=1}^{k}sum _{j=1}^{n_i} Y_{ij}). The calculated F values are used to rank the features. The discriminative capability of a predictor is better for higher F values.

Support vector machine

The support vector machine (SVM)29,30,31, one of the dominant statistical learning algorithms was adopted as a core prediction algorithm. It seeks the optimum hyperplane with the highest margin between two groups18,40. Furthermore, it solves the problem of constraint optimization as described below

$$begin{aligned} maximize_alpha sum _{i=1}^{n}alpha _i – frac{1}{2}sum _{i=1}^{n}sum _{j=1}^{n}alpha _ialpha _j y_i y_j k(x_i,x_j) end{aligned}$$

(14)

Subject to: (sum _{i=1}^{n}y_ialpha _i=0,quad 0le alpha _ile C), for all i(=1,2,3,…,n). After involving the kernel function, the discriminant function of SVM took the following form

$$begin{aligned} f(x)=sum _{i}^{n}alpha _i y_i k (x,x_i)+b end{aligned}$$

(15)

In this paper, the radial basis function kernel18,41 was applied to construct SVM classifier and given by, (k(x_i,x_j)=exp(-{gamma }Vert x_i-x_jVert ^2)), where (gamma >0)42. As the benchmark dataset was highly imbalanced, different error cost (DEC)18 method had been used to tackle the class imbalance problem24,43. According to this approach, the SVM soft margin objective function was adjusted to allocate two costs for misclassification12, such as (C^+) for the positive class instances and (C^-) for the negative class instances

$$begin{aligned} C^+ = C*W^+, qquad C^- = C*W^- end{aligned}$$

(16)

In Eq. (16), (W^+) is the weight for the positive instances and (W^-) is the weight for the negative instances and defined by

 (W^+ = frac{M}{2*M_1},quad W^- = frac{M}{2*M_2}) where M is the total number of elements, (M_1) is the number of elements for the positive class, and (M_2) is the number of elements for the negative class.

Evaluation metrics

As shown in Table 1 and Supplementary Material, the total number of peptide samples are 10118 in total, of which 4154 are labelled with ‘acetylation’, 208 with ‘crotonylation’, 325 with ‘methylation’, 1253 with ‘succinylation’, 236 with ‘glutarylation’, and 5059 with ‘Non-K-PTM’. Since a sample can contain more than one labels, metrics for multi-label systems5,18 have been utilized instead of ordinary metrics for single-label systems9,10,12,44. According to Chou’s formulation45, the metrics for multi-label systems can be defined as,

$$begin{aligned} {left{ begin{array}{ll} Aiming = frac{1}{N}sum _{i=1}^{N}left( frac{Vert Y_i cap Y_i^prime Vert }{Vert Y_i^prime Vert }right) \ Coverage = frac{1}{N}sum _{i=1}^{N}left( frac{Vert Y_i cap Y_i^prime Vert }{Vert Y_iVert }right) \ Accuracy = frac{1}{N}sum _{i=1}^{N}left( frac{Vert Y_i cap Y_i^prime Vert }{Vert Y_i cup Y_i^prime Vert }right) \ Absolute-True = frac{1}{N}sum _{i=1}^{N}left( Delta {Vert Y_i,Y_i^prime Vert }right) \ Absolute-False = frac{1}{N}sum _{i=1}^{N}left( frac{{Vert Y_i cup Y_i^prime Vert }-{Vert Y_i cap Y_i^prime Vert }}{L}right) end{array}right. } end{aligned}$$

(17)

where N and L are the total numbers of the samples and labels in the system respectively5,18, (cup) and (cap) denotes the ‘union’ and ‘intersection’ in the set theory, (||,||) means the operator acting on the set to calculate the number of its elements, (Y_i) and (Y_i^prime) denotes the subset that contained all the labels experiment-observed and all the labels predicted for the (i^{th}) sample respectively, and

$$begin{aligned} Delta (Y_i,Y_i^prime )= {left{ begin{array}{ll} 1, text {if all labels in }Y_i^prime text { and }Y_itext { are identical }\ 0, text {otherwise} end{array}right. } end{aligned}$$

The metrics defined above have been applied effectively in several multi-label based systems5,18.

Model development and validation

In this study, five separate SVM classifiers18 have been used to predict acetylation, crotonylation, methylation, succinylation, and glutarylation sites. Each of the classifiers has performed binary classification on the benchmark dataset described in Table 1. For all five K-PTM types, necessary features have been extracted by integrating multiple encoding methods and 100 optimal features with ANOVA F-test have been selected to train the models, as shown in Fig. 1. The radial basis function (RBF) kernel40,46 has been used for each SVM classifier. As there is a lack of details about the exact 5-way splits of the dataset40, five complete runs of 5-fold cross-validation have been executed5,18,47. The misclassification cost C has been calculated according to Eq. (16) for handling the data imbalance issue. In this study, libSVM’s default parameters (i.e. (C=1) and (gamma =1/)number of features) have been selected to train the model. Eventually, after training the five binary SVM classifiers with the appropriate hyperparameters, multi-label predictor iMul-kSite has been constructed by combining the outputs from these classifiers40, as depicted in Fig. 1. Five times repetition of the 5-fold cross-validation40 have produced five sets of values of all metrics, which are defined in the previous section. The average results of each multi-label metric have been taken to evaluate the final model. It should be mentioned that Matlab 2019a and python 3.7.3 have been utilized to implement the system.

Source link