Preloader

Subnetwork representation learning for discovering network biomarkers in predicting lymph node metastasis in early oral cancer

The proposed method works in three stages: (1) subnetwork extraction using graph embedding technique, (2) construction of subnetwork level representation, and (3) integration of subnetwork level representation into the master-level decision.

Subnetwork extraction using graph embedding technique

Extracting subnetworks from a given PPI network , taking into account its biological significance, is an important task in constructing subnetwork-level representations. Essentially, the problem can be thought of as a clustering node within a PPI network represented in the form of an adjacency matrix (Fig. 1). The sparsity of network representation is useful for defining clusters, but at the same time is a huge challenge to the generalization of machine learning. DeepWalk is a powerful tool to deal with this problem, deploying representation learning techniques based on neural networks such as Word2Vec18, 22. It works as a graph embedding tool and shows good performance when used for node classification18. In the study of Perozzi et al.18, DeepWalk was compared with five other methods in terms of the multi-label classification task, which is a problem of erasing some of the labels and guessing the erased labels through node clustering when given a graph with labels on each node. DeepWalk outperformed all other opponents under various experimental conditions. Based on the Macro-f1 score, DeepWalk’s performance reaches up to 43.05%.

DeepWalk receives the sparse representation of the PPI network and generates a dense representation of the individual nodes encoding the relationship between each node in a continuous vector space with a reduced number of dimensions (Fig. 1). Using the encoded vector as a new representation of each node, we can solve the problem of extracting subnetworks by transforming it into a clustering problem. We used the Gaussian Mixture Model (GMM) and Bayesian Information Criteria (BIC) (Eq. 1) to estimate the optimal number of clusters for a given PPI network , and each resulting cluster can be considered a subnetwork (Fig. 2). For this step, the Python library scikit-learn-0.19.2 was used23.

$$begin{aligned} BIC =ln {(n)}{(kd)}-2ln {(p(xmid widehat{theta },M))} end{aligned}$$

(1)

where x is the observed data, n is the number of data points in x, and k is the number of clusters. d is the number of dimensions of the latent representation generated by DeepWalk. (p(xmid widehat{theta }, M)) represents the maximum value of the GMM likelihood function. Where (widehat{theta }) is the parameter value that maximizes the likelihood function. The model with the lowest BIC value is considered optimal.

Figure 1
figure1

Extracting subnetworks using graph embedding technique . This involves 1) generating an adjacency matrix from a given PPI network, 2) random work sampling from a given PPI graph, and 3) generating a word2vec representation of the sampled works to generate a dense representation of each gene.

Figure 2
figure2

Subnetwork clustering using latent representation. This involves (1) applying a Gaussian mixture model to a given dense representation of the PPI network using a wide range of the number of components as parameters, (2) evaluating each model by calculating the BIC criterion, and (3) choosing the best model to create a subnetwork for a given PPI network.

Figure 3
figure3

Constructing subnetwork level representation. This includes (1) calculating the sSAS representation for each optimized subnetwork and (2) integrating the representations into the subnetwork-level representation for each sample.

In order to improve interpretability and reduce noise, we took the Hallmark Gene Set (HGS) from the molecular signature database (MSigDB)24 to limit the gene space. HGS is a well-selected group of functional genes, in which genes associated with a common cancer phenotype are grouped into a set of genes. HGS has 50 genesets containing a total of 4,384 genes. For each geneset, we first generated a PPI network graph using the protein-protein interactions between the genes of each geneset. Here , PPI networks were extracted from BioGrid21 using only high-confidence protein interactions . For each PPI network graph, we applied DeepWalk to create a vector space, then applied GMM to create subnetworks (Figs. 1 and 2). By defining sub-networks within each HGS geneset, the genes in each sub-network are not only closely linked in terms of PPI network , but also in terms of cancer phenotype. In summary, 279 subnetworks were obtained, each subnetwork assigned to one of the 50 HGS genesets.

Construction of subnetwork level representation

Constructing a subnetwork-level representation of cancer transcriptome requires the integration of gene expression levels and PPI networks between genes so that the activity of each subnetwork can be quantified. SAS is one of the most effective tools to this end19. SAS uses RNA-seq samples and subnetworks generated from PPI networks as inputs to quantify subnetwork level activation for each sample. As explained in the Eqs. (2a), (2b), (2c) and (2d), SAS is a single value called Subnetwork Activation Score for the subnetwork level representation of the transcriptome. It is defined as a nonlinear combination of gene expression using the closeness centrality of each gene with a coefficient defined by a given PPI network .

$$begin{aligned}&ACT_{i,j} = N_{i,j} * frac{{(c_{i}r_{i}+c_{j}r_{j})}^2}{2(r_{i}+r_{j})} end{aligned}$$

(2a)

$$begin{aligned}&SAS = sum _{i}sum _{j}ACT_{i,j} end{aligned}$$

(2b)

$$begin{aligned}&N_{i,j} = frac{a_{ij}}{sum _{s}sum _{t}a_{st}} end{aligned}$$

(2c)

$$begin{aligned}&a_{ij} = {left{ begin{array}{ll} 1 &{} text {if gene i and j are connected} \ 0 &{} text {otherwise} end{array}right. } end{aligned}$$

(2d)

(ACT_{i,j}) represents the edge level activation score between the two genes i and j. (r_{i}) represents the expression level of the gene i (TPM in this case). (c_{i}) represents the closeness centrality of the gene i within a given subnetwork PPI network . SAS is the total activation score for the subnetwork. (a_{i,j}) is an indicator of whether two genes are linked within a given PPI network , and (N_{i,j}) is the normalized term for (a_{i,j}).

While SAS does not use sample labels when calibrating subnetwork representations, our goal is to predict metastasis potential in early OTSCC, so we modified SAS to better serve this purpose and named it supervised SAS (sSAS). sSAS inherited the basic idea of SAS, but calculated the coefficients in different ways (e.g. (c_{i}) and (N_{i,j})). Rather than defining the coefficients directly in the network topology, they were estimated by maximizing the log-likelihood function (Eq. 3f) designed to be considered as a latent variable and minimize prediction errors for the labels of each sample. As shown in Eq. (3a) and (3b), sSAS is defined as a logit in a logistic regression problem rather than a single activation score. x is defined as a vector containing a nonlinear combination of gene expression combined by paired combinations of genes, and (theta) is the latent weight corresponding to x. The problem definition is as follows.

First, the (ACT_{i,j}) term is divided into three parts: (frac{r_{i}^2}{r_{i}+r_{j}}), (frac{r_{j}^2}{r_{i}+r_{j}}), and (frac{r_{i}r_{j}}{r_{i}+r_{j}}). Then, all coefficients are considered as latent variables such as (w_{ij1}), (w_{ij2}), and (w_{ij3}). Then the linear combination of the three division terms replaces (ACT_{i,j}) (Eq. 3a). We named it (sACT_{i,j}) as a supervised (ACT_{i,j}). Then the term SAS is also changed to a supervised format (e.g. sSAS) to estimate the latent weights by target variable (ie, sample label) (Eq. 3b). The original observations are transformed into a nonlinear combinatorial vector of gene expression x (Eq. 3c) and their weights are defined by the model parameter (theta) (Eq. 3d). Based on this, the logistic function (q_{k}(x)) is defined to represent the estimated probability of observation x with target label k (Eq. 3e). Finally, a log-likelihood function (l(theta _{k})) is defined so that the model parameter (theta _{k}) can be estimated by maximizing (l(theta _{k})) (Eq. 3f and 3g).

$$begin{aligned}&sACT_{i,j} = w_{i j 1}(frac{r_{i}^2}{r_{i}+r_{j}}) + w_{i j 2}(frac{r_{j}^2}{r_{i}+r_{j}}) + w_{i j 3}(frac{r_{i}r_{j}}{r_{i}+r_{j}}) end{aligned}$$

(3a)

$$begin{aligned}{}&sSAS = ln {frac{q}{1-q}}= sum _{i}sum _{j}sACT_{i,j} end{aligned}$$

(3b)

$$begin{aligned}&x = leftlangle frac{r_{1}^2}{r_{1}+r_{2}}, frac{r_{2}^2}{r_{1}+r_{2}}, frac{r_{1}r_{2}}{r_{1}+r_{2}}, … rightrangle end{aligned}$$

(3c)

$$begin{aligned}&theta = langle w_{ij1}, w_{ij2}, w_{ij3}, … rangle end{aligned}$$

(3d)

$$begin{aligned}&q_{k}(x) = frac{1}{1+e^{-{theta _{k}^{T}x}}} end{aligned}$$

(3e)

$$begin{aligned}{}&l(theta _{k}) = sum _{m}y_{mk}ln {q_{k}(x_{m})} + (1-y_{mk})ln (1-q_{k}(x_{m})) end{aligned}$$

(3f)

$$begin{aligned}&y_{mk} = {left{ begin{array}{ll} 1 &{} text {if the label of sample m is k} \ 0 &{} text {otherwise} end{array}right. } end{aligned}$$

(3g)

The representation of a subnetwork t of a sample m is defined in Eq. (4a) and (4b). In the case of multiple classes, the model parameter for each class (theta _{k}) is independently estimated in a “one versus the rest” way, then consolidated into (p_{mtk}) as in Eq. (4b). In our scheme, therefore, the subnetwork level representation of a sample is probability distribution estimated from the given data at each subnetwork by logistic regression model (Eq. 4c and Fig. 3). For example, if RNA-seq samples have k classes of labels then each RNA-seq sample will have a vector with dimensions of (279 * k) because we used 279 subnetworks in this study. scikit-learn-0.19.2 was used for this step23.

$$begin{aligned}&q_{mtk} = frac{1}{1+e^{-theta _{k}^{T}x_{m}}} end{aligned}$$

(4a)

$$begin{aligned}&p_{mtk} = frac{q_{mtk}}{sum _{r}{q_{mtr}}} end{aligned}$$

(4b)

$$begin{aligned}&Sub_{mt} = langle p_{mt1}, p_{mt2}, …, p_{mtk} rangle end{aligned}$$

(4c)

Integration of subnetwork level representation into master-level decision

The remaining problem is to incorporate the constructed subnetwork level representation into a single master level decision. We solved this ensemble learning problem using the attention layer built into the neural network. The attention mechanism stems from the problem of sequence-to-sequence mapping in machine translation25. In the work of Bahdanau et al., the attention layer was inserted between the encoder and decoder layer to act as memory25. In other words, they are trained to dictate which context to focus on at a specific point in time and which context to not. The attention mechanism has been applied to various tasks and has been shown to exhibit excellent performance26, 27. Also, Choi et al.27 suggested that the attention mechanism can be used to make models more explainable. In this model, the attention layer acts as a master-level decision agent trained to decide which sub-network to focus on based on the certainty computed with each sub-network level representation (Fig. S1). As shown in Eq. (5a) and (5b), the attention layer takes the negative Shannon’s entropy28 of each subnetwork level representation. Since the entropy of a given probability distribution represents the level of uncertainty, negative entropy was used to quantify how certain each subnetwork level predictor is for a given classification task.

The negative entropy values of each subnetwork are concatenated into a single vector (Eq. 5b) and passed to the next fully concatenated layer (FC). Then softmax activation is applied, resulting in a proportional distribution that is the attention layer (Eq. 6a and 6b). Hence, the actual parameter that can be learned here is the W matrix (Eq. 6a), which learns to decide which subnetworks to focus on based on the C vector. The actual decision-making process is described in Eq. (6c) and (6d) (Fig. S2). It is basically the weighted sum of subnetwork level representation for each class, where the weights are learned by the attention mechanism. The prioritization of features by the model is an instance-wise process, so each sample gets a different attention value depending on their subnetwork level representation. Python libraries tensorflow-1.10.029 and keras-2.2.230 were used for this step.

$$begin{aligned}&c_{t} = sum _{k}{p_{tk}ln {p_{tk}}} end{aligned}$$

(5a)

$$begin{aligned}&C = langle c_{1}, c_{2}, …, c_{t} rangle end{aligned}$$

(5b)

$$begin{aligned}&H = softmax(WC^{T}) end{aligned}$$

(6a)

$$begin{aligned}&H = langle h_{1}, h_{2}, …, h_{t} rangle end{aligned}$$

(6b)

$$begin{aligned}&d_{k} = sum _{t}{h_{t}*p_{tk}} end{aligned}$$

(6c)

$$begin{aligned}&f_{k} = frac{e^{d_{k}}}{sum _{s}{e^{d_{s}}}} end{aligned}$$

(6d)

Source link