Preloader

A machine learning framework for predicting drug–drug interactions

Data

The known drug–drug interactions and drug–gene interactions are extracted from DrugBank27. As we use drug target profile to represent drugs and drug pairs, only the drugs that have been discovered to target at least one human gene are studied in this work. As results, we totally extract 6066 drugs and 2940 targeted human genes from DrugBank27. There are 915,413 drug–drug interactions and 23,169 drug–gene interactions associated with these drugs. As drug–drug interaction prediction is essentially a problem of binary supervised learning, we use the 915,413 drug pairs as the positive training data and randomly sample another 915,413 drug pairs from the 6066 drugs as the negative training data. The two classes of data are ensured to have no overlap.

The comprehensive database28 provides a large repository for drug–drug interactions from experiments and text mining, some of which come from scattered databases such as DrugBank27, KEGG29, OSCAR30 (https://oscar-emr.com/), VA NDF-RT31 and so on. After removing the drug–drug interactions that already exist in DrugBank27, we totally obtain 13 external datasets as positive independent test data, for instance, the largest 8188 drug–drug interactions from KEGG29. To estimate the risk of model bias, we randomly sample 8188 drug pairs as negative independent test data. These drug pairs are not overlapped with the training data and the positive independent test data.

To quantitatively estimate the intensity that two drugs perturbate each other’s efficacy, we build up comprehensive physical protein–protein interaction (PPI) networks from existing databases (HPRD32, BioGRID33, IntAct34, HitPredict35. We totally obtain 171,249 physical PPIs. From NetPath36, we obtain 27 immune signaling pathways with IL1–IL11 merged into one pathway for simplicity. From Reactome37, we obtain 1846 human signaling pathways.

Drug target profile-based feature construction

Drugs act on their target genes to produce desirable therapeutic efficacies. In most cases, drug perturbations could disperse to other genes through PPI networks or signaling pathways, so as to accidentally yield synergy or antagonism to the drugs targeting the indirectly affected genes. In this study, we depict drugs and drug pairs using drug target profile only. For each drug ({d}_{i}) in the DDI-associated drug set (D), its targeted human gene set is denoted as ({G}_{{d}_{i}}). The entire target gene set is defined as follows.

$$G={cup }_{{d}_{i}in D}{ G}_{{d}_{i}}$$

(1)

For each drug ({d}_{i}), drug target profile is formally defined as follows.

$$V_{{d_{i} }} left[ g right] = left{ {begin{array}{*{20}l} {1,} hfill & {g in G_{{d_{i} }} Lambda g in G} hfill \ {0,} hfill & {g notin G_{{d_{i} }} Lambda g in G} hfill \ end{array} } right.$$

(2)

Then the drug target profile of a drug pair (({d}_{i},{d}_{j})) is defined by combining the target profile of ({d}_{i}) and ({d}_{j}) as follows.

$${V}_{{(d}_{i},{d}_{j})}left[gright]={V}_{{d}_{i}}left[gright]+{V}_{{d}_{j}}left[gright], gin G$$

(3)

The genes (gnotin G) are discarded. The simple feature representation of drug target profile intuitively reveals the co-occurrence patterns of genes that a drug or drug pair targets. As an intuitive example, assuming the entire gene set (G={TF,ALB,XDH,ORM1,ORM2}), drug Patisiran (DB14582) targets the genes {ALB, ORM1, ORM2} and drug Bismuth Subsalicylate (DB01294) targets the genes {ALB, TF}, then Patisiran is represented with the vector [0, 1, 0, 1, 1] and Bismuth Subsalicylate is represented with the vector [1, 1, 0, 0, 0]. The drug pair (Patisiran, Bismuth Subsalicylate) is represented with the combined vector [1, 2, 0, 1, 1], which is used as the input of the base learner. All the data including the training set and the test set have the same feature descriptors. It is noted that all the target genes are chosen to represent drugs and drug pairs without giving priority or importance to the features, because the known target genes are very sparse and many target genes are unknown. If feature selection with importance weights is conducted, many drugs and drug pairs would be represented with null vector.

L2-regularized logistic regression as base learner

L2-regularized logistic regression38, well-known for its fast fitting large training data and penalizing potential noise and overtraining, is adopted as the base learner in this study. Given the training data x and labels y with each instance ({x}_{i}) corresponding a class label ({y}_{i}), i.e., (({x}_{i},{y}_{i}),i=mathrm{1,2},…,l;{x}_{i}in {R}^{n};{y}_{i}in {-1,+1}), the decision function of logistic regression is defined as (f(x)=frac{1}{1+mathit{exp}(-y{omega }^{T}x)}). L2-regularized logistic regression derives the weight vector (omega) via solving the optimization problem

$$mathop {min}limits_{omega } frac{1}{2}omega ^{T} omega + Csumlimits_{{i = 1}}^{l} {logleft( {1 + e^{{ – y_{i} omega ^{T} x_{i} }} } right)}$$

(4)

where (C) denotes penalty parameter or regularizer. The second term penalizes potential noise/outlier or overtraining. The optimization problem (4) is solved via its dual form

$$begin{aligned} & mathop {min}limits_{alpha } frac{1}{2}alpha^{T} Qalpha + sumlimits_{{i:alpha_{i} > 0}}^{l} {alpha_{i} logalpha_{i} } + sumlimits_{{i:alpha_{i} < C}} {(C – alpha_{i} )log(C – alpha_{i} )} – sumlimits_{i}^{l} {ClogC} \ & s.t. 0 le alpha_{i} le C,i = 1, ldots ,l \ end{aligned}$$

(5)

where ({alpha }_{i}) denotes Lagrangian operator and ({Q}_{ij}={y}_{i}{y}_{j}{x}_{i}^{T}{x}_{j}). To simplify the parameter tuning, the regularizer C as defined in Formula (4) is chosen within the set ({{2}^{i}|-16le ile 16,iin I}), where I denotes the integer set.

Metrics for model performance and intensity of drug–drug interactions

Metrics for binary classification

Frequently-used performance metrics for supervised classification include Receiver Operating Characteristic curve AUC (ROC-AUC), sensitivity (SE), precision (PR), Matthews correlation coefficient (MCC), accuracy and F1 score. Except that ROC-AUC is calculated based on the outputs of decision function (f(x)), all the other metrics are calculated via confusion matrix M. The element ({M}_{i,j}) records the counts that class i are classified to class j. From M, we first define several intermediate variables as Formula (6). Then we further define the performance metrics PRl, SEl and MCCl for each class label as Formula (7). The overall accuracy and MCC are defined by Formula (8).

$$begin{aligned} & p_{l} = M_{l,l} ,q_{l} = sumlimits_{i = 1,i ne l}^{L} {sumlimits_{j = 1,j ne l}^{L} {M_{i,j} ,r_{l} } } = sumlimits_{i = 1,i ne l}^{L} {M_{i,l} ,s_{l} } = sumlimits_{j = 1,j ne l}^{L} {M_{l,j} } \ & p = sumlimits_{l = 1}^{L} {p_{l} ,q} = sumlimits_{l = 1}^{L} {q_{l} ,r} = sumlimits_{l = 1}^{L} {r_{l} ,s} = sumlimits_{l = 1}^{L} {s_{l} } \ end{aligned}$$

(6)

$$begin{aligned} & PR_{l} = frac{{p_{l} }}{{p_{l} + r_{l} }},l = 1,2 ldots ,L \ & SE_{l} = frac{{p_{l} }}{{p_{l} + s_{l} }},l = 1,2 ldots ,L \ & MCC_{l} = frac{{left( {p_{l} q_{l} – r_{l} s_{l} } right)}}{{sqrt {left( {p_{l} + r_{l} } right)left( {p_{l} + s_{l} } right)left( {q_{l} + r_{l} } right)left( {q_{l} + s_{l} } right)} }},l = 1,2 ldots ,L \ end{aligned}$$

(7)

$$begin{aligned} & Acc = frac{{sumnolimits_{l = 1}^{L} {M_{l,l} } }}{{sumnolimits_{i = 1}^{L} {sumnolimits_{j = 1}^{L} {M_{i,j} } } }} \ & MCC = frac{{left( {pq – rs} right)}}{{sqrt {left( {p + r} right)left( {p + s} right)left( {q + r} right)left( {q + s} right)} }} \ end{aligned}$$

(8)

where L denotes the number of labels and equals to 2 in this study. F1 score is defined as follows.

$$F1;score = frac{{2 times PR_{l} times SE_{l} }}{{PR_{l} + SE_{l} }},;l = 1;denotes;the;positive;class$$

(9)

Metrics for intensity of drug–drug interactions

Two drugs perturbate each other’s efficacy through their targeted genes and the association between the targeted genes determines the interaction intensity of two drugs. If two drugs target common genes or different genes connected via short paths in PPI networks, we deem it as close interaction; if two drugs target different genes via long paths in PPI networks or across signaling pathways, we deem it as distant interaction; otherwise, the two drugs may not interact. If two drugs target common genes, the interaction could be regarded as most intensive and the intensity can be measured by Jaccard index. Given a drug pair (({d}_{i},{d}_{j})), the Jaccard index between the two drugs is defined as follows

$$Jaccard({d}_{i},{d}_{j})=frac{|{ G}_{{d}_{i}}cap { G}_{{d}_{j}}|}{|{ G}_{{d}_{i}}cup { G}_{{d}_{j}}|}$$

(10)

where ({G}_{{d}_{i}}) and ({G}_{{d}_{j}}) denote the target gene set of ({d}_{i}) and ({d}_{j}), respectively. The larger the Jaccard index is, the more intensively the drugs interact. We use the threshold (xi) to measure the level of interaction intensity. We further estimate the percentage of drug pairs whose interaction intensity exceeds (xi) as follows

$${Sim}_{U}=frac{|{({d}_{i},{d}_{j})|Jaccard({d}_{i},{d}_{j})ge xi ,({d}_{i},{d}_{j})in U}|}{|U|}$$

(11)

where (U) denotes the set of drug–drug interactions. If (xi ={ min}_{forall ({d}_{i},{d}_{j})in U}frac{1}{|{ G}_{{d}_{i}}cup { G}_{{d}_{j}}|}), then ({Sim}_{U}) gives the percentage of drug pairs that target at least one common gene.

Two drugs may also interact through their target genes communicating via protein–protein interactions, although they do not target common genes. In these cases, we need to consider all the paths between two target genes in PPI networks. Given a gene pair (({g}_{i},{g}_{j})), we use breadth-first graph search algorithm to search for all the paths between (mathrm{them}) in human PPI networks, denotes as ({P}_{({g}_{i},{g}_{j})}). The length of the shortest path and longest path s denoted as ({S}_{({g}_{i},{g}_{j})}) and ({L}_{({g}_{i},{g}_{j})}), respectively. We use the distance between target genes in terms of path length in PPI networks to define the distance between drugs. The average number of paths ({Avg}_{({d}_{i},{d}_{j})}), the shortest distance ({S}_{({d}_{i},{d}_{j})}) and the longest distance ({L}_{({d}_{i},{d}_{j})}) between drug ({d}_{i}) and ({d}_{j}) are defined as follows.

$$begin{aligned} & Avg_{{left( {d_{i} ,d_{j} } right)}} = frac{{mathop sum nolimits_{{left( {g_{i} ,g_{j} } right),g_{i} in G_{{d_{i} }} Lambda g_{j} in G_{{d_{j} }} }} left| { P_{{left( {g_{i} ,g_{j} } right)}} } right|}}{{left| {left{ {left( {g_{i} ,g_{j} } right)left| {g_{i} in G_{{d_{i} }} Lambda g_{j} in G_{{d_{j} }} } right.} right}} right|}} \ & S_{{left( {d_{i} ,d_{j} } right)}} = min_{{forall left( {g_{i} ,g_{j} } right),g_{i} in G_{{d_{i} }} Lambda g_{j} in G_{{d_{j} }} }} ;S_{{left( {g_{i} ,g_{j} } right)}} \ & L_{{left( {d_{i} ,d_{j} } right)}} = max_{{forall left( {g_{i} ,g_{j} } right),g_{i} in G_{{d_{i} }} Lambda g_{j} in G_{{d_{j} }} }} ;L_{{left( {g_{i} ,g_{j} } right)}} \ end{aligned}$$

(12)

({Avg}_{({d}_{i},{d}_{j})}) indicates the number of paths through which two drugs interact. ({S}_{({d}_{i},{d}_{j})}) indicates the most economical and effective way that two drugs interact. ({L}_{({d}_{i},{d}_{j})}) indicates how far two drugs could alter each other’s efficacy, i.e., action range between two drugs. These three metrics are proposed to measure the interaction intensities between two drugs. Especially, ({S}_{({d}_{i},{d}_{j})}=0) indicates that drug ({d}_{i}) and ({d}_{j}) target common genes, and ({Avg}_{({d}_{i},{d}_{j})}=0) indicates that there are no paths between drug ({d}_{i}) and ({d}_{j}) and the two drugs do not interact.

Assuming K signaling pathways in total, if there exists a target gene ({g}_{j}) of drug ({d}_{i}) located in a signaling pathway ({Sig}_{k}), denoted as ({{g}_{j}in Sig}_{k}), the pathway set associated with ({g}_{j}) is defined as ({Sig}_{{g}_{j}}={{{{Sig}_{k}|g}_{j}in Sig}_{k},k=mathrm{1,2},dots ,K}). The signaling pathways targeted by ({d}_{i}) is defined as ({bigcup }_{{g}_{j}in { G}_{{d}_{i}}}{Sig}_{{g}_{j}}), and then the common target signaling pathways between ({d}_{i}) and ({d}_{j}) are defined as ({Sig}_{({d}_{i},{d}_{j})}={bigcup }_{{g}_{j}in { G}_{{d}_{i}}}{Sig}_{{g}_{j}}bigwedge {bigcup }_{{g}_{j}in { G}_{{d}_{j}}}{Sig}_{{g}_{j}}). The common target cellular processes between ({d}_{i}) and ({d}_{j}) are constructed in the same way, except that the signaling pathways are replaced with the GO terms of biological processes in GOA database39.

Source link