Preloader

Biomarker selection and a prospective metabolite-based machine learning diagnostic for lyme disease

LCMS analysis

Serum sample LCMS data acquired previously was utilized24. Detailed methods for metabolite extraction and LCMS analysis can be found in the cited publication.

Data partitioning

Early Lyme disease and healthy control serum samples, previously analyzed by LCMS as two separate batches, were utilized in this study24. These two independently processed batches formed our 118 training samples and 118 sequestered test samples respectively. Samples were categorized by the health state labels: EDL, ELL, healthy control non-endemic (HCN), HCE1, and healthy control endemic site 2 (HCE2). Training samples were partitioned as 30 EDL, 30 ELL, 28 HCN, and 30 HCE1. Test samples were partitioned as 40 EDL, 40 ELL, 30 HCE1, and 8 HCE2. We label a sample as Lyme disease if it belongs to either the ELL or EDL group, and label a sample as healthy if it belongs to the HCE1, HCN, or HCE2 group.

Untargeted and targeted peak identification

For untargeted feature selection, raw data files were converted into mzML format files using MSConvert (Proteowizard) and then processed using XCMS (3.6.2) in R (3.6.1)25,34. Peak detection was performed using the centWave algorithm35. Default parameters were used except for ppm = 30, peakwidth = c(10,30), and noise = 2000). Peak alignment by retention time was carried out using the obiwarp method with binSize = 0.6 and specifying the centerSample as the sample that was measured in middle of the LCMS run36. Quality control included manual inspection of plots of total ion counts and specified peaks by retention time. Peaks were grouped using the peak density method with default parameters except bw = 5 and minfrac = 0.425.

Features selected by kFFS were manually inspected to determine peak quality, whether the monoisotopic peak was chosen, any possible adducts, and feature presence in both runs. After manual evaluation, good quality features were targeted in both the training and test sets using Skyline with suggested settings28,37. Each peak was manually evaluated to ensure correct integration before exporting peak area values.

Cleaning, imputing, and normalizing

As a first step, any metabolites which were missing in more than 80% of samples across each class of healthy or Lyme disease were removed. No features in our list met this criterion and so no features were removed. All samples with missing values were imputed by the KNN algorithm38. KNN imputes missing data in a sample by finding its k-nearest neighbors, taking the mean of a feature with respect to its neighbors, and then imputing that value for the missing feature. Wahl et al. concludes that KNN imputation performs well across several evaluation schemes and computationally takes less resources39. Modified versions of the KNN imputation algorithm, such as normalized No-Skip KNN (NS-KNN), have been proposed and may even outperform the standard algorithm for real datasets when a significant portion of the missing data is Missing Not at Random (MNAR) type38. For this particular application we used (k=5) and implemented the algorithm via the python package missingpy.

Once imputed, the samples were transformed by either the (log _2) transform, standardization, median-fold change normalization, or using raw peak areas40. Standardization is defined as shifting and scaling each feature so that its mean is 0 and its variance is 1 across samples. These transformation schemes were chosen to be the best with respect to the classification accuracy of the SSVM model on the training data, amongst other transformation schemes such as quantile normalization26; see the Supplemental_Data directory in our github repository for our complete transformation/imputation experiment29.

Sparse support vector machines

We classify samples into two classes of healthy, (C_-), and Lyme disease, (C_+), using a variation of SVM called SSVM8,41. Each sample (mathbf {x}) can be viewed as vector living in (mathbb {R}^n) where n is the number of features/biomarkers/measurements. SVM classifies samples by first constructing a hyperplane (mathbf {H}subset mathbb {R}^n) which best separates the training samples into (C_-) and (C_+). SSVM alters SVM by finding a hyperplane which, in addition to separating the two classes, uses relatively few features compared to the entire feature space. Explicitly, we solve the convex optimization problem

$$begin{aligned} min _{mathbf {w},varvec{xi },b} leftVert mathbf {w}rightVert _1 + Cvarvec{e}^Tvarvec{xi }quad text {subject to}quad mathbf {Y}left( mathbf {X}mathbf {w}-bmathbf {e}right) +varvec{xi } ge mathbf {e},,, varvec{xi }ge mathbf {0}, end{aligned}$$

(1)

where (mathbf {X}) is the (mtimes n) matrix whose ith row (mathbf {X}^{(i)}in mathbb {R}^n) is the feature vector for the ith sample, (mathbf {Y}) is the (mtimes m) diagonal matrix whose entries are either (+,1) or (-,1) corresponding the class labels of samples, (varvec{xi }in mathbb {R}^m) is the vector of penalties for samples violating the hyperplane boundary, C is a tuning parameter for balancing the misclassification rate against the sparsity, (mathbf {e}) is the vector of all 1’s in the appropriate dimension space, (mathbf {w}) is the normal vector to the hyperplane (mathbf {H}), and b is the scalar affine shift of the hyperplane (mathbf {H}). It is known that minimizing the 1-norm of (mathbf {w}) promotes sparsity in the components of (mathbf {w})42,43. That is (mathbf {w}) will have relatively few large components while its many other components will be near zero, see Fig. 2. It appears to be a special feature of SSVM that there is an abrupt drop in feature size, i.e., often on the order of a 100–1000 factor reduction, see Fig. 2. Features corresponding to large components in (mathbf {w}) are chosen to build a sparse model. We solve (1) by first transforming the convex optimization problem into a linear program via a simple substitution and then applying a primal-dual interior point method using our own in-house python package calcom—provided in our github repository29,44,45.

k-fold feature selection (kFFS)

We selected features/biomarkers using a new method: kFFS. First, we randomly partitioned training samples into k non-overlapping and equally-sized parts. We then chose (k-1) parts as a training set for an SSVM classifier and then validated the classifier on the withheld part. There are k ways to choose (k-1) parts from k parts—therefore we obtained a k-fold experiment, known as k-fold cross validation (cross-validation). For each fold of the experiment we extracted features, ordered them by the absolute value of their weight in the SSVM model, grabbed the top (ple n) features from each fold, collected them into a common list of features, and then ordered the list by feature occurrence across the k folds, see Fig. 5a. For the results of our paper we used (k=5) and an (p=5). Using multiple folds for feature selection brings in features from sub-populations of the data that may not be captured by using the training set as a whole. Ordering by frequency shows which of those features generalize to the entire training set.

Figure 6
figure 6

Fivefold classification accuracy of SSVM model for different values the hyper-parameter C. The solid line indicates the mean accuracy across fivefold while the shaded regions indicate 1 standard deviation of the accuracy.

Batch correction

For batch correction we used an IFR technique, which we simply call IFR, to remove features discriminating between HCN and HCE1 control groups in the training set23. Specifically, we perform kFFS ((k=2), (n=5)) between the training HCE1 and HCN groups, obtain a set of discriminatory features, remove these features, and then repeat the process until the mean 2-fold cross-validation accuracy of the SSVM classifier goes below (60%), see Algorithm 1.

To evaluate the efficacy of IFR for batch correction we utilized the visualization tool UMAP. UMAP attempts to embed data into a lower dimensional space so that it is approximately uniformly distributed and its local geometry is preserved27. UMAP does so by representing each k-neighborhood of a sample as a weighted graph, “gluing” these graphs together over all samples, and then approximating the resulting global structure in a lower dimensional space.

If it happens that a point has most of its neighbors from the same class or batch then this point will be pulled in that direction in the embedding; making it a great tool for visualizing batch effects in data. We used the python package umap-learn with parameters min_dist(=.1), n_neighbors(=15), n_components(=2) for our UMAP visualizations. See Tran et al. for UMAP applied to several genomics data sets46.

figurea

Classification

Once we removed features for batch effects we restricted the training data to the remaining features, and we then either (log _2) transformed, standardized, median-fold change normalized, or did not transform the training data. Once transformed we imputed the training samples using the KNN algorithm. We performed a fivefold cross-validation experiment with an SSVM classifier, while varying the hyper-parameter C in Eq. (1). C was chosen so that it was as small as possible (promoting sparsity), while simultaneous yielding high accuracy and small variance, see Fig. 6.

We classified test samples by first restricting both the training data and test data to the selected features; found by the methods above. We restricted the samples by first targeting these features in Skyline. Once these new feature sets were obtained they were (log _2) transformed and a SSVM classifier was trained and tuned on all of the training samples. We then evaluated the performance of the classifier on the sequestered test samples via confusion matrix, see Fig. 5b for a diagram of the classification pipeline.

Metabolite class validation

Confirmation of the chemical structure of selected molecular features (MF) was performed by LCMS/MS. MS/MS spectra were manually evaluated using MassHunter Qualitative software (Agilent Technologies)47. MS/MS spectra were compared with available spectra in Metlin and NIST databases. The level of structural identification followed refined Metabolomics Standards Initiative guidelines proposed by Schymanski et al.31.

Source link