Preloader

Multiomics implicate gut microbiota in altered lipid and energy metabolism in Parkinson’s disease

Study subjects, clinical data, and sampling

This study was conducted in accordance with the Declaration of Helsinki and was approved by the ethics committee of the Hospital District of Helsinki and Uusimaa. All participants provided written informed consent.

The present study uses bacterial abundance data that was used in a previously published study by Aho et al.14. The study subjects and associated clinical data in the present study is similar to the data referred to previously in that study as “follow-up” timepoint, with minor changes specific to the present study: of the original 128 subjects, 61 control subjects and 63 PD patients were used in the present study, i.e. 3 control subjects and 1 PD subject less than in the original cohort (C75, C82, C123, and P119). This difference in sample numbers was due to insufficient metabolite data available to perform the study.

For DNA sequencing, the stool samples were collected at home by the study subjects using tubes containing DNA stabilizer (PSP Spin Stool DNA Plus Kit, STRATEC Molecular), which were stored for a maximum of three days in a freezer until transport. At the clinic, the received samples were stored at −80 °C and later transferred to the sequencing centre, where they were also stored at −80 °C until further processing14.

For serum samples, blood was drawn at the study visit and, after processing, immediately transferred to −20 °C and subsequently to −80 °C. Samples were shipped overnight on dry ice from Helsinki to Manchester for analysis.

Sample preparation and metabolomics methods

Metabolomics sample preparation

Untargeted metabolite profiling was performed on serum samples that were collected from participants and stored at −80 °C prior to analysis. Complementary coverage of metabolites was obtained using ultra-high performance liquid chromatography mass spectrometry (UHPLC-MS) and gas chromatography mass spectrometry (GC-MS). The procedures were adapted from the Dunn58 and Begley59 protocols as summarized here:

Metabolites were extracted from the serum samples by individually adding 400 µL of cold methanol to 200 µL of serum. This was followed by vortexing and centrifugation (17,500 × g) to yield a metabolite rich supernatant that was split into two aliquots and lyophilised for 12 h. Resultant metabolite pellet was stored at −80 °C until analysis. A pooled QC standard was also generated by combining 20 µL aliquots of each sample into a pooled vial with subsequent 200 µL aliquots from the pool, being extracted identical to each sample.

LC-MS method parameters

Processed metabolite pellets were defrosted at 4 °C and subsequently reconstituted in 100 µL of 95:5 H2O:MeOH (v/v). UHPLC-MS analysis was performed using an Accela UHPLC with cooled auto sampler system coupled to an electrospray LTQ-Orbitrap XL hybrid mass spectrometer (ThermoFisher, Bremen, Germany). Analysis was carried out in positive and negative ESI modes while samples were completely randomised to negate for any bias. The mobile phases and gradient elution profile were as tabulated in Table 5. From each sample vial, 10 µL of the extract was injected onto a Hypersil GOLD UHPLC C18 column (length 100 mm, diameter 2.1 mm, particle size 1.9 µm, Thermo-Fisher Ltd. Hemel Hempsted, UK) held at a constant temperature of 50 °C with a solvent flow rate of 400 µL min−1.

Table 5 Mobile phases and gradient elution profile.

Prior to analysis, LTQ-Orbitrap XL was calibrated according to manufacturer’s instructions using caffeine (20 µg mL−1), the tetrapeptide MRFA (1 µg ml−1) and Ultramark 1621 (0.001%) in an aqueous solution of acetonitrile (50%), methanol (25%) and acetic acid (1%). The data acquisition was performed in centroid mode with 30 K mass resolution and scan rate of 400 ms per scan. The masses were measured between 100 and 1200 m/z range with source gases set at sheath gas = 40 arb units, aux gas = 0 arb units, sweep gas = 5 arb units. The ESI source voltage was set to 3.5 V, and capillary ion transfer tube temperature set at 275 °C.

LC-MS data processing

Xcaliber software (v.3.0; Thermo-Fisher Ltd. Hemel Hempsted, UK) was used as the operating software for the Thermo LTQ-Orbitrap XL mass spectrometer. Data processing was initiated by the conversion of the standard UHPLC.raw files into the.mzML format using Proteowizard60. Subsequently, peak picking was carried out in RStudio61 using the XCMS62 package for data deconvolution (http://masspec.scripps.edu/xcms/xcms.php). The output data was a matrix of mass spectral features with accurate m/z and retention time pairs. Any missing values after deconvolution were replaced using k-nearest neighbours algorithm. Peaks with relative standard deviation of more than 20% within pooled QCs were removed. The remaining data was normalised with total ion count to account for injection to injection signal variations, log10 transformed, and pareto scaled prior to statistical analysis.

GC-MS method parameters

Analysis of serum samples was also carried out on a Agilent 7250 GC-Time-of-Flight mass spectrometer coupled to a Gerstel-MPS autosampler. Two step derivatization of metabolite pellets thawed at 4 °C was carried out as described in the Begley protocol59. The source temperature was set to 230 °C and quad temperature was at 150 °C. The total run time was 25 min for 10 µL sample injected each time. The sample was injected in split mode with 20:1 split ratio and split flow of 20 mL per minute. Agilent CP8944 VF-5 ms column was used for separation (30 m × 250 µm × 0.25 µm). With a 5 min solvent delay at the start of run, gradient elution method was used to elute and separate analytes from serum. The oven temperature was ramped from 70 °C to 300 °C with an increase of 14 °C per minute. At 300 °C the temperature was held for 4 min before dropping back to starting conditions.

GC-MS data processing

The raw data files were in Agilent.D format that were converted to.mzML format using Proteowizard60. Peak picking was carried out in RStudio61 using an in-house script for the eRah63 package for GC-MS peak picking and deconvolution. The peaks were annotated using eRah’s MassBank library. Any missing values after deconvolution were replaced using k-nearest neighbours algorithm. Peaks with relative standard deviation of more than 20% within pooled QCs were removed. The remaining data was normalised with total ion count to account for injection to injection signal variations, log10 transformed, and pareto scaled prior to statistical analysis.

All metabolites successfully annotated within both the LC-MS and GC-MS analysis were assessed and scored at MSI level 3 putative identification according to rules set out by the Chemical Analysis Working Group of the Metabolite Standards Initiative18.

Sample preparation and DNA sequencing

DNA extraction was performed according to the manufacturer’s instructions. We then amplified the V3-V4 regions of the bacterial 16 S rRNA gene, using two technical replicates (25 μL reactions) per biological sample, and a mixture of the universal bacterial primers 341F1–4 (5′ CCTACGGGNGGCWGCAG 3′) and 785R1–4 (5′ GACTACHVGGGTATCTAATCC 3′) with partial Illumina TruSeq adapter sequences added to the 5′ ends (F1; ATCTACACTCTTTCCCTACACGACGCTCTTCCGATCT, F2; ATCTACACTCTTTCCCTACACGACGC TCTTCCGATCTgt, F3; ATCTACACTCTTTCCCTACACGACGCTCTTCCGA TCTagag, F4; ATCTACACTCTTTCCCTACACGACGCTCTTCCGATCTtagtgt and R1; GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCT, R2; GTGACT GGAGTTCAGACGTGTGCTCTTCCGATCTa, R3; GTGACTGGAGTTCAGACGT GTGCTCTTCCGATCTtct, R4; GTGACTGGAGTTCAGACGTGTGCTCTTCCGA TCTctgagtg)14. The small letters in the above sequences are additional nucleotides introduced for purposes of mixing in the sequencing process. We performed two-step PCRs followed by quantification, pooling, and purification14. The resulting PCR products were then sequenced with Illumina MiSeq (v3 600 cycle kit), with 325 bases and 285 bases for the forward and reverse reads, respectively.

Bioinformatics and statistical data analysis

Statistical analysis of metabolomics data

In this untargeted profiling study, we detected a total of 7585 features combined between LC-MS positive ionisation mode, LC-MS negative ionisation mode and GC-MS data. After processing the raw data and applying QC based relative standard deviation filtering, in LC-MS positive ionisation mode 5897 features remained, and in LC-MS negative ionisation mode data 1260 features remained. In GC-MS data 428 features were retained for statistical analysis. As described earlier in this article, these features were scaled and transformed prior to any statistical analysis.

(i) Classification by disease status

Individual dataset from LC-MS positive mode, LC-MS negative mode, and GC-MS were first analysed separately and then standardised combined data was used to generate a model based on the whole metabolome measured by 7585 features in total. Individual datasets from each technique were analysed using partial least squares discriminant analysis (PLS-DA) to classify between PD and control using the Metaboanalyst package in R. The models were validated by 60:40 split of data with 250 bootstraps and also by comparing correct classification rates with permuted models (Supplementary Table 7 (MODELS)).

In order to investigate the classification of samples into PD and control, support vector machines (SVM) were used on 7585 metabolite features. Using Python via Orange user interface, SVM models were generated for this analysis. The data were pre-treated as described earlier in this article. The data were then split into train (60% data) and test (40% data), and resampling was repeated 100 times. The SVM model was generated with RBF-kernel, cost (C) was set to 1.5, and regression loss epsilon was set at 0.10. We accounted for potential effects of hypercholesterolaemia, gender, age at sampling, BMI, defecations per week, weekly stool characteristics average, clinical scores (GDS15, MMSE, NMSQ, NMSS, RBDSQ, Rome-III constipation score, SDQ, Wexner total, Progression score, and Rome-III IBS criteria) as well as 68 dietary and lifestyle related variables as described in Supplementary table 7 (MODELS). The top 10% variables were selected after ranking them by ReliefF algorithm, and models were regenerated to rule out the possibility of an effect due to a large number of metabolite features.

(ii) Key predictive metabolite feature selection

To select metabolite features (i.e., features predictive of PD status) contributing to the SVM models, the mSVM-RFE algorithm was used64. The iterative algorithm worked backwards from an initial set of features consisting of all the metabolite features in the dataset. At each iterative round, firstly a simple linear SVM was fitted, then features were ranked based on their weights in the SVM solution and lastly, the algorithm eliminated the feature with the lowest weight. In order to stabilize these feature rankings, at each iteration cross validation resampling was used. By using k-fold cross validation (k = 10) multiple SVM-RFE iterations were carried out. From the resultant ranked feature list, the top 10% of the features were selected (the 139 key predictive metabolite features) for further interpretation as they contributed the most towards SVM models.

(iii) Effects of PD medications, UPDRS-III score, and time since onset of motor symptoms on key metabolite features selected

During the selection of the key predictive metabolite features, potential effects from clinical variables like PD medication could not be taken directly into account in those models, since they are restricted to the PD group. However, such an approach would leave open the question regarding possible effects of those variables on the metabolite predictors. Thus, to investigate the potential effects of PD medications, UPDRS-III score, and time since motor symptoms on PD vs HC classification (i.e. the 139 metabolite features), and to investigate associations of drug dosages to the key metabolite features, partial least squares and logistic regression were carried out within-PD for continuous and categorical responses, respectively. The variables that had continuous scaled values were subjected to partial least squares regression with 20 PCs, for 1000 iterations. The variables that had categorical values were subjected to logistic regression with ridge penalization with cost value of 1. All partial least squares and logistic regression models were validated by leave-one-out cross-validation (LOOCV). Partial least squares R2 and root mean squared error (RMSE) and logistic regression classification accuracy were used for model evaluation (Supplementary Table 7 (MODELS)). The top 10% of variables were selected after ranking them by ReliefF algorithm, and models were regenerated to rule out the possibility of an effect due to a large number of metabolite features. Our results indicate that the choice of 139 metabolite features was not affected by PD medication use or dosage. One possible exception could be COMT inhibitors, which may have a weak effect and may be associated with the PD metabolome in serum: using the 139 key metabolites, we were able to classify between a PD subject with and without COMT inhibitor treatment with an accuracy of 63%, which is low. Thus, these results suggest by extension that drug use did not affect the choice of the 139 key predictive metabolite features in any substantial way during the HC vs PD predictive feature analysis. Supplementary Table 7 (MODELS) includes information about variables and adjusted variables, for each model.

(iv) Effects of clinical variables on metabolome within PD

We investigated potential effects and association of clinical variables within the PD cohort. All metabolite features (not just the key predictive metabolite features) were regressed against clinical features of Parkinson’s viz. GDS-1565 (depression), MMSE66 (cognition), NMSS67 (non-motor symptoms), RBDSQ68 (REM-sleep behaviour disorder), Rome-III constipation score69, Rome-III IBS status, Wexner score70 (constipation), SCS-PD71 (drooling), SDQ72 (dysphagia), UPDRS II-III, Hoeh & Yahr scale, and progression category from Aho et al.14, while adjusting for model covariates (see Supplementary Table 7 (MODELS)). All RBDSQ and Progression categories were adjusted for age at sampling and time since motor onset. SCS-PD, SDQ, UPDRS-II, UPDRS-III and Hoehn & Yahr were also adjusted for LEDD. UPDRS-III was additionally adjusted for beta blockers. Wexner score and Rome-III IBS status and Rome-III constipation scores were also adjusted for anticholinergic medication, constipation medication, opioids, and tricyclic medications, as well as dietary fibre intake. SCS-PD was additionally adjusted for anticholinergic and tricyclics. GDS-15 was additionally adjusted for SSRI medications. MMSE was additionally adjusted for anticholinergic and tricyclic medications.

Partial least squares and logistic regression was carried out for continuous and categorical responses, respectively. The features that had continuous scaled values were subjected to partial least squares regression with 20 PCs, for 1000 iterations. The features that had categorical values were subjected to logistic regression with ridge penalization with cost value of 1. All partial least squares and logistic regression models were validated by splitting data into 60:40 training: test sets and resampling for 100 times. Partial least square R2 and root mean squared error (RMSE) and logistic regression classification accuracy were used for model evaluation (Supplementary Table 7 (MODELS)). Top 10% of variables (759 variables) were selected after ranking them by ReliefF algorithm, and models were regenerated to rule out the possibility of an effect due to a large number of metabolite features.

(v) Pathway analysis

In this data driven approach, we have interrogated data generated from an untargeted profiling study. It is often impractical to identify each peak in a metabolomics profiling study as it could contain upwards to 5000 features in a single sample. To identify them accurately, the only option is to purchase commercial standards and perform MS/MS analysis in both samples and standards and then match fragmentation spectra. This could be relevant when performing a targeted analysis with a defined set of metabolites. Computationally predicted m/z based identification alone is not adequate for pathway analysis due to multiple metabolite matches to single m/z. Thus, we have employed mummichog analysis that does not depend on identification of metabolites and then mapping on pathways. Instead, mummichog leverages the collective power in the organisation of metabolic networks. If a list of m/z values truly reflects a biological activity, the true metabolites that are represented by these m/z values should show enrichment on a local structure of a metabolic network. If the measured m/z matches to a falsely represented metabolite, the distribution will be observed randomly. The overall significance of mapping and pathway enrichment is estimated by ranking the p-values from the real data among the p-values from permutation data to adjust for type I error, along with penalisation. Thus, a robust functional metabolic network gives us insight into our data more than identifying a handful of features. Mummichog73 (v.1.0.9) pathway analysis was used to predict network activity from pre-processed UHPLC-MS metabolomics data. The full metabolite data set consisting of 5897 and 1260 features from LC-MS positive and negative mode, respectively, was used as an input. Pathway enrichment analysis was performed on annotated 428 GC-MS features using MetaboAnalystR74.

Data pre-processing for 16S rRNA gene sequence data

The raw sequence data amounted to a total of 34,701,899 reads. In brief, primers were removed from the reads using cutadapt before further processing14. We then used mothur to pair, quality trim, taxonomically classify, and finally cluster the reads into OTUs, following mothur’s Standard Operating Procedure (SOP) for MiSeq. The following customizations were made to the SOP parameters: insert = 40 and deltaq = 10 in make.contigs; maxlength = 450 in the first screen.seqs step; start = 6428 and end = 23,440 in the second screen.seqs step; diffs = 4 in pre.cluster. Singleton sequences were also removed with split.abunds (cutoff = 1) before running classify.seqs. The reference databases used were the full-length SILVA alignment release 128 for align.seqs and the RDP 16S rRNA reference (PDS) version 16 for classify.seqs. The final, processed data set (without sequencing blanks) consisted of 18,867 278 reads14.

Metabolome-microbiome correlation analysis

For correlation analysis between metabolites and bacterial taxa at genus, family, and phylum levels, we used the fido75 package (v.0.1.13; the package was formerly known as stray) for the R Statistical Programming Software76 (v.3.6.0). fido provides a framework for inferring multinomial logistic-normal models which can account for zeros and compositional constraints, as well as sampling and technical variation present in sequence count data75. For the present study we used the function orthus from fido which enables joint modelling of multivariate count data (e.g. 16S rRNA gene amplicon sequence data) and multivariate Gaussian data (e.g. metabolomics data on the log-scale).

For samples j {1, …, N} we denote by Yj the observed D1 -dimensional vector of sequence counts, Zj the standardized (i.e. Z-transformed) and log10 -transformed P -dimensional vector of observed metabolite abundances, and Xj a Q -dimensional vector of covariates. Using this notation, the orthus likelihood model is given by

$$begin{array}{l}Y_j sim {{{mathrm{Multinomial}}}}left( {pi _j} right)\ pi _j = phi ^{ – 1}left( {eta _j} right)\ left[ {begin{array}{*{20}{c}} {eta _j^T} \ {Z_j^T} end{array}} right]^T sim Nleft( {{{Lambda }}X_j,{{Sigma }}} right)end{array}$$

(1)

with priors ({{Lambda }} sim N({{Theta }},{{Sigma }},{{Gamma }})) and ({{Sigma }} sim)Inverse Wishart (left( {{{Xi }},v} right)) and with (phi ^{ – 1}) denoting the inverse additive log-ratio (ALR) transform with respect to the D -th taxa77. Of note, the ultimate inference is invariant to the chosen ALR transform. This represents a joint linear model over the latent relative abundances of microbial taxa and metabolite abundances. For computational scalability this model was inferred using the multinomial-Dirichlet Bootstrap approximation to the marginal posterior density p (π | Y) that is available in fido. The multinomial-Dirichlet bootstrap approximates the true marginal posterior density using the posterior of a Bayesian multinomial-Dirichlet model centred at the maximum a posteriori (map) estimate of π. In brief, this is accomplished as follows: for each sample j, the marginal posterior distribution pj | Y) is approximated as the posterior of a Bayesian multinomial Dirichlet model (p(tilde pi mid tilde Y)) where (tilde Y = pi _j^{map}mathop {sum}nolimits_{i = 1}^D {Y_{ij}}). Here, the Dirichlet parameters αj were all taken to be 0.5; this can be thought of as a probabilistic equivalent to using a pseudo-count of 0.5 yet also producing quantified uncertainty due to multivariate counting. The prior parameters were chosen as

$${{Theta }} = 0_{([D – 1 + P] times Q)},{{Gamma }} = I_q$$

(2)

and v = D + P + 9. Finally, we set the prior

$${{Xi }} = (v – D – P),{it{BlockDiagonal}}left( {GG^T,I_P} right)$$

(3)

where G is the (D − 1) × D ALR contrast matrix given by (G = left[ {I_{D – 1} – 1} right]). This choice of Θ, Γ, ({{Xi }}), and v represents the weak prior belief that the correlation between the absolute abundance of taxa is, on average, small. This prior is closely related to the sparse penalization used by SparCC78. Using this model, priors, and inference, we sampled 2000 independent samples from the posterior distribution p (Λ, Σ | Y, Z, X).

Variable selection for these metabolite feature/microbiota correlation models was performed as follows: we selected the relevant variables based on multivariate analyses of the communities’ Bray-Curtis dissimilarity measure using PERMANOVA, a semi-parametric approach that does not assume multivariate normality. These analyses were performed separately for the within-Controls and within-PD groups. First, all clinical and technical variables of interest were analysed on their own in a univariable model (i.e. a model with a single explanatory variable). Given that the amount of variance explained in microbiome models is always very low, the choice of variables at this step was based solely on achieving statistical significance equal to or lower than 0.05. The variables that passed this alpha threshold of significance were then combined into a multivariable model (i.e. a model with more than one explanatory variable) using marginal testing. Those that retained significance at 0.05 or less in this full model were considered for the Bayesian covariance models. Of the latter, only those variables in common between the metabolite-based variable selection (see section iii above) and the microbiota-based variable selection were used for the Bayesian covariance models. Thus, for the three within-Parkinson’s covariance models (i.e. using only PD subjects at three taxonomic levels), the models were adjusted for COMT inhibitor medication use. For the three within-Controls covariance models we used intercept-only models.

The matrix Σ represents a ([D − 1] + Q) × ([D − 1] + Q) covariance matrix encoding all possible covariances between ALR coordinates and metabolites. For model interpretation and inference, each posterior sample of the upper (D − 1) × Q submatrix was transformed to a D × Q matrix representing the covariance between microbial composition (now represented with respect to centred log-ratio coordinates, or CLR) and metabolite abundances. Covariances were transformed to correlations using the function cov2cor in the R programming language. For the purposes of this study, we considered only those correlations that had a posterior mean equal to or larger than 0.3 and that had a 95% credible region not including zero. Conditioned on our chosen priors, this decision boundary can be thought of as limiting our results to correlations which we believe (with at least 95% certainty) are non-zero.

Covariance modelling was performed for bacterial genus, family, and phylum levels, using only metabolite abundance data at “Peak ID” level. This means that, although several of their corresponding MSI level 3 putative identifications were nominally the same, these were not merged before analysis, because they have different retention times and there is non-negligible uncertainty in their identification. After the correlations were calculated, we broadly assigned metabolite class information to these metabolites for Table 4 and Table 2 to aid in interpretation. These class assignments were then used to produce the Cytoscape79 network visualisations (v.3.8.0), both because classes simplify the networks and because they are more plausible than the putative MSI level 3 identifications. These classes were assigned by searching each putative identification of a metabolite feature against the Human Metabolome Database16 (HMDB) and the Kyoto Encyclopedia of Genes and Genomes80 (KEGG) entry (Table 6).

Table 6 Key to Figures 1 and 2.

Reporting summary

Further information on research design is available in the Nature Research Reporting Summary linked to this article.

Source link