Early prediction of preeclampsia in pregnancy with cell-free RNA

Clinical study design

The discovery and validation 1 cohorts were collected as part of a longitudinal, prospective study. We enrolled pregnant mothers (aged 18 years or older) receiving routine antenatal care on or before 12 weeks of gestation at Lucile Packard Children’s Hospital at Stanford University, following study review and approval by the Institutional Review Board (IRB) at Stanford University (21956). All individuals signed informed consent before enrolment. Whole-blood samples for plasma isolation were then collected at three distinct time points during their pregnancy course and once (or twice for two individuals) post-partum. To split the larger Stanford cohort into Discovery and Validation 1, we first allocated samples using sequencing batches of which there were three. We allocated the sequencing batch with the most preeclampsia samples to Discovery to ensure sufficient statistical power and the second most preeclampsia samples to Validation 1. Sequencing batches themselves contained randomly allocated samples based on the individual such that all samples from the same individual were in the same batch. For the final sequencing batch, we randomly allocated individuals to either Discovery or Validation 1 such that all samples from 1 individual were part of the same group (either Discovery or Validation 1) and we maintained at least a 1–2 case to control ratio in both groups.

The validation 2 cohort was collected as part of the Global Alliance to Prevent Prematurity and Stillbirth (GAPPS) Pregnancy Biorepository at Yakima Valley Memorial Hospital, Swedish Medical Center and the University of Washington Medical Center under review and approval by Advarra IRB (CR00195799). Samples were processed and sequenced at Stanford under the same IRB as above (21956). All individuals signed informed consent before enrolment. Whole-blood samples for plasma isolation were collected at a single time point (or two time points in the case of two individuals with preeclampsia) before or at 16 weeks of gestation.

For all three cohorts, we chose a case to control ratio of approximately 1–2 to increase statistical power. We also ensured that case and control groups were matched for race and ethnicity, and that all included individuals did not have chronic hypertension or gestational diabetes. No other matching or exclusion criterion were used; we performed no further sample selection prior to sample processing. Mothers were defined as having preeclampsia on the basis of current American College of Obstetrics and Gynecology (ACOG) guidelines (see below). Mothers were defined as controls if they had uncomplicated term pregnancies and either normal spontaneous vaginal or caesarean deliveries. For mothers who developed preeclampsia, all antenatal samples included in this study were collected before clinical diagnosis.

We processed samples from 88 individuals (60 normotensive, 28 with preeclampsia) in the discovery cohort, 43 individuals (34 normotensive, 9 with preeclampsia) in the validation 1 cohort and 87 individuals (61 normotensive, 26 with preeclampsia) in the validation 2 cohort. For some cohorts, only a subset of these individuals was included in the final analysis after filtering samples on the basis of pre-defined quality metrics (see ‘Sample quality filtering’, Supplementary Note 1, Supplementary Table 1)

We tested for within-cohort (normotensive versus preeclampsia) and across-cohort differences in demographic variables using a two-sided chi-squared test and ANOVA for categorical and continuous variables. respectively. We then applied Bonferroni correction and reported any differences as significant if adjusted P ≤ 0.05. Investigators were blinded during data collection as pregnancy outcomes were not known yet. Investigators were not blinded during data analysis as analysis methods required knowledge of outcome (that is, supervised learning).

Definition of preeclampsia

Preeclampsia was defined per the ACOG guidelines²⁷ on the basis of two diagnostic criteria: (1) new-onset hypertension developing on or after 20 weeks of gestation; and (2) new-onset proteinuria or in its absence, thrombocytopenia, impaired liver function, renal insufficiency, pulmonary oedema or cerebral or visual disturbances.

New-onset hypertension was defined when the systolic and/or diastolic blood were at least 140 or 90 mmHg, respectively, on at least two separate occasions between four hours and one week apart. Proteinuria was defined when either 300 mg protein was present within a 24-h urine collection or an individual urine sample contained a protein/creatinine ratio of 0.3 mg dl⁻¹, or if these were not available, a random urine specimen had more than 1 mg protein as measured by dipstick. Thrombocytopenia, impaired liver function and renal insufficiency were defined as a platelet count of less than 100,000 per µl, liver transaminase levels two or more times higher than normal and serum creatinine level of higher than 1.1 mg dl⁻¹, respectively.

Symptoms were defined as severe per the ACOG guidelines. Specifically, preeclampsia was defined as severe if any of the following symptoms were present and diagnosed as described above: new-onset hypertension with systolic and/or diastolic blood pressure of at least 160 and/or 110 mmHg, respectively, thrombocytopenia, impaired liver function, renal insufficiency, pulmonary oedema, new-onset headache unresponsive to medication and unaccounted for otherwise or visual disturbances.

Finally, a pregnant mother was considered to have early-onset preeclampsia if onset occurred before 34 weeks of gestation and late onset thereafter.

Sample preparation

Plasma processing

At Lucile Packard Children’s Hospital, blood samples were collected in either EDTA-coated (368661, Becton-Dickinson) or Streck cfRNA BCT (218976, Streck) tubes at ≤12, 13–20 and ≥23 weeks of gestation, and post-partum for each participant. Within 30 min, tubes were then centrifuged at 1,600g for 30 min at room temperature. Plasma was transferred to 2-ml microfuge tubes and centrifuged at 13,000g for another 10 min in a microfuge. One-millilitre aliquots were then transferred to 2-ml Sarstedt screw cap microtubes (50809242, Thermo Fisher Scientific) and stored at −80 °C until analysis.

At GAPPS, blood samples were collected in EDTA-coated tubes at ≤16 weeks of gestation from a network of collection sites. Per standard operating procedure, tubes were then centrifuged within 2 h of collection at 2,500 rpm for 10 min at room temperature in a swinging bucket rotor. Plasma was transferred to 2-ml cryovials in at most 1-ml aliquots and stored at −80 °C until analysis. Sample volume was also recorded.

cfRNA isolation

In 96-sample batches, cfRNA from 1-ml plasma samples was extracted in a semi-automated manner using the Opentrons 1.0 system and Norgen Plasma/Serum Circulating and Exosomal RNA Purification 96-Well Kit (Slurry Format) (29500, Norgen). Samples were subsequently treated with Baseline-ZERO DNAse (DB0715K, Lucigen) for 20 min at 37 °C. DNAse-treated cfRNA was then cleaned and concentrated into 12 µl using Zymo RNA Clean and Concentrator-96 kits (R1080).

After cfRNA extraction from plasma samples, isolated RNA concentrations were estimated for a randomly selected 11 samples per batch using the Bioanalyzer RNA 6000 Pico Kit (5067-1513, Agilent) per the manufacturer’s instructions.

Sequencing library preparation

cfRNA sequencing libraries were prepared with the SMARTer Stranded Total RNAseq Kit v2 – Pico Input Mammalian Components (634419, Takara) from 4 µl of eluted cfRNA according to the manufacturer’s instructions. Samples were barcoded using the SMARTer RNA Unique Dual Index Kit – 96U Set A (634452, Takara), and then pooled in an equimolar manner and sequenced on Illumina’s NovaSeq platform (2 × 75 bp) to a mean depth of 54, 33 and 38 million reads per sample for discovery, validation 1, and validation 2 cohorts, respectively. Some samples (12, 61 and 0 for discovery, validation 1 and validation 2 cohorts) were not sequenced owing to failed library preparation.

Bioinformatic processing

For each sample, raw sequencing reads were trimmed using Trimmomatic (v.0.36) and then mapped to the human reference genome (hg38) with STAR (v.2.7.3a). Duplicate reads were then removed by GATK’s (v.4.1.1) MarkDuplicates tool. Finally, mapped reads were sorted and quantified using htseq-count (v.0.11.1) generating a counts table (genes × samples). Read statistics were estimated using FastQC (v.0.11.8).

Across samples, the bioinformatic pipeline was managed using Snakemake (v.5.8.1). Read and tool performance statistics were aggregated across samples and steps using MultiQC (v.1.7). After sample quality and gene filtering, all gene counts were adjusted to log₂-transformed counts per million reads (CPM) with trimmed mean of M values (TMM) normalization⁴¹.

Sample quality filtering

For every sequenced sample, we estimated three quality parameters as previously described^42,43. To estimate RNA degradation in each sample, we first counted the number of reads per exon and then annotated each exon with its corresponding gene ID and exon number using htseq-count. Using these annotations, we measured the frequency of genes for which all reads mapped exclusively to the 3′-most exon as compared to the total number of genes detected. RNA degradation for a given sample can then be approximated as the fraction of genes in which all reads mapped to the 3′-most exon. To estimate the number of reads that mapped to genes, we summed counts for all genes per sample using the counts table generated from bioinformatic processing above. To estimate DNA contamination, we quantified the ratio of reads that mapped to intronic as compared to exonic regions of the genome.

After measuring these metrics across nearly 700 samples, we empirically estimated RNA degradation and DNA contamination’s 95th percentile bound. We considered any given sample an outlier, low-quality sample if its value for at least one of these metrics was greater than or equal to the 95th percentile bound or if no reads were assigned to genes.

Once values for each metric were estimated across the entire dataset, we visualized: (1) whether low-quality samples clustered separately using hierarchical clustering (average linkage, Euclidean distance metric); and (2) whether sample quality drove variance in gene measurements using principal component analysis (PCA). These analyses were performed in Python (v.3.6) using Scikit-learn for PCA (v.0.23.2), Scipy for hierarchical clustering (v.1.5.1) and nheatmap for heat map and clustering visualization (v.0.1.4).

After confirming sample quality, 404 samples from 199 mothers (142 normotensive, 57 with preeclampsia) were included in the final analysis (Supplementary Table 1). Specifically, 209, 106 and 89 samples from 73, 39 and 87 participants (49, 32 and 61 normotensive; 24, 7 and 26 with preeclampsia) were included in discovery, validation 1 and validation 2, respectively.

Gene filtering

We performed filtering to identify well-detected genes across the entire cohort. Specifically, we used a basic cut-off that required a given gene be detected at a level of at least 0.5 CPM in at least 75% of discovery samples after removing outlier samples. Following this step, we retain 7,160 genes for differential expression analysis.

Differential expression analysis

Differential expression analysis was performed in R using Limma (v.3.38.3). To identify gene changes associated with preeclampsia across gestation and post-partum, we used a mixed-effects model. We performed differential expression analysis using two design matrices: (1) examine the interaction between time to preeclampsia onset or delivery for normotensive and preeclampsia symptoms (that is, preeclampsia with or without severe symptoms); and (2) examine the interaction between time to preeclampsia onset or delivery for normotensive and preeclampsia broadly. In both design matrices, we included time to preeclampsia onset or delivery for normotensive (continuous variable), whether a sample was collected post-partum (binary variable), the interaction between time and preeclampsia symptoms for (1) or preeclampsia for (2), the interaction between whether a sample is post-partum and preeclampsia symptoms for (1) and preeclampsia for (2), and 7–8 confounding factors.

In (1), we defined preeclampsia symptoms categorially using three levels—normotensive, preeclampsia without severe symptoms, preeclampsia with severe symptoms. In (2), we defined whether a sample was preeclampsia using a binary, indicator variable (0 = normotensive, 1 = preeclampsia). The 7–8 confounding variables included were maternal race (categorical variable), maternal ethnicity (binary variable), fetal sex (binary variable), maternal pre-pregnancy BMI group (categorical variable), maternal age (continuous variable, only included in design 1), and sequencing batch (categorical variable). We defined time to preeclampsia onset or delivery as the difference between gestational age at onset or delivery and gestational age at sample collection. We defined BMI group as follows: obese (BMI ≥ 30), overweight (25 ≤ BMI < 30), healthy (18.5 ≤ BMI < 25), underweight (BMI < 18.5). We chose to model time to preeclampsia onset or delivery as a continuous variable, specifically a natural cubic spline with four degrees of freedom to account for the range across which samples were collected (one to three months per collection period). We also blocked for participant identity (categorical variable), modelling it as a random effect to account for auto-correlation between samples from the same person.

Per the Limma-Voom guide, to account for sample auto-correlation over time, we ran the function voomWithQualityWeights twice. We first ran it without any blocking on participant identity, and used this base estimation to approximate sample auto-correlation on the basis of participant identity using the function duplicateCorrelation. After estimating correlation, voomWithQualityWeights was run again, this time blocking for participant identity and including the estimated auto-correlation level. A linear model was then fit for each gene using lmFit and differential expression statistics were approximated using Empirical Bayes (eBayes) methods. For comparing preeclampsia with versus without severe symptoms, we contrasted the relevant coefficients (makeContrasts) and then applied Empirical Bayes as opposed to directly after lmFit.

DEGs were then identified using the relevant design matrix coefficients with Benjamini–Hochberg multiple hypothesis correction at a significance level of 0.05. For design 1, we identified DEGs related to three comparisons: preeclampsia without severe symptoms versus normotensive (1,759 DEGs); severe preeclampsia versus normotensive (1,198 DEGs); and preeclampsia with versus without severe symptoms (503 DEGs). We find 544 genes in common for preeclampsia without and with severe symptoms versus normotensive. These 544 DEGs are examined in Fig. 2 and the related main text. For design 2, we identified DEGs related to preeclampsia versus normotensive alone (330 DEGs), which we used as the initial gene set for building a logistic regression model (see Supplementary Note 2). Finally, we removed the effect of sequencing batch on estimated logCPM values with TMM normalization for the discovery cohort using the limma-voom function, removeBatchEffect.

log₂-transformed fold change and CV estimation

We define log₂-transformed fold-change (log₂(FC)) as the difference between the median gene level (logCPM; see ‘Bioinformatic processing’) between preeclampsia and normotensive samples for a given sample collection period (that is, ≤12, 13–20 and ≥23 weeks of gestation, or post-partum). In the case in which a given person had multiple samples included into a specific collection period, we only used the values associated with the first collected sample to avoid artificially reducing within-group (preeclampsia or normotensive) variance due to auto-correlation among samples from the same person.

We then quantified the relative dispersion around the estimated log₂(FC) for each gene using an approximation for CV. Specifically, we consider CV to be the ratio between an error bound, ∂, and the estimated log₂(FC). We defined the error bound, ∂, as the one-sided error bound associated with the lower (or upper in the case of negative log₂(FC) values) 95% CI as estimated by bootstrapping. This definition of ∂ as a one-sided bound that approaches 0 (equivalent to no FC) allowed us to explore how confident we could be in an estimated log₂(FC). For instance, a CV = 1 would indicate that at the boundary of proposed values, the log₂(FC) for a given gene becomes effectively 0. Similarly, a CV > 1 would indicate even less confidence in a proposed average log₂(FC) and indicate that at the boundary, the estimated log₂(FC) changes signs (that is, a negative log₂(FC) becomes a positive one or vice versa).

Hierarchical clustering analysis

For each sample collection period, hierarchical clustering was performed using DEGs with an |log₂(FC)| ≥ 1 and CV < 0.5 or 0.4 in the case of the 13–20 weeks of gestation time point so that the number of genes used did not exceed the number of samples. For each gene that passed these thresholds, we calculated a z-score across all samples (at most 1 per individual, the earliest collected sample in a given group) in each sample collection period using the function StandardScaler in Scikit-learn (v.0.23.2), Average linkage hierarchical clustering with a Euclidean distance metric was then performed for both rows (gene z-scores) and columns (samples in same collection group) for a given matrix in Python using Scipy (v.1.5.1). Clustering and corresponding heat maps were visualized using nheatmap (v.0.1.4).

Longitudinal dynamics analysis

To group gene changes by longitudinal behaviour, we performed k-means clustering on a matrix in which each row was a DEG and each column was the estimated log₂(FC) for a given sample collection period (N genes × 4 time points). We measured the sum of squared distances for every ‘k’ between 1 and 16 (4²), in which 16 represents the maximum possible k (4 time points with 2 possibilities each, log₂(FC) > 0 or log₂(FC) < 0). We then identified the optimal k clusters by using the well-established elbow method to identify the smallest ‘k’ that best explained the data, visually described as the elbow (or knee) of a convex plot like that for the sum of squared distances versus k (Extended Data Fig. 4a, c). To do so, we either visually inspected and identified the elbow (Fig. 2d) or used the function KneeLocator as implemented in the package Kneed (v.0.7.0) (Fig. 4a). We used visual inspection for Fig. 2d as we observed that given two k-values (for example, k = 2, 3) with a similar sum of squared distances, KneeLocator would choose the larger value, whereas we preferred a simpler model. Having identified the optimal number of clusters, k, we labelled every DEG with its assigned cluster and visualized average behaviour (median) and the 95% CI (bootstrapped using 1,000 iterations) per cluster using Seaborn line plot (v.0.10.0).

To confirm that the identified patterns were not spurious (that is, an artifact of the k-means clustering algorithm), we permuted the data columns (log₂(FC) per time point) for each gene thereby scrambling any time-related structure while preserving its overall distribution. We then visualized the result using Seaborn line plot as described above. After permutation, we observed no longitudinal patterns, which were instead replaced by nearly flat, uninformative trends (Extended Data Fig. 4b, d).

Correlation analysis

To verify DEGs identified in the discovery cohort, we compared log₂(FC) values for the discovery cohort as compared to both validation 1 and validation 2 cohorts. We calculated the percentage of genes for which the log₂(FC) had the same sign across cohorts (that is, both positive or both negative) and the Spearman correlation using the function scipy.stats.spearmanr. We did not calculate log₂(FC) values for DEGs at ≤12 weeks of gestation in validation 1 because of small sample numbers (3 preeclampsia samples before 12 weeks).

We also sought to compare whether symptom severity (without or with severe preeclampsia) correlated with log₂(FC) magnitude for 544 DEGs identified as common to all preeclampsia in design 1. To do so, we calculated the slope of a best-fit line in which x and y were defined as log₂(FC) values for preeclampsia without (x) and with (y) severe features versus normotensive. We would expect a slope > 1 and < 1 if log₂(FC) magnitudes for preeclampsia with as compared to without severe symptoms were larger or smaller on average, respectively. Similarly, a slope = 1 would reflect that symptom severity did not correlate with log₂(FC) magnitude for the 544 DEGs tested.

Finally, to confirm that the identified correlations were significant, we permuted the data columns (log₂(FC) per cohort) for each gene, thereby scrambling any structure while preserving its overall distribution. We then calculated the same statistics. After permutation, we observe about 50–55% log₂(FC) agreement, as expected at random, a Spearman correlation of 0 and a slope of 0.

Defining cell-type- and tissue-specific gene profiles

Cell-type gene profiles were identified as previously described²⁶ and briefly summarized below. We also briefly describe an adapted, similar method to derive tissue gene profiles.

On the tissue level, for genes and tissues (and some blood and immune cell types) measured in the HPA (v.19)³⁵, we calculated a Gini coefficient per gene as a measure of tissue specificity. As a measure of inequality, Gini coefficient values closer to 1 represent genes that are tissue-specific. We defined a given gene Y as specific to tissue X included in the HPA if Gini(Y) ≥ 0.6 and max expression for Y is in tissue X. The aforementioned method can identify genes that are expressed in several tissues (for example, group-enriched) as opposed to only one. Specifically, it is possible to have a gene Y where Gini(Y) ≥ 0.6 and the gene is expressed in more than 1 tissue (for example, enrichment in placenta and muscle). To this end, when tracking a single cell type or tissue’s trajectory over gestation (for example, Fig. 4b), where the specificity of a given gene profile is especially important, we imposed a further constraint to ensure that any gene signal only reflects the named tissue (for example, any gene named placenta-specific is specific to the placenta alone). Specifically, we required that genes be annotated by HPA as ‘Tissue-enriched’ or ‘Tissue-enhanced’ and term this reference ‘HPA strict’.

On the cell-type level, we identified cell-type-specific gene profiles using both Tabula Sapiens v.1.0 (TSP) and individual cell atlases. We used individual cell atlases to identify gene profiles for cell types from missing tissues in TSP (for example, placenta, brain) or tissues that are known to be important in preeclampsia with additional annotations in individual single-cell atlases (for example, liver, kidney). First, for genes and cell types measured in TSP, we defined a given gene Y as specific to cell type X included in TSP if Gini(Y) ≥ 0.8 and max mean expression for Y is in cell type X. We combined subtype annotations for neutrophil and endothelial cells into single parent categories called ‘neutrophil’ and ‘endothelial cell’ respectively; as subtype annotations were based on manifold clustering, it was unclear whether they were truly distinct enough to be distinguished at a whole-body level for our purposes. Next, for genes and cell types described in individual tissue single-cell atlases, we required that a gene be differentially expressed in the specific single-cell atlases and tissue-specific per the HPA (Gini ≥ 0.6). The following single-cell atlases were used for each organ: (1) placenta^44,45; (2) liver⁴⁶; (3) kidney⁴⁷; (4) heart⁴⁸; and (5) brain⁴⁹.

We then created an augmented reference, which we term TSP+. For TSP+, we took the union of TSP and individual atlas gene annotations. A small number of genes had conflicting double annotations in TSP as compared to at most one individual tissue single-cell atlas. In these rare instances, which most often occurred for genes related to cell types missing in TSP (for example, placental or brain cell types), we used the individual single-cell atlas label.

Determining cell type and tissue of origin

We determined whether a given cell type or tissue was enriched in preeclampsia by comparing preeclampsia DEGs with cell-type and tissue gene profiles using a one-sided hypergeometric test. For every tissue (HPA) or cell type (TSP+) with at least two DEGs specific to it, we performed the following. First, we defined a hypergeometric distribution (scipy.stats.hypergeom, (v.1.5.1)) with the following parameters, where category refers to tissue when using HPA and cell type when using TSP+: M = number of genes specific to any category; n = number of genes specific to this category; N = number of DEGs in this k-means longitudinal cluster specific to this category. Next, we estimated a P value using the survival function (1 − cumulative distribution function (CDF)) for the specified distribution. Specifically, a P value is defined as the cumulative probability, P(X > (n_DEGs_specific_to_this_category − 1)), that the distribution takes a value greater than the number of DEGs specific to this category − 1. Finally, once we estimated a P value for every cell type (TSP+) or tissue (HPA) identified in each DEG k-means longitudinal cluster, we adjusted for multiple hypotheses using Benjamini–Hochberg with a significance threshold of 0.05.

Defining relative signature score per cell type or tissue

We define a signature score as the sum of logCPM values over all genes in a given tissue or cell-type gene profile²⁶. We required that a cell-type or tissue gene profile have at least five specified genes to be considered for signature scoring in cfRNA. Genes were defined as specific to a given tissue on the basis of the reference HPA strict, and to a given cell type on the basis of the reference TSP+ (see ‘Defining cell-type- and tissue-specific gene profiles’ for details).

To account for our previous observation that baseline cfRNA levels vary between individuals—the consequence of biological and technical (for example, sample processing) factors, we chose to calculate relative as opposed to absolute signature scores. For each individual whose post-partum sample passed sample quality control (QC) (see ‘Sample quality filtering’ for details), we estimated a relative signature score, which was defined as the difference between the signature score at a given gestational time point and the post-partum sample. For both the discovery cohort and the validation 1 cohort, 49 normotensive individuals and 24 individuals with preeclampsia had a post-partum sample that survived sample QC. After normalization, all samples at post-partum had a similar baseline (0). We note that one can define a relative signature score based on any sampled time point for a given person. We chose the post-partum sample because we were interested in tracking maternal organ health over gestation.

Finally, we scaled (that is, z-score) the relative signature scores for a given cell type or tissue by dividing by the interquartile range, a robust alternative to standard deviation, using the sklearn.preprocessing class, RobustScaler. This accounted for differing gene profile lengths and gene expression levels, and allowed us to compare both different cell-type and tissue contributions and case groups per cell type or tissue.

Having defined a relative signature score per cell type and tissue, we visualized average behaviour (median) and the 75% CI, a non-parametric estimation of standard deviation, (bootstrapped relative signature score per case group and time point using 1,000 iterations) using Seaborn line plot (v.0.10.0).

Functional enrichment analysis

Functional enrichment analysis was performed using the tool GProfiler (v.1.0.0) for the following data sources: Gene Ontology: biological processes and cellular compartments (GO:BP, GO:CC, released 1 May 2021, 2021-05-01), Reactome (REAC, released 7 May 2021, 2021-05-07) and Kyoto Encyclopedia of Genes and Genomes (KEGG FTP, released 3 May 2021, 2021-05-03). To identify GO terms, we excluded electronic GO annotations (IEA) and used a custom background of only the 7,160 genes that were included in the differential expression analysis. We then performed the recommended multiple hypothesis correction (g:SCS) with an experiment-wide significance threshold of a = 0.05 (ref. ⁵⁰).

Logistic regression feature selection and training

To build a robust classifier that can identify mothers at risk of preeclampsia at or before 16 weeks of gestation, we first pre-selected features using the set of 330 DEGs when contrasting preeclampsia versus normotensive (see design 2 in ‘Differential expression analysis’ and Supplementary Note 2) as a starting point.

We normalized gene measurements using a series of steps. First, to correct for batch effect, in which we define batch as a set of samples processed at the same time by a distinct group (for example, discovery cohort = batch, Del Vecchio cohort = batch), we centred the data by subtracting the median logCPM per gene for a given cohort. Next, we scaled gene values for each cohort using its corresponding interquartile range in the discovery cohort. Finally, to account for sampling differences across samples, we used an approach similar to when analysing quantitative PCR with reverse transcription (RT–qPCR) data, and normalized data using multiple internal control (that is, housekeeping) genes. On a per-sample basis, we subtracted the median, normalized logCPM value (centred and scaled) for all internal control genes, which we define as 66 genes for which the measured value did not change across preeclampsia versus normotensive comparisons (all genes with adjusted P > 0.99 for preeclampsia versus normotensive; design 2). When calculating the median value for all internal control genes, we excluded any 0 logCPM values as these are likely to have been the consequence of technical dropout.

Model training then used the discovery cohort alone split into 80% for hyperparameter tuning and 20% for model selection and consisted of two stages—further feature pre-selection based on two metrics followed by the construction of a logistic regression model with an elastic net penalty. Using a split discovery cohort for training mitigated overfitting even though all discovery samples were used for differential expression, which defined the initial feature set.

For feature pre-selection, we calculated logFC values using the 80% discovery split for all 330 genes for preeclampsia versus normotensive. We focused on 2 practical metrics measured across the 80% split of discovery samples collected on or before 16 weeks of gestation: gene change size (|log₂(FC)|) and gene change stability (CV). All model hyperparameters were then tuned using AUROC as the outcome metric and fivefold cross validation. Next, we selected the best model including tuned feature pre-selection cut-offs again using AUROC. Specifically, we calculated an AUROC score for both the 80% and the 20% discovery splits separately, and the selected model achieved the best score on both splits.

Finally, we tuned the probability threshold, P, at which a sample is labelled as at risk of preeclampsia if P(PE) ≥ P using the entire discovery cohort. To do so, we constructed a receiver operator characteristic curve (ROC) and calculated the false positive rate (FPR) and true positive rate (TPR) at different thresholds, P_i. We identified the threshold, P_i, at which FPR = 10%, and round to the nearest 5 (for example, 0.37 would become 0.35). This yielded a tuned threshold of P = 0.35. All classifications as negative or positive were then made based on this threshold.

To understand the importance of each gene feature, we trained a separate logistic regression model for a subset of all possible feature subsets (307 combinations out of a total of 262,143 for 1–17 genes). No feature pre-selection was performed for this sub-analysis. All model hyperparameters were tuned as previously described. We defined a gene subset as weakly predictive if the model yielded an AUROC > 0.5 on the test set (validation 2).

In all cases, performance metrics were assessed as described below (see next section) and used Scikit-learn (v.0.23.2).

Performance metric analysis

Model performance was assessed using several statistics including sensitivity, specificity, positive predictive value (PPV), negative predictive value (NPV) and AUROC. Given a 2 × 2 confusion matrix in which rows 1and 2 represent true negatives and positives and columns 1 and 2 represent negative and positive predictions, respectively, we can define the values in row 1, column 1 as true negatives (TN), row 1, column 2 as false positives (FP), row 2, column 1 as false negatives (FN), and row 2, column 2 as true positives (TP). We can then define the following proportions: (1) sensitivity = TP/(TP + FN); (2) specificity = TN/(TN + FP); (3) PPV = TP/(TP + FP); (4) NPV = TN/(TN + FN). For each proportion, we calculated 90% CIs using Jeffrey’s interval⁵¹ and the function, proportion_confint, from statsmodels.stats.proportion. We also approximated AUC and its corresponding 90% CI using the Scikit-learn function, roc_auc_score, and the binormal approximation, respectively.

Statistical analyses

All P values reported herein were calculated using the non-parametric Mann–Whitney rank test unless otherwise stated. One-sided tests were performed where required based on the hypothesis tested.

Reporting summary

Further information on research design is available in the Nature Research Reporting Summary linked to this paper.

Source link

Vasiprak Blog