Preloader

Detection of COVID-19 using multimodal data from a wearable device: results from the first TemPredict Study

Study design

We began recruiting participants on March 19, 2020. Recruitment was rolling and stopped on September 23, 2020. We first consented participants to participate through August 30th, 2020. Due to the continuation of the pandemic, we contacted participants with the invitation to consent to extend their study participation to November 30th, 2020.

We recruited adults from the broader population who already possessed Oura Rings by sending them invitations within the Oura App on their smartphones. Prospective participants could tap on this invitation, which linked to the UCSF consent survey online. We did not have a recruitment ceiling for participants who met eligibility criteria and possessed their own Oura Rings. We recruited frontline healthcare workers at participating sites by enlisting leadership at each institution and obtaining IRB review at each institution. We mailed sites recruitment materials, including study flyers and Oura Ring sizing kits, which contained plastic rings for healthcare workers to try on to determine their size. These kits also included instructions about how to reach the UCSF consent survey online. In this survey, prospective participants could review and download the study consent form, indicate their Oura Ring size, and enroll in the study. All participants provided informed consent to participate in the study. Our target recruitment for healthcare workers to whom the study would provide Oura Rings was n = 3,400. We recruited a subset of participants (n = 10,021) located within the U.S. to complete mail-based antibody testing using dried blood spot (DBS) cards.

Sites

We recruited participants at the following healthcare sites: The University of California San Francisco hospitals at Mission Bay and Parnassus; Zuckerberg San Francisco Hospital and Trauma Center; Stanford Medical Center; Santa Clara Valley Medical Center, Northwestern McGaw Medical Center; Beth Israel Deaconess Medical Center-Harvard Medical School; Stony Brook University Renaissance School of Medicine; Stonybrook Medical Center; Weill Cornell Medicine; New York-Presbyterian Queens; New York-Presbyterian Brooklyn Methodist Hospital; University of Miami Health System; University of Texas Southwestern Medical Center Dallas; Tufts Medical Center; Jamaica Hospital Medical Center; University of California Los Angeles Medical Center; Boston Medical Center; Kaiser Permanente San Diego Medical Center; Florida Atlantic University, and American Medical Response (ambulance). Participants who already owned an Oura Ring and who we recruited from the existing user base via the Oura app were distributed globally.

Antibody testing

We aimed to recruit 10,000 participants to participate in DBS COVID-19 antibody testing using the following selection criteria, in the following order of priority: (a) Reported (via daily survey) a positive COVID diagnosis; (b) had high illness probability based on symptoms reported in daily survey and had extensive Oura Ring data available (minimum 100,000 observations and maximum 2 day gap); (c) located in COVID-19 hotspot ZIP code (40-mile radius of ZIP 10010 or 48206) at the time of consenting; (d) located in top 50 hotspot counties; (e) with Oura Ring data going back to at least April 15; (f) moderate illness probability based on reported daily symptoms and minimum of 100,000 Oura Ring data observations available; (g) modest illness probability based on reported daily symptoms and Oura Ring data spanning the period of March to May; (h) remaining participants, sorted by illness probability.

Participants

Eligible participants were frontline healthcare workers at one of several participating healthcare institutions whom the study provided with Oura Rings, or adults who possessed an Oura Ring of their own that they used for participation. Eligible participants were at least 18 years of age, possessed a smartphone that could pair with their Oura Ring, and could communicate in English. The University of California San Francisco (UCSF) Institutional Review Board (IRB, IRB# 20-30408) and the U.S. Department of Defense (DOD) Human Research Protections Office (HRPO, HRPO# E01877.1a) approved of all study activities, and all research was performed in accordance with relevant guidelines and regulations and the Declaration of Helsinki. All participants provided informed consent (electronic). We did not compensate participants for participation.

Procedures

Prospective participants visited a survey hosted on the UCSF Qualtrics platform, and after reviewing information about the study, could download a PDF of the study consent form, and if they wished to enroll, provided digital consent. Participants then completed a baseline survey, wherein they entered demographic and health information (see “Measures”). We asked participants who were waiting for an Oura Ring to download the Oura App, and upon receiving an Oura Ring in the mail, to pair the Oura Ring with the Oura App and opt to share their Oura data with UCSF (from within the Oura App). We asked participants who already possessed an Oura Ring to share their Oura data with the research team from within the Oura App. Participants were presented with a daily in-app message that linked to a brief survey that asked them to report whether they were experiencing potential COVID-19 symptoms and whether they had received COVID-19 testing or diagnoses (see “Measures”). We also asked participants to complete monthly surveys that included questions about health behavior and diagnoses, mental health and psychological stress, and COVID-19 exposure. In June and July of 2020, we mailed the first of two DBS cards to participants who consented to complete mail-based antibody testing. We mailed participants a second DBS card roughly 8 weeks after we received their first completed DBS card by mail (see “Measures”).

Measures

We collected the following measures.

Baseline self-report survey

Participants reported on demographic factors including age, biological sex, race/ethnicity, educational background, anthropometric information, country and state of residence, and other factors that are not the focus of this manuscript.

Daily self-report surveys

Participants reported on whether they had experienced any of the following symptoms since they last completed the survey: fever, chills, fatigue, general aches and pains, dry cough, sore throat, cough with mucus, cough with blood, shortness of breath, runny/stuffy nose, swollen/red eyes, headache, unexpected loss of smell or taste, loss of appetite, nausea/vomiting, and/or diarrhea. Participants also reported on whether they had received any new testing results for COVID-19 and could indicate the type of testing (nasal or oral swab specimen, antibody blood test, saliva/spit specimen, stool specimen, or other with the ability to specify) and the date they provided the test specimen. Participants also reported on the results of their test (positive, negative, or indeterminate).

Monthly self-report surveys

Participants reported on any medical diagnoses (including COVID-19), as well as COVID-19 exposure, health behavior, alcohol and drug use, prescription medication information, and mental health. For each diagnosis they endorsed, they also reported on whether their diagnosis was confirmed with testing. If a COVID-19 or flu diagnosis was confirmed with testing, participants answered questions that were identical to those in the daily survey (type, date, and results of testing).

Dried blood spot (DBS) antibody testing

To obtain specimens for SARS CoV-2 antibody testing, we mailed kits for obtaining dried blood spots to 10,021 participants (TropBio Filter Paper Disks, Cellabs). We prioritized sending kits to willing participants who had higher quality Oura Ring data, completed more symptom surveys, who were located in the U.S. (due to cost and regulatory complexity of international shipping), and who were within specific geographic locations (greater prevalence of and/or exposure to SARS CoV-2). The collection kit included tabs for obtaining up to six dried blood spots. We instructed participants to dry their blood spots overnight before returning by mail in plastic specimen bags containing a desiccant. We processed DBS with eluent and tested using the Ortho Clinical Diagnostics VITROS® SARS CoV-2 Total Assay9. To validate testing using dried blood spots we performed several steps, including preparing dried blood spots from whole blood obtained from individuals diagnosed with COVID-19 who tested positive on serum testing for comparison to testing methods on which the assay was originally developed and validated. In validation testing using these dried blood spots, we found the use of dried blood spot sample collection reduced the sensitivity of the antibody testing to about 90% compared with standard sample collection methods that performed the same assay on plasma. For comparison, the sensitivity of the Ortho VITROS ® SARS CoV-2 Total Assay is reported to be 98.8% with patients confirmed to be SARS-CoV-2 positive by PCR21 using serum specimens. While the assay normalized signal-to-cutoff (S/CO) ratios provide clear separation between reactive and non-reactive specimens on serum specimens, we found some overlap between S/CO results on DBS specimens from individuals with COVID-19 based on PCR testing or serum SARS CoV-2 antibody testing with the Ortho VITROS ® SARS CoV-2 Total Assay. We therefore designated an indeterminate test range where there was evidence of overlap in S/CO values between individuals with and without prior SARS CoV-2 infection.

Oura ring data

All participants wore the Oura Ring Gen2 (ouraring.com), a commercially available wearable sensor device (Oura Health, Oulu, Finland), on a finger of their choosing. The Oura Ring connects to the Oura App (available from the Google Play Store and the Apple App Store) via Bluetooth. Users wear the ring continuously during daily activities in both wet and dry environments. The Oura Ring assesses temperature by using a negative temperature coefficient (NTC) thermistor (resolution of 0.07 °C) on the internal surface of the ring. The sensor registers dermal temperature readings from the palm side of the finger base every 60 s. The Oura Ring assesses heart rate (HR), heart rate variability (HRV), and respiratory rate (RR) by extracting features from a photoplethysmogram (PPG) signal sampled at 250 Hz. The Oura Ring calculates HR and HRV in the form of the root mean square of the successive differences in heartbeat intervals (RMSSD) at 5 min resolution. The Oura Ring also estimates RR at 30 s resolution. The PPG-derived metrics (HR, HRV, RR) are calculated from inter-beat intervals (IBI), which are only available during periods of sleep. Tri-axial accelerometers estimate activity metrics as metabolic equivalents (MET) reported at 60 s resolution during both sleep and wake periods and sleep stages at 5 min resolution. The Oura Ring generates all of these metrics, which we will refer to as the five data streams, on device. The Oura Ring does not continuously record or store PPG for analysis.

Variable creation

We created several variables for these analyses as follows.

Diagnosis determination (DX)

We identified COVID-19 cases based on data from daily and monthly surveys, with confirmation from study-provided SARS-CoV2 antibody testing on dried blood spot (DBS) specimens when available (n = 3664 participants had SARS-CoV2 antibody results from submitted specimens at the time of case definition). A total of n = 704 participants self-reported having COVID-19.

Confirmed positive cases

These were participants who reported a positive COVID-19 test result on an oral or nasopharyngeal swab, saliva, stool, or antigen test. We identified the diagnosis date as the earliest reported positive test date across surveys to capture the first test positive date. Confirmed positive cases did not provide discordant reports across survey reports or test types.

Confirmed negative cases

These were participants who tested negative on study-provided DBS antibody testing and who did not report positive COVID-19 test results in any study survey.

Test ambiguous cases

These were participants who had a negative result on study-provided DBS antibody testing (n = 9) or who self-reported a negative antibody test result after a reported positive swab, saliva, antigen, or stool test (with an 11-day buffer to allow time for seroconversion). These were suspected false positive tests.

Survey ambiguous cases

These were participants who reported conflicting results for the same test in the same 4-day period across different survey types (which we considered to be reporting errors that we reconciled by contacting participants to confirm reporting).

DX-generated case lists

At the time we generated the positive case lists, n = 210 participants reported a positive swab test on a daily survey, and an additional 108 reported a positive swab test on monthly surveys (that they did not report in daily surveys); n = 4 reported positive saliva tests (without a prior swab test). Where reports were conflicting (where test results differed for the same test type reported within an interval ± 4 days across different survey instruments), or for participants who reported a positive COVID-19 test in the very early study weeks before test type questions were added to the daily survey, we followed up with participants to obtain additional testing-related information and completed this follow up with n = 113 participants. During follow-up, two participants reported a positive swab test and one reported a positive antigen test that they had not previously reported on another survey type. After removing test ambiguous (n = 11) and survey ambiguous (n = 7) cases, the final list of COVID-19 confirmed positive cases included 306 participants positive by swab (n = 302), saliva (n = 3), or antigen (n = 1) test. Among these, we confirmed n = 45 with positive study-provided DBS antibody testing.

DX region

We defined a time of probable COVID-19 infection (DX region) by proximity to the confirmatory COVID-19 test date as 14 days before (DX – 14), and 7 days after (DX + 7) the testing date.

Self-report (per daily surveys) symptom onset date determination (SX)

Daily surveys include a list of symptoms that may be endorsed each day. To determine the date of symptom onset (SX), we focused on four core symptoms associated with COVID-19: fever, fatigue, dry cough, and unexpected loss of smell or taste. For a window surrounding the diagnosis date, spanning 14 days prior to DX through 7 days post-DX, we looked for participants who self-reported a transition from “no symptoms” to one or more core symptoms with no more than 2 consecutive missed survey responses in the vicinity of this transition. We defined symptom onset as the first day in the DX region in which participant reported a core symptom following one or more days of “no symptoms.” We considered participants who completed the daily surveys during the window around the diagnosis date, but who did not endorse any of the four core symptoms, to be asymptomatic. We did not attempt to establish symptom onset dates for participants who completed fewer than three symptom surveys in the DX region, or those who did not endorse the “no symptoms” option at least once prior to the first reported symptom.

Physiological (per Oura data) disruption date determination (PX)

To develop a method for imputing illness onset for individuals with incomplete or missing symptom reports—and, more generally, to decide where in the time series to search for informative, infection-related patterns—we designated a data-driven, physiological disruption (PX) date for each participant of interest (n = 73). We derived PX from two of the five Oura data streams (HR and RR time series). We compared the resultant PX values with 1) SX for those 41 individuals with the highest-confidence symptom self-reports and 2) the dates of coincident, physiological changes across the other three streams for all these 73 participants.

To impute a single date for physiological disruption, we first designated 21-day, individually curated baseline periods for a subset of participants who tested positive for COVID-19 (n = 73; see COVID-19 infection data availability; see also Algorithmic description). We then computed means over all the values taken by the HR and RR time series in these baseline periods. We compared the baseline means to each of the 21 mean daily heart rate and mean daily respiratory rate values that characterized the respective individuals’ DX regions; this allowed us to assign dates for the maximal observed heart rate and respiratory rate deviations in the extended neighborhood surrounding illness onset. Maximal deviations in one’s daily heart rates need not occur simultaneously with those of their respiratory rates. Thus, in keeping with intentions to designate a single physiological (per Oura data) disruption date, per individual, we assigned as PX the temporal midpoint between the maximal HR- and maximal RR-derived deviations.

We observed that substantial deviations in both the average daily heart rates and respiratory rates in the vicinity of the diagnosis date (DX) were overwhelmingly of finite duration—typically lasting for several days—among the aforementioned subset of participants (see COVID-19 infection data availability). We also observed shorter-duration (i.e., overnight, etc.) fluctuations in the HR and RR series in the days surrounding DX, but these fluctuations did not necessarily reflect infection-related disruptions. In order to penalize the influence of the latter on PX determination, and to instead emphasize the most salient, trend-like deviations (e.g., a clear rise in the respiratory rate over three consecutive days), we operationalized our PX imputations into several steps.

First, we treated each 21-day sequence of absolute deviations (for heart rate and respiratory rate) as its own time series signal. We then identified the peaks associated with each such signal, according to criteria for numerical analysis defined by the SciPy 1.6.3 package for Python. We did not impose constraints regarding the minimal threshold for peak detection, relative peak heights, or distances by which successive peaks should be separated; we did exclude “peaks” corresponding to the first two days (DX—14 to DX—12) of the 21-day sequence, as it would take at least three observations to establish coherent trends. The highest peaks represented the maximal deviation dates for an individual’s heart rate and respiratory rate signals. In cases where SciPy did not detect peaks, or detected peaks only within the first two days of a given signal, we assessed maximal deviation dates from the full subset of 21 values for that signal.

Oura data preparation

The Oura Ring records five physiological metrics (data streams) on the scale of minutes. For the present analyses, we aggregated data from each of the five streams within 30-min, consecutive time intervals that overlapped by 15 min. We chose these time frames to balance computational resolution (i.e., the inherent tradeoff between an ability to work with fine-scale features and data-architecture costs and considerations) with expectations based on physical and physiological limitations for the observability of illness-related changes. Within each time interval, we extracted a set of summary statistics from the available physiological metrics, including their mean, standard deviation, and 25th and 75th percentile values. For example, the Oura dermal temperature metric, T, is natively sampled once per minute; we aggregated the (up to 30) temperature samples in each 30-min interval to compute the temperature-derived variables Tmean, Tstd, Tper25, and Tper75 (temperature mean, standard deviation, 25th and 75th percentiles, respectively). These “derived” variables compactly summarized all dermal temperature measurements falling within the respective intervals, replacing the high-resolution temperature time series in our algorithmic computations.

Potential artifacts in wearable data include records saved during non-active wear times (e.g., elevated temperature readings, saved while an Oura Ring is charging). We preprocessed the dermal temperature and MET data to determine times when participants’ rings were actively worn, versus non-wear times, by comparing MET values against a fixed threshold of 0.5. We treated values below this threshold as non-wear and discarded both MET and dermal temperature measurements during these non-wear periods.

COVID-19 infection data availability

Beginning with a cohort of 306 participants for whom there was a confirmed self-reported diagnosis of COVID-19 based on a positive oral or nasal swab, saliva, or stool specimen tested using PCR or antigen assays (see “Diagnosis determination (DX)”), we identified participants for inclusion in the training dataset. We selected individuals for whom we had Oura data available on at least 20 days (consecutive or not) that would be usable as baseline (at least 17 days prior to DX) and had at least 7 days of data prior to DX and 14 days following DX. Additionally, we excluded individuals who reported 4 or more concurrent symptoms (see “Self-report (per daily surveys) symptom onset date determination (SX)”) or exhibited dermal temperatures above 38 degrees Celsius in the baseline period, so as to screen potential confounding illness. For algorithm training, we restricted analysis to 73 participants that met the above criteria and had complete heart rate, respiratory rate, and temperature data. For independent validation, we considered only participants outside this training set with confirmatory antibody results following their DX and physiology data available for i) at least 13 of their baseline days, and ii) no less than 20 of their DX region days. Participants in the independent validation set therefore met minimum data requirements similar to those described above, but we omitted restrictions on symptoms and elevated dermal temperatures in the baseline period. Ten participants met the secondary criteria for inclusion in the independent validation set.

Algorithmic description

We created a machine learning pipeline that detected physiological features distinguishing COVID-19 illness from non-illness. This pipeline had 3 constituent parts: 1) data processing module, 2) short-time classification and detection module, and 3) post-detection “trigger” logic module.

The data processing steps were to a) gather and b) normalize individuals’ data. We refer to the compressed time series described in “Oura data preparation” as data sketches. The data sketches consisted of aggregated statistics extracted from successive 30-min intervals; we created data sketches for each of the five physiological data streams. Additionally, we generated several new variables that capture longer-duration trends by applying moving-average filters across the data sketches. This allowed us to learn illness-related features that occur over multiple time scales. Measurements of physiological signals may have distinct characteristics during wake versus sleep; therefore, for trend assessment, we calculated separate variables from measurements taken during wake and sleep. We set our moving-average filters to calculate 1-, 2- or 3-day trends from the “asleep” and “awake” data sketches. For dermal temperature, variables included a 3-day moving average and moving standard deviation of Tper25 during wake intervals, and a 3-day moving average of Tper75 during sleep intervals. For activity, our primary trend variable was a 3-day moving average of the 75th-percentile MET value during wake intervals. The full set of variables supplied as inputs to the pipeline are listed in Supplementary Table 1.

Baseline physiology can vary greatly between individuals. We therefore normalized all physiological variables (data sketches and trend variables) according to each individual’s baseline values. To do so, we subtracted individuals’ baseline means and divided by their baseline standard deviations (z-score). We nominally estimated baseline mean and standard deviation values using the 21-day baseline period data (see Physiological (per Oura data) disruption date determination (PX)). In several cases, missing data precluded the use of the full 21 days for baseline estimation, and we therefore designated an alternative baseline period for normalization for these cases. Specifically, in these cases, we estimated the baseline mean and standard deviation using available baseline data in the 49 days prior to DX.

We trained a set of five Random Forest models on the normalized data sketches and trend time series. The classifier training samples consist of data from overlapped 30-min intervals from individuals assigned to the training set and each of the five models were differentiated by considering distinct time frames as the positive class. The set of trained classifier models were then used to predict a preliminary score at each interval assessing whether the individual appeared sick, and a detection score was computed as the fraction of sick intervals out of the last 48 intervals (i.e., nominally 12 h of real-world data), for each interval.

The ground-truth target labels for this classifier were provided by treating all time intervals in each individual as not sick (“negative” training samples) across up to 73 days pre-COVID (from −90 days to −17 days with respect to PX) and sick on five distinct time frames in the vicinity of PX as progressive phases of illness (“positive” training samples).

The five Random Forest models were trained such that each model encompassed a different positive time frame near PX. The negative training samples were held constant for all five models. The first of these models was trained on data sketch and trend variable values drawn from the range PX − 3 to PX − 1; the second covered PX − 2 to PX; the third, PX − 1 to PX + 1; fourth, PX to PX + 2, and fifth PX + 1 to PX + 3. In this way, we learned patterns relevant to infection at each of several “early-stage” time frames in the vicinity of illness onset. For example, one of the earliest signs of oncoming illness in our data may be encoded in aberrant nightly heart rates while temperature disruptions arise as a more important sign as the disease progresses through the incubation period. Rather than build in constraints based on clinically recognized illness patterns or prior knowledge from previous research, we allowed the classifier architecture to discern patterns directly from data. Once patterns distinguishing illness from non-illness were identified in each of the five timeframes independently, via training, we ran all 5 classifiers concurrently to search for instances of those patterns during testing, as noted above. The classification and detection process ran continuously across each individual’s time series data. This basic architecture is adapted from previous work on pre-symptomatic detection of infection using animal data sets22.

To flag individuals as potentially infected, we carried out several post-detection “trigger” operations on the five sets of detections scores that were reported by the classification algorithms. First, we created a new set of scores, at the same 30-min resolution, by computing their envelope (maximum score across the five classifier models). This served to make salient all those places in the time series where any of the learned feature sets would suggest illness. Next, we binned all the detection scores for a given individual, summing all the values at the original 30-min resolution values to form new, aggregate scores that we could associate with each 24-h window. These daily, aggregated scores were compared against a fixed threshold (here, a value of 10) to determine whether our pipeline would pronounce each 24-h bin as a “trigger” opportunity (reputed “sick day”).

Performance evaluation

The detection performance of our pipeline was evaluated via a five-fold cross validation using data from the identified training cohort (n = 73). We calculated ROC curves and their corresponding AUCs using the short-time detection scores generated at 30-min intervals with 15-min overlap. We evaluated the ROC curves against negative and positive ground-truth target labels as defined above (see “Algorithmic description”).

Where we report the ROC curve evaluations for DX rather than PX, we defined the positive labels as DX − 3 to DX − 1, DX − 2 to DX, and so on; and we applied the same method for SX performance evaluation. We report the 95% confidence interval with all AUCs where we assumed that the data points were normally distributed due to the large number of datapoints represented in each curve (see Supplemental Table 2). We assumed the degrees of freedom used in the confidence intervals to be half the total number of datapoints to account for the overlap of the time series data.

Sensitivity and specificity estimates have been reported based on the outcome of the post-detection trigger logic at the fixed threshold. We evaluated these outcomes to assess which individuals would have been triggered, at least once, within their DX regions. The 21 days defined as DX region formed a basis for sensitivity estimates and the 21 days of baseline from 6 to 3 weeks prior to DX formed a basis for specificity estimates.

Source link