Epidemiological data processing
The main unstructured text materials of epidemiological data were Wikipedia web pages and the published literatures. Wikipedia collated a series of reports which reviewed early confirmed cases in each country on a country-by-country scale during the early outbreak of COVID-19. By April 19, 2020, COVID-19 outbreaks had been reported in 168 countries. The epidemiological literatures were another source of unstructured text material, containing de-identified transmission and infection details, and can be considered more reliable information than Wikipedia. There were tens of thousands of COVID-19 related articles before June of 2020, but few articles were about epidemiological research. These articles were reviewed, and 33 articles were collated in which the early outbreak with COVID-19 cases information were described in detail, the bibliographic information of 33 articles was listed on https://github.com/BioMedBigDataCenter/KGCoV/blob/main/data/curated_articles.csv. Because there may be several infected individuals in one webpage or article, every person was assigned a unique case id, which may be infected or potentially infected or contacted with infected people. The information of COVID-19 cases was extracted from unstructured text materials, which was performed in a double-blinded manner by two junior curators and rechecked by another senior curator. The junior curator found out case information (report date, location, gender, age), contact history, travel history, clinical symptom (patient status, clinical symptoms, onset date), and other information. All case information was extracted by two junior curators in parallel. If the result of two curator were same, this case information was qualified. If the results were different, the senior curator would re-curate and retain qualified information directly. As some information may be lost or corrupted in the structuration procedure, the original text was kept in the “description” field and thus can be traced by the following matching step.
The main structured epidemiological data came from Xu et al.16, who collected and collated epidemiological data from multiple sources such as government reports and news. We had continuously updated Xu et al.’s data until November 6, 2020, and found most of the data is before June 2020. All structured data were integrated with unstructured information, so both types of data were organized with one data model to record case entries. The fields included report date, gender, age, country, location, contact information, travel history, clinical symptoms, description, and information source. The data model was compatible with Xu et al.’s data model, and all COVID-19 cases were structured and characterized using the unified model.
After data acquisition and structuring, quality control was focused on a few data fields, such as location, report date, age and gender. All values of location field were standardized by the Google Place API, which built a controlled vocabulary of country names based on ISO 3166, and can be as accurate as possible, including country, province or state, city, and lower levels. The values of date field were transformed to standard date format. The values of age field were adjusted as numeral values or ranges. The generic descriptions of age, such as “adult” or “child” were ignored. The values of gender field were unified as “male” or “female”.
Genomic data processing
The main genomic data were obtained from GISAID’s EpiCoV thematic database (https://www.gisaid.org/). All involved genomes were de-duplicated according to the genome sequence, submission date, gender, patient age, submitting lab and originating lab. Sample and host information together with genome sequencing records were extracted, and were considered genome-related epidemiological data. The function data of the SARS-CoV-2 protein domains were obtained from UniProt18.
The raw genome sequences were filtered using BLAT (v. 36 × 5)19, and genomes with greater than 95% similarity to the reference genome NC_045512.2 were retained for further analysis. The whole genome variations and amino acid variations of these genomes were annotated by the multiple sequence alignment tool MAFFT(v 7.453)20 and ANNOVAR (2019 Oct 24 version)21. As nonsynonymous mutations lead to changes in the amino acid sequences of proteins, which may lead to changes in the infectivity and lethality of the novel coronavirus22,23, the amino acid variations were aligned to protein domains based on their position (offset on locus). In this way, genomes, proteins, and functional domain information were linked with genomic variation annotations and amino acid variations. The data sources were shown in Table 1.
Data matching
Once the epidemiological data and genomic information were structured and standardized from heterogeneous sources, and the both type data were matched using anonymized information of infected persons. In this study, four indicators were used as matching criteria: date, location, gender and age. The date of epidemiological data was report date, and the date of genomic data was collection date. The location values were structured as lower level of administrative division as possible, which may be from country to province/state to city. Comparatively, the combination of the four indicators was reasonable to determine the individual case or genome in the outbreak data of early phase, thus can be used to link both data. In order to evaluate the sufficiency and effectiveness of the matching criteria, the percentages of unique values were calculated in all case/genome records with three indicators (date-location-gender or date-location-age) or four indicators(date-location-gender-age), the results (Fig. 1a) shown the percentage of 4 indicators was higher than 3 indicators in genome dataset and case dataset.The temporal distribution of case reported date and genome collection date were counted every fortnight (Fig. 1b). The case number rapidly increased between March and June of 2020, some countries’ distribution also provides (https://github.com/BioMedBigDataCenter/KGCoV/blob/main/data/descriptive_statistics.xlsx), which brought great difficulties to data matching and case collection.


Distribution of data rank and date. (a) The ratio distribution of unique entries in top 10 rank case or genome data. The ratio of unique case or genome entries to all entries, was calculated by three kinds of combination of 4 indicators. (b) The temporal distribution of the number of cases or genomes. Blue represents the cases and orange represents the genomes. Note that because cases are multiple sources, the number of cases does not represent the actual number of reported cases, which is higher than the actual number of cases.
The matching process include two steps as shown in Fig. 2, the initial qualified matching dataset was collating by manual curation, and the characteristic of 4 indicators were used to design and develop the in-house inferred scripts.


Manual curation flow chart.
Manual curation
The early epidemiological reports were often more detailed and less ambiguous than later reports, and the first cases in an area had higher exposure and were often well studied. Therefore, cases were grouped by countries and sorted by ascending date order. Two indicators were used to identify earlier cases: “rank” and “percentile”. The rank and percentile were considered when matching genomes and cases. Firstly, the top 5 rank genome in every country were extracted from genomic data, and the case data were filtered by every country name of the corresponding genome. Secondly, four indicators in the selected genome data of each country, as mentioned above, were applied to match the selected case data. If the value of date, location, gender and age of in case dataset was the same as in the genome dataset, the case was considered to match the genome directly. Otherwise, if the report date of some cases were in the interval of ± 3 days of the collection date of genomes, the location was same, and the gender and age do not conflict (that is, the value of these fields was the same in case and genome data, and cannot be empty at the same time), the case was considered to potential match the genome. The number of potential matched case is further considered, if there was only one case data with the genome, the matching was successful. If there were multiple cases, the final matching data was determined according to the minimum absolute value of case rank minus genome rank (Fig. 2). With the elaborate governance, 285 genome-case pairs were deduced, and also used manual curation when verifying the accuracy of code matching. In this manual curated data set, the collection date and report date of most matched data were the same day (107/285, 37.54%) or one day apart (112/285, 39.30%).
Inference
The manual curated dataset was mostly extracted from top 5 rank genome-related epidemiological data. All potential matching pair were imputed from whole genome-related and case related epidemiological data by scripts. The inferred criterion is location is same, date is the same day or one day apart, gender and age do not conflict in both data.
The case and genome were assumed to be one matched genome–case pair cluster, containing both epidemiological and genomic information about a single patient. If there was one genome and one case in a cluster, it was very likely that the genome was sampled from the case in the cluster. If there was more than one case, it was difficult to identify a specific case in the cluster, and more detailed information was needed, such as contact, travel, genome variation, and virus typing information. In this case we can only provide information about cases that the genome might correspond to.

