Of the 1098 individuals analysed, 336 were typed with the MassARRAY EUROFORGEN NAME assay20, and 762 were typed with the custom AmpliSeq EUROFORGEN NAME panel21. Only the 102 loci that were included in both assays were used for the population genetic and ancestry analyses below. The information concerning the physical position and rs-numbers of the loci included in the AmpliSeq design is shown in Supplementary Table S1. All samples were also typed for the 165 AIMs of the Precision ID Ancestry Panel6,21 and in this work. Two AIMs, rs12913832 and rs4833103, were present in both the EUROFORGEN NAME panel and the Precision ID Ancestry Panel. These two AIMs performed best in the EUROFORGEN NAME panel, and the results from the Precision ID Ancestry Panel were not used. Of the 1098 individuals, 28 individuals had no genotype calls in more than 10% of the loci. The data of these individuals were excluded from further analysis. Data for the remaining 1070 individuals were used for the downstream analyses.
The data obtained with the EUROFORGEN NAME and Precision ID Ancestry panels were tested separately for Hardy–Weinberg equilibrium (HWE). For the EUROFORGEN NAME panel, the data of the AIM rs7873963 was in Hardy–Weinberg disequilibrium in five populations (Pcor = 4.9E-04). There was an excess of homozygotes of the T allele, which was caused by a deletion downstream of the locus that was associated with the C allele. Only samples typed with the MassARRAY assay were affected by the deletion; the locus was in HWE in the populations typed with the AmpliSeq EUROFORGEN NAME panel. The locus, which was also in linkage disequilibrium (LD) with another locus (see below), was excluded from further population genetic analysis.
The HWE was also assessed for the markers present in the Precision ID Ancestry panel. After Bonferroni correction, the AIM rs310644 was in Hardy–Weinberg disequilibrium in the Pakistani and Portuguese populations (Pcor = 3.07E−4). Among Portuguese individuals, 74 had the TT genotype, two had the CC genotype, while no heterozygote individual was observed. Among Pakistani individuals (N = 72), 43 individuals had the TT genotype, 13 the CC genotype, and 16 the CT genotype.
Linkage disequilibrium (LD) analysis was performed on the combined dataset including 265 AIMs with 34,980 pairs of loci. Besides LD most likely due to physical linkage, LD between alleles at different chromosomes was also observed. Supplementary Tables S4 and S5 show the pairs of loci that were in statistically significant LD in the different populations. Several loci in the EUROFORGEN NAME panel showed statistically significant LD. The HaploView software was used to evaluate if these loci could belong to haplotype blocks. The analysis showed that two groups of markers on chromosome 4 (rs4975193—rs1757928—rs337277—rs1699387, and rs17616434—rs4833103), one group on chromosome 7 (rs9649356—rs1227171), one group on chromosome 10 (rs2031581—rs2765650), and one group on chromosome 12 (rs10862511—rs10506882) seemed to form haplotype blocks. The loci rs1406045 (typed with the EUROFORGEN NAME panel) and rs4463276 (typed with the Precision ID Ancestry Panel) on chromosome 6 as well as rs621341, typed with the EUROFORGEN NAME panel, and rs6754311, typed with the Precision ID Ancestry Panel on chromosome 2 were in linkage disequilibrium (Supplementary Table S4). To ensure marker independence, one locus in each pairwise comparison was eliminated for the population genetic analyses. The performance of the loci in terms of heterozygote balance, locus balance, noise level, and the number of genotype drop-outs was evaluated and for each pair, the locus with the best performance was retained. If the loci performed equally well, preference was given to the locus with the shortest read length (Supplementary Table S6). After evaluating the LD, the final numbers of loci for further genetic analysis were 72 for the EUROFORGEN NAME panel and 161 for the Precision ID Ancestry Panel. The combined dataset included 233 SNP markers.
Genetic structure
The population variation of reference groups from Sub-Saharan Africa (N = 606), Europe (N = 604), the Middle East (N = 134), South-Central Asia (N = 689), and East Asia (N = 504) was analysed. Figure 1 shows a PCA plot of the combined data set with 233 AIMs. PCAs where each population is highlighted can be found in the supplementary materials (Supplementary Figures S1–S14). The Sub-Saharan African, European, South Asian, and East Asian individuals were separated from each other by PC1 and PC2. The Middle Eastern individuals was located between the South Asian and the European individuals with a small overlap with the European individuals. The North African individuals were situated between the Sub-Saharan African and the Middle Eastern individuals, while the NE African individuals were found between the North African and Sub-Saharan African individuals. Supplementary Figure S15 shows a similar analysis based on the 72 EUROFORGEN NAME markers only. PCA analyses showed that the Middle Eastern individuals had a larger overlap with the Southern European populations from Greece and Albania than with the Danish individuals (Supplementary Figures S2, S3, and S5). There was a substantial overlap between the Middle Eastern and South-Central Asian populations mainly consisting of individuals from Afghanistan.


PCA plot of the results obtained with the combined dataset of 233 AIMs included in the EUROFORGEN NAME panel and the Precision ID Ancestry Panel. The PCA were performed using a custom script written in R v. 3.5.0 using the ‘adegenet’ v. 2.1.2 and the ‘ade4’ v. 1.7-15 R packages43,44.
To evaluate the genetic structure of the populations, STRUCTURE analyses were performed using K = 3 to K = 7. Figure 2 shows the results for K = 4 to K = 6 for the 233 loci in the combined data set. The most likely number of clusters was K = 4 corresponding to the Sub-Saharan, East Asian, South-Central Asian, and European populations. Co-ancestry contribution from Sub-Saharan, European, and South-Central Asian populations was observed among individuals from North-East Africa and North Africa, whereas the Middle Eastern individuals shared cluster memberships with primarily the European populations and, to a smaller degree, South-Central Asians. With K = 6, an additional component was observed for the Middle Eastern, North-East African, and the European individuals. For the Middle Eastern individuals, the component differed from those of the North-East African and North African populations mainly due to the Sub-Saharan contribution to the latter populations, and it differed from the clusters of the European populations due to the South-Central Asian contribution to the cluster. Some variation within the European cluster was also observed at K = 6. South Europeans shared more cluster membership with the Middle Eastern, North-East African, and North African populations than the North Europeans. The STRUCTURE analysis performed with EUROFORGEN NAME markers only showed a similar pattern (Supplementary Figure S16).


Diagram of the STRUCTURE analysis with runs of K = 4 to 6 of the combined dataset of 233 AIMs. The reference data are from the 1000 Genomes Project. Population abbreviations are the same as those described in Supplementary Materials Table S2. The membership proportions were plotted using CLUMPP v.1.1.22248 and Distruct v. 1.1.2349.
Population assignment based on z-score and LR
Based on the STRUCTURE and PCA results, the 14 populations typed in this work were grouped into five meta-populations: (1) a European meta-population including individuals from Albania, Denmark, Greece, Portugal, and Slovenia, (2) a Middle Eastern meta-population including individuals from Afghanistan, Iran, Iraq, Syria, and Turkey, (3) a North-East African meta-population including individuals from Eritrea and Somalia, (4) a North-African meta-population including individuals from Morocco, and (5) a South-Central Asian meta-population including individuals from Pakistan.
A z-score test was performed for each of the 1070 individuals using the GenoGeographer software and the cross-validation method22,23. This was done for the EUROFORGEN NAME panel (72 loci), the Precision ID Ancestry Panel (161 loci), and the combined dataset (233 loci). The AIM profiles were tested against both the individual’s meta-population of origin and the four other meta-populations. Table 1 shows the results of the z-score tests. The results of the test of each AIM profile against each meta-population with the three sets of AIMs were categorised as either “Accepted”, “Ambiguous”, or “Rejected” (Fig. 4).
Irrespectively of the origin of the sample, the number of AIM profiles categorised as “Ambiguous” was lower with the combined set of markers than with the Precision ID Ancestry Panel. The reduction in the number of ambiguous profiles was most pronounced for individuals from the Middle East and South-Central Asia (Table 1). In both cases, the population assignments primarily changed from “Ambiguous” to “Accepted”. For example, 47.4% of the Middle Eastern individuals were classified as “Ambiguous” with the Precision ID Ancestry Panel, while only 36.7% were classified as “Ambiguous” with the combined panel. The percentage of Middle Eastern individuals in the “Accepted” category increased from 38.5% with the Precision ID Ancestry Panel to 49.9% with the combined panel. Furthermore, fewer Middle Eastern individuals, categorised as “Accepted” or “Ambiguous”, likely belonged to the European meta-population based on the genotypes generated with the combined panel (1.1% and 8.6%, respectively) compared to the genotypes generated with the Precision ID Ancestry Panel (3.3% and 15.6%, respectively).
For the North African and the North-East African meta-populations, the number of profiles assigned to the ‘Rejected’ category increased when the combined panel was used. Regarding North African individuals, four profiles classified as ‘Accepted’ and two profiles classified as ‘Ambiguous’ with the Precision ID Ancestry panel were assigned as ‘Rejected’ with the combined panel. For the North-East African individuals, three profiles (one defined as ‘Accepted’ and two as ‘Ambiguous’) were classified as ‘Rejected’ when the combined panel was used. These AIM profiles were outliers in all reference populations (z-scores > 1.64; P < 0.05) with the combined panel.
Figure 3 shows the distribution of the log LRs for all individuals with z-scores ≤ 1.64 (P ≥ 0.05) for their populations of origin. Overall, the combined panel (red distribution in Fig. 3) led to an increase in LRs compared to those of the two panels separately. The increase in LR for the combined panel was greatest when the AIM profiles of individuals from North Africa and North-East Africa were compared with those from individuals from Europe, the Middle East, and South-Central Asia, while it was smallest when the AIM profiles of individuals from (1) Europe and the Middle East and (2) the Middle East and South-Central Asia were compared.


Distributions of log LRs for the individuals with z-score ≤ 1.64 (P ≥ 0.05) for the population of origin (listed in the plot headings). The colours refer to the panels used. The curves are based on smoothed kernel density estimates for the 1070 individuals. The hypothesized meta-population in the numerator of the LRs is given by the heading of each plot, while the hypothesized meta-population in the denominator is indicated to the left of the ordinate. R v. 3.5.0 and the ‘ggplot2’ v. 3.2.1 R package (https://ggplot2.tidyverse.org/index.html) was used to visualise the LRs.


Diagrammatic presentation of the decisions for classification of the results of investigations of ancestry with AIMs.

