Patterns of Genetic Structure and Evidence of Gene Flow between Arabian Peninsula and European Populations

Patterns of Genetic Structure and Evidence of Gene Flow between Arabian Peninsula and European Populations. Biomed Abstract The genetic interaction was observed between Asian and European populations. However, genetic admixtures among Eurasians, particularly between East Asians and North western Europeans have been reported but population admixture and gene flow between Arabian Peninsula and European populations have been poorly studied at the level of genome. Here, we have compared the whole-exome sequencing of 1208 individuals from the Arabian Peninsula, Africa, Europe, Caucasian, East and South Asia, to identify genetic structure and gene flow between them. We have shown that there is less differentiation between Arabian Peninsula and Italy samples than that between the Arabian Peninsula and other Europeans in this study. As well as Italy samples exhibit higher similarity of copy number variation of deletion and duplication with Qatari individuals. Arabian Peninsula ancestry expanded into the South Asian, Caucasian and parts of European populations. Large identity by descent tracts (≥ 2.0 cM) were identified between Arabian Peninsula and individuals from Kenya and Nigeria. outgroup f3-statistics suggest that within Europeans, Southern Europeans share more genetic drift with Arabian Peninsula than with other European regions. It is possible that these patterns reflect the Arabs migration to the Italian island of Sicily, perhaps dating back to the 831-1072 AD. Our results showed genetic structure of Afro-Eurasian populations, with different levels of southern European admixture, as a result of the genetic interaction with Arabian Peninsula.


Introduction
The genetic structure of biological populations varies across the world, as results from the interaction between population admixture, movement, gene flow and natural selection. It is believed that early modern humans left Africa via the Nile Valley heading to the Middle East and through the Red Sea crossing to the Arabian Peninsula (AP) and settling in these places as early as 125,000 years ago [1,2]. Then ancestors began to spread into Southern Asia and Australia [3], Europe, and eventually, the Americas [4] and being the basis of these modern human population structures, and a continuing path for their admixture.
In addition, genome-wide studies revealed that European Romani individuals fall between Southern Asian and non-Roma European populations relative to populations from current Punjab state of India, Central Asia, Pakistan and Caucasus [11][12][13]. Furthermore, Gene flow from North Africa groups affected the gene pool of differential human populations in southern Europe [6].
However, genetic interaction among Eurasians groups, particularly between North Europeans and East Asians has been reported [14,15], the AP gene flow in Europeans has not been well studied yet. To overcome the mentioned limitations, the present research consists of a whole-exome sequencing (WES) analysis of the 1208 individuals from the Arabian Peninsula, Africa, Europe, Caucasian, East and South Asia, with the following aims to determine the level of admixture of the AP with other Europeans and to distinguish the patterns of gene flow between them. Our main objectives were to evaluate the status of genetic structure, phylogenetic relationships and quantify the extent and pattern of recent gene flow between AP and European populations. using Burrows-Wheeler Aligner (BWA) (https://sourceforge.net/ projects/bio-bwa) [16]. All SAM files were converted to the BAM files, Using SAM tools [17], followed by sorting and indexing. PCR duplicates were marked from the BAM files using Mark Duplicates tool from Picard Tools (https://github.com/broadinstitute/ picard). By using Indel Realigner and Base Recalibrator command from GATK 3.4 program, indels were realigned and base quality was recalibrated respectively (www.broadinstitute.org/gatk). Finally, the SNPs related to all individuals were detected and filtered using the Unified Genotyper with the "EMIT-ALL-SITES" option and the Variant Filtration command in the GATK 3.4 program. VCF file was used to a filtering for MAF ≥0.05 and max-missing = 0.90 by using VCFtools [18] v.0.1.13. After the application of quality control filters, 203,256 high-quality WES SNVs were retained for our analyses.

Mixture analysis
To investigate the potential of genetic admixture between AP, European and worldwide population samples, we have used the block relaxation algorithm implemented in ADMIXTURE [19]. v.1.3.0 to estimate individual ancestry proportions given k ancestral components. We have applied the default cross validation parameters (folds = 5) with itera¬tions of k value ranging from 2 to 23. Minimum squared error values calculated from the crossvalidation procedure in ADMIXTURE to evaluate the fit of different values of k determined that k = 12 was optimum for samples ( Figure S1).

Copy number variations (CNVs) analysis
CNVs were recognized using CNVkit [20], a command-line toolkit, to visualize and infer copy number from WES data to a reference human genome (hg19). For this purpose, 5 samples from each population were randomly selected to show more details in the heatmap. We used bam files as input and default CNVkit settings were used for CNV identification individually. Given CNVs in X and Y chromosome were not included.

Principal-components analysis (PCA)
PCA was used to investigate the affinities in human populations and the relationships between them. We have performed PCA on all samples using smart pca from the Eigen strat package [21] and the first two principal components were compared graphically.

Influence of recent migration on maximum-likelihood phylogeny
Tree mix v.1.13 [22] was used to estimate the historical relationships and migration among populations on the maximumlikelihood phylogeny. We have tested the fit of model 3 migration events (-m 3). Tree Mix was run using SNPs grouped in windows of 500 (--k 500) with samples grouped by location and population.
Sample-size adjustment was turned off because samples per population were ≥ 7.
shared genetic drift between AP and population X, after divergence of the ancestors of Mbuti as outgroup. Standard error was estimated using a weighted block jack knife approach [25] over 5-Mb blocks.
Regions of genomic identity by descent (IBD) were identified [26] with the Beagle [27] implementation of fastIBD [28]. We have applied PCA and the clustering algorithm ADMIXTURE to study the population structure and relationship of AP to European and worldwide population in a WES of 1208 samples. In Figure 1A,   Figure 1B). The long tails exhibited by the AP populations in the PCA plot resulting admixture events or the event of gene flow from other populations ( Figure 1B).

Genome-wide ancestry analysis of the AP and worldwide population
When performing the ADMIXTURE analysis, the lowest crossvalidation error could be found when K=12 ( Figure S1). These results clearly show that the AP ancestral patterns expansion into the South Asian, Caucasian, parts of European, especially Italy individual, which were consistent with the observed PCA results and suggested admixture or the occurrence of gene flow events ( Figure 1C). All AP subgroups showed similar ancestral patterns.
However, we have observed small portions of other ancestral components in AP individuals ( Figure 1C).
To test signatures of recent admixture of modern human populations in this study, we have created a maximum likelihood tree using the Tree Mix approach [29]. Tree Mix uses a model that allows for both population splits and gene flow to better capture historical relationships between populations. We first generated a tree with no migration events (Figure 2A). The evolutionary history of localities without migration showed a close relationship among the AP population that confirmed our PCA and ADMICTURE result.
Furthermore, the tree shows that the European and Caucasian shared drift with AP populations (Figure 2A). Italy individuals was closed the AP branch but showed substantial divergence. Germany individuals showed much greater apparent divergence among

Whole-Exome CNV analysis of AP and worldwide population
CNVs are a category of structural variation determined by the gain or loss of large regions of genomic sequence [30,31] and can be used to measure genetic relatedness. In the present study,  Table   S3). Within Asia, The highest sharing was found with Pakistani  Figure 4A).
The use of WES data and a larger sample size in our study allows us to better investigate the level of admixture and characterize recent gene flow by exploring for long segments of genetic identity by descent (IBD). Large IBD tracts (≥ 2.0 cM) were identified between AP and both populations from Africa in this study, and the degree of recent gene flow with Eurasian populations ranged from essentially none with East Asians to very high with the Caucasian and Italy individuals ( Figure 4B). These results are consistent with geographical proximity between AP and each respective group.
Of the populations included in Eurasian, Pakistan shares the second-highest level of IBD with the AP people, behind Caucasian population ( Figure 4B).

Discussion
Analyzing the detailed genetic diversity among different human ethnic group can be biomedically beneficial, as well as identifying stratifications within populations and interactions among populations and suggesting shared ancestry through time and across geographic regions. In the present study, WES data have revealed that recent genetic admixture did occur and have been prevalent in Europe continent [32]. Admixture has been detected between Southern European (Italy) and AP populations ( Figure 1C) which are geographically far away from each other and generally considered as well-differentiated populations. We have observed that this genetic admixture might not exactly came from AP, instead, it could come from some Caucasian people [33] who live in Eastern Europe. Qatari samples were overlap with all Saudi Arabia, part of Italy, south Asian and African populations. Previous study reported that Qatari can be separated into three founder populations: Iranian (''Persian''), Arab and Central Asia and Bantu-speaking Africans [34]. According to our finding in the present study Qatari founder can be more than three and South Asian and European populations can be initial origination of the founders of Qatari ( Figure 1C).

Differences in copy number of genomic segments can result
in changes in gene expression and phenotypic variation through gene disruption and altering gene dosage [35,36]. Based on CNV analysis, our study revealed CNV structure similarity between Southern Europe (Italy) and AP populations [37].  [39]. In addition, Caucasian individuals are closely related to AP populations, in agreement with a continuous gene flow, as clearly determined with IBD and f3-ratio statistics [40,41]. As well as AP is similar to African populations, this finding confirms that the Red Sea coasts may have been important in this southern expansion [42]. IBD analysis revealed that the Kenya and Nigeria exhibit the highest IBD sharing with the AP (≥ 2 cM). This may suggest not only recent gene flow between populations, but also their common ancestry or ancient admixture.

Conclusion
Our findings contribute to an improved understanding of the history of human migration and the evolutionary mechanisms that have shaped the genetic structure of populations in Afro-Eurasia. Our study has confirmed that the southern European, Southern Asian and Caucasian populations have ancestry from AP.

Acknowledgment
Thanks are owed to the University of Tabriz Research Computing support staff for analytic assistance.

Conflict of Interest
The author(s) declare that there are no competing interests.

Author Contributions
HC and RJO performed data analysis, manuscript preparation and manuscript revision. Both the authors have read and approved the final manuscript.

Data Availability Statement
Data are available in the Supporting Information.