An Automated Approach to Identify RNA Editing Sites

An Automated Approach to Identify Abstract RNA-editing is one type of post transcription modifications on RNA sequences. To detect RNA-editing, one method is to compare mature mRNA (or cDNA) with the sequences in the coding region. In most existing studies, the coding region sequences were extracted from the reference genome, and therefore SNPs are also detected during this comparison. In this study, both the coding region sequences and the mature mRNAs or cDNAs were from the same genome. Therefore, the detected variations from the mature mRNAs and the coding regions would be either RNA-editing sites or sequencing errors. We developed an automated and computational approach to identify RNA editing sites and the clusters with highly frequent RNA-editing sites. The results of our computational approach provided a candidate list of genes that are most likely to contain the coding regions that represent RNA editing sites. The results also showed that most of the “A-to-G” editing sites located in the 3’ regions, followed by transcript and exonic regions. Moreover, we have provided a visualization of the editing sites within genes and chromosomes. Since the experimental clinical studies to identify the RNA editing sites are very resource intensive in terms of cost, time, and efforts, so our results will be used to define the initial candidate list of genes that should be experimentally tested.

There are several methods to identify RNA editing sites such as, the separate samples and pooled samples methods, which depend on the RNA sequencing data without the need for matched genome sequencing [4]. GIREMI is another method that uses allelic linkage and generalized linear models to differentiate between RNA editing sites and genetic variations in a single RNA-seq sample [5]. RNA Editor is a method that developed a clustering algorithm to identify the distribution of editing sites [6]. Researchers at [7] used two parameters developed a prediction method to predict the distribution of RNA editing sites using two parameters called Hits Per Billion-mapped-bases (HPB) and Potential SNP Score (PPS).
The current work aims to develop an automated approach to identify RNA editing sites and the clusters with highly frequent RNA-editing sites.

Materials and Methods
RNA-seq data were obtained from the school of medicine at the University of Pittsburgh. The RNA-seq data were isolated from hepatocytes (cells of the main parenchymal tissue of the liver and different from liver tissues), which excluded all other cell types from the liver. The sequencing was performed on three unrelated mice with b6 background. The dataset includes 197123 records.
Every record has detailed information such as chromosome, region, reference allele, gene name, and gene version.
Mouse SNPs were extracted and genes annotation information from Ensemble database V80, which is a publicly accessible database in which sequence data are integrated with the gene annotation. It aims to predict gene locations [8].
Comprehensive relational database was built to integrate our needed information about mouse SNPs and genes annotation. This database expediates the process of searching and querying the mouse data and permits efficient comparisons with our dataset. To perform the comparison between our generated mouse database and the data from the school of medicine, we used the chromosome as the first matching criterion, then we used the region as the second criterion to identify the strand (forward or reverse) and the distribution of editing sites.
Three methods of analysis were performed on the genes: first one is based on the total number of genes (unique ones), which counts the number of occurrences of each gene regardless of the count of each occurrence. The second kind of analysis is based on the total number of editing sites (events, not counts). The third kind of analysis is based on the ratio of counts and coverage of the editing sites. For each kind of analysis, we identified the list of top ranked genes according to a certain threshold. Additionally, we performed a statistical analysis on the editing sites and determined the location distribution of these editing sites, which means the range of the editing sites in bp, 3', 5', CDS, and Exon region. After that, we visualized our results to show the overall picture of RNAediting sites and density. The three methods are further discussed in the following paragraph.
In first method, the total of editing sites in each gene was considered regardless of the count for each occurrence. For example, if editing site number 1 has a count of 500, then we count this editing site one time. In the second method, the count of editing sites in each occurrence was considered. In the third method, the ratio of count and coverage was used to select the top ranked genes.
We used a threshold value of 0.45, which means that we considered the editing sites that have a ratio >= 0.45. X-axis represents the genome positions and y-axis represents the count of editing sites.
After comparing the three methods, we found some discrepancies in the distribution of editing sites. The justification for this discrepancy is that the list of top ranked genes is different in the three methods. To identify the best method, we compared our results with the clinical results, and we have found that combining the three methods together will produce better results.
To get a complete view of the distribution of editing sites in the different regions, we have combined the three lists of top ranked genes in one set as provided in the following expression: Top ranked genes =Top ranked genes list1U Top ranked genes list2 U Top ranked genes list (1) Where Top ranked genes list1 is the list of top ranked genes using method 1 and so on. U means union.
Finally, we provided the distribution of editing sites in each gene (Gene-based analysis). We selected the top ranked genes.

Analysis Using First Method
The top ranked 40 genes according to method 1 are shown in Table 1.    Table 2 shows the top ranked genes according to the second method. As examples of the distribution of editing sites in each chromosome using method 2, we provided a representation of chromosomes 4 and 8. Figure   We can notice that distribution of editing sites in methods 1 and 2 are different since we are using different set of genes because of using different criteria. Table 3 shows the top ranked genes according to method 3.             In our study, we found 7954 distinct genes expressed in primary hepatocytes. It is known that RNA editing is a relative rare event in the RNA pool, and it only occurs to certain RNA molecules at certain adenosine residuals. Thus, identify the edited genes and the RNA editing sites would be very challenging. Analysis of our RNA seq data found A-to-G mismatch sites, the potential editing sites. However, the experimental testing of every single gene is very time and resource consuming. To confirm the editing events, it needs to provide a candidate list of genes that are most likely to contain the coding regions that represent RNA editing sites. In our computational study, we provided three methods to get the candidate genes and we assigned a score for each gene. Based on our computational methods, we selected the top 40 ranked genes using each method. After that, we found the intersection between the top 40 genes from each method and then we provided the list of genes that are at least mentioned in two methods. Based on the intersection, we got a list of 20 genes. These candidate genes were tested through lab experiments. Experiments could positively confirm our list of 20 genes. This implies that our computational methods are very helpful and able to save time and effort by (20/7954=0.0025) times.

Conclusions
In this research, we developed an automated approach that identifies RNA editing sites and the clusters with high frequent RNA-editing sites. The top ranked genes were selected according to several criteria including, total number of editing sites, count of editing sites, and the ratio of count and coverage. We found that the

American Journal of Biomedical Science & Research
Copy@ Mohannad AL Saghir ratio of count and coverage can provide more accurate results. Based on the current results, most of the editing sites are "A-to-G" editing sites and located in the 3' regions, followed by CDS and transcript and then exonic regions. Additionally, we have provided a spatial visualization of the editing sites within genes and chromosomes.
In the future, we aim to perform a similar study on different species and determine whether they have the same patterns or there are any species-specific patterns.

Data Availability
All datasets generated or processed during this study are available upon reasonable request from the corresponding author.

Conflicts of Interest
The Author(s) declare(s) that there is no conflict of interest." If there are potential conflicts of interest.