Analysis of Privacy Protection Methods for DNA Motif Finding

DNA motif finding is a repetitive expressive sequence fragment that found in given DNA sequence sets and its precise location is of significance to fully comprehend the regulation mechanism of genetic expression. Motif finding is the key to grasp the mechanism of genetic transcriptional regulation, however, the security and privacy issues of motif finding are so overwhelming that we must to pay more attention to it. In our paper, we simply overviewed the methods for protecting motif finding privacy from three broad perspectives: Controlled access, anonymity method, and ε -differential privacy.


Introduction
The implementation of the genome project makes DNA sequence analysis become a top priority for Bioinformatics.
Sequence alignment and motif finding are the two main directions of biological sequence analysis. DNA sequence motif is the short and recurred patterns in DNA sequences that are assumed to have the biological function [1][2][3]. In 1975, Professor Pribnow used the early multi-sequence comparison methods to analyze the promoter region of yeast, found a TATA box that refers to a highly conserved and consistent 10 bp patterns, which was the first time that people found motifs [4]. Recently, people used stochastic evolution methods for motif finding. After more than 40 years of development, the research on the method of motif finding has grown exponentially. National scientific research projects such as motif finding in genetic researches are growing at an annual rate of 30 ; by the 2020 publication year, the number of papers about genetic researches in the direction of motif finding published as many as 64,116 ; at the same time, thousands of DNA motif finding algorithms [5][6][7][8][9] and platforms [10][11][12] have also been developed.
In fact, DNA sequences analysis gives the access to make sense of amounts of information about a person's characteristics, function, illnesses, and personality disorders and his or her genetic relatives [13][14] which are very privately. These private information's are easily leaked in the mining of DNA sequences.
In this paper, we firstly simply investigated the privacy leakage types of DNA motif finding as well as some methods for its discovery process. Then, we overviewed the above content from three broad perspectives: controlled access, anonymity method, and differential privacy. This paper is organized as follows. Section 2 summarized the current privacy protection methods for DNA motif finding. Section 3 briefly reviewed the main content of this paper.

Methods
With the explosive growth of DNA data, making full use of data is the only way to increase the value of DNA data. However, the privacy protection of DNA data has clearly become a bottleneck in the development of DNA sequence analysis especially in motif finding.

Controlled Access
The access control method is the same as the dbGaP file download in the NCBI database, which allows the user to obtain and manipulate the specified data after having the approval and within the granted operation authority. Also, Controlled access to protect

k-Anonymity Method
The main idea of this method is that by generalizing/concealing the target data. Each record in the published DNA data set has records that are indistinguishable from each other on the Quasi-Identifier. The probability that the attacker discriminates the individual's private information from the published data set is less than, thus effectively protecting the personal privacy of the data owner. For example, [26] used the k-anonymity-based method before performing DNA motif finding, and successfully protect the privacy of DNA data sharers. However, due to the particularity of DNA sequence data, it is easy to overgeneration data by applying k-anonymity method in this field, which makes DNA data analysis lose its value.

ε -Differential Privacy
Although it is widely believed that improper use of DNA data can reveal personal privacy, it is still uncertain what types of privacy leakage is caused by what information or background knowledge [27] an attacker might launch an attack. These can be solved by -differential privacy, which is a powerful method for current applications in the field of DNA motif finding privacy protection. -differential privacy requires that the results of any analysis cannot be relied on any single data record, and similarly in the process of performing DNA motif finding, referring to any single DNA sequence. For instance, in [24], author proposed a high-utility motif finding algorithm based on -differential privacy.
Their solution was that make use of the closed frequent pattern set to reduce redundant motifs of result sets and obtain accurate motifs results, then use -differential privacy to protect motif finding results. Therefore, when motif finding results are shared, -differential privacy can ensure that the privacy information of it is not disclosed even if when the attacker mastered the background information of all the data except a certain DNA sequence. However, the use of -differential privacy in DNA motif finding has problems such as large redundancy of results.

Conclusions
DNA sequence analysis will deepen our understanding of human health or disease and plays a major role in discovering the cause of disease and achieving prevention, diagnosis and personalized medical treatment. But the rich information contained in the DNA sequence is easily leaked out during the motif finding.
In this article, we describe techniques for protecting human genetic privacy in three ways: Controlled access, anonymity method, and differential privacy. Of course, these are not perfect methods for privacy protection methods for DNA motif finding, because the problem is always imposed.
In this context, the future direction is clearly, and struggle will focus more on effective solutions to the problem of genetic privacy and security. We foresee that it is very necessary to seriously investigate and adopt varieties of methods for DNA motif finding privacy protection.