Big Data Mining and Analytics  2018, Vol. 01 Issue (03): 191-210    DOI: 10.26599/BDMA.2018.9020018
Survey on Encoding Schemes for Genomic Data Representation and Feature Learning—From Signal Processing to Machine Learning
Ning Yu, Zhihua Li, Zeng Yu*
Ning Yu is with the Department of Computing Sciences, College at Brockport, State University of New York, Brockport, NY 14422, USA. E-mail: nyu@brockport.edu.
Zhihua Li is with the Department of Computer Science and Technology at Jiangnan University, Wuxi 214122, China. E-mail: zhli@jiangnan.edu.cn.
Zeng Yu is with the School of Information Science and Technology, Southwest Jiaotong University, Chengdu 611756, China.

Abstract

Data-driven machine learning, especially deep learning technology, is becoming an important tool for handling big data issues in bioinformatics. In machine learning, DNA sequences are often converted to numerical values for data representation and feature learning in various applications. Similar conversion occurs in Genomic Signal Processing (GSP), where genome sequences are transformed into numerical sequences for signal extraction and recognition. This kind of conversion is also called encoding scheme. The diverse encoding schemes can greatly affect the performance of GSP applications and machine learning models. This paper aims to collect, analyze, discuss, and summarize the existing encoding schemes of genome sequence particularly in GSP as well as other genome analysis applications to provide a comprehensive reference for the genomic data representation and feature learning in machine learning.

Received: 21 January 2018      Published: 13 January 2020
Corresponding Authors: Zeng Yu