Please wait a minute...
Big Data Mining and Analytics  2018, Vol. 01 Issue (03): 191-210    DOI: 10.26599/BDMA.2018.9020018
    
Survey on Encoding Schemes for Genomic Data Representation and Feature Learning—From Signal Processing to Machine Learning
Ning Yu, Zhihua Li, Zeng Yu*
Ning Yu is with the Department of Computing Sciences, College at Brockport, State University of New York, Brockport, NY 14422, USA. E-mail: nyu@brockport.edu.
Zhihua Li is with the Department of Computer Science and Technology at Jiangnan University, Wuxi 214122, China. E-mail: zhli@jiangnan.edu.cn.
Zeng Yu is with the School of Information Science and Technology, Southwest Jiaotong University, Chengdu 611756, China.
Download: PDF (3300 KB)      HTML  
Export: BibTeX | EndNote (RIS)      

Abstract  

Data-driven machine learning, especially deep learning technology, is becoming an important tool for handling big data issues in bioinformatics. In machine learning, DNA sequences are often converted to numerical values for data representation and feature learning in various applications. Similar conversion occurs in Genomic Signal Processing (GSP), where genome sequences are transformed into numerical sequences for signal extraction and recognition. This kind of conversion is also called encoding scheme. The diverse encoding schemes can greatly affect the performance of GSP applications and machine learning models. This paper aims to collect, analyze, discuss, and summarize the existing encoding schemes of genome sequence particularly in GSP as well as other genome analysis applications to provide a comprehensive reference for the genomic data representation and feature learning in machine learning.



Key wordsencoding scheme      data representation      feature learning      deep learning      genomic signal processing      machine learning      genome analysis     
Received: 21 January 2018      Published: 13 January 2020
Corresponding Authors: Zeng Yu   
Cite this article:

Ning Yu, Zhihua Li, Zeng Yu. Survey on Encoding Schemes for Genomic Data Representation and Feature Learning—From Signal Processing to Machine Learning. Big Data Mining and Analytics, 2018, 01(03): 191-210.

URL:

http://bigdata.tsinghuajournals.com/10.26599/BDMA.2018.9020018     OR     http://bigdata.tsinghuajournals.com/Y2018/V01/I03/191

Fig. 1 Summary of encoding schemes.
29] (1 cal=4.18 J).">
Fig. 2 Enthalpy values of thermodynamic interactions between two molecules. The unit of measurement is kcal/mol<sup>[<xref ref-type="bibr" rid="R29">29</xref>]</sup> (1 cal=4.18 J).
29].">
Fig. 3 Difference of transition and transversion between molecules measured by Hamming distance and Euclidean distance<sup>[<xref ref-type="bibr" rid="R29">29</xref>]</sup>.
Fig. 4 Dinucleotides placed in a unit circle.
Fig. 5 Six hexagons.
Fig. 6 Constellation for real number and complex number representations.
CategoryEncoded initial position
CGR-RYA(0, 0), T(1, 0), C(0, 1), G(1, 1)
CGR-MKA(0, 0), T(1, 0), G(0, 1), C(1, 1)
CGR-WSA(0, 0), G(1, 0), C(0, 1), T(1, 1)
Table 1 Encoded initial positions of CGR-walk.
Fig. 7 3-dimensional tetrahedron in a cube.
Fig. 8 Tetrahedron encoding scheme for codons.
Fig. 9 Encoding methods based on (a) a regular tetrahedron and (b) an irregular tetrahedron.
Fig. 10 Tetrahedron-based coordinate system in Z-curve.
Fig. 11 Flow chart on the position of encoding scheme in feature learning.
[1]   Sanger F., Air G. M., Barrell B. G., Brown N. L., Coulson A. R., Fiddes J. C., Hutchison III C. A., Slocombe P. M., and Smith M., Nucleotide sequence of bacteriophage ϕX174 DNA, Nature, vol. 265, no. 5596, pp. 687-695, 1977.
[2]   Yu N., Guo X., Gu F., and Pan Y., Signalign: An ontology of DNA as signal for comparative gene structure prediction using information-coding-and-processing techniques, IEEE Trans. NanoBioscience, vol. 15, no. 2, pp. 119-130, 2016.
[3]   Anastassiou D., Genomic signal processing, IEEE Signal Process. Mag., vol. 18, no. 4, pp. 8-20, 2001.
[4]   Holden T., Subramaniam R., Sullivan R., Cheung E., Schneider C., Tremberger G. Jr., Flamholz A., Lieberman D. H., and Cheung T. D., ATCG nucleotide fluctuation of Deinococcus radiodurans radiation genes, in Proc. Instruments, Methods, and Missions for Astrobiology X, San Diego, CA, USA, 2007, p. 669417.
[5]   Yu N., Yu Z., Gu F., and Pan Y., Evaluating the impact of encoding schemes on deep auto- encoders for DNA annotation, in Bioinformatics Research and Applications, Cai Z., Daescu O., and Li M., eds. Springer International Publishing, 2017, pp. 390-395.
[6]   Cristea P. D., Conversion of nucleotides sequences into genomic signals, J. Cell. Mol. Med., vol. 6, no. 2, pp. 279-303, 2002.
[7]   Voss R. F., Evolution of long-range fractal correlations and 1/f noise in DNA base sequences, Phys. Rev. Lett., vol. 68, no. 25, pp. 3805-3808, 1992.
[8]   Borrayo E., Mendizabal-Ruiz E. G., Vlez-Pérez H., Romo-Vázquez R., Mendizabal A. P., and Morales J. A., Genomic signal processing methods for computation of alignment-free distances from DNA sequences, PLoS One, vol. 9, no. 11, p. e110954, 2014.
[9]   Hutter B., Helms V., and Paulsen M., Tandem repeats in the CpG islands of imprinted genes, Genomics, vol. 88, no. 3, pp. 323-332, 2006.
[10]   Ning Z. M., Cox A. J., and Mullikin J. C., SSAHA: A fast search method for large DNA databases, Genome Res., vol. 11, no. 10, pp. 1725-1729, 2001.
[11]   Katoh K., Misawa K., Kuma K. I., and Miyata T., MAFFT: A novel method for rapid multiple sequence alignment based on fast Fourier transform, Nucleic Acids Res., vol. 30, no. 14, pp. 3059-3066, 2002.
[12]   King B. R., Aburdene M., Thompson A., and Warres Z., Application of discrete Fourier inter-coefficient difference for assessing genetic sequence similarity, EURASIP    J. Bioinform. Syst. Biol., vol. 2014, no. 1, p. 8, 2014.
[13]   Hoang T., Yin C. C., Zheng H., Yu C. L., He R. L., and Yau S. S. T., A new method to cluster DNA sequences using Fourier power spectrum, J. Theor. Biol., vol. 372, pp. 135-145, 2015.
[14]   Peng W., Wang J. X., Zhao B. H., and Wang L. S., Identification of protein complexes using weighted PageRank-nibble algorithm and core-attachment structure, IEEE/ACM Trans. Comput. Biol. Bioinform., vol. 12, no. 1, pp. 179-192, 2015.
[15]   Cervantes-De la Torre F., González-Trejo J. I., Real-Ramírez C. A., and Hoyos-Reyes L. F., Fractal dimension algorithms and their application to time series associated with natural phenomena, J. Phys. Conf. Ser., vol. 475, no. 1, p. 012002, 2013.
[16]   Vinga S., Carvalho A. M., Francisco A. P., Russo L. M., and Almeida J. S., Pattern matching through chaos game representation: Bridging numerical and discrete data structures for biological sequence analysis, Algorithms Mol. Biol., vol. 7, no. 1, p. 10, 2012.
[17]   Kwan H. K. and Arniker S. B., Numerical representation of DNA sequences, in Proc. 2009 IEEE International Conf. Electro/Information Technology, Windsor, ON, Canada, 2009, pp. 307-310.
[18]   bai Arniker S. and Kwan H. K., Advanced numerical representation of DNA sequences, in Proc. 2012 Int. Conf. Bioscience, Biochemistry and Bioinformatices, Singapore, 2012, pp. 1-5.
[19]   Bielinska-Waz D., Graphical and numerical representations of DNA sequences: statistical aspects of similarity, J. Math. Chem., vol. 49, no. 10, pp. 2345-2407, 2011.
[20]   Roy A., Raychaudhury C., and Nandy A., Novel techniques of graphical representation and analysis of DNA sequences—A review, J. Biosci., vol. 23, no. 1, pp. 55-71, 1998.
[21]   Cosic I., Macromolecular bioactivity: Is it resonant interaction between macromolecules?—Theory and applications, IEEE Trans. Biomed. Eng., vol. 41, no. 12, pp. 1101-1114, 1994.
[22]   Pirogova E. and Cosic I., Examination of amino acid indexes within the resonant recognition model, in Proc. 2nd Conf. Victorian Chapter of the IEEE EMBS, Melbourne, Australia, 2001, pp. 1-4.
[23]   Ning J., Moore C. N., and Nelson J. C., Preliminary wavelet analysis of genomic sequences, in Proc. 2003 IEEE Bioinformatics Conf. Computational Systems Bioinformatics, Stanford, CA, USA, 2003, pp. 509-510.
[24]   Nair A. and Sreenadhan S. P., A coding measure scheme employing electron-ion interaction pseudopotential (EIIP), Bioinformation, vol. 1, no. 6, pp. 197-202, 2006.
[25]   Stanley H. E., Buldyrev S. V., Goldberger A. L., Goldberger Z. D., Havlin S., Mantegna R. N., Ossadnik S. M., Peng C. K., and Simons M., Statistical mechanics in biology: How ubiquitous are long-range correlations? Phys. A, vol. 205, nos. 1-3, pp. 214-253, 1994.
[26]   Li W. and Kaneko K., Long-range correlation and partial 1/fα spectrum in a noncoding DNA sequence, EPL, vol. 17, no. 7, p. 655, 1992.
[27]   Bari A. T. M. G., Reaz M. R., Islam A. K. M. T., Choi H. J., and Jeong B. S., Effective encoding for DNA sequence visualization based on nucleotide’s ring structure, Evol. Bioinform., vol. 9, pp. 251-261, 2013.
[28]   Breslauer K. J., Frank R., Blcker H., and Marky L. A., Predicting DNA duplex stability from the base sequence, Proc. Natl. Acad. Sci. USA, vol. 83, no. 11, pp. 3746-3750, 1986.
[29]   Yu N., Guo X., Gu F., and Pan Y., DNA AS X: An information-coding-based model to improve the sensitivity in comparative gene analysis, in Bioinformatics Research and Applications, Harrison R., Li Y. H., and Mandoiu I., eds. Springer International Publishing, 2015, pp. 366-377.
[30]   Garzon M. H. and Deaton R. J., Codeword design and information encoding in DNA ensembles, Nat. Comput., vol. 3, no. 3, pp. 253-292, 2004.
[31]   Deng W. and Luan Y. H., Analysis of similarity/ dissimilarity of DNA sequences based on chaos game representation, Abstr. Appl. Anal., vol. 2013, p. 926519, 2013.
[32]   Gao J. and Xu Z. Y., Chaos game representation (CGR)-walk model for DNA sequences, Chin. Phys. B, vol. 18, no. 1, pp. 370-376, 2009.
[33]   Almeida J. S., Carri?o J. A., Maretzek A., Noble P. A., and Fletcher M., Analysis of genomic sequences by chaos game representation, Bioinformatics, vol. 17, no. 5, pp. 429-437, 2001.
[34]   Faria L. C. B., Rocha A. S. L., Kleinschmidt J. H., Silva-Filho M. C., Bim E., Herai R. H., Yamagishi M. E. B., and Palazzo R. Jr., Is a genome a codeword of an error-correcting code? PLoS One, vol. 7, no. 5, p. e36644, 2012.
[35]   Liu X. and Geng X. L., A convolutional code-based sequence analysis model and its application, Int. J. Mol. Sci., vol. 14, no. 4, pp. 8393-8405, 2013.
[36]   Liu Z. B., Liao B., Zhu W., and Huang G. H., A 2D graphical representation of DNA sequence based on dual nucleotides and its application, Int. J. Quantum Chem., vol. 109, no. 5, pp. 948-958, 2009.
[37]   Nair A. S. S. and Mahalakshmi T., Visualization of genomic data using inter-nucleotide distance signals, in Proc. IEEE Genomic Signal Processing, Bucharest, Romania, 2005.
[38]   Hackenberg M., Previti C., Luque-Escamilla P. L., Carpena P., Martínez-Aroza J., and Oliver J. L., CpGcluster: A distance-based algorithm for CpG-island detection, BMC Bioinf., vol. 7, p. 446, 2006.
[39]   Yu N., Guo X., Zelikovsky A., and Pan Y., GaussianCpG: A Gaussian model for detection of human CpG island, in Proc. 5th Int. Conf. Computational Advances in Bio and Medical Sciences, Miami, FL, USA, 2015, p. 1.
[40]   Afreixo V., Bastos C. A. C., Pinho A. J., Garcia S. P., and Ferreira P. J. S. G., Genome analysis with inter-nucleotide distances, Bioinformatics, vol. 25, no. 23, pp. 3064-3070, 2009.
[41]   Zhou L. Q., Li R., and Han G. S., A method based on the improved inter-nucleotide distances of genomes to construct vertebrates phylogeny tree, in Proc. 7th Int. Conf. Biomedical Engineering and Informatics, Dalian, China, 2014, pp. 776-780.
[42]   Bastos C. A., Afreixo V., Pinho A. J., Garcia S. P., Rodrigues J. M., and Ferreira P. J., Inter-dinucleotide distances in the human genome: an analysis of the whole-genome and protein-coding distributions, J. Integr. Bioinform., vol. 8, no. 3, p. 172, 2011.
[43]   Mujiono, Wasito I., and Veritawati I., Fractal dimension approach for clustering of DNA sequences based on internucleotide distance, in Proc. 2013 Int. Conf. Information and Communication Technology, Bandung, Indonesia, 2013, pp. 82-87.
[44]   Bastos C. A. C., Afreixo V., Pinho A. J., Garcia S. P., Rodrigues J. M. O. S., and Ferreira P. J. S. G., Distances between dinucleotides in the human genome, in Proc. 5th Int. Conf. Practical Applications of Computational Biology & Bioinformatics, 2011, pp. 205-211.
[45]   Ding S. Y., Li Y., Yang X. W., and Wang T. M., A simple k-word interval method for phylogenetic analysis of DNA sequences, J. Theor. Biol., vol. 317, pp. 192-199, 2013.
[46]   Tang J., Hua K. R., Chen M. Y., Zhang R. M., and Xie X. L., A novel k-word relative measure for sequence comparison, Comput. Biol. Chem., vol. 53, pp. 331-338, 2014.
[47]   Xie X. H., Yu Z. G., Han G. S., Yang W. F., and Anh V., Whole-proteome based phylogenetic tree construction with inter-amino-acid distances and the conditional geometric distribution profiles, Mol. Phylogenet. Evol., vol. 89, pp. 37-45, 2015.
[48]   Zou S., Wang L., and Wang J. F., A 2D graphical representation of the sequences of DNA based on triplets and its application, EURASIP J. Bioinform. Syst. Biol., vol. 2014, no. 1, p. 1, 2014.
[49]   Akhtar M., Epps J., and Ambikairajah E., On DNA numerical representations for period-3 based exon prediction, in Proc. 2007 IEEE Int. Workshop on Genomic Signal Processing and Statistics, Tuusula, Finland, 2007, pp. 1-4.
[50]   Jabbari K. and Bernardi G., Cytosine methylation and CpG, TpG (CpA) and TpA frequencies, Gene, vol. 333, pp. 143-149, 2004.
[51]   Datta S. and Asif A., A fast DFT based gene prediction algorithm for identification of protein coding regions, in Proc. 2005 IEEE Int. Conf. Acoustics, Speech, and Signal Processing, 2005, pp. 653-656.
[52]   Motahari A. S., Bresler G., and Tse D. N. C., Information theory of DNA shotgun sequencing, IEEE Trans. Inf. Theory, vol. 59, no. 10, pp. 6273-6289, 2013.
[53]   Simmen M. W., Genome-scale relationships between cytosine methylation and dinucleotide abundances in animals, Genomics, vol. 92, no. 1, pp. 33-40, 2008.
[54]   Tuqan J. and Rushdi A., A DSP approach for finding the codon bias in DNA sequences, IEEE J. Sel. Top. Signal Process., vol. 2, no. 3, pp. 343-356, 2008.
[55]   Galleani L. and Garello R., The minimum entropy mapping spectrum of a DNA sequence, IEEE Trans. Inf. Theory, vol. 56, no. 2, pp. 771-783, 2010.
[56]   Román-Roldán R., Bernaola-Galván P., and Oliver J., Application of information theory to DNA sequence analysis: A review, Pattern Recognition, vol. 29, no. 7, pp. 1187-1194, 1996.
[57]   Bernaola-Galván P., Grosse I., Carpena P., Oliver J. L., Román-Roldán R., and Stanley H. E., Finding borders between coding and noncoding DNA regions by an entropic segmentation method, Phys. Rev. Lett., vol. 85, no. 6, pp. 1342-1345, 2000.
[58]   Dan Cristea P., Genetic signal representation and analysis, in Proc. Functional Monitoring and Drug-Tissue Interaction, San Jose, CA, USA, 2002, pp. 77-84.
[59]   Cristea P., Genetic signal analysis, in Proc. 6th Int. Symp. Signal Processing and Its Applications, Kuala Lumpur, Malaysia, 2001, pp. 703-706.
[60]   Hebert P. D. N., Cywinska A., Ball S. L., and deWaard J. R., Biological identifications through DNA barcodes, Proc. Roy. Soc. B Biol. Sci., vol. 270, no. 1512, pp. 313-321, 2003.
[61]   Ratnasingham S. and Hebert P. D. N., Bold: The barcode of life data system, Mol. Ecol. Notes, vol. 7, no. 3, pp. 355-364, 2007.
[62]   Afreixo V., Bastos C. A. C., Pinho A. J., Garcia S. P., and Ferreira P. J. S. G., Genome analysis with distance to the nearest dissimilar nucleotide, J. Theor. Biol., vol. 275, no. 1, pp. 52-58, 2011.
[63]   Kent W. J., Sugnet C. W., Furey T. S., Roskin K. M., Pringle T. H., Zahler A. M., and Haussler D., The human genome browser at UCSC, Genome Res., vol. 12, no. 6, pp. 996-1006, 2002.
[64]   Kauer G. and Bl?cker H., Applying signal theory to the analysis of biomolecules, Bioinformatics, vol. 19, no. 16, pp. 2016-2021, 2003.
[65]   Cheever E. A., Searls D. B., Karunaratne W., and Overton G. C., Using signal processing techniques for DNA sequence comparison, in Proc. 15th Annu. Northeast Bioengineering Conference, Boston, MA, USA, 1989, pp. 173-174.
[66]   Kwan H. K., Kwan B. Y. M., and Kwan J. Y. Y., Novel methodologies for spectral classification of exon and intron sequences, EURASIP J. Adv. Signal Process., vol. 2012, no. 1, p. 50, 2012.
[67]   Berger J. A., Mitra S. K., Carli M., and Neri A., New Approaches to Genome Sequence Analysis Based on Digital Signal Processing. University of California, CA, USA, 2002.
[68]   Rao N. and Shepherd S. J., Detection of 3- periodicity for small genomic sequences based on AR technique, in Proc. 2004 Int. Conf. Communications, Circuits and Systems, Chengdu, China, 2004, pp. 1032-1036.
[69]   Chakravarthy N., Spanias A., Iasemidis L. D., and Tsakalis K., Autoregressive modeling and feature analysis of DNA sequences, EURASIP J. Appl. Signal Process., vol. 2004, p. 952689, 2004.
[70]   Yu Z. G., Anh V. V., Zhou Y., and Zhou L. Q., Numerical sequence representation of DNA sequences and methods to distinguish coding and non-coding sequences in a complete genome, in Proc. 11th World Multi-Conf. Systemics, Cybernetics and Informatics: WMSCI 2007, 2007, pp. 171-176.
[71]   Brodzik A. K. and Peters O., Symbol-balanced quaternionic periodicity transform for latent pattern detection in DNA sequences, in Proc. 2005 IEEE Int. Conf. Acoustics, Speech, and Signal Processing, 2005, pp. 373-376.
[72]   Rosen G., Examining coding structure and redundancy in DNA, IEEE Eng. Med. Biol. Mag., vol. 25, no. 1, pp. 62-68, 2006.
[73]   Rosen G. L. and Moore J. D., Investigation of coding structure in DNA, in Proc. 2003 IEEE Int. Conf. Acoustics, Speech, and Signal Processing, Hong Kong, China, 2003, p. II-361-4.
[74]   Peng C. K., Buldyrev S. V., Goldberger A. L., Havlin S., Sciortino F., Simons M., and Stanley H. E., Long-range correlations in nucleotide sequences, Nature, vol. 356, no. 6365, pp. 168-170, 1992.
[75]   Berger J. A., Mitra S. K., Carli M., and Neri A., Visualization and analysis of DNA sequences using DNA walks, J. Franklin Inst., vol. 341, nos. 1&2, pp. 37-53, 2004.
[76]   Tiwari S., Ramachandran S., Bhattacharya A., Bhattacharya S., and Ramaswamy R., Prediction of probable genes by Fourier analysis of genomic sequences, Bioinformatics, vol. 13, no. 3, pp. 263-270, 1997.
[77]   Li W. T., Marr T. G., and Kaneko K., Understanding long-range correlations in DNA sequences, Phys. D Nonlinear Phenom., vol. 75, nos. 1-3, pp. 392-416, 1994.
[78]   Abbasi O., Rostami A., and Karimian G., Identification of exonic regions in DNA sequences using cross-correlation and noise suppression by discrete wavelet transform, BMC Bioinformatics, vol. 12, p. 430, 2011.
[79]   Deng S. P., Shi Y. X., Yuan L. Y., Li Y. X., and Ding G. H., Detecting the borders between coding and non-coding DNA regions in prokaryotes based on recursive segmentation and nucleotide doublets statistics, BMC Genomics, vol. 13, no. , p. S19, 2012.
[80]   Bastos C. A. C., Afreixo V., Garcia S. P., and Pinho A. J., Inter-stop symbol distances for the identification of coding regions, J. Integr. Bioinform., vol. 10, no. 3, p. 230, 2013.
[81]   Rosen G. L., Signal processing for biologically-inspired gradient source localization and DNA sequence analysis, PhD dissertation, School of Electrical and Computer Engineering, Georgia Institute of Technology, Atlanta, GA, USA, 2006.
[82]   Limbachiya D., Rao B., and Gupta M. K., The art of DNA strings: Sixteen years of DNA coding theory, arXiv preprint arXiv: 1607.00266, 2016.
[83]   Faria L. C. B., Rocha A. S. L., Kleinschmidt J. H., Palazzo R., and Silva-Filho M. C., DNA sequences generated by BCH codes over GF(4), Electron. Lett., vol. 46, no. 3, pp. 203-204, 2010.
[84]   Zhang L., Tian F. C., Wang S. Y., and Liu X., A novel coding method for gene mutation correction during protein translation process, J. Theor. Biol., vol. 296, pp. 33-40, 2012.
[85]   Castro-Chavez F., A tetrahedral representation of the genetic code emphasizing aspects of symmetry, BIOcomplexity, vol. 2012, no. 2, pp. 1-6, 2012.
[86]   Castro-Chavez F., Defragged binary I Ching genetic code chromosomes compared to Nirenberg’s and transformed into rotating 2D circles and squares and into a 3D 100% symmetrical tetrahedron coupled to a functional one to discern start from non-start methionines through a Stella octangula, J. Proteome Sci. Comput. Biol., vol. 1, no. 1, p. 3, 2012.
[87]   Jeffrey H. J., Chaos game representation of gene structure, Nucleic Acids Res., vol. 18, no. 8, pp. 2163-2170, 1990.
[88]   Wang Y. W., Hill K., Singh S., and Kari L., The spectrum of genomic signatures: From dinucleotides to chaos game representation, Gene, vol. 346, pp. 173-185, 2005.
[89]   Joseph J. and Sasikumar R., Chaos game representation for comparison of whole genomes, BMC Bioinformatics, vol. 7, p. 243, 2006.
[90]   Dutta C. and Das J., Mathematical characterization of chaos game representation: New algorithms for nucleotide sequence analysis, J. Mol. Biol., vol. 228, no. 3, pp. 715-719, 1992.
[91]   Goldman N., Nucleotide, dinucleotide and trinucleotide frequencies explain patterns observed in chaos game representations of DNA sequences, Nucleic Acids Res., vol. 21, no. 10, pp. 2487-2491, 1993.
[92]   Castro-Chavez F., Most used codons per amino acid and per genome in the code of man compared to other organisms according to the rotating circular genetic code, Neuroquantology, vol. 9, no. 4, p. 500, 2011.
[93]   Delgado S., Morán F., Mora A., Merelo J. J., and Briones C., A novel representation of genomic sequences for taxonomic clustering and visualization by means of self-organizing maps, Bioinformatics, vol. 31, no. 5, pp. 736-744, 2015.
[94]   Yu Z. G. and Anh V., Time series model based on global structure of complete genome, Chaos, Solitons & Fractals, vol. 12, no. 10, pp. 1827-1834, 2001.
[95]   Chang H. T., Lo N. W., Lu W. C., and Kuo C. J., Visualization and comparison of DNA sequences by use of three-dimensional trajectories, in Proc. 1st Asia-Pacific Bioinformatics Conf. Bioinformatics 2003, Adelaide, Australia, 2003, pp. 81-85.
[96]   Kohonen T., Self-organized formation of topologically correct feature maps, Biol. Cybern., vol. 43, no. 1, pp. 59-69, 1982.
[97]   Kohonen T. and Somervuo P., How to make large self-organizing maps for nonvectorial data, Neural Netw., vol. 15, nos. 8&9, pp. 945-952, 2002.
[98]   Boyle A. P., Araya C. L., Brdlik C., Cayting P., Cheng C., Cheng Y., Gardner K., Hillier L. W., Janette J., Jiang L. X., Kasper D., et al., Comparative analysis of regulatory information and circuits across distant species, Nature, vol. 512, no. 7515, pp. 453-456, 2014.
[99]   Hamori E. and Ruskin J., H curves, a novel method of representation of nucleotide series especially suited for long DNA sequences, J. Biol. Chem., vol. 258, no. 2, pp. 1318-1327, 1983.
[100]   Gates M. A., Simpler DNA sequence representations, Nature, vol. 316, no. 6025, p. 219, 1985.
[101]   Yau S. S. T., Wang J. S., Niknejad A., Lu C. X., Jin N., and Ho Y. K., DNA sequence representation without degeneracy, Nucleic Acids Res., vol. 31, no. 12, pp. 3078-3080, 2003.
[102]   Zhang R. and Zhang C. T., Z curves, an intutive tool for visualizing and analyzing the DNA sequences, J. Biomol. Struct. Dyn., vol. 11, no. 4, pp. 767-782, 1994.
[103]   Kwan H. K., Atwal R., and Kwan B. Y. M., Wavelet analysis of DNA sequences, in Proc. 2008 Int. Conf. Communications, Circuits and Systems, Fujian, China, 2008, pp. 816-820.
[104]   Yu C. L., Deng M., Zheng L., He R. L., Yang J., and Yau S. S. T., DFA7, a new method to distinguish between intron-containing and intronless genes, PLoS One, vol. 9, no. 7, p. e101363, 2014.
[105]   Akhtar M., Epps J., and Ambikairajah E., Signal processing in sequence analysis: Advances in eukaryotic gene prediction, IEEE J. Sel. Top. Signal Process., vol. 2, no. 3, pp. 310-321, 2008.
[106]   Mendizabal-Ruiz G., Román-Godínez I., Torres-Ramos S., Salido-Ruiz R. A., and Morales J. A., On DNA numerical representations for genomic similarity computation, PLoS One, vol. 12, no. 3, p. e0173288, 2017.
[107]   Ranawana R. and Palade V., A neural network based multi-classifier system for gene identification in DNA sequences, Neural Comput. Appl., vol. 14, no. 2, pp. 122-131, 2005.
[108]   Arniker S. B., Kwan H. K., Law N. F., and Lun D. P. K., DNA numerical representation and neural network based human promoter prediction system, in Proc. 2011 Annu. IEEE India Conf., Hyderabad, India, 2011, pp. 1-4.
[109]   Xie X., Wu S., Lam K. M., and Yan H., Promoterexplorer: An effective promoter identification method based on the AdaBoost algorithm, Bioinformatics, vol. 22, no. 22, pp. 2722-2728, 2006.
[110]   Deng L. and Yu D., Deep learning: Methods and applications, Tech. Rep. MSR-TR-2014-21, 2014,
[111]   Bengio Y., Courville A., and Vincent P., Representation learning: A review and new perspectives, IEEE Trans. Pattern Anal. Mach. Intell., vol. 35, no. 8, pp. 1798-1828, 2013.
[112]   Reese M. G., Eeckman F. H., Kulp D., and Haussler D., Improved splice site detection in genie, J. Comput. Biol., vol. 4, no. 3, pp. 311-323, 1997.
[113]   Yu N., Yu Z., and Pan Y., A deep learning method for lincRNA detection using auto-encoder algorithm, BMC Bioinformatics, vol. 18, no. , p. 511, 2017.
[114]   Orr G. B. and Müller K. R., Neural Networks: Tricks of the Trade. Springer, 1998, p. 1524.
[115]   Wiesler S., Richard A., Schluter R., and Ney H., Mean-normalized stochastic gradient for large-scale deep learning, in Proc. 2014 IEEE Int. Conf. Acoustics, Speech and Signal Processing, Florence, Italy, 2014, pp. 180-184.
[116]   Raiko T., Valpola H., and LeCun Y., Deep learning made easier by linear transformations in perceptrons, in Proc. 15th Int. Conf. Artificial Intelligence and Statistics, La Palma, Canary Islands, 2012, pp. 924-932.
[117]   Ioffe S. and Szegedy C., Batch normalization: Accelerating deep network training by reducing internal covariate shift, arXiv preprint arXiv: 1502.03167, 2015.
[118]   Danihelka I., Wayne G., Uria B., Kalchbrenner N., and Graves A., Associative long short-term memory, arXiv preprint arXiv: 1602.03032, 2016.
[119]   Jose C., Cisse M., and Fleuret F., Kronecker recurrent units, arXiv preprint arXiv: 1705.10142, 2017.
[120]   Jing L., Gül?ehre ?., Peurifoy J., Shen Y. C., Tegmark M., Soljacic M., and Bengio Y., Gated orthogonal recurrent units: On learning to forget, arXiv preprint arXiv: 1706.02761, 2017.
[121]   Arjovsky M., Shah A., and Bengio Y., Unitary evolution recurrent neural networks, arXiv preprint arXiv: 1511.06464, 2015.
[122]   Trabelsi C., Bilaniuk O., Zhang Y., Serdyuk D., Subramanian S., Santos J. F., Mehri S., Rostamzadeh N., Bengio Y., and Pal C. J., Deep complex networks, arXiv preprint arXiv: 1705.09792, 2017.
[123]   Mescheder L., Nowozin S., and Geiger A., The numerics of GANs, arXiv preprint arXiv: 1705.10461, 2017.
[1] Farid Ablayev, Marat Ablayev, Joshua Zhexue Huang, Kamil Khadiev, Nailya Salikhova, Dingming Wu. On Quantum Methods for Machine Learning Problems Part I: Quantum Tools[J]. Big Data Mining and Analytics, 2020, 03(01): 41-55.
[2] Zhenxing Guo, Shihua Zhang. Sparse Deep Nonnegative Matrix Factorization[J]. Big Data Mining and Analytics, 2020, 03(01): 13-28.
[3] James Palmer, Victor S. Sheng, Travis Atkison, Bernard Chen. Classification on Grade, Price, and Region with Multi-Label and Multi-Target Methods in Wineinformatics[J]. Big Data Mining and Analytics, 2020, 03(01): 1-12.
[4] Ying Yu, Min Li, Liangliang Liu, Yaohang Li, Jianxin Wang. Clinical Big Data and Deep Learning: Applications, Challenges, and Future Outlooks[J]. Big Data Mining and Analytics, 2019, 2(4): 288-305.
[5] Qile Zhu, Xiyao Ma, Xiaolin Li. Statistical Learning for Semantic Parsing: A Survey[J]. Big Data Mining and Analytics, 2019, 2(4): 217-239.
[6] Wenmao Wu, Zhizhou Yu, Jieyue He. A Semi-Supervised Deep Network Embedding Approach Based on the Neighborhood Structure[J]. Big Data Mining and Analytics, 2019, 2(3): 205-216.
[7] Mondher Bouazizi, Tomoaki Ohtsuki. Multi-Class Sentiment Analysis on Twitter: Classification Performance and Challenges[J]. Big Data Mining and Analytics, 2019, 2(3): 181-194.
[8] Jiangcheng Zhu, Shuang Hu, Rossella Arcucci, Chao Xu, Jihong Zhu, Yi-ke Guo. Model Error Correction in Data Assimilation by Integrating Neural Networks[J]. Big Data Mining and Analytics, 2019, 2(2): 83-91.
[9] Bo Zhao, Hucheng Zhou, Guoqiang Li, Yihua Huang. ZenLDA: Large-Scale Topic Model Training on Distributed Data-Parallel Platform[J]. Big Data Mining and Analytics, 2018, 1(1): 57-74.
[10] Jin Liu, Yi Pan, Min Li, Ziyue Chen, Lu Tang, Chengqian Lu, Jianxin Wang. Applications of Deep Learning to MRI Images: A Survey[J]. Big Data Mining and Analytics, 2018, 1(1): 1-18.
[11] Chenxi Yang, Yang Chen, Qingyuan Gong, Xinlei He, Yu Xiao, Yuhuan Huang, Xiaoming Fu. Understanding the Behavioral Differences Between American and German Users: A Data-Driven Study[J]. Big Data Mining and Analytics, 2018, 01(04): 284-296.
[12] Qianyu Meng, Kun Wang, Xiaoming He, Minyi Guo. QoE-Driven Big Data Management in Pervasive Edge Computing Environment[J]. Big Data Mining and Analytics, 2018, 01(03): 222-233.