Please wait a minute...
Big Data Mining and Analytics  2020, Vol. 03 Issue (01): 13-28    DOI: 10.26599/BDMA.2019.9020020
    
Sparse Deep Nonnegative Matrix Factorization
Zhenxing Guo, Shihua Zhang*
Zhenxing Guo is with the Academy of Mathematics and Systems Science, Chinese Academy of Sciences (CAS), Beijing 100190, China. E-mail: gzx911023@163.com.
Shihua Zhang is with NCMIS, CEMS, RCSDS, Academy of Mathematics and Systems Science, Chinese Academy of Sciences, Beijing 100190, China, and also with the School of Mathematical Sciences, University of Chinese Academy of Sciences, Beijing 100049, China.
Download: PDF (859 KB)      HTML  
Export: BibTeX | EndNote (RIS)      

Abstract  

Nonnegative Matrix Factorization (NMF) is a powerful technique to perform dimension reduction and pattern recognition through single-layer data representation learning. However, deep learning networks, with their carefully designed hierarchical structure, can combine hidden features to form more representative features for pattern recognition. In this paper, we proposed sparse deep NMF models to analyze complex data for more accurate classification and better feature interpretation. Such models are designed to learn localized features or generate more discriminative representations for samples in distinct classes by imposing L1-norm penalty on the columns of certain factors. By extending a one-layer model into a multilayer model with sparsity, we provided a hierarchical way to analyze big data and intuitively extract hidden features due to nonnegativity. We adopted the Nesterov's accelerated gradient algorithm to accelerate the computing process. We also analyzed the computing complexity of our frameworks to demonstrate their efficiency. To improve the performance of dealing with linearly inseparable data, we also considered to incorporate popular nonlinear functions into these frameworks and explored their performance. We applied our models using two benchmarking image datasets, and the results showed that our models can achieve competitive or better classification performance and produce intuitive interpretations compared with the typical NMF and competing multilayer models.



Key wordssparse Nonnegative Matrix Factorization (NMF)      deep learning      Nesterov’s accelerated gradient algorithm     
Received: 27 April 2019      Published: 13 January 2020
Corresponding Authors: Shihua Zhang   
Cite this article:

Zhenxing Guo, Shihua Zhang. Sparse Deep Nonnegative Matrix Factorization. Big Data Mining and Analytics, 2020, 03(01): 13-28.

URL:

http://bigdata.tsinghuajournals.com/10.26599/BDMA.2019.9020020     OR     http://bigdata.tsinghuajournals.com/Y2020/V03/I01/13

k𝟏, 160] in terms of (a) NMI and (b) ER. Each sub-figure shows the results of deep semi-NMF and our models for comparison.">
Fig. 1 Comparison of two-layer models with layer size = [<inline-formula><math xmlns:mml="http://www.w3.org/1998/Math/MathML" display="inline" id="MA254"><mml:msub><mml:mi>k</mml:mi><mml:mtext>𝟏</mml:mtext></mml:msub></math></inline-formula>, 160] in terms of (a) NMI and (b) ER. Each sub-figure shows the results of deep semi-NMF and our models for comparison.
k𝟐. Under each k𝟐, we chose the best k𝟏 among 100, 200, 300, 400, 500, 600, 700, and 800 for each deep model.">
Fig. 2 Comparison of PgNMF, NeNMF, deep semi-NMF, DNMF, and SDNMF/RL2 in terms of NMI under different <inline-formula><math xmlns:mml="http://www.w3.org/1998/Math/MathML" display="inline" id="MA255"><mml:msub><mml:mi>k</mml:mi><mml:mtext>𝟐</mml:mtext></mml:msub></math></inline-formula>. Under each <inline-formula><math xmlns:mml="http://www.w3.org/1998/Math/MathML" display="inline" id="MA256"><mml:msub><mml:mi>k</mml:mi><mml:mtext>𝟐</mml:mtext></mml:msub></math></inline-formula>, we chose the best <inline-formula><math xmlns:mml="http://www.w3.org/1998/Math/MathML" display="inline" id="MA257"><mml:msub><mml:mi>k</mml:mi><mml:mtext>𝟏</mml:mtext></mml:msub></math></inline-formula> among 100, 200, 300, 400, 500, 600, 700, and 800 for each deep model.
klower layer>khigher layer. For example, for SDNMF/L with L = 3 and layer size = [k𝟏,k𝟐,k𝟑], the values of k𝟏 and k𝟐 were drawn from exponential distributions, with k𝟏>k𝟐, and k𝟑 is fixed to be 40. The experiments for SDNMF/L under each layer size was repeated 50 times, generating 50 NMI values. The rest experiments for the other models were performed in the same way.">
Fig. 3 Classification precision of our linear models on PIE-pose_27.0 dataset in terms of NP with different layer numbers (<i>L</i> = 2, 3, and 4). The number of sub-bases in hidden layers was randomly extracted from certain exponential distributions. Then they were ranked to satisfy the relationship <inline-formula><math xmlns:mml="http://www.w3.org/1998/Math/MathML" display="inline" id="MA276"><mml:mrow><mml:msub><mml:mi>k</mml:mi><mml:mtext>lower layer</mml:mtext></mml:msub><mml:mo>></mml:mo><mml:msub><mml:mi>k</mml:mi><mml:mtext>higher layer</mml:mtext></mml:msub></mml:mrow></math></inline-formula>. For example, for SDNMF/L with <i>L</i> = 3 and layer size = <inline-formula><math xmlns:mml="http://www.w3.org/1998/Math/MathML" display="inline" id="MA277"><mml:mrow><mml:msub><mml:mrow><mml:mtext>[</mml:mtext><mml:mi>k</mml:mi></mml:mrow><mml:mtext>𝟏</mml:mtext></mml:msub><mml:mo rspace="5.8pt">,</mml:mo><mml:msub><mml:mi>k</mml:mi><mml:mtext>𝟐</mml:mtext></mml:msub><mml:mo rspace="5.8pt">,</mml:mo><mml:msub><mml:mi>k</mml:mi><mml:mtext>𝟑</mml:mtext></mml:msub><mml:mo stretchy="false">]</mml:mo></mml:mrow></math></inline-formula>, the values of <inline-formula><math xmlns:mml="http://www.w3.org/1998/Math/MathML" display="inline" id="MA278"><mml:msub><mml:mi>k</mml:mi><mml:mtext>𝟏</mml:mtext></mml:msub></math></inline-formula> and <inline-formula><math xmlns:mml="http://www.w3.org/1998/Math/MathML" display="inline" id="MA279"><mml:msub><mml:mi>k</mml:mi><mml:mtext>𝟐</mml:mtext></mml:msub></math></inline-formula> were drawn from exponential distributions, with <inline-formula><math xmlns:mml="http://www.w3.org/1998/Math/MathML" display="inline" id="MA280"><mml:mrow><mml:msub><mml:mi>k</mml:mi><mml:mtext>𝟏</mml:mtext></mml:msub><mml:mo>></mml:mo><mml:msub><mml:mi>k</mml:mi><mml:mtext>𝟐</mml:mtext></mml:msub></mml:mrow></math></inline-formula>, and <inline-formula><math xmlns:mml="http://www.w3.org/1998/Math/MathML" display="inline" id="MA281"><mml:msub><mml:mi>k</mml:mi><mml:mtext>𝟑</mml:mtext></mml:msub></math></inline-formula> is fixed to be 40. The experiments for SDNMF/L under each layer size was repeated 50 times, generating 50 NMI values. The rest experiments for the other models were performed in the same way.
k𝟏 selection of our proposed linear deep models on PIE-pose_27.0 dataset. Each linear model was tested under four conditions: layer size = [300, 40], [300, 80], [600, 40], and [600, 80].">
Fig. 4 <inline-formula><math xmlns:mml="http://www.w3.org/1998/Math/MathML" display="inline" id="MA282"><mml:msub><mml:mi>k</mml:mi><mml:mtext>𝟏</mml:mtext></mml:msub></math></inline-formula> selection of our proposed linear deep models on PIE-pose_27.0 dataset. Each linear model was tested under four conditions: layer size = [300, 40], [300, 80], [600, 40], and [600, 80].
k𝟐 selection of our proposed linear deep models on PIE-pose_27.0 dataset. Each linear model was tested under five conditions: layer size = [600, 40], [600, 80], [600, 120], [600, 160], and [600, 200].">
Fig. 5 <inline-formula><math xmlns:mml="http://www.w3.org/1998/Math/MathML" display="inline" id="MA283"><mml:msub><mml:mi>k</mml:mi><mml:mtext>𝟐</mml:mtext></mml:msub></math></inline-formula> selection of our proposed linear deep models on PIE-pose_27.0 dataset. Each linear model was tested under five conditions: layer size = [600, 40], [600, 80], [600, 120], [600, 160], and [600, 200].
Fig. 6 Correspondent comparison between two-layer and three layer linear deep models on PIE-pose_27.0 dataset. Specifically, the layer size for three-layer model contains [600,160, 40], [600,160, 80], and [600,160,120]. The layer size for our two-layer models is [600,160].
Square1 means that we only added nonlinear function onto H1, whereas Square2 means that both H1 and H2 are projected by the nonlinear function. The classification precisions in terms of the NMI of DNMF, SDNMF/R, SDNMF/L, and SDNMF/RL2 are shown in (a) and (b), respectively. The results of SDNMF/RL1 in terms of NP, NMI, and ER are shown in (c).">
Fig. 7 Comparison among linear and nonlinear models. <inline-formula><math xmlns:mml="http://www.w3.org/1998/Math/MathML" display="inline" id="MA287"><mml:msub><mml:mtext>Square</mml:mtext><mml:mn>1</mml:mn></mml:msub></math></inline-formula> means that we only added nonlinear function onto <inline-formula><math xmlns:mml="http://www.w3.org/1998/Math/MathML" display="inline" id="MA288"><mml:msub><mml:mi>H</mml:mi><mml:mn>1</mml:mn></mml:msub></math></inline-formula>, whereas <inline-formula><math xmlns:mml="http://www.w3.org/1998/Math/MathML" display="inline" id="MA289"><mml:msub><mml:mtext>Square</mml:mtext><mml:mn>2</mml:mn></mml:msub></math></inline-formula> means that both <inline-formula><math xmlns:mml="http://www.w3.org/1998/Math/MathML" display="inline" id="MA290"><mml:msub><mml:mi>H</mml:mi><mml:mn>1</mml:mn></mml:msub></math></inline-formula> and <inline-formula><math xmlns:mml="http://www.w3.org/1998/Math/MathML" display="inline" id="MA291"><mml:msub><mml:mi>H</mml:mi><mml:mn>2</mml:mn></mml:msub></math></inline-formula> are projected by the nonlinear function. The classification precisions in terms of the NMI of DNMF, SDNMF/R, SDNMF/L, and SDNMF/RL2 are shown in (a) and (b), respectively. The results of SDNMF/RL1 in terms of NP, NMI, and ER are shown in (c).
MethodNMINPER
PgNMF0.8840.7866.691
NeNMF0.9140.8686.729
Deep Semi-NMF0.9450.9105.772
Multi-layer NMF (linear)0.9050.8456.616
Multi-layer NMF (square1)0.9090.8516.332
Multi-layer NMF (square2)0.8880.8136.510
SDNMF/L (linear)0.9250.9228.271
SDNMF/L (square1)0.9500.9467.195
SDNMF/L (square2)0.9680.9515.234
SDNMF/RL2 (linear)0.9010.8376.706
SDNMF/RL2 (square1)0.9360.9106.320
SDNMF/RL2 (square2)0.9370.9025.882
SDNMF/R (linear)0.8750.8057.353
SDNMF/R (square1)0.9580.9425.857
SDNMF/R (square2)0.8370.7227.20
DNMF (linear)0.7500.5437.786
DNMF (square1)0.9540.9436.485
DNMF (square2)0.9400.9035.779
SDNMF/RL1 (linear)0.8900.8367.390
SDNMF/RL1 (square1)0.8140.6907.674
SDNMF/RL1 (square2)0.7890.6367.717
Table 2 Comparison of NMF-related models on PIE data.
Fig. 8 Hierarchical feature interpretation by a linear SDNMF/L model with layer size = [193, 141, 40]. (a) 10 samples of class 36 and the representation matrix of class 36. Certain bases as well as corresponding coefficients in the first, second and third layers are shown in (b), (c), and (d), respectively. The horizontal and vertical coordinates indicate the indexs of each matrix (or image).
[1]   Lee D. D. and Seung H. S., Learning the parts of objects by non-negative matrix factorization, Nature, vol. 401, no. 6755, pp. 788-791, 1999.
[2]   Lee D. D. and and Seung H. S., Algorithms for non-negative matrix factorization, in Proceedings of 14th Neural Information Processing Systems, Denver, CO, USA, 2000, pp. 556-562.
[3]   Xu W., Liu X., and Gong Y., Document clustering based on non-negative matrix factorization, in Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Toronto, Canada, 2003, pp. 267-273.
[4]   Pauca V. P., Shahnaz F., Berry M. W., and Plemmons R. J., Text mining using non-negative matrix factorizations, in Proceedings of the 4th SIAM International Conference on Data Mining, Society for Industrial and Applied Mathematics, Lake Buena Vista, FL, USA, 2004, pp. 452-456.
[5]   Shahnaz F., Berry M. W., Pauca V. P., and Plemmons R. J., Document clustering using nonnegative matrix factorization, Information Processing and Management, vol. 42, no. 2, pp. 373-386, 2006.
[6]   Wang Y., Jia Y., Hu C., and Turk M., Non-negative matrix factorization framework for face recognition, International Journal of Pattern Recognition and Artificial Intelligence, vol. 19, no. 4, pp. 495-511, 2005.
[7]   Guillamet D. and Vitria J., Classifying faces with nonnegative matrix factorization, in Proceedings of 5th Catalan Conference for Artificial Intelligence, Castelln, Spain, 2002, pp. 24-31.
[8]   Brunet J. P., Tamayo P., Golub T. R., and Mesirov J. P., Metagenes and molecular pattern discovery using matrix factorization, Proceedings of the National Academy of Sciences USA, vol. 101, no. 12, pp. 4164-4169, 2004.
[9]   Qi Q., Zhao Y., Li M., and Simon R., Non-negative matrix factorization of gene expression profiles: A plug-in for BRB-ArrayTools, Bioinformatics, vol. 25, no. 4, pp. 545-547, 2009.
[10]   Hoyer P. O., Non-negative matrix factorization with sparseness constraints, Journal of Machine Learning Research, vol. 5, pp. 1457-1469, 2004.
[11]   Kim H. and Park H., Sparse non-negative matrix factorizations via alternating non-negativity-constrained least squares for microarray data analysis, Bioinformatics, vol. 23, no. 12, pp. 1495-1502, 2007.
[12]   Peharz R. and Pernkopf F., Sparse nonnegative matrix factorization with L0-constraints, Neurocomputing, vol. 80, pp. 38-46, 2012
[13]   Cai D., He X., Wu X., and Han J., Non-negative matrix factorization on manifold, in Proceedings of the 8th IEEE International Conference on Data Ming, Pisa, Italy, 2008, pp. 63-72.
[14]   Cai D., He X., Han J., and Huang T. S., Graph regularized nonnegative matrix factorization for data representation, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 33, no. 8, pp. 1548-1560, 2011.
[15]   Zhang S., Liu C. C., Li W., Shen H., Laird P., and Zhou X. J., Discovery of multi-dimensional modules by integrative analysis of cancer genomic data, Nucleic Acids Research, vol. 40, no. 19, pp. 9379-9391, 2012.
[16]   Zhang S., Li Q., Liu J., and Zhou X. J., Integrating multiple functional genomic data to define microRNA-genes regulatory modules by a sparse network-regularized multiple matrix factorization method, Bioinformatics, vol. 27, no. 13, pp. i401-i409, 2011.
[17]   Yang Z. and Michailidis G., A non-negative matrix factorization method for detecting modules in heterogeneous omics multi-modal data, Bioinformatics, vol. 32, no. 1, pp. 1-8, 2016.
[18]   ?itnik M. and Zupan B., Data fusion by matrix factorization, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 37, no. 1, pp. 41-53, 2015.
[19]   Hinton G. E. and Salakhutdinov R. R., Reducing the dimensionality of data with neural networks, Science, vol. 313, no. 5786, pp. 504-507, 2006.
[20]   Hinton G. E., Osindero S., and Teh Y. W., A fast learning algorithm for deep belief nets, Neural Computation, vol. 18, no. 7, pp. 1527-1554, 2006.
[21]   Mohamed A. R., Dahl G., and Hinton G. E., Deep belief networks for phone recognition, in Proceedings of the NIPS Workshop on Deep Learning for Speech Recognition and Related Applications, Vancouver, Canada, 2009, vol. 1, no. 9, p. 39.
[22]   Mohamed A. R., Dahl G., and Hinton G. E., Acoustic modeling using deep belief networks, IEEE Transactions on Audio Speech Language Processing, vol. 20, no. 1, pp. 14-22, 2012.
[23]   Hinton G. E., Deng L., Yu D., Dahl G. E., Mohamed A., Jaitly N., Senior A., Vanhoucke V., Nguyen P., Sainath T. N., et al., Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups, IEEE Signal Processing Magazine, vol. 29, no. 6, pp. 82-97, 2012.
[24]   Liong V. E., Lu J., and Wang G., Face recognition using deep PCA, in Proc. of 2013 9th IEEE International Conference on Information, Communications and Signal Processing (ICICS), Tainan, China, 2013, pp. 1-5.
[25]   Chan T. H., Jia K., Gao S., Lu J., Zeng Z., and Ma Y., PCANet: A simple deep learning baseline for image classification? IEEE Transactions on Image Processing, vol. 24, no. 12, pp. 5017-5032, 2015.
[26]   Zhou Z. H. and Feng J., Deep forest: Towards an alternative to deep neural networks, arXiv preprint arXiv:1702.08835v1, 2017.
[27]   Ahn J. H., Choi S., and Oh J., A multiplicative up-propagation algorithm, in Proceedings of the 21st International Conference on Machine Learning (ACM), Banff, Canada, 2004, p. 3.
[28]   Song H. A., Kim B. K., Xuan T. L., and Lee S. Y., Hierarchical feature extraction by multi-layer non-negative matrix factorization network for classification task, Neurocomputing, vol. 165, pp. 63-74, 2015.
[29]   Trigeorgis G., Bousmalis K., Zafeiriou S., and Schuller B. W., A deep matrix factorization method for learning attribute representations, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 39, no. 3, pp. 417-429, 2017.
[30]   Ding C. H., Li T., and Jordan M. T., Convex and semi-nonnegative matrix factorizations, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 32, no. 1, pp. 45-55, 2010.
[31]   Nesterov Y., Smooth minimization of non-smooth functions, Mathematical Programming, vol. 103, no. 1, pp. 127-152, 2005.
[32]   Pascual-Montano A., Carazo J. M., Kochi K., Lehmann D., and Pascual-Marqui R. D., Nonsmooth nonnegative matrix factorization (nsNMF), IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 28, no. 3, pp. 403-415, 2006.
[33]   Boutsidis C. and Gallopoulos E., SVD-based initialization: A head start for nonnegative matrix factorization, Pattern Recognition, vol. 41, no. 4, pp. 1350-1362, 2008.
[34]   Guan N., Tao D., Luo Z., and Yuan B., NeNMF: An optimal gradient method for nonnegative matrix factorization, IEEE Transactions on Signal Processing, vol. 60, no. 6, pp. 2882-2898, 2012.
[35]   Bertsekas D. P., Nonlinear Programming, 2nd ed. Belmont, MA, USA: Athena Scientific, 1999.
[36]   Lin C. J., Projected gradient methods for nonnegative matrix factorization, Neural Computation, vol. 19, no. 10, pp. 2756-2779, 2007.
[37]   Berry M. W., Browne M., Langville A. N., Pauca V. P., and Plemmons R. J., Algorithms and applications for approximate nonnegative matrix factorization, Computational Statistics and Data Analysis, vol. 52, no. 1, pp.155-173, 2007.
[38]   Grippo L. and Sciandrone M., On the convergence of the block nonlinear Gauss-Seidel method under convex constraints, Operations Research Letters, vol. 26, no. 3, pp. 127-136, 2000.
[39]   Danon L., Diaz-Guilera A., Duch J., and Arenas A., Comparing community structure identification, Journal of Statistical Mechanics: Theory and Experiment, vol. 2005, no. 9, p. P09008, 2005.
[40]   Lin Y. R., Chi Y., Zhu S., Sundaram H., and Tseng B. L., Analyzing communities and their evolutions in dynamic social networks, ACM Transactions on Knowledge Discovery from Data (TKDD), vol. 3, no. 2, p. 8, 2009.
[1] Ying Yu, Min Li, Liangliang Liu, Yaohang Li, Jianxin Wang. Clinical Big Data and Deep Learning: Applications, Challenges, and Future Outlooks[J]. Big Data Mining and Analytics, 2019, 2(4): 288-305.
[2] Qile Zhu, Xiyao Ma, Xiaolin Li. Statistical Learning for Semantic Parsing: A Survey[J]. Big Data Mining and Analytics, 2019, 2(4): 217-239.
[3] Wenmao Wu, Zhizhou Yu, Jieyue He. A Semi-Supervised Deep Network Embedding Approach Based on the Neighborhood Structure[J]. Big Data Mining and Analytics, 2019, 2(3): 205-216.
[4] Jiangcheng Zhu, Shuang Hu, Rossella Arcucci, Chao Xu, Jihong Zhu, Yi-ke Guo. Model Error Correction in Data Assimilation by Integrating Neural Networks[J]. Big Data Mining and Analytics, 2019, 2(2): 83-91.
[5] Jin Liu, Yi Pan, Min Li, Ziyue Chen, Lu Tang, Chengqian Lu, Jianxin Wang. Applications of Deep Learning to MRI Images: A Survey[J]. Big Data Mining and Analytics, 2018, 1(1): 1-18.
[6] Qianyu Meng, Kun Wang, Xiaoming He, Minyi Guo. QoE-Driven Big Data Management in Pervasive Edge Computing Environment[J]. Big Data Mining and Analytics, 2018, 01(03): 222-233.
[7] Ning Yu, Zhihua Li, Zeng Yu. Survey on Encoding Schemes for Genomic Data Representation and Feature Learning—From Signal Processing to Machine Learning[J]. Big Data Mining and Analytics, 2018, 01(03): 191-210.