30 Most Down Articles Published in last 1 year | In last 2 years| In last 3 years| All| Most Downloaded in Recent Month | Most Downloaded in Recent Year| All
 Select A Brief Review of Network Embedding Yaojing Wang, Yuan Yao, Hanghang Tong, Feng Xu, Jian Lu Big Data Mining and Analytics   2019, 2 (1): 35-47.   DOI: 10.26599/BDMA.2018.9020029 Abstract （446）   HTML （39）    PDF （795KB）（744）       Learning the representations of nodes in a network can benefit various analysis tasks such as node classification, link prediction, clustering, and anomaly detection. Such a representation learning problem is referred to as network embedding, and it has attracted significant attention in recent years. In this article, we briefly review the existing network embedding methods by two taxonomies. The technical taxonomy focuses on the specific techniques used and divides the existing network embedding methods into two stages, i.e., context construction and objective design. The non-technical taxonomy focuses on the problem setting aspect and categorizes existing work based on whether to preserve special network properties, to consider special network types, or to incorporate additional inputs. Finally, we summarize the main findings based on the two taxonomies, analyze their usefulness, and discuss future directions in this area.
 Select Applications of Deep Learning to MRI Images: A Survey Jin Liu, Yi Pan, Min Li, Ziyue Chen, Lu Tang, Chengqian Lu, Jianxin Wang Big Data Mining and Analytics   2018, 1 (1): 1-18.   DOI: 10.26599/BDMA.2018.9020001 Accepted: 18 December 2017 Abstract （558）   HTML （1）    PDF （876KB）（567）       Deep learning provides exciting solutions in many fields, such as image analysis, natural language processing, and expert system, and is seen as a key method for various future applications. On account of its non-invasive and good soft tissue contrast, in recent years, Magnetic Resonance Imaging (MRI) has been attracting increasing attention. With the development of deep learning, many innovative deep learning methods have been proposed to improve MRI image processing and analysis performance. The purpose of this article is to provide a comprehensive overview of deep learning-based MRI image processing and analysis. First, a brief introduction of deep learning and imaging modalities of MRI images is given. Then, common deep learning architectures are introduced. Next, deep learning applications of MRI images, such as image detection, image registration, image segmentation, and image classification are discussed. Subsequently, the advantages and weaknesses of several common tools are discussed, and several deep learning tools in the applications of MRI images are presented. Finally, an objective assessment of deep learning in MRI applications is presented, and future developments and trends with regard to deep learning for MRI images are addressed.
 Select ZenLDA: Large-Scale Topic Model Training on Distributed Data-Parallel Platform Bo Zhao, Hucheng Zhou, Guoqiang Li, Yihua Huang Big Data Mining and Analytics   2018, 1 (1): 57-74.   DOI: 10.26599/BDMA.2018.9020006 Abstract （75）   HTML （1）    PDF （2214KB）（495）       Recently, topic models such as Latent Dirichlet Allocation (LDA) have been widely used in large-scale web mining. Many large-scale LDA training systems have been developed, which usually prefer a customized design from top to bottom with sophisticated synchronization support. We propose an LDA training system named ZenLDA, which follows a generalized design for the distributed data-parallel platform. The novelty of ZenLDA consists of three main aspects: (1) it converts the commonly used serial Collapsed Gibbs Sampling (CGS) inference algorithm to a Monte-Carlo Collapsed Bayesian (MCCB) estimation method, which is embarrassingly parallel; (2) it decomposes the LDA inference formula into parts that can be sampled more efficiently to reduce computation complexity; (3) it proposes a distributed LDA training framework, which represents the corpus as a directed graph with the parameters annotated as corresponding vertices and implements ZenLDA and other well-known inference methods based on Spark. Experimental results indicate that MCCB converges with accuracy similar to that of CGS, while running much faster. On top of MCCB, the ZenLDA formula decomposition achieved the fastest speed among other well-known inference methods. ZenLDA also showed good scalability when dealing with large-scale topic models on the data-parallel platform. Overall, ZenLDA could achieve comparable and even better computing performance with state-of-the-art dedicated systems.
 Select Online Internet Traffic Monitoring System Using Spark Streaming Baojun Zhou, Jie Li, Xiaoyan Wang, Yu Gu, Li Xu, Yongqiang Hu, Lihua Zhu Big Data Mining and Analytics   2018, 1 (1): 47-56.   DOI: 10.26599/BDMA.2018.9020005 Abstract （77）   HTML （0）    PDF （1926KB）（467）       Owing to the explosive growth of Internet traffic, network operators must be able to monitor the entire network situation and efficiently manage their network resources. Traditional network analysis methods that usually work on a single machine are no longer suitable for huge traffic data owing to their poor processing ability. Big data frameworks, such as Hadoop and Spark, can handle such analysis jobs even for a large amount of network traffic. However, Hadoop and Spark are inherently designed for offline data analysis. To cope with streaming data, various stream-processing-based frameworks have been proposed, such as Storm, Flink, and Spark Streaming. In this study, we propose an online Internet traffic monitoring system based on Spark Streaming. The system comprises three parts, namely, the collector, messaging system, and stream processor. We considered the TCP performance monitoring as a special use case of showing how network monitoring can be performed with our proposed system. We conducted typical experiments with a cluster in standalone mode, which showed that our system performs well for large Internet traffic measurement and monitoring.
 Select Mining Conditional Functional Dependency Rules on Big Data Mingda Li, Hongzhi Wang, Jianzhong Li Big Data Mining and Analytics   2020, 03 (01): 68-84.   DOI: 10.26599/BDMA.2019.9020019 Abstract （191）   HTML （2）    PDF （965KB）（371）       Current Conditional Functional Dependency (CFD) discovery algorithms always need a well-prepared training dataset. This condition makes them difficult to apply on large and low-quality datasets. To handle the volume issue of big data, we develop the sampling algorithms to obtain a small representative training set. We design the fault-tolerant rule discovery and conflict-resolution algorithms to address the low-quality issue of big data. We also propose parameter selection strategy to ensure the effectiveness of CFD discovery algorithms. Experimental results demonstrate that our method can discover effective CFD rules on billion-tuple data within a reasonable period.
 Select Multi-view Clustering: A Survey Yan Yang, Hao Wang Big Data Mining and Analytics   2018, 01 (02): 83-107.   DOI: 10.26599/BDMA.2018.9020003 Accepted: 18 December 2017 Abstract （450）   HTML （4）    PDF （2463KB）（335）       In the big data era, the data are generated from different sources or observed from different views. These data are referred to as multi-view data. Unleashing the power of knowledge in multi-view data is very important in big data mining and analysis. This calls for advanced techniques that consider the diversity of different views, while fusing these data. Multi-view Clustering (MvC) has attracted increasing attention in recent years by aiming to exploit complementary and consensus information across multiple views. This paper summarizes a large number of multi-view clustering algorithms, provides a taxonomy according to the mechanisms and principles involved, and classifies these algorithms into five categories, namely, co-training style algorithms, multi-kernel learning, multi-view graph clustering, multi-view subspace clustering, and multi-task multi-view clustering. Therein, multi-view graph clustering is further categorized as graph-based, network-based, and spectral-based methods. Multi-view subspace clustering is further divided into subspace learning-based, and non-negative matrix factorization-based methods. This paper does not only introduce the mechanisms for each category of methods, but also gives a few examples for how these techniques are used. In addition, it lists some publically available multi-view datasets. Overall, this paper serves as an introductory text and survey for multi-view clustering.
 Select Feature Selection with Graph Mining Technology Thosini Bamunu Mudiyanselage, Yanqing Zhang Big Data Mining and Analytics   2019, 2 (2): 73-82.   DOI: 10.26599/BDMA.2018.9020032 Abstract （145）   HTML （0）    PDF （1027KB）（292）       Many real world applications have problems with high dimensionality, which existing algorithms cannot overcome. A critical data preprocessing problem is feature selection, whereby its non-scalability negatively influences both the efficiency and performance of big data applications. In this research, we developed a new algorithm to reduce the dimensionality of a problem using graph-based analysis, which retains the physical meaning of the original high-dimensional feature space. Most existing feature-selection methods are based on a strong assumption that features are independent of each other. However, if the feature-selection algorithm does not take into consideration the interdependencies of the feature space, the selected data fail to correctly represent the original data. We developed a new feature-selection method to address this challenge. Our aim in this research was to examine the dependencies between features and select the optimal feature set with respect to the original data structure. Another important factor in our proposed method is that it can perform even in the absence of class labels. This is a more difficult problem that many feature-selection algorithms fail to address. In this case, they only use wrapper techniques that require a learning algorithm to select features. It is important to note that our experimental results indicates, this proposed simple ranking method performs better than other methods, independent of any particular learning algorithm used.
 Select Towards Understanding the Security of Modern Image Captchas and Underground Captcha-Solving Services Haiqin Weng, Binbin Zhao, Shouling Ji, Jianhai Chen, Ting Wang, Qinming He, Raheem Beyah Big Data Mining and Analytics   2019, 2 (2): 118-144.   DOI: 10.26599/BDMA.2019.9020001 Abstract （103）   HTML （1）    PDF （5160KB）（279）       Image captchas have recently become very popular and are widely deployed across the Internet to defend against abusive programs. However, the ever-advancing capabilities of computer vision have gradually diminished the security of image captchas and made them vulnerable to attack. In this paper, we first classify the currently popular image captchas into three categories: selection-based captchas, slide-based captchas, and click-based captchas. Second, we propose simple yet powerful attack frameworks against each of these categories of image captchas. Third, we systematically evaluate our attack frameworks against $10$ popular real-world image captchas, including captchas from tencent.com, google.com, and 12306.cn. Fourth, we compare our attacks against nine online image recognition services and against human labors from eight underground captcha-solving services. Our evaluation results show that (1) each of the popular image captchas that we study is vulnerable to our attacks; (2) our attacks yield the highest captcha-breaking success rate compared with state-of-the-art methods in almost all scenarios; and (3) our attacks achieve almost as high a success rate as human labor while being much faster. Based on our evaluation, we identify some design flaws in these popular schemes, along with some best practices and design principles for more secure captchas. We also examine the underground market for captcha-solving services, identifying $152$ such services. We then seek to measure this underground market with data from these services. Our findings shed light on understanding the scale, impact, and commercial landscape of the underground market for captcha solving.
 Select Analysis of Protein-Ligand Interactions of SARS-CoV-2 Against Selective Drug Using Deep Neural Networks Natarajan Yuvaraj,Kannan Srihari,Selvaraj Chandragandhi,Rajan Arshath Raja,Gaurav Dhiman,Amandeep Kaur Big Data Mining and Analytics   2021, 4 (2): 76-83.   DOI: 10.26599/BDMA.2020.9020007 Accepted: 09 July 2020 Online available: 09 July 2020 Abstract （246）   HTML （14）    PDF （1660KB）（265）       In recent time, data analysis using machine learning accelerates optimized solutions on clinical healthcare systems. The machine learning methods greatly offer an efficient prediction ability in diagnosis system alternative with the clinicians. Most of the systems operate on the extracted features from the patients and most of the predicted cases are accurate. However, in recent time, the prevalence of COVID-19 has emerged the global healthcare industry to find a new drug that suppresses the pandemic outbreak. In this paper, we design a Deep Neural Network (DNN) model that accurately finds the protein-ligand interactions with the drug used. The DNN senses the response of protein-ligand interactions for a specific drug and identifies which drug makes the interaction that combats effectively the virus. With limited genome sequence of Indian patients submitted to the GISAID database, we find that the DNN system is effective in identifying the protein-ligand interactions for a specific drug.
 Select A Semi-Supervised Deep Network Embedding Approach Based on the Neighborhood Structure Wenmao Wu, Zhizhou Yu, Jieyue He Big Data Mining and Analytics   2019, 2 (3): 205-216.   DOI: 10.26599/BDMA.2019.9020004 Abstract （93）   HTML （0）    PDF （515KB）（221）       Network embedding is a very important task to represent the high-dimensional network in a low-dimensional vector space, which aims to capture and preserve the network structure. Most existing network embedding methods are based on shallow models. However, actual network structures are complicated which means shallow models cannot obtain the high-dimensional nonlinear features of the network well. The recently proposed unsupervised deep learning models ignore the labels information. To address these challenges, in this paper, we propose an effective network embedding method of Structural Labeled Locally Deep Nonlinear Embedding (SLLDNE). SLLDNE is designed to obtain highly nonlinear features through utilizing deep neural network while preserving the label information of the nodes by using a semi-supervised classifier component to improve the ability of discriminations. Moreover, we exploit linear reconstruction of neighborhood nodes to enable the model to get more structural information. The experimental results of vertex classification on two real-world network datasets demonstrate that SLLDNE outperforms the other state-of-the-art methods.
 Select Sparse Deep Nonnegative Matrix Factorization Zhenxing Guo, Shihua Zhang Big Data Mining and Analytics   2020, 03 (01): 13-28.   DOI: 10.26599/BDMA.2019.9020020 Abstract （159）   HTML （1）    PDF （859KB）（189）       Nonnegative Matrix Factorization (NMF) is a powerful technique to perform dimension reduction and pattern recognition through single-layer data representation learning. However, deep learning networks, with their carefully designed hierarchical structure, can combine hidden features to form more representative features for pattern recognition. In this paper, we proposed sparse deep NMF models to analyze complex data for more accurate classification and better feature interpretation. Such models are designed to learn localized features or generate more discriminative representations for samples in distinct classes by imposing L$1$-norm penalty on the columns of certain factors. By extending a one-layer model into a multilayer model with sparsity, we provided a hierarchical way to analyze big data and intuitively extract hidden features due to nonnegativity. We adopted the Nesterov's accelerated gradient algorithm to accelerate the computing process. We also analyzed the computing complexity of our frameworks to demonstrate their efficiency. To improve the performance of dealing with linearly inseparable data, we also considered to incorporate popular nonlinear functions into these frameworks and explored their performance. We applied our models using two benchmarking image datasets, and the results showed that our models can achieve competitive or better classification performance and produce intuitive interpretations compared with the typical NMF and competing multilayer models.
 Select Model Error Correction in Data Assimilation by Integrating Neural Networks Jiangcheng Zhu, Shuang Hu, Rossella Arcucci, Chao Xu, Jihong Zhu, Yi-ke Guo Big Data Mining and Analytics   2019, 2 (2): 83-91.   DOI: 10.26599/BDMA.2018.9020033 Abstract （124）   HTML （1）    PDF （2485KB）（164）       In this paper, we suggest a new methodology which combines Neural Networks (NN) into Data Assimilation (DA). Focusing on the structural model uncertainty, we propose a framework for integration NN with the physical models by DA algorithms, to improve both the assimilation process and the forecasting results. The NNs are iteratively trained as observational data is updated. The main DA models used here are the Kalman filter and the variational approaches. The effectiveness of the proposed algorithm is validated by examples and by a sensitivity study.
 Select Classification on Grade, Price, and Region with Multi-Label and Multi-Target Methods in Wineinformatics James Palmer, Victor S. Sheng, Travis Atkison, Bernard Chen Big Data Mining and Analytics   2020, 03 (01): 1-12.   DOI: 10.26599/BDMA.2019.9020014 Abstract （154）   HTML （4）    PDF （4563KB）（126）       Classifying wine according to their grade, price, and region of origin is a multi-label and multi-target problem in wineinformatics. Using wine reviews as the attributes, we compare several different multi-label/multi-target methods to the single-label method where each label is treated independently. We explore both single-label and multi-label approaches for a two-class problem for each of the labels and we explore both single-label and multi-target approaches for a four-class problem on two of the three labels, with the third label remaining a two-class problem. In terms of per-label accuracy, the single-label method has the best performance, although some multi-label methods approach the performance of single-label. However, multi-label/multi-target metrics approaches do exceed the performance of the single-label method.
 Select A Survey of Matrix Completion Methods for Recommendation Systems Andy Ramlatchan, Mengyun Yang, Quan Liu, Min Li, Jianxin Wang, Yaohang Li Big Data Mining and Analytics   2018, 01 (04): 308-323.   DOI: 10.26599/BDMA.2018.9020008 Abstract （56）   HTML （0）    PDF （354KB）（109）       In recent years, the recommendation systems have become increasingly popular and have been used in a broad variety of applications. Here, we investigate the matrix completion techniques for the recommendation systems that are based on collaborative filtering. The collaborative filtering problem can be viewed as predicting the favorability of a user with respect to new items of commodities. When a rating matrix is constructed with users as rows, items as columns, and entries as ratings, the collaborative filtering problem can then be modeled as a matrix completion problem by filling out the unknown elements in the rating matrix. This article presents a comprehensive survey of the matrix completion methods used in recommendation systems. We focus on the mathematical models for matrix completion and the corresponding computational algorithms as well as their characteristics and potential issues. Several applications other than the traditional user-item association prediction are also discussed.
 Select Network Representation Based on the Joint Learning of Three Feature Views Zhonglin Ye, Haixing Zhao, Ke Zhang, Zhaoyang Wang, Yu Zhu Big Data Mining and Analytics   2019, 2 (4): 248-260.   DOI: 10.26599/BDMA.2019.9020009 Abstract （104）   HTML （0）    PDF （2979KB）（104）       Network representation learning plays an important role in the field of network data mining. By embedding network structures and other features into the representation vector space of low dimensions, network representation learning algorithms can provide high-quality feature input for subsequent tasks, such as network link prediction, network vertex classification, and network visualization. The existing network representation learning algorithms can be trained based on the structural features, vertex texts, vertex tags, community information, etc. However, there exists a lack of algorithm of using the future evolution results of the networks to guide the network representation learning. Therefore, this paper aims at modeling the future network evolution results of the networks based on the link prediction algorithm, introducing the future link probabilities between vertices without edges into the network representation learning tasks. In order to make the network representation vectors contain more feature factors, the text features of the vertices are also embedded into the network representation vectors. Based on the above two optimization approaches, we propose a novel network representation learning algorithm, Network Representation learning algorithm based on the joint optimization of Three Features (TFNR). Based on Inductive Matrix Completion (IMC), TFNR algorithm introduces the future probabilities between vertices without edges and text features into the procedure of modeling network structures, which can avoid the problem of the network structure sparse. Experimental results show that the proposed TFNR algorithm performs well in network vertex classification and visualization tasks on three real citation network datasets.
 Select Distributed Storage System for Electric Power Data Based on HBase Jiahui Jin, Aibo Song, Huan Gong, Yingying Xue, Mingyang Du, Fang Dong, Junzhou Luo Big Data Mining and Analytics   2018, 01 (04): 324-334.   DOI: 10.26599/BDMA.2018.9020026 Abstract （66）   HTML （0）    PDF （2035KB）（98）       Managing massive electric power data is a typical big data application because electric power systems generate millions or billions of status, debugging, and error records every single day. To guarantee the safety and sustainability of electric power systems, massive electric power data need to be processed and analyzed quickly to make real-time decisions. Traditional solutions typically use relational databases to manage electric power data. However, relational databases cannot efficiently process and analyze massive electric power data when the data size increases significantly. In this paper, we show how electric power data can be managed by using HBase, a distributed database maintained by Apache. Our system consists of clients, HBase database, status monitors, data migration modules, and data fragmentation modules. We evaluate the performance of our system through a series of experiments. We also show how HBase’s parameters can be tuned to improve the efficiency of our system.
 Select Applying Big Data Based Deep Learning System to Intrusion Detection Wei Zhong, Ning Yu, Chunyu Ai Big Data Mining and Analytics   2020, 3 (3): 181-195.   DOI: 10.26599/BDMA.2020.9020003 Abstract （162）   HTML （2）    PDF （5676KB）（96）       With vast amounts of data being generated daily and the ever increasing interconnectivity of the world’s internet infrastructures, a machine learning based Intrusion Detection Systems (IDS) has become a vital component to protect our economic and national security. Previous shallow learning and deep learning strategies adopt the single learning model approach for intrusion detection. The single learning model approach may experience problems to understand increasingly complicated data distribution of intrusion patterns. Particularly, the single deep learning model may not be effective to capture unique patterns from intrusive attacks having a small number of samples. In order to further enhance the performance of machine learning based IDS, we propose the Big Data based Hierarchical Deep Learning System (BDHDLS). BDHDLS utilizes behavioral features and content features to understand both network traffic characteristics and information stored in the payload. Each deep learning model in the BDHDLS concentrates its efforts to learn the unique data distribution in one cluster. This strategy can increase the detection rate of intrusive attacks as compared to the previous single learning model approaches. Based on parallel training strategy and big data techniques, the model construction time of BDHDLS is reduced substantially when multiple machines are deployed.
 Select Disseminating Authorized Content via Data Analysis in Opportunistic Social Networks Chenguang Kong, Guangchun Luo, Ling Tian, Xiaojun Cao Big Data Mining and Analytics   2019, 2 (1): 12-24.   DOI: 10.26599/BDMA.2018.9020028 Abstract （111）   HTML （0）    PDF （1261KB）（84）       Authorized content is a type of content that can be generated only by a certain Content Provider (CP). The content copies delivered to a user may bring rewards to the CP if the content is adopted by the user. The overall reward obtained by the CP depends on the user’s degree of interest in the content and the user’s role in disseminating the content copies. Thus, to maximize the reward, the content provider is motivated to disseminate the authorized content to the most interested users. In this paper, we study how to effectively disseminate the authorized content in Interest-centric Opportunistic Social Networks (IOSNs) such that the reward is maximized. We first derive Social Connection Pattern (SCP) data to handle the challenging opportunistic connections in IOSNs and statistically analyze the interest distribution of the users contacted or connected. The SCP is used to predict the interests of possible contactors and connectors. Then, we propose our SCP-based Dissemination (SCPD) algorithm to calculate the optimum number of content copies to disseminate when two users meet. Our dataset based simulation shows that our SCPD algorithm is effective and efficient to disseminate the authorized content in IOSNs.
 Select Multi-Class Sentiment Analysis on Twitter: Classification Performance and Challenges Mondher Bouazizi, Tomoaki Ohtsuki Big Data Mining and Analytics   2019, 2 (3): 181-194.   DOI: 10.26599/BDMA.2019.9020002 Abstract （127）   HTML （0）    PDF （716KB）（83）       Sentiment analysis refers to the automatic collection, aggregation, and classification of data collected online into different emotion classes. While most of the work related to sentiment analysis of texts focuses on the binary and ternary classification of these data, the task of multi-class classification has received less attention. Multi-class classification has always been a challenging task given the complexity of natural languages and the difficulty of understanding and mathematically "quantifying" how humans express their feelings. In this paper, we study the task of multi-class classification of online posts of Twitter users, and show how far it is possible to go with the classification, and the limitations and difficulties of this task. The proposed approach of multi-class classification achieves an accuracy of 60.2% for 7 different sentiment classes which, compared to an accuracy of 81.3% for binary classification, emphasizes the effect of having multiple classes on the classification performance. Nonetheless, we propose a novel model to represent the different sentiments and show how this model helps to understand how sentiments are related. The model is then used to analyze the challenges that multi-class classification presents and to highlight possible future enhancements to multi-class classification accuracy.
 Select Comparative Study of Statistical Features to Detect the Target Event During Disaster Madichetty Sreenivasulu, M. Sridevi Big Data Mining and Analytics   2020, 3 (2): 121-130.   DOI: 10.26599/BDMA.2019.9020021 Abstract （51）   HTML （1）    PDF （620KB）（82）       Microblogs, such as facebook and twitter, have much attention among the users and organizations. Nowadays, twitter is more popular because of its real-time nature. People often interacted with real-time events such as earthquakes and floods through twitter. During a disaster, the number of posts or tweets is drastically increased in twitter. At the time of the disaster, detecting a target event is a challenging task. In this paper, a framework is proposed for observing the tweets and to detect the target event. For detecting the target event, a classifier is devised based on different combinations of statistical features such as the position of the keyword in a tweet, length of a tweet, the frequency of hashtag, and frequency of user mentions and the URL. From the result, it is evident that the combination of frequency of hashtag and position of keyword features provides good classification results than the other combinations of features. Hence, usage of two features, namely, frequency of hashtag and position of the earthquake keyword reduces the event’s detection time. And also these two features are further helpful for detecting the sub-events which are used for filtering the tweets related to the disaster. Additionally, different classifiers such as Artificial Neural Networks (ANN), decision tree, and K-Nearest Neighbor (KNN) are compared by using these two features. However, Support Vector Machine (SVM) with linear kernel by using the combination of position of earthquake keyword and frequency of hashtag outperforms state-of-the-art methods. Therefore, SVM (linear kernel) with proposed features is applied for detecting the earthquake during disaster. The proposed algorithm is tested on Nepal earthquake and landslide datasets, 2015.
 Select Clinical Big Data and Deep Learning: Applications, Challenges, and Future Outlooks Ying Yu, Min Li, Liangliang Liu, Yaohang Li, Jianxin Wang Big Data Mining and Analytics   2019, 2 (4): 288-305.   DOI: 10.26599/BDMA.2019.9020007 Abstract （332）   HTML （0）    PDF （994KB）（78）       The explosion of digital healthcare data has led to a surge of data-driven medical research based on machine learning. In recent years, as a powerful technique for big data, deep learning has gained a central position in machine learning circles for its great advantages in feature representation and pattern recognition. This article presents a comprehensive overview of studies that employ deep learning methods to deal with clinical data. Firstly, based on the analysis of the characteristics of clinical data, various types of clinical data (e.g., medical images, clinical notes, lab results, vital signs, and demographic informatics) are discussed and details provided of some public clinical datasets. Secondly, a brief review of common deep learning models and their characteristics is conducted. Then, considering the wide range of clinical research and the diversity of data types, several deep learning applications for clinical data are illustrated: auxiliary diagnosis, prognosis, early warning, and other tasks. Although there are challenges involved in applying deep learning techniques to clinical data, it is still worthwhile to look forward to a promising future for deep learning applications in clinical big data in the direction of precision medicine.
 Select Auxo: A Temporal Graph Management System Wentao Han, Kaiwei Li, Shimin Chen, Wenguang Chen Big Data Mining and Analytics   2019, 2 (1): 58-71.   DOI: 10.26599/BDMA.2018.9020030 Abstract （82）   HTML （0）    PDF （753KB）（74）       As real-world graphs are often evolving over time, interest in analyzing the temporal behavior of graphs has grown. Herein, we propose Auxo, a novel temporal graph management system to support temporal graph analysis. It supports both efficient global and local queries with low space overhead. Auxo organizes temporal graph data in spatio-temporal chunks. A chunk spans a particular time interval and covers a set of vertices in a graph. We propose chunk layout and chunk splitting designs to achieve the desired efficiency and the abovementioned goals. First, by carefully choosing the time split policy, Auxo achieves linear complexity in both space usage and query time. Second, graph splitting further improves the worst-case query time, and reduces the performance variance introduced by splitting operations. Third, Auxo optimizes the data layout inside chunks, thereby significantly improving the performance of traverse-based graph queries. Experimental evaluation showed that Auxo achieved $2.9×$ to $12.1×$ improvement for global queries, and $1.7×$ to $2.7×$ improvement for local queries, as compared with state-of-the-art open-source solutions.
 Select An Advanced Uncertainty Measure Using Fuzzy Soft Sets: Application to Decision-Making Problems Nitin Bhardwaj,Pallvi Sharma Big Data Mining and Analytics   2021, 4 (2): 94-103.   DOI: 10.26599/BDMA.2020.9020020 Abstract （51）   HTML （0）    PDF （2613KB）（70）       In this paper, uncertainty has been measured in the form of fuzziness which arises due to imprecise boundaries of fuzzy sets. Uncertainty caused due to human’s cognition can be decreased by the use of fuzzy soft sets. There are different approaches to deal with the measurement of uncertainty. The method we proposed uses fuzzified evidence theory to calculate total degree of fuzziness of the parameters. It consists of mainly four parts. The first part is to measure uncertainties of parameters using fuzzy soft sets and then to modulate the uncertainties calculated. Afterward, the appropriate basic probability assignments with respect to each parameter are produced. In the last, we use Dempster’s rule of combination to fuse independent parameters into integrated one. To validate the proposed method, we perform an experiment and compare our outputs with grey relational analysis method. Also, a medical diagnosis application in reference to COVID-19 has been given to show the effectiveness of advanced method by comparing with other method.
 Select A Novel Clustering Technique for Efficient Clustering of Big Data in Hadoop Ecosystem Sunil Kumar, Maninder Singh Big Data Mining and Analytics   2019, 2 (4): 240-247.   DOI: 10.26599/BDMA.2018.9020037 Abstract （182）   HTML （0）    PDF （32747KB）（69）       Big data analytics and data mining are techniques used to analyze data and to extract hidden information. Traditional approaches to analysis and extraction do not work well for big data because this data is complex and of very high volume. A major data mining technique known as data clustering groups the data into clusters and makes it easy to extract information from these clusters. However, existing clustering algorithms, such as $k$-means and hierarchical, are not efficient as the quality of the clusters they produce is compromised. Therefore, there is a need to design an efficient and highly scalable clustering algorithm. In this paper, we put forward a new clustering algorithm called hybrid clustering in order to overcome the disadvantages of existing clustering algorithms. We compare the new hybrid algorithm with existing algorithms on the bases of precision, recall, F-measure, execution time, and accuracy of results. From the experimental results, it is clear that the proposed hybrid clustering algorithm is more accurate, and has better precision, recall, and F-measure values.
 Select Statistical Learning for Semantic Parsing: A Survey Qile Zhu, Xiyao Ma, Xiaolin Li Big Data Mining and Analytics   2019, 2 (4): 217-239.   DOI: 10.26599/BDMA.2019.9020011 Abstract （121）   HTML （1）    PDF （1740KB）（68）       A long-term goal of Artificial Intelligence (AI) is to provide machines with the capability of understanding natural language. Understanding natural language may be referred as the system must produce a correct response to the received input order. This response can be a robot move, an answer to a question, etc. One way to achieve this goal is semantic parsing. It parses utterances into semantic representations called logical form, a representation of many important linguistic phenomena that can be understood by machines. Semantic parsing is a fundamental problem in natural language understanding area. In recent years, researchers have made tremendous progress in this field. In this paper, we review recent algorithms for semantic parsing including both conventional machine learning approaches and deep learning approaches. We first give an overview of a semantic parsing system, then we summary a general way to do semantic parsing in statistical learning. With the rise of deep learning, we will pay more attention on the deep learning based semantic parsing, especially for the application of Knowledge Base Question Answering (KBQA). At last, we survey several benchmarks for KBQA.
 Select On Quantum Methods for Machine Learning Problems Part I: Quantum Tools Farid Ablayev, Marat Ablayev, Joshua Zhexue Huang, Kamil Khadiev, Nailya Salikhova, Dingming Wu Big Data Mining and Analytics   2020, 03 (01): 41-55.   DOI: 10.26599/BDMA.2019.9020016 Abstract （164）   HTML （1）    PDF （18325KB）（67）       This is a review of quantum methods for machine learning problems that consists of two parts. The first part, "quantum tools", presents the fundamentals of qubits, quantum registers, and quantum states, introduces important quantum tools based on known quantum search algorithms and SWAP-test, and discusses the basic quantum procedures used for quantum search methods. The second part, "quantum classification algorithms", introduces several classification problems that can be accelerated by using quantum subroutines and discusses the quantum methods used for classification.
 Select Tweetluenza: Predicting Flu Trends from Twitter Data Balsam Alkouz, Zaher Al Aghbari, Jemal Hussien Abawajy Big Data Mining and Analytics   2019, 2 (4): 273-287.   DOI: 10.26599/BDMA.2019.9020012 Abstract （116）   HTML （0）    PDF （1040KB）（64）       Health authorities worldwide strive to detect Influenza prevalence as early as possible in order to prepare for it and minimize its impacts. To this end, we address the Influenza prevalence surveillance and prediction problem. In this paper, we develop a new Influenza prevalence prediction model, called Tweetluenza, to predict the spread of the Influenza in real time using cross-lingual data harvested from Twitter data streams with emphases on the United Arab Emirates (UAE). Based on the features of tweets, Tweetluenza filters the Influenza tweets and classifies them into two classes, reporting and non-reporting. To monitor the growth of Influenza, the reporting tweets were employed. Furthermore, a linear regression model leverages the reporting tweets to predict the Influenza-related hospital visits in the future. We evaluated Tweetluenza empirically to study its feasibility and compared the results with the actual hospital visits recorded by the UAE Ministry of Health. The results of our experiments demonstrate the practicality of Tweetluenza, which was verified by the high correlation between the Influenza-related Twitter data and hospital visits due to Influenza. Furthermore, the evaluation of the analysis and prediction of Influenza shows that combining English and Arabic tweets improves the correlation results.
 Select Location Prediction on Trajectory Data: A Review Ruizhi Wu, Guangchun Luo, Junming Shao, Ling Tian, Chengzong Peng Big Data Mining and Analytics   2018, 01 (02): 108-127.   DOI: 10.26599/BDMA.2018.9020010 Abstract （117）   HTML （0）    PDF （5482KB）（64）       Location prediction is the key technique in many location based services including route navigation, dining location recommendations, and traffic planning and control, to mention a few. This survey provides a comprehensive overview of location prediction, including basic definitions and concepts, algorithms, and applications. First, we introduce the types of trajectory data and related basic concepts. Then, we review existing location-prediction methods, ranging from temporal-pattern-based prediction to spatiotemporal-pattern-based prediction. We also discuss and analyze the advantages and disadvantages of these algorithms and briefly summarize current applications of location prediction in diverse fields. Finally, we identify the potential challenges and future research directions in location prediction. Cited: Baidu(1)
 Select Selective Ensemble Learning Method for Belief-Rule-Base Classification System Based on PAES Wanling Liu, Weikun Wu, Yingming Wang, Yanggeng Fu, Yanqing Lin Big Data Mining and Analytics   2019, 2 (4): 306-318.   DOI: 10.26599/BDMA.2019.9020008 Abstract （95）   HTML （0）    PDF （2967KB）（63）       Traditional Belief-Rule-Based (BRB) ensemble learning methods integrate all of the trained sub-BRB systems to obtain better results than a single belief-rule-based system. However, as the number of BRB systems participating in ensemble learning increases, a large amount of redundant sub-BRB systems are generated because of the diminishing difference between subsystems. This drastically decreases the prediction speed and increases the storage requirements for BRB systems. In order to solve these problems, this paper proposes BRBCS-PAES: a selective ensemble learning approach for BRB Classification Systems (BRBCS) based on Pareto-Archived Evolutionary Strategy (PAES) multi-objective optimization. This system employs the improved Bagging algorithm to train the base classifier. For the purpose of increasing the degree of difference in the integration of the base classifier, the training set is constructed by the repeated sampling of data. In the base classifier selection stage, the trained base classifier is binary coded, and the number of base classifiers participating in integration and generalization error of the base classifier is used as the objective function for multi-objective optimization. Finally, the elite retention strategy and the adaptive mesh algorithm are adopted to produce the PAES optimal solution set. Three experimental studies on classification problems are performed to verify the effectiveness of the proposed method. The comparison results demonstrate that the proposed method can effectively reduce the number of base classifiers participating in the integration and improve the accuracy of BRBCS.
 Select New Enhanced Authentication Protocol for Internet of Things Mourade Azrour,Jamal Mabrouki,Azedine Guezzaz,Yousef Farhaoui Big Data Mining and Analytics   2021, 4 (1): 1-9.   DOI: 10.26599/BDMA.2020.9020010 Abstract （78）   HTML （0）    PDF （1097KB）（61）       Internet of Things (IoT) refers to a new extended network that enables to any object to be linked to the Internet in order to exchange data and to be controlled remotely. Nowadays, due to its multiple advantages, the IoT is useful in many areas like environment, water monitoring, industry, public security, medicine, and so on. For covering all spaces and operating correctly, the IoT benefits from advantages of other recent technologies, like radio frequency identification, wireless sensor networks, big data, and mobile network. However, despite of the integration of various things in one network and the exchange of data among heterogeneous sources, the security of user’s data is a central question. For this reason, the authentication of interconnected objects is received as an interested importance. In 2012, Ye et al. suggested a new authentication and key exchanging protocol for Internet of things devices. However, we have proved that their protocol cannot resist to various attacks. In this paper, we propose an enhanced authentication protocol for IoT. Furthermore, we present the comparative results between our proposed scheme and other related ones.
 Select Survey on Lie Group Machine Learning Mei Lu,Fanzhang Li Big Data Mining and Analytics   2020, 3 (4): 235-258.   DOI: 10.26599/BDMA.2020.9020011 Abstract （105）   HTML （2）    PDF （1364KB）（61）       Lie group machine learning is recognized as the theoretical basis of brain intelligence, brain learning, higher machine learning, and higher artificial intelligence. Sample sets of Lie group matrices are widely available in practical applications. Lie group learning is a vibrant field of increasing importance and extraordinary potential and thus needs to be developed further. This study aims to provide a comprehensive survey on recent advances in Lie group machine learning. We introduce Lie group machine learning techniques in three major categories: supervised Lie group machine learning, semisupervised Lie group machine learning, and unsupervised Lie group machine learning. In addition, we introduce the special application of Lie group machine learning in image processing. This work covers the following techniques: Lie group machine learning model, Lie group subspace orbit generation learning, symplectic group learning, quantum group learning, Lie group fiber bundle learning, Lie group cover learning, Lie group deep structure learning, Lie group semisupervised learning, Lie group kernel learning, tensor learning, frame bundle connection learning, spectral estimation learning, Finsler geometric learning, homology boundary learning, category representation learning, and neuromorphic synergy learning. Overall, this survey aims to provide an insightful overview of state-of-the-art development in the field of Lie group machine learning. It will enable researchers to comprehensively understand the state of the field, identify the most appropriate tools for particular applications, and identify directions for future research.
 Select CircRNA-Disease Associations Prediction Based on Metapath2vec++ and Matrix Factorization Yuchen Zhang,Xiujuan Lei,Zengqiang Fang,Yi Pan Big Data Mining and Analytics   2020, 3 (4): 280-291.   DOI: 10.26599/BDMA.2020.9020025 Abstract （91）   HTML （0）    PDF （4531KB）（61）       Circular RNA (circRNA) is a novel non-coding endogenous RNAs. Evidence has shown that circRNAs are related to many biological processes and play essential roles in different biological functions. Although increasing numbers of circRNAs are discovered using high-throughput sequencing technologies, these techniques are still time-consuming and costly. In this study, we propose a computational method to predict circRNA-disesae associations which is based on metapath2vec++ and matrix factorization with integrated multiple data (called PCD_MVMF). To construct more reliable networks, various aspects are considered. Firstly, circRNA annotation, sequence, and functional similarity networks are established, and disease-related genes and semantics are adopted to construct disease functional and semantic similarity networks. Secondly, metapath2vec++ is applied on an integrated heterogeneous network to learn the embedded features and initial prediction score. Finally, we use matrix factorization, take similarity as a constraint, and optimize it to obtain the final prediction results. Leave-one-out cross-validation, five-fold cross-validation, and f-measure are adopted to evaluate the performance of PCD_MVMF. These evaluation metrics verify that PCD_MVMF has better prediction performance than other methods. To further illustrate the performance of PCD_MVMF, case studies of common diseases are conducted. Therefore, PCD_MVMF can be regarded as a reliable and useful circRNA-disease association prediction tool.
 Select QoE-Driven Big Data Management in Pervasive Edge Computing Environment Qianyu Meng, Kun Wang, Xiaoming He, Minyi Guo Big Data Mining and Analytics   2018, 01 (03): 222-233.   DOI: 10.26599/BDMA.2018.9020020 Abstract （48）   HTML （0）    PDF （2260KB）（61）       In the age of big data, services in the pervasive edge environment are expected to offer end-users better Quality-of-Experience (QoE) than that in a normal edge environment. However, the combined impact of the storage, delivery, and sensors used in various types of edge devices in this environment is producing volumes of high-dimensional big data that are increasingly pervasive and redundant. Therefore, enhancing the QoE has become a major challenge in high-dimensional big data in the pervasive edge computing environment. In this paper, to achieve high QoE, we propose a QoE model for evaluating the qualities of services in the pervasive edge computing environment. The QoE is related to the accuracy of high-dimensional big data and the transmission rate of this accurate data. To realize high accuracy of high-dimensional big data and the transmission of accurate data through out the pervasive edge computing environment, in this study we focused on the following two aspects. First, we formulate the issue as a high-dimensional big data management problem and test different transmission rates to acquire the best QoE. Then, with respect to accuracy, we propose a Tensor-Fast Convolutional Neural Network (TF-CNN) algorithm based on deep learning, which is suitable for high-dimensional big data analysis in the pervasive edge computing environment. Our simulation results reveal that our proposed algorithm can achieve high QoE performance.
 Select Efficient Preference Clustering via Random Fourier Features Jingshu Liu, Li Wang, Jinglei Liu Big Data Mining and Analytics   2019, 2 (3): 195-204.   DOI: 10.26599/BDMA.2019.9020003 Abstract （78）   HTML （0）    PDF （3012KB）（60）       Approximations based on random Fourier features have recently emerged as an efficient and elegant method for designing large-scale machine learning tasks. Unlike approaches using the Nystr?m method, which randomly samples the training examples, we make use of random Fourier features, whose basis functions (i.e., cosine and sine ) are sampled from a distribution independent from the training sample set, to cluster preference data which appears extensively in recommender systems. Firstly, we propose a two-stage preference clustering framework. In this framework, we make use of random Fourier features to map the preference matrix into the feature matrix, soon afterwards, utilize the traditional $k$-means approach to cluster preference data in the transformed feature space. Compared with traditional preference clustering, our method solves the problem of insufficient memory and greatly improves the efficiency of the operation. Experiments on movie data sets containing 100 000 ratings, show that the proposed method is more effective in clustering accuracy than the Nystr?m and $k$-means, while also achieving better performance than these clustering approaches.
 Select Multi-Attention Fusion Modeling for Sentiment Analysis of Educational Big Data Guanlin Zhai,Yan Yang,Heng Wang,Shengdong Du Big Data Mining and Analytics   2020, 3 (4): 311-319.   DOI: 10.26599/BDMA.2020.9020024 Abstract （73）   HTML （0）    PDF （937KB）（60）       As an important branch of natural language processing, sentiment analysis has received increasing attention. In teaching evaluation, sentiment analysis can help educators discover the true feelings of students about the course in a timely manner and adjust the teaching plan accurately and timely to improve the quality of education and teaching. Aiming at the inefficiency and heavy workload of college curriculum evaluation methods, a Multi-Attention Fusion Modeling (Multi-AFM) is proposed, which integrates global attention and local attention through gating unit control to generate a reasonable contextual representation and achieve improved classification results. Experimental results show that the Multi-AFM model performs better than the existing methods in the application of education and other fields.
 Select Improvement in Automated Diagnosis of Soft Tissues Tumors Using Machine Learning El Arbi Abdellaoui Alaoui,Stéphane Cédric Koumetio Tekouabou,Sri Hartini,Zuherman Rustam,Hassan Silkan,Said Agoujil Big Data Mining and Analytics   2021, 4 (1): 33-46.   DOI: 10.26599/BDMA.2020.9020023 Abstract （80）   HTML （0）    PDF （5099KB）（58）       Soft Tissue Tumors (STT) are a form of sarcoma found in tissues that connect, support, and surround body structures. Because of their shallow frequency in the body and their great diversity, they appear to be heterogeneous when observed through Magnetic Resonance Imaging (MRI). They are easily confused with other diseases such as fibroadenoma mammae, lymphadenopathy, and struma nodosa, and these diagnostic errors have a considerable detrimental effect on the medical treatment process of patients. Researchers have proposed several machine learning models to classify tumors, but none have adequately addressed this misdiagnosis problem. Also, similar studies that have proposed models for evaluation of such tumors mostly do not consider the heterogeneity and the size of the data. Therefore, we propose a machine learning-based approach which combines a new technique of preprocessing the data for features transformation, resampling techniques to eliminate the bias and the deviation of instability and performing classifier tests based on the Support Vector Machine (SVM) and Decision Tree (DT) algorithms. The tests carried out on dataset collected in Nur Hidayah Hospital of Yogyakarta in Indonesia show a great improvement compared to previous studies. These results confirm that machine learning methods could provide efficient and effective tools to reinforce the automatic decision-making processes of STT diagnostics.
 Select Big Data Analytics for Healthcare Industry: Impact, Applications, and Tools Sunil Kumar, Maninder Singh Big Data Mining and Analytics   2019, 2 (1): 48-57.   DOI: 10.26599/BDMA.2018.9020031 Abstract （303）   HTML （0）    PDF （1057KB）（56）       In recent years, huge amounts of structured, unstructured, and semi-structured data have been generated by various institutions around the world and, collectively, this heterogeneous data is referred to as big data. The health industry sector has been confronted by the need to manage the big data being produced by various sources, which are well known for producing high volumes of heterogeneous data. Various big-data analytics tools and techniques have been developed for handling these massive amounts of data, in the healthcare sector. In this paper, we discuss the impact of big data in healthcare, and various tools available in the Hadoop ecosystem for handling it. We also explore the conceptual architecture of big data analytics for healthcare which involves the data gathering history of different branches, the genome database, electronic health records, text/imagery, and clinical decisions support system.
 Select Spreading Social Influence with both Positive and Negative Opinions in Online Networks Jing (Selena) He, Meng Han, Shouling Ji, Tianyu Du, Zhao Li Big Data Mining and Analytics   2019, 2 (2): 100-117.   DOI: 10.26599/BDMA.2018.9020034 Abstract （114）   HTML （0）    PDF （2038KB）（56）       Social networks are important media for spreading information, ideas, and influence among individuals. Most existing research focuses on understanding the characteristics of social networks, investigating how information is spread through the "word-of-mouth" effect of social networks, or exploring social influences among individuals and groups. However, most studies ignore negative influences among individuals and groups. Motivated by the goal of alleviating social problems, such as drinking, smoking, and gambling, and influence-spreading problems, such as promoting new products, we consider positive and negative influences, and propose a new optimization problem called the Minimum-sized Positive Influential Node Set (MPINS) selection problem to identify the minimum set of influential nodes such that every node in the network can be positively influenced by these selected nodes with no less than a threshold of $θ$. Our contributions are threefold. First, we prove that, under the independent cascade model considering positive and negative influences, MPINS is APX-hard. Subsequently, we present a greedy approximation algorithm to address the MPINS selection problem. Finally, to validate the proposed greedy algorithm, we conduct extensive simulations and experiments on random graphs and seven different real-world data sets that represent small-, medium-, and large-scale networks.
 Select Intelligent Monitoring System for Biogas Detection Based on the Internet of Things: Mohammedia, Morocco City Landfill Case Jamal Mabrouki,Mourade Azrour,Ghizlane Fattah,Driss Dhiba,Souad El Hajjaji Big Data Mining and Analytics   2021, 4 (1): 10-17.   DOI: 10.26599/BDMA.2020.9020017 Abstract （91）   HTML （4）    PDF （1847KB）（56）       Mechanization is a depollution activity, because it provides an energetic and ecological response to the problem of organic waste treatment. Through burning, biogas from mechanization reduces gas pollution from fermentation by a factor of 20. This study aims to better understand the influence of the seasons on the emitted biogas in the landfill of the city Mohammedia. The composition of the biogas that naturally emanates from the landfill has been continuously analyzed by our intelligent system, from different wells drilled in recent and old waste repositories. During the rainy season, the average production of methane, carbon dioxide, and oxygen and nitrogen are currently 56%, 32%, and 1%, respectively, compared to 51%, 31%, and 0.8%, respectively, for old waste. Hazards levels, potential fire, and explosion risks associated with biogas are lower than those of natural gases in most cases. For this reason a system is proposed to measure and monitor the biogas production of the landfill site remotely. Measurement results carried out at various sites of the landfill in the city of Mohammedia by the system show that the biogas contents present dangers and sanitary risks which are of another order.