In recent time, data analysis using machine learning accelerates optimized solutions on clinical healthcare systems. The machine learning methods greatly offer an efficient prediction ability in diagnosis system alternative with the clinicians. Most of the systems operate on the extracted features from the patients and most of the predicted cases are accurate. However, in recent time, the prevalence of COVID-19 has emerged the global healthcare industry to find a new drug that suppresses the pandemic outbreak. In this paper, we design a Deep Neural Network (DNN) model that accurately finds the protein-ligand interactions with the drug used. The DNN senses the response of protein-ligand interactions for a specific drug and identifies which drug makes the interaction that combats effectively the virus. With limited genome sequence of Indian patients submitted to the GISAID database, we find that the DNN system is effective in identifying the protein-ligand interactions for a specific drug.
Lie group machine learning is recognized as the theoretical basis of brain intelligence, brain learning, higher machine learning, and higher artificial intelligence. Sample sets of Lie group matrices are widely available in practical applications. Lie group learning is a vibrant field of increasing importance and extraordinary potential and thus needs to be developed further. This study aims to provide a comprehensive survey on recent advances in Lie group machine learning. We introduce Lie group machine learning techniques in three major categories: supervised Lie group machine learning, semisupervised Lie group machine learning, and unsupervised Lie group machine learning. In addition, we introduce the special application of Lie group machine learning in image processing. This work covers the following techniques: Lie group machine learning model, Lie group subspace orbit generation learning, symplectic group learning, quantum group learning, Lie group fiber bundle learning, Lie group cover learning, Lie group deep structure learning, Lie group semisupervised learning, Lie group kernel learning, tensor learning, frame bundle connection learning, spectral estimation learning, Finsler geometric learning, homology boundary learning, category representation learning, and neuromorphic synergy learning. Overall, this survey aims to provide an insightful overview of state-of-the-art development in the field of Lie group machine learning. It will enable researchers to comprehensively understand the state of the field, identify the most appropriate tools for particular applications, and identify directions for future research.
Social media has more than three billion users sharing events, comments, and feelings throughout the world. It serves as a critical information source with large volumes, high velocity, and a wide variety of data. The previous studies on information spreading, relationship analyzing, and individual modeling, etc., have been heavily conducted to explore the tremendous social and commercial values of social media data. This survey studies the previous literature and the existing applications from a practical perspective. We outline a commonly used pipeline in building social media-based applications and focus on discussing available analysis techniques, such as topic analysis, time series analysis, sentiment analysis, and network analysis. After that, we present the impacts of such applications in three different areas, including disaster management, healthcare, and business. Finally, we list existing challenges and suggest promising future research directions in terms of data privacy, 5G wireless network, and multilingual support.
Insect pest control is considered as a significant factor in the yield of commercial crops. Thus, to avoid economic losses, we need a valid method for insect pest recognition. In this paper, we proposed a feature fusion residual block to perform the insect pest recognition task. Based on the original residual block, we fused the feature from a previous layer between two 1×1 convolution layers in a residual signal branch to improve the capacity of the block. Furthermore, we explored the contribution of each residual group to the model performance. We found that adding the residual blocks of earlier residual groups promotes the model performance significantly, which improves the capacity of generalization of the model. By stacking the feature fusion residual block, we constructed the Deep Feature Fusion Residual Network (DFF-ResNet). To prove the validity and adaptivity of our approach, we constructed it with two common residual networks (Pre-ResNet and Wide Residual Network (WRN)) and validated these models on the Canadian Institute For Advanced Research (CIFAR) and Street View House Number (SVHN) benchmark datasets. The experimental results indicate that our models have a lower test error than those of baseline models. Then, we applied our models to recognize insect pests and obtained validity on the IP102 benchmark dataset. The experimental results show that our models outperform the original ResNet and other state-of-the-art methods.
Circular RNA (circRNA) is a novel non-coding endogenous RNAs. Evidence has shown that circRNAs are related to many biological processes and play essential roles in different biological functions. Although increasing numbers of circRNAs are discovered using high-throughput sequencing technologies, these techniques are still time-consuming and costly. In this study, we propose a computational method to predict circRNA-disesae associations which is based on metapath2vec++ and matrix factorization with integrated multiple data (called PCD_MVMF). To construct more reliable networks, various aspects are considered. Firstly, circRNA annotation, sequence, and functional similarity networks are established, and disease-related genes and semantics are adopted to construct disease functional and semantic similarity networks. Secondly, metapath2vec++ is applied on an integrated heterogeneous network to learn the embedded features and initial prediction score. Finally, we use matrix factorization, take similarity as a constraint, and optimize it to obtain the final prediction results. Leave-one-out cross-validation, five-fold cross-validation, and f-measure are adopted to evaluate the performance of PCD_MVMF. These evaluation metrics verify that PCD_MVMF has better prediction performance than other methods. To further illustrate the performance of PCD_MVMF, case studies of common diseases are conducted. Therefore, PCD_MVMF can be regarded as a reliable and useful circRNA-disease association prediction tool.
Mechanization is a depollution activity, because it provides an energetic and ecological response to the problem of organic waste treatment. Through burning, biogas from mechanization reduces gas pollution from fermentation by a factor of 20. This study aims to better understand the influence of the seasons on the emitted biogas in the landfill of the city Mohammedia. The composition of the biogas that naturally emanates from the landfill has been continuously analyzed by our intelligent system, from different wells drilled in recent and old waste repositories. During the rainy season, the average production of methane, carbon dioxide, and oxygen and nitrogen are currently 56%, 32%, and 1%, respectively, compared to 51%, 31%, and 0.8%, respectively, for old waste. Hazards levels, potential fire, and explosion risks associated with biogas are lower than those of natural gases in most cases. For this reason a system is proposed to measure and monitor the biogas production of the landfill site remotely. Measurement results carried out at various sites of the landfill in the city of Mohammedia by the system show that the biogas contents present dangers and sanitary risks which are of another order.
The novel coronavirus outbreak was first reported in late December 2019 and more than 7 million people were infected with this disease and over 0.40 million worldwide lost their lives. The first case was diagnosed on 30 January 2020 in India and the figure crossed 0.24 million as of 6 June 2020. This paper presents a detailed study of recently developed forecasting models and predicts the number of confirmed, recovered, and death cases in India caused by COVID-19. The correlation coefficients and multiple linear regression applied for prediction and autocorrelation and autoregression have been used to improve the accuracy. The predicted number of cases shows a good agreement with 0.9992 R-squared score to the actual values. The finding suggests that lockdown and social distancing are two important factors that can help to suppress the increasing spread rate of COVID-19.
In recent years, the monitoring systems play significant roles in our life. So, in this paper, we propose an automatic weather monitoring system that allows having dynamic and real-time climate data of a given area. The proposed system is based on the internet of things technology and embedded system. The system also includes electronic devices, sensors, and wireless technology. The main objective of this system is sensing the climate parameters, such as temperature, humidity, and existence of some gases, based on the sensors. The captured values can then be sent to remote applications or databases. Afterwards, the stored data can be visualized in graphics and tables form.
Soft Tissue Tumors (STT) are a form of sarcoma found in tissues that connect, support, and surround body structures. Because of their shallow frequency in the body and their great diversity, they appear to be heterogeneous when observed through Magnetic Resonance Imaging (MRI). They are easily confused with other diseases such as fibroadenoma mammae, lymphadenopathy, and struma nodosa, and these diagnostic errors have a considerable detrimental effect on the medical treatment process of patients. Researchers have proposed several machine learning models to classify tumors, but none have adequately addressed this misdiagnosis problem. Also, similar studies that have proposed models for evaluation of such tumors mostly do not consider the heterogeneity and the size of the data. Therefore, we propose a machine learning-based approach which combines a new technique of preprocessing the data for features transformation, resampling techniques to eliminate the bias and the deviation of instability and performing classifier tests based on the Support Vector Machine (SVM) and Decision Tree (DT) algorithms. The tests carried out on dataset collected in Nur Hidayah Hospital of Yogyakarta in Indonesia show a great improvement compared to previous studies. These results confirm that machine learning methods could provide efficient and effective tools to reinforce the automatic decision-making processes of STT diagnostics.
Internet of Things (IoT) refers to a new extended network that enables to any object to be linked to the Internet in order to exchange data and to be controlled remotely. Nowadays, due to its multiple advantages, the IoT is useful in many areas like environment, water monitoring, industry, public security, medicine, and so on. For covering all spaces and operating correctly, the IoT benefits from advantages of other recent technologies, like radio frequency identification, wireless sensor networks, big data, and mobile network. However, despite of the integration of various things in one network and the exchange of data among heterogeneous sources, the security of user’s data is a central question. For this reason, the authentication of interconnected objects is received as an interested importance. In 2012, Ye et al. suggested a new authentication and key exchanging protocol for Internet of things devices. However, we have proved that their protocol cannot resist to various attacks. In this paper, we propose an enhanced authentication protocol for IoT. Furthermore, we present the comparative results between our proposed scheme and other related ones.
As an important branch of natural language processing, sentiment analysis has received increasing attention. In teaching evaluation, sentiment analysis can help educators discover the true feelings of students about the course in a timely manner and adjust the teaching plan accurately and timely to improve the quality of education and teaching. Aiming at the inefficiency and heavy workload of college curriculum evaluation methods, a Multi-Attention Fusion Modeling (Multi-AFM) is proposed, which integrates global attention and local attention through gating unit control to generate a reasonable contextual representation and achieve improved classification results. Experimental results show that the Multi-AFM model performs better than the existing methods in the application of education and other fields.
Speed forecasting has numerous applications in intelligent transport systems’ design and control, especially for safety and road efficiency applications. In the field of electromobility, it represents the most dynamic parameter for efficient online in-vehicle energy management. However, vehicles’ speed forecasting is a challenging task, because its estimation is closely related to various features, which can be classified into two categories, endogenous and exogenous features. Endogenous features represent electric vehicles’ characteristics, whereas exogenous ones represent its surrounding context, such as traffic, weather, and road conditions. In this paper, a speed forecasting method based on the Long Short-Term Memory (LSTM) is introduced. The LSTM model training is performed upon a dataset collected from a traffic simulator based on real-world data representing urban itineraries. The proposed models are generated for univariate and multivariate scenarios and are assessed in terms of accuracy for speed forecasting. Simulation results show that the multivariate model outperforms the univariate model for short- and long-term forecasting.
A novel coronavirus (SARS-CoV-2) is an unusual viral pneumonia in patients, first found in late December 2019, latter it declared a pandemic by World Health Organizations because of its fatal effects on public health. In this present, cases of COVID-19 pandemic are exponentially increasing day by day in the whole world. Here, we are detecting the COVID-19 cases, i.e., confirmed, death, and cured cases in India only. We are performing this analysis based on the cases occurring in different states of India in chronological dates. Our dataset contains multiple classes so we are performing multi-class classification. On this dataset, first, we performed data cleansing and feature selection, then performed forecasting of all classes using random forest, linear model, support vector machine, decision tree, and neural network, where random forest model outperformed the others, therefore, the random forest is used for prediction and analysis of all the results. The K-fold cross-validation is performed to measure the consistency of the model.
With the development of the Internet, technology, and means of communication, the production of tourist data has multiplied at all levels (hotels, restaurants, transport, heritage, tourist events, activities, etc.), especially with the development of Online Travel Agency (OTA). However, the list of possibilities offered to tourists by these Web search engines (or even specialized tourist sites) can be overwhelming and relevant results are usually drowned in informational "noise", which prevents, or at least slows down the selection process. To assist tourists in trip planning and help them to find the information they are looking for, many recommender systems have been developed. In this article, we present an overview of the various recommendation approaches used in the field of tourism. From this study, an architecture and a conceptual framework for tourism recommender system are proposed, based on a hybrid recommendation approach. The proposed system goes beyond the recommendation of a list of tourist attractions, tailored to tourist preferences. It can be seen as a trip planner that designs a detailed program, including heterogeneous tourism resources, for a specific visit duration. The ultimate goal is to develop a recommender system based on big data technologies, artificial intelligence, and operational research to promote tourism in Morocco, specifically in the Daraa-Tafilalet region.
The modeling of an efficient classifier is a fundamental issue in automatic training involving a large volume of representative data. Hence, automatic classification is a major task that entails the use of training methods capable of assigning classes to data objects by using the input activities presented to learn classes. The recognition of new elements is possible based on predefined classes. Intrusion detection systems suffer from numerous vulnerabilities during analysis and classification of data activities. To overcome this problem, new analysis methods should be derived so as to implement a relevant system to monitor circulated traffic. The main objective of this study is to model and validate a heterogeneous traffic classifier capable of categorizing collected events within networks. The new model is based on a proposed machine learning algorithm that comprises an input layer, a hidden layer, and an output layer. A reliable training algorithm is proposed to optimize the weights, and a recognition algorithm is used to validate the model. Preprocessing is applied to the collected traffic prior to the analysis step. This work aims to describe the mathematical validation of a new machine learning classifier for heterogeneous traffic and anomaly detection.
E-learning is the most promising venture in the entire world. During the COVID-19 lockdown, e-learning is successfully providing potential information to the students and researchers. In developing nations like India, with limited resources, e-learning tools and platforms provide a chance to make education available to middle and low income households. This paper gives insights about three different online services, namely Google Classroom, Zoom, and Microsoft Teams being used by three different educational institutions. We aim to analyze the efficiency and acceptability of e-learning tools among Indian students during the COVID-19 lockdown. The paper also aims to evaluate the impact of e-learning on the environment and public health during COVID-19 lockdown. It is found that e-learning has potential to reduce carbon emissions, which has beneficial impact on the environment. However, the mental health is impacted as e-learning may lead to self-isolation and reduction in academic achievements that may lead to anxiety and mental depression. Due to usage of electronic devices for learning, the eyes and neck muscles may be put in strain, having deleterious effects on physical health.
A travel recommendation system based on social media activity provides a customized place of interest to accommodate user-specific needs and preferences. In general, the user’s inclination towards travel destinations is subject to change over time. In this project, we have analyzed users’ twitter data, as well as their friends and followers in a timely fashion to understand recent travel interest. A machine learning classifier identifies tweets relevant to travel. The travel tweets are then used to obtain personalized travel recommendations. Unlike most of the personalized recommendation systems, our proposed model takes into account a user’s most recent interest by incorporating time-sensitive recency weight into the model. Our proposed model has outperformed the existing personalized place of interest recommendation model, and the overall accuracy is 75.23%.
With the rapid development of human society, the urbanization of the world’s population is also progressing rapidly. Urbanization has brought many challenges and problems to the development of cities. For example, the urban population is under excessive pressure, various natural resources and energy are increasingly scarce, and environmental pollution is increasing, etc. However, the original urban model has to be changed to enable people to live in greener and more sustainable cities, thus providing them with a more convenient and comfortable living environment. The new urban framework, the smart city, provides excellent opportunities to meet these challenges, while solving urban problems at the same time. At this stage, many countries are actively responding to calls for smart city development plans. This paper investigates the current stage of the smart city. First, it introduces the background of smart city development and gives a brief definition of the concept of the smart city. Second, it describes the framework of a smart city in accordance with the given definition. Finally, various intelligent algorithms to make cities smarter, along with specific examples, are discussed and analyzed.
In this paper, uncertainty has been measured in the form of fuzziness which arises due to imprecise boundaries of fuzzy sets. Uncertainty caused due to human’s cognition can be decreased by the use of fuzzy soft sets. There are different approaches to deal with the measurement of uncertainty. The method we proposed uses fuzzified evidence theory to calculate total degree of fuzziness of the parameters. It consists of mainly four parts. The first part is to measure uncertainties of parameters using fuzzy soft sets and then to modulate the uncertainties calculated. Afterward, the appropriate basic probability assignments with respect to each parameter are produced. In the last, we use Dempster’s rule of combination to fuse independent parameters into integrated one. To validate the proposed method, we perform an experiment and compare our outputs with grey relational analysis method. Also, a medical diagnosis application in reference to COVID-19 has been given to show the effectiveness of advanced method by comparing with other method.
Coronavirus disease 2019 also known as COVID-19 has become a pandemic. The disease is caused by a beta coronavirus called Severe Acute Respiratory Syndrome Coronavirus 2 (SARS-CoV-2). The severity of the disease can be understood by the massive number of deaths and affected patients globally. If the diagnosis is fast-paced, the disease can be controlled in a better manner. Laboratory tests are available for diagnosis, but they are bounded by available testing kits and time. The use of radiological examinations that comprise Computed Tomography (CT) can be used for the diagnosis of the disease. Specifically, chest X-Ray images can be analysed to identify the presence of COVID-19 in a patient. In this paper, an automated method for the diagnosis of COVID-19 from the chest X-Ray images is proposed. The method presents an improved depthwise convolution neural network for analysing the chest X-Ray images. Wavelet decomposition is applied to integrate multiresolution analysis in the network. The frequency sub-bands obtained from the input images are fed in the network for identifying the disease. The network is designed to predict the class of the input image as normal, viral pneumonia, and COVID-19. The predicted output from the model is combined with Grad-CAM visualization for diagnosis. A comparative study with the existing methods is also performed. The metrics like accuracy, sensitivity, and F1-measure are calculated for performance evaluation. The performance of the proposed method is better than the existing methodologies and thus can be used for the effective diagnosis of the disease.
Density-based clustering is an important category among clustering algorithms. In real applications, many datasets suffer from incompleteness. Traditional imputation technologies or other techniques for handling missing values are not suitable for density-based clustering and decrease clustering result quality. To avoid these problems, we develop a novel density-based clustering approach for incomplete data based on Bayesian theory, which conducts imputation and clustering concurrently and makes use of intermediate clustering results. To avoid the impact of low-density areas inside non-convex clusters, we introduce a local imputation clustering algorithm, which aims to impute points to high-density local areas. The performances of the proposed algorithms are evaluated using ten synthetic datasets and five real-world datasets with induced missing values. The experimental results show the effectiveness of the proposed algorithms.
The aspect-based sentiment analysis (ABSA) consists of two subtasks'aspect term extraction and aspect sentiment prediction. Existing methods deal with both subtasks one by one in a pipeline manner, in which there lies some problems in performance and real application. This study investigates the end-to-end ABSA and proposes a novel multitask multiview network (MTMVN) architecture. Specifically, the architecture takes the unified ABSA as the main task with the two subtasks as auxiliary tasks. Meanwhile, the representation obtained from the branch network of the main task is regarded as the global view, whereas the representations of the two subtasks are considered two local views with different emphases. Through multitask learning, the main task can be facilitated by additional accurate aspect boundary information and sentiment polarity information. By enhancing the correlations between the views under the idea of multiview learning, the representation of the global view can be optimized to improve the overall performance of the model. The experimental results on three benchmark datasets show that the proposed method exceeds the existing pipeline methods and end-to-end methods, proving the superiority of our MTMVN architecture.
Live streaming has grown rapidly in recent years, attracting increasingly more participation. As the number of online anchors is large, it is difficult for viewers to find the anchors they are interested in. Therefore, a personalized recommendation system is important for live streaming platforms. On live streaming platforms, the viewer’s and anchor’s preferences are dynamically changing over time. How to capture the user’s preference change is extensively studied in the literature, but how to model the viewer’s and anchor’s preference changes and how to learn their representations based on their preference matching are less studied. Taking these issues into consideration, in this paper, we propose a deep sequential model for live streaming recommendation. We develop a component named the multi-head related-unit in the model to capture the preference matching between anchor and viewer and extract related features for their representations. To evaluate the performance of our proposed model, we conduct experiments on real datasets, and the results show that our proposed model outperforms state-of-the-art recommendation models.
The plethora of complex Artificial Intelligence (AI) algorithms and available High-Performance Computing (HPC) power stimulates the expeditious development of AI components with heterogeneous designs. Consequently, the need for cross-stack performance benchmarking of AI-HPC systems has rapidly emerged. In particular, the de facto HPC benchmark, LINPACK, cannot reflect the AI computing power and input/output performance without a representative workload. Current popular AI benchmarks, such as MLPerf, have a fixed problem size and therefore limited scalability. To address these issues, we propose an end-to-end benchmark suite utilizing automated machine learning, which not only represents real AI scenarios, but also is auto-adaptively scalable to various scales of machines. We implement the algorithms in a highly parallel and flexible way to ensure the efficiency and optimization potential on diverse systems with customizable configurations. We utilize Operations Per Second (OPS), which is measured in an analytical and systematic approach, as a major metric to quantify the AI performance. We perform evaluations on various systems to ensure the benchmark’s stability and scalability, from 4 nodes with 32 NVIDIA Tesla T4 (56.1 Tera-OPS measured) up to 512 nodes with 4096 Huawei Ascend 910 (194.53 Peta-OPS measured), and the results show near-linear weak scalability. With a flexible workload and single metric, AIPerf can easily scale on and rank AI-HPC, providing a powerful benchmark suite for the coming supercomputing era.
Identity-recognition technologies require assistive equipment, whereas they are poor in recognition accuracy and expensive. To overcome this deficiency, this paper proposes several gait feature identification algorithms. First, in combination with the collected gait information of individuals from triaxial accelerometers on smartphones, the collected information is preprocessed, and multimodal fusion is used with the existing standard datasets to yield a multimodal synthetic dataset; then, with the multimodal characteristics of the collected biological gait information, a Convolutional Neural Network based Gait Recognition (CNN-GR) model and the related scheme for the multimodal features are developed; at last, regarding the proposed CNN-GR model and scheme, a unimodal gait feature identity single-gait feature identification algorithm and a multimodal gait feature fusion identity multimodal gait information algorithm are proposed. Experimental results show that the proposed algorithms perform well in recognition accuracy, the confusion matrix, and the kappa statistic, and they have better recognition scores and robustness than the compared algorithms; thus, the proposed algorithm has prominent promise in practice.
Time series forecasting has attracted wide attention in recent decades. However, some time series are imbalanced and show different patterns between special and normal periods, leading to the prediction accuracy degradation of special periods. In this paper, we aim to develop a unified model to alleviate the imbalance and thus improving the prediction accuracy for special periods. This task is challenging because of two reasons: (1) the temporal dependency of series, and (2) the tradeoff between mining similar patterns and distinguishing different distributions between different periods. To tackle these issues, we propose a self-attention-based time-varying prediction model with a two-stage training strategy. First, we use an encoder-?decoder module with the multi-head self-attention mechanism to extract common patterns of time series. Then, we propose a time-varying optimization module to optimize the results of special periods and eliminate the imbalance. Moreover, we propose reverse distance attention in place of traditional dot attention to highlight the importance of similar historical values to forecast results. Finally, extensive experiments show that our model performs better than other baselines in terms of mean absolute error and mean absolute percentage error.
As a powerful tool for elucidating the embedding representation of graph-structured data, Graph Neural Networks (GNNs), which are a series of powerful tools built on homogeneous networks, have been widely used in various data mining tasks. It is a huge challenge to apply a GNN to an embedding Heterogeneous Information Network (HIN). The main reason for this challenge is that HINs contain many different types of nodes and different types of relationships between nodes. HIN contains rich semantic and structural information, which requires a specially designed graph neural network. However, the existing HIN-based graph neural network models rarely consider the interactive information hidden between the meta-paths of HIN in the poor embedding of nodes in the HIN. In this paper, we propose an Attention-aware Heterogeneous graph Neural Network (AHNN) model to effectively extract useful information from HIN and use it to learn the embedding representation of nodes. Specifically, we first use node-level attention to aggregate and update the embedding representation of nodes, and then concatenate the embedding representation of the nodes on different meta-paths. Finally, the semantic-level neural network is proposed to extract the feature interaction relationships on different meta-paths and learn the final embedding of nodes. Experimental results on three widely used datasets showed that the AHNN model could significantly outperform the state-of-the-art models.
Data are crucial to the growth of e-commerce in today’s world of highly demanding hyper-personalized consumer experiences, which are collected using advanced web scraping technologies. However, core data extraction engines fail because they cannot adapt to the dynamic changes in website content. This study investigates an intelligent and adaptive web data extraction system with convolutional and Long Short-Term Memory (LSTM) networks to enable automated web page detection using the You only look once (Yolo) algorithm and Tesseract LSTM to extract product details, which are detected as images from web pages. This state-of-the-art system does not need a core data extraction engine, and thus can adapt to dynamic changes in website layout. Experiments conducted on real-world retail cases demonstrate an image detection (precision) and character extraction accuracy (precision) of 97% and 99%, respectively. In addition, a mean average precision of 74%, with an input dataset of 45 objects or images, is obtained.
In recent years, Apache Spark has become the de facto standard for big data processing. SparkSQL is a module offering support for relational analysis on Spark with Structured Query Language (SQL). SparkSQL provides convenient data processing interfaces. Despite its efficient optimizer, SparkSQL still suffers from the inefficiency of Spark resulting from Java virtual machine and the unnecessary data serialization and deserialization. Adopting native languages such as C++ could help to avoid such bottlenecks. Benefiting from a bare-metal runtime environment and template usage, systems with C++ interfaces usually achieve superior performance. However, the complexity of native languages also increases the required programming and debugging efforts. In this work, we present LotusSQL, an engine to provide SQL support for dataset abstraction on a native backend Lotus. We employ a convenient SQL processing framework to deal with frontend jobs. Advanced query optimization technologies are added to improve the quality of execution plans. Above the storage design and user interface of the compute engine, LotusSQL implements a set of structured dataset operations with high efficiency and integrates them with the frontend. Evaluation results show that LotusSQL achieves a speedup of up to 9× in certain queries and outperforms Spark SQL in a standard query benchmark by more than 2× on average.
The COVID-19 pandemic has hit the world hard. The reaction to the pandemic related issues has been pouring into social platforms, such as Twitter. Many public officials and governments use Twitter to make policy announcements. People keep close track of the related information and express their concerns about the policies on Twitter. It is beneficial yet challenging to derive important information or knowledge out of such Twitter data. In this paper, we propose a Tripartite Graph Clustering for Pandemic Data Analysis (TGC-PDA) framework that builds on the proposed models and analysis: (1) tripartite graph representation, (2) non-negative matrix factorization with regularization, and (3) sentiment analysis. We collect the tweets containing a set of keywords related to coronavirus pandemic as the ground truth data. Our framework can detect the communities of Twitter users and analyze the topics that are discussed in the communities. The extensive experiments show that our TGC-PDA framework can effectively and efficiently identify the topics and correlations within the Twitter data for monitoring and understanding public opinions, which would provide policy makers useful information and statistics for decision making.