Pengyu Chen;Wendi He;Wenxuan Ma;Xiangdong Huang;Chen Wang
{"title":"IoTDQ: An Industrial IoT Data Analysis Library for Apache IoTDB","authors":"Pengyu Chen;Wendi He;Wenxuan Ma;Xiangdong Huang;Chen Wang","doi":"10.26599/BDMA.2023.9020010","DOIUrl":"https://doi.org/10.26599/BDMA.2023.9020010","url":null,"abstract":"There is a growing demand for time series data analysis in industry areas. Apache IoTDB is a time series database designed for the Internet of Things (IoT) with enhanced storage and I/O performance. With User-Defined Functions (UDF) provided, computation for time series can be executed on Apache IoTDB directly. To satisfy most of the common requirements in industrial time series analysis, we create a UDF library, IoTDQ, on Apache IoTDB. This library integrates stream computation functions on data quality analysis, data profiling, anomaly detection, data repairing, etc. IoTDQ enables users to conduct a wide range of analyses, such as monitoring, error diagnosis, equipment reliability analysis. It provides a framework for users to examine IoT time series with data quality problems. Experiments show that IoTDQ keeps the same level of performance compared to mainstream alternatives, and shortens I/O consumption for Apache IoTDB users.","PeriodicalId":52355,"journal":{"name":"Big Data Mining and Analytics","volume":"7 1","pages":"29-41"},"PeriodicalIF":0.0,"publicationDate":"2023-12-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10372952","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139041286","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Call for Papers: Special Issue on Challenges and Opportunities in Biomedical Big Data Analysis: From Large Language Models to Clinical Applications","authors":"","doi":"10.26599/BDMA.2023.9020026","DOIUrl":"https://doi.org/10.26599/BDMA.2023.9020026","url":null,"abstract":"","PeriodicalId":52355,"journal":{"name":"Big Data Mining and Analytics","volume":"7 1","pages":"244-244"},"PeriodicalIF":0.0,"publicationDate":"2023-12-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10372958","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139041287","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Molecular Generation and Optimization of Molecular Properties Using a Transformer Model","authors":"Zhongyin Xu;Xiujuan Lei;Mei Ma;Yi Pan","doi":"10.26599/BDMA.2023.9020009","DOIUrl":"https://doi.org/10.26599/BDMA.2023.9020009","url":null,"abstract":"Generating novel molecules to satisfy specific properties is a challenging task in modern drug discovery, which requires the optimization of a specific objective based on satisfying chemical rules. Herein, we aim to optimize the properties of a specific molecule to satisfy the specific properties of the generated molecule. The Matched Molecular Pairs (MMPs), which contain the source and target molecules, are used herein, and logD and solubility are selected as the optimization properties. The main innovative work lies in the calculation related to a specific transformer from the perspective of a matrix dimension. Threshold intervals and state changes are then used to encode logD and solubility for subsequent tests. During the experiments, we screen the data based on the proportion of heavy atoms to all atoms in the groups and select 12 365, 1503, and 1570 MMPs as the training, validation, and test sets, respectively. Transformer models are compared with the baseline models with respect to their abilities to generate molecules with specific properties. Results show that the transformer model can accurately optimize the source molecules to satisfy specific properties.","PeriodicalId":52355,"journal":{"name":"Big Data Mining and Analytics","volume":"7 1","pages":"142-155"},"PeriodicalIF":0.0,"publicationDate":"2023-12-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10373001","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139041293","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jun Wang;Maiwang Shi;Xiao Zhang;Yan Li;Yunsheng Yuan;Chenglei Yang;Dongxiao Yu
{"title":"Incremental Data Stream Classification with Adaptive Multi-Task Multi-View Learning","authors":"Jun Wang;Maiwang Shi;Xiao Zhang;Yan Li;Yunsheng Yuan;Chenglei Yang;Dongxiao Yu","doi":"10.26599/BDMA.2023.9020006","DOIUrl":"https://doi.org/10.26599/BDMA.2023.9020006","url":null,"abstract":"With the enhancement of data collection capabilities, massive streaming data have been accumulated in numerous application scenarios. Specifically, the issue of classifying data streams based on mobile sensors can be formalized as a multi-task multi-view learning problem with a specific task comprising multiple views with shared features collected from multiple sensors. Existing incremental learning methods are often single-task single-view, which cannot learn shared representations between relevant tasks and views. An adaptive multi-task multi-view incremental learning framework for data stream classification called MTMVIS is proposed to address the above challenges, utilizing the idea of multi-task multi-view learning. Specifically, the attention mechanism is first used to align different sensor data of different views. In addition, MTMVIS uses adaptive Fisher regularization from the perspective of multi-task multi-view learning to overcome catastrophic forgetting in incremental learning. Results reveal that the proposed framework outperforms state-of-the-art methods based on the experiments on two different datasets with other baselines.","PeriodicalId":52355,"journal":{"name":"Big Data Mining and Analytics","volume":"7 1","pages":"87-106"},"PeriodicalIF":0.0,"publicationDate":"2023-12-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10373002","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139041254","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Discriminatively Constrained Semi-Supervised Multi-View Nonnegative Matrix Factorization with Graph Regularization","authors":"Guosheng Cui;Ye Li;Jianzhong Li;Jianping Fan","doi":"10.26599/BDMA.2023.9020004","DOIUrl":"https://doi.org/10.26599/BDMA.2023.9020004","url":null,"abstract":"Nonnegative Matrix Factorization (NMF) is one of the most popular feature learning technologies in the field of machine learning and pattern recognition. It has been widely used and studied in the multi-view clustering tasks because of its effectiveness. This study proposes a general semi-supervised multi-view nonnegative matrix factorization algorithm. This algorithm incorporates discriminative and geometric information on data to learn a better-fused representation, and adopts a feature normalizing strategy to align the different views. Two specific implementations of this algorithm are developed to validate the effectiveness of the proposed framework: Graph regularization based Discriminatively Constrained Multi-View Nonnegative Matrix Factorization (GDCMVNMF) and Extended Multi-View Constrained Nonnegative Matrix Factorization (ExMVCNMF). The intrinsic connection between these two specific implementations is discussed, and the optimization based on multiply update rules is presented. Experiments on six datasets show that the effectiveness of GDCMVNMF and ExMVCNMF outperforms several representative unsupervised and semi-supervised multi-view NMF approaches.","PeriodicalId":52355,"journal":{"name":"Big Data Mining and Analytics","volume":"7 1","pages":"55-74"},"PeriodicalIF":0.0,"publicationDate":"2023-12-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10372950","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139041279","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"QAR Data Imputation Using Generative Adversarial Network with Self-Attention Mechanism","authors":"Jingqi Zhao;Chuitian Rong;Xin Dang;Huabo Sun","doi":"10.26599/BDMA.2023.9020001","DOIUrl":"https://doi.org/10.26599/BDMA.2023.9020001","url":null,"abstract":"Quick Access Recorder (QAR), an important device for storing data from various flight parameters, contains a large amount of valuable data and comprehensively records the real state of the airline flight. However, the recorded data have certain missing values due to factors, such as weather and equipment anomalies. These missing values seriously affect the analysis of QAR data by aeronautical engineers, such as airline flight scenario reproduction and airline flight safety status assessment. Therefore, imputing missing values in the QAR data, which can further guarantee the flight safety of airlines, is crucial. QAR data also have multivariate, multiprocess, and temporal features. Therefore, we innovatively propose the imputation models A-AEGAN (“A” denotes attention mechanism, “AE” denotes autoencoder, and “GAN” denotes generative adversarial network) and SA-AEGAN (“SA” denotes self-attentive mechanism) for missing values of QAR data, which can be effectively applied to QAR data. Specifically, we apply an innovative generative adversarial network to impute missing values from QAR data. The improved gated recurrent unit is then introduced as the neural unit of GAN, which can successfully capture the temporal relationships in QAR data. In addition, we modify the basic structure of GAN by using an autoencoder as the generator and a recurrent neural network as the discriminator. The missing values in the QAR data are imputed by using the adversarial relationship between generator and discriminator. We introduce an attention mechanism in the autoencoder to further improve the capability of the proposed model to capture the features of QAR data. Attention mechanisms can maintain the correlation among QAR data and improve the capability of the model to impute missing data. Furthermore, we improve the proposed model by integrating a self-attention mechanism to further capture the relationship between different parameters within the QAR data. Experimental results on real datasets demonstrate that the model can reasonably impute the missing values in QAR data with excellent results.","PeriodicalId":52355,"journal":{"name":"Big Data Mining and Analytics","volume":"7 1","pages":"12-28"},"PeriodicalIF":0.0,"publicationDate":"2023-12-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10372953","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139041281","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Multi-Smart Meter Data Encryption Scheme Based on Distributed Differential Privacy","authors":"Renwu Yan;Yang Zheng;Ning Yu;Cen Liang","doi":"10.26599/BDMA.2023.9020008","DOIUrl":"https://doi.org/10.26599/BDMA.2023.9020008","url":null,"abstract":"Under the general trend of the rapid development of smart grids, data security and privacy are facing serious challenges; protecting the privacy data of single users under the premise of obtaining user-aggregated data has attracted widespread attention. In this study, we propose an encryption scheme on the basis of differential privacy for the problem of user privacy leakage when aggregating data from multiple smart meters. First, we use an improved homomorphic encryption method to realize the encryption aggregation of users' data. Second, we propose a double-blind noise addition protocol to generate distributed noise through interaction between users and a cloud platform to prevent semi-honest participants from stealing data by colluding with one another. Finally, the simulation results show that the proposed scheme can encrypt the transmission of multi-intelligent meter data under the premise of satisfying the differential privacy mechanism. Even if an attacker has enough background knowledge, the security of the electricity information of one another can be ensured.","PeriodicalId":52355,"journal":{"name":"Big Data Mining and Analytics","volume":"7 1","pages":"131-141"},"PeriodicalIF":0.0,"publicationDate":"2023-12-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10372998","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139041294","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
G. Shukla, Santosh Kumar, S. Pandey, Rohit Agarwal, Neeraj Varshney, Ankit Kumar
{"title":"Diagnosis and Detection of Alzheimer's Disease Using Learning Algorithm","authors":"G. Shukla, Santosh Kumar, S. Pandey, Rohit Agarwal, Neeraj Varshney, Ankit Kumar","doi":"10.26599/bdma.2022.9020049","DOIUrl":"https://doi.org/10.26599/bdma.2022.9020049","url":null,"abstract":"","PeriodicalId":52355,"journal":{"name":"Big Data Mining and Analytics","volume":"1 1","pages":""},"PeriodicalIF":13.6,"publicationDate":"2023-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"69029476","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Replication-Based Query Management for Resource Allocation Using Hadoop and MapReduce over Big Data","authors":"Ankit Kumar;Neeraj Varshney;Surbhi Bhatiya;Kamred Udham Singh","doi":"10.26599/BDMA.2022.9020026","DOIUrl":"10.26599/BDMA.2022.9020026","url":null,"abstract":"We live in an age where everything around us is being created. Data generation rates are so scary, creating pressure to implement costly and straightforward data storage and recovery processes. MapReduce model functionality is used for creating a cluster parallel, distributed algorithm, and large datasets. The MapReduce strategy from Hadoop helps develop a community of non-commercial use to offer a new algorithm for resolving such problems for commercial applications as expected from this working algorithm with insights as a result of disproportionate or discriminatory Hadoop cluster results. Expected results are obtained in the work and the exam conducted under this job; many of them are scheduled to set schedules, match matrices' data positions, clustering before determining to click, and accurate mapping and internal reliability to be closed together to avoid running and execution times. Mapper output and proponents have been implemented, and the map has been used to reduce the function. The execution input key/value pair and output key/value pair have been set. This paper focuses on evaluating this technique for the efficient retrieval of large volumes of data. The technique allows for capabilities to inform a massive database of information, from storage and indexing techniques to the distribution of queries, scalability, and performance in heterogeneous environments. The results show that the proposed work reduces the data processing time by 30%.","PeriodicalId":52355,"journal":{"name":"Big Data Mining and Analytics","volume":"6 4","pages":"465-477"},"PeriodicalIF":13.6,"publicationDate":"2023-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/iel7/8254253/10233239/10233249.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"49356278","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A Clinical Data Analysis Based Diagnostic Systems for Heart Disease Prediction Using Ensemble Method","authors":"Ankit Kumar;Kamred Udham Singh;Manish Kumar","doi":"10.26599/BDMA.2022.9020052","DOIUrl":"10.26599/BDMA.2022.9020052","url":null,"abstract":"The correct diagnosis of heart disease can save lives, while the incorrect diagnosis can be lethal. The UCI machine learning heart disease dataset compares the results and analyses of various machine learning approaches, including deep learning. We used a dataset with 13 primary characteristics to carry out the research. Support vector machine and logistic regression algorithms are used to process the datasets, and the latter displays the highest accuracy in predicting coronary disease. Python programming is used to process the datasets. Multiple research initiatives have used machine learning to speed up the healthcare sector. We also used conventional machine learning approaches in our investigation to uncover the links between the numerous features available in the dataset and then used them effectively in anticipation of heart infection risks. Using the accuracy and confusion matrix has resulted in some favorable outcomes. To get the best results, the dataset contains certain unnecessary features that are dealt with using isolation logistic regression and Support Vector Machine (SVM) classification.","PeriodicalId":52355,"journal":{"name":"Big Data Mining and Analytics","volume":"6 4","pages":"513-525"},"PeriodicalIF":13.6,"publicationDate":"2023-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/iel7/8254253/10233239/10233243.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"42487577","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}