Joseph Agyapong Mensah, Ezekiel N. N. Nortey, Eric Ocran, Samuel Iddi, Louis Asiedu
{"title":"De-occlusion and recognition of frontal face images: a comparative study of multiple imputation methods","authors":"Joseph Agyapong Mensah, Ezekiel N. N. Nortey, Eric Ocran, Samuel Iddi, Louis Asiedu","doi":"10.1186/s40537-024-00925-6","DOIUrl":"https://doi.org/10.1186/s40537-024-00925-6","url":null,"abstract":"<p>Increasingly, automatic face recognition algorithms have become necessary with the development and extensive use of face recognition technology, particularly in the era of machine learning and artificial intelligence. However, the presence of unconstrained environmental conditions degrades the quality of acquired face images and may deteriorate the performance of many classical face recognition algorithms. Due to this backdrop, many researchers have given considerable attention to image restoration and enhancement mechanisms, but with minimal focus on occlusion-related and multiple-constrained problems. Although occlusion robust face recognition modules, via sparse representation have been explored, they require a large number of features to achieve correct computations and to maximize robustness to occlusions. Therefore, such an approach may become deficient in the presence of random occlusions of relatively moderate magnitude. This study assesses the robustness of Principal Component Analysis and Singular Value Decomposition using Discrete Wavelet Transformation for preprocessing and city block distance for classification (DWT-PCA/SVD-L1) face recognition module to image degradations due to random occlusions of varying magnitudes (10% and 20%) in test images acquired with varying expressions. Numerical evaluation of the performance of the DWT-PCA/SVD-L1 face recognition module showed that the use of the de-occluded faces for recognition enhanced significantly the performance of the study recognition module at each level (10% and 20%) of occlusion. The algorithm attained the highest recognition rate of 85.94% and 78.65% at 10% and 20% occlusions respectively, when the MICE de-occluded face images were used for recognition. With the exception of Entropy where MICE de-occluded face images attained the highest average value, the MICE and RegEM result in images of similar quality as measured by their Absolute mean brightness error (AMBE) and peak signal to noise ratio (PSNR). The study therefore recommends MICE as a suitable imputation mechanism for de-occlusion of face images acquired under varying expressions.</p>","PeriodicalId":15158,"journal":{"name":"Journal of Big Data","volume":"17 1","pages":""},"PeriodicalIF":8.1,"publicationDate":"2024-04-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140841610","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Bilal Hassan Ahmed Khattak, Imran Shafi, Chaudhary Hamza Rashid, Mejdl Safran, Sultan Alfarhood, Imran Ashraf
{"title":"Profitability trend prediction in crypto financial markets using Fibonacci technical indicator and hybrid CNN model","authors":"Bilal Hassan Ahmed Khattak, Imran Shafi, Chaudhary Hamza Rashid, Mejdl Safran, Sultan Alfarhood, Imran Ashraf","doi":"10.1186/s40537-024-00908-7","DOIUrl":"https://doi.org/10.1186/s40537-024-00908-7","url":null,"abstract":"<p>Cryptocurrency has become a popular trading asset due to its security, anonymity, and decentralization. However, predicting the direction of the financial market can be challenging, leading to difficult financial decisions and potential losses. The purpose of this study is to gain insights into the impact of Fibonacci technical indicator (TI) and multi-class classification based on trend direction and price-strength (trend-strength) to improve the performance and profitability of artificial intelligence (AI) models, particularly hybrid convolutional neural network (CNN) incorporating long short-term memory (LSTM), and to modify it to reduce its complexity. The main contribution of this paper lies in its introduction of Fibonacci TI, demonstrating its impact on financial prediction, and incorporation of a multi-classification technique focusing on trend strength, thereby enhancing the depth and accuracy of predictions. Lastly, profitability analysis sheds light on the tangible benefits of utilizing Fibonacci and multi-classification. The research methodology employed to carry out profitability analysis is based on a hybrid investment strategy—direction and strength by employing a six-stage predictive system: data collection, preprocessing, sampling, training and prediction, investment simulation, and evaluation. Empirical findings show that the Fibonacci TI has improved its performance (44% configurations) and profitability (68% configurations) of AI models. Hybrid CNNs showed most performance improvements particularly the C-LSTM model for trend (binary-0.0023) and trend-strength (4 class-0.0020) and 6 class-0.0099). Hybrid CNNs showed improved profitability, particularly in CLSTM, and performance in CLSTM mod. Trend-strength prediction showed max improvements in long strategy ROI (6.89%) and average ROIs for long-short strategy. Regarding the choice between hybrid CNNs, the C-LSTM mod is a viable option for trend-strength prediction at 4-class and 6-class due to better performance and profitability.</p>","PeriodicalId":15158,"journal":{"name":"Journal of Big Data","volume":"60 1","pages":""},"PeriodicalIF":8.1,"publicationDate":"2024-04-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140811116","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Hussien Ali El-Sayed Ali, M. H. Alham, Doaa Khalil Ibrahim
{"title":"Big data resolving using Apache Spark for load forecasting and demand response in smart grid: a case study of Low Carbon London Project","authors":"Hussien Ali El-Sayed Ali, M. H. Alham, Doaa Khalil Ibrahim","doi":"10.1186/s40537-024-00909-6","DOIUrl":"https://doi.org/10.1186/s40537-024-00909-6","url":null,"abstract":"<p>Using recent information and communication technologies for monitoring and management initiates a revolution in the smart grid. These technologies generate massive data that can only be processed using big data tools. This paper emphasizes the role of big data in resolving load forecasting, renewable energy sources integration, and demand response as significant aspects of smart grids. Meters data from the Low Carbon London Project is investigated as a case study. Because of the immense stream of meters' readings and exogenous data added to load forecasting models, addressing the problem is in the context of big data. Descriptive analytics are developed using Spark SQL to get insights regarding household energy consumption. Spark MLlib is utilized for predictive analytics by building scalable machine learning models accommodating meters' data streams. Multivariate polynomial regression and decision tree models are preferred here based on the big data point of view and the literature that ensures they are accurate and interpretable. The results confirmed the descriptive analytics and data visualization capabilities to provide valuable insights, guide the feature selection process, and enhance load forecasting models' accuracy. Accordingly, proper evaluation of demand response programs and integration of renewable energy resources is accomplished using achieved load forecasting results.</p>","PeriodicalId":15158,"journal":{"name":"Journal of Big Data","volume":"37 1","pages":""},"PeriodicalIF":8.1,"publicationDate":"2024-04-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140813019","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Raghu Raman, Debidutta Pattnaik, Hiran H. Lathabai, Chandan Kumar, Kannan Govindan, Prema Nedungadi
{"title":"Green and sustainable AI research: an integrated thematic and topic modeling analysis","authors":"Raghu Raman, Debidutta Pattnaik, Hiran H. Lathabai, Chandan Kumar, Kannan Govindan, Prema Nedungadi","doi":"10.1186/s40537-024-00920-x","DOIUrl":"https://doi.org/10.1186/s40537-024-00920-x","url":null,"abstract":"<p>This investigation delves into Green AI and Sustainable AI literature through a dual-analytical approach, combining thematic analysis with BERTopic modeling to reveal both broad thematic clusters and nuanced emerging topics. It identifies three major thematic clusters: (1) Responsible AI for Sustainable Development, focusing on integrating sustainability and ethics within AI technologies; (2) Advancements in Green AI for Energy Optimization, centering on energy efficiency; and (3) Big Data-Driven Computational Advances, emphasizing AI’s influence on socio-economic and environmental aspects. Concurrently, BERTopic modeling uncovers five emerging topics: Ethical Eco-Intelligence, Sustainable Neural Computing, Ethical Healthcare Intelligence, AI Learning Quest, and Cognitive AI Innovation, indicating a trend toward embedding ethical and sustainability considerations into AI research. The study reveals novel intersections between Sustainable and Ethical AI and Green Computing, indicating significant research trends and identifying Ethical Healthcare Intelligence and AI Learning Quest as evolving areas within AI’s socio-economic and societal impacts. The study advocates for a unified approach to innovation in AI, promoting environmental sustainability and ethical integrity to foster responsible AI development. This aligns with the Sustainable Development Goals, emphasizing the need for ecological balance, societal welfare, and responsible innovation. This refined focus underscores the critical need for integrating ethical and environmental considerations into the AI development lifecycle, offering insights for future research directions and policy interventions.</p>","PeriodicalId":15158,"journal":{"name":"Journal of Big Data","volume":"140 1","pages":""},"PeriodicalIF":8.1,"publicationDate":"2024-04-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140636715","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"An improved deep hashing model for image retrieval with binary code similarities","authors":"Huawen Liu, Zongda Wu, Minghao Yin, Donghua Yu, Xinzhong Zhu, Jungang Lou","doi":"10.1186/s40537-024-00919-4","DOIUrl":"https://doi.org/10.1186/s40537-024-00919-4","url":null,"abstract":"<p>The exponential growth of data raises an unprecedented challenge in data analysis: how to retrieve interesting information from such large-scale data. Hash learning is a promising solution to address this challenge, because it may bring many potential advantages, such as extremely high efficiency and low storage cost, after projecting high-dimensional data to compact binary codes. However, traditional hash learning algorithms often suffer from the problem of semantic inconsistency, where images with similar semantic features may have different binary codes. In this paper, we propose a novel end-to-end deep hashing method based on the similarities of binary codes, dubbed CSDH (Code Similarity-based Deep Hashing), for image retrieval. Specifically, it extracts deep features from images to capture semantic information using a pre-trained deep convolutional neural network. Additionally, a hidden and fully connected layer is attached at the end of the deep network to derive hash bits by virtue of an activation function. To preserve the semantic consistency of images, a loss function has been introduced. It takes the label similarities, as well as the Hamming embedding distances, into consideration. By doing so, CSDH can learn more compact and powerful hash codes, which not only can preserve semantic similarity but also have small Hamming distances between similar images. To verify the effectiveness of CSDH, we evaluate CSDH on two public benchmark image collections, i.e., CIFAR-10 and NUS-WIDE, with five classic shallow hashing models and six popular deep hashing ones. The experimental results show that CSDH can achieve competitive performance to the popular deep hashing algorithms.</p>","PeriodicalId":15158,"journal":{"name":"Journal of Big Data","volume":"25 1","pages":""},"PeriodicalIF":8.1,"publicationDate":"2024-04-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140627077","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pan Pan, Yue Wang, Chang Liu, Yanhui Tu, Haibo Cheng, Qingyun Yang, Fei Xie, Yuan Li, Lixin Xie, Yuhong Liu
{"title":"Revisiting the potential value of vital signs in the real-time prediction of mortality risk in intensive care unit patients","authors":"Pan Pan, Yue Wang, Chang Liu, Yanhui Tu, Haibo Cheng, Qingyun Yang, Fei Xie, Yuan Li, Lixin Xie, Yuhong Liu","doi":"10.1186/s40537-024-00896-8","DOIUrl":"https://doi.org/10.1186/s40537-024-00896-8","url":null,"abstract":"<h3 data-test=\"abstract-sub-heading\">Background</h3><p>Predicting patient mortality risk facilitates early intervention in intensive care unit (ICU) patients at greater risk of disease progression. This study applies machine learning methods to multidimensional clinical data to dynamically predict mortality risk in ICU patients.</p><h3 data-test=\"abstract-sub-heading\">Methods</h3><p>A total of 33,798 patients in the MIMIC-III database were collected. An integrated model NIMRF (Network Integrating Memory Module and Random Forest) based on multidimensional variables such as vital sign variables and laboratory variables was developed to predict the risk of death for ICU patients in four non overlapping time windows of 0–1 h, 1–3 h, 3–6 h, and 6–12 h. Mortality risk in four nonoverlapping time windows of 12 h was externally validated on data from 889 patients in the respiratory critical care unit of the Chinese PLA General Hospital and compared with LSTM, random forest and time-dependent cox regression model (survival analysis) methods. We also interpret the developed model to obtain important factors for predicting mortality risk across time windows. The code can be found in https://github.com/wyuexiao/NIMRF.</p><h3 data-test=\"abstract-sub-heading\">Results</h3><p>The NIMRF model developed in this study could predict the risk of death in four nonoverlapping time windows (0–1 h, 1–3 h, 3–6 h, 6–12 h) after any time point in ICU patients, and in internal data validation, it is suggested that the model is more accurate than LSTM, random forest prediction and time-dependent cox regression model (area under receiver operating characteristic curve, or AUC, 0–1 h: 0.8015 [95% CI 0.7725–0.8304] vs. 0.7144 [95%] CI 0.6824–0.7464] vs. 0.7606 [95% CI 0.7300–0.7913] vs 0.3867 [95% CI 0.3573–0.4161]; 1–3 h: 0.7100 [95% CI 0.6777–0.7423] vs. 0.6389 [95% CI 0.6055–0.6723] vs. 0.6992 [95% CI 0.6667–0.7318] vs 0.3854 [95% CI 0.3559–0.4150]; 3–6 h: 0.6760 [95% CI 0.6425–0.7097] vs. 0.5964 [95% CI 0.5622–0.6306] vs. 0.6760 [95% CI 0.6427–0.7099] vs 0.3967 [95% CI 0.3662–0.4271]; 6–12 h: 0.6380 [0.6031–0.6729] vs. 0.6032 [0.5705–0.6406] vs. 0.6055 [0.5682–0.6383] vs 0.4023 [95% CI 0.3709–0.4337]). External validation was performed on the data of patients in the respiratory critical care unit of the Chinese PLA General Hospital. Compared with LSTM, random forest and time-dependent cox regression model, the NIMRF model was still the best, with an AUC of 0.9366 [95% CI 0.9157–0.9575 for predicting death risk in 0–1 h]. The corresponding AUCs of LSTM, random forest and time-dependent cox regression model were 0.9263 [95% CI 0.9039–0.9486], 0.7437 [95% CI 0.7083–0.7791] and 0.2447 [95% CI 0.2202–0.2692], respectively. Interpretation of the model revealed that vital signs (systolic blood pressure, heart rate, diastolic blood pressure, respiratory rate, and body temperature) were highly correlated with events of death.</p><h3 data-test=\"abstract-sub-heading\">Conclusion</h3><p>","PeriodicalId":15158,"journal":{"name":"Journal of Big Data","volume":"11 1","pages":""},"PeriodicalIF":8.1,"publicationDate":"2024-04-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140626658","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Enhancing academic performance prediction with temporal graph networks for massive open online courses","authors":"Qionghao Huang, Jili Chen","doi":"10.1186/s40537-024-00918-5","DOIUrl":"https://doi.org/10.1186/s40537-024-00918-5","url":null,"abstract":"<p>Educational big data significantly impacts education, and Massive Open Online Courses (MOOCs), a crucial learning approach, have evolved to be more intelligent with these technologies. Deep neural networks have significantly advanced the crucial task within MOOCs, predicting student academic performance. However, most deep learning-based methods usually ignore the temporal information and interaction behaviors during the learning activities, which can effectively enhance the model’s predictive accuracy. To tackle this, we formulate the learning processes of e-learning students as dynamic temporal graphs to encode the temporal information and interaction behaviors during their studying. We propose a novel academic performance prediction model (APP-TGN) based on temporal graph neural networks. Specifically, in APP-TGN, a dynamic graph is constructed from online learning activity logs. A temporal graph network with low-high filters learns potential academic performance variations encoded in dynamic graphs. Furthermore, a global sampling module is developed to mitigate the problem of false correlations in deep learning-based models. Finally, multi-head attention is utilized for predicting academic outcomes. Extensive experiments are conducted on a well-known public dataset. The experimental results indicate that APP-TGN significantly surpasses existing methods and demonstrates excellent potential in automated feedback and personalized learning.</p>","PeriodicalId":15158,"journal":{"name":"Journal of Big Data","volume":"8 1","pages":""},"PeriodicalIF":8.1,"publicationDate":"2024-04-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140573466","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Zenghong Wu, Kun Zhang, Weijun Wang, Mengke Fan, Rong Lin
{"title":"The differences in gastric cancer epidemiological data between SEER and GBD: a joinpoint and age-period-cohort analysis","authors":"Zenghong Wu, Kun Zhang, Weijun Wang, Mengke Fan, Rong Lin","doi":"10.1186/s40537-024-00907-8","DOIUrl":"https://doi.org/10.1186/s40537-024-00907-8","url":null,"abstract":"<h3 data-test=\"abstract-sub-heading\">Background</h3><p>The burden of gastric cancer (GC) should be further clarified worldwide, and helped us to understand the current situation of GC.</p><h3 data-test=\"abstract-sub-heading\">Methods</h3><p>In the present study, we estimated disability-adjusted life-years (DALYs) and mortality rates attributable to several major GC risk factors, including smoking, dietary risk, and behavioral risk. In addition, we evaluated the incidence rate and trends of incidence-based mortality (IBM) due to GC in the United States (US) during 1992–2018.</p><h3 data-test=\"abstract-sub-heading\">Results</h3><p>Globally, GC incidences increased from 883,395 in 1990 to 1,269,805 in 2019 while GC-associated mortality increased from 788,316 in 1990 to 957,185 in 2019. In 2019, the age-standardized rate (ASR) of GC exhibited variations around the world, with Mongolia having the highest observed ASR (43.7 per 100,000), followed by Bolivia (34 per 100,000) and China (30.6 per 100,000). A negative association was found among estimated annual percentage change (EAPC) and ASR (age-standardized incidence rate (ASIR): r = − 0.28, <i>p</i> < 0.001; age-standardized death rate (ASDR): r = − 0.19, <i>p</i> = 0.005). There were 74,966 incidences of GC and 69,374 GC-related deaths recorded between 1992 and 2018. The significant decrease in GC incidences as well as decreasing trends in IBM of GC were first detected in 1994. The GC IBM significantly increased at a rate of 35%/y from 1992 to 1994 (95% CI 21.2% to 50.4%/y), and then begun to decrease at a rate of − 1.4%/y from 1994 to 2018 (95% CI − 1.6% to − 1.2%/y).</p><h3 data-test=\"abstract-sub-heading\">Conclusion</h3><p>These findings mirror the global disease burden of GC and are important for development of targeted prevention strategies.</p>","PeriodicalId":15158,"journal":{"name":"Journal of Big Data","volume":"26 1","pages":""},"PeriodicalIF":8.1,"publicationDate":"2024-04-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140573468","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"DAPS diagrams for defining Data Science projects","authors":"Jeroen de Mast, Joran Lokkerbol","doi":"10.1186/s40537-024-00916-7","DOIUrl":"https://doi.org/10.1186/s40537-024-00916-7","url":null,"abstract":"<h3 data-test=\"abstract-sub-heading\">Background</h3><p>Models for structuring big-data and data-analytics projects typically start with a definition of the project’s goals and the business value they are expected to create. The literature identifies proper project definition as crucial for a project’s success, and also recognizes that the translation of business objectives into data-analytic problems is a difficult task. Unfortunately, common project structures, such as CRISP-DM, provide little guidance for this crucial stage when compared to subsequent project stages such as data preparation and modeling.</p><h3 data-test=\"abstract-sub-heading\">Contribution</h3><p>This paper contributes structure to the project-definition stage of data-analytic projects by proposing the Data-Analytic Problem Structure (DAPS). The diagrammatic technique facilitates the collaborative development of a consistent and precise definition of a data-analytic problem, and the articulation of how it contributes to the organization’s goals. In addition, the technique helps to identify important assumptions, and to break down large ambitions in manageable subprojects.</p><h3 data-test=\"abstract-sub-heading\">Methods</h3><p>The semi-formal specification technique took other models for problem structuring — common in fields such as operations research and business analytics — as a point of departure. The proposed technique was applied in 47 real data-analytic projects and refined based on the results, following a design-science approach.</p>","PeriodicalId":15158,"journal":{"name":"Journal of Big Data","volume":"36 1","pages":""},"PeriodicalIF":8.1,"publicationDate":"2024-04-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140573982","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Muhammad Aidiel Rachman Putra, Tohari Ahmad, Dandy Pramana Hostiadi
{"title":"B-CAT: a model for detecting botnet attacks using deep attack behavior analysis on network traffic flows","authors":"Muhammad Aidiel Rachman Putra, Tohari Ahmad, Dandy Pramana Hostiadi","doi":"10.1186/s40537-024-00900-1","DOIUrl":"https://doi.org/10.1186/s40537-024-00900-1","url":null,"abstract":"<p>Threats on computer networks have been increasing rapidly, and irresponsible parties are always trying to exploit vulnerabilities in the network to do various dangerous things. One way to exploit vulnerabilities in a computer network is by employing malware. Botnets are a type of malware that infects and attacks targets in groups. Botnets develop quickly; the characteristics of initially sporadic attacks have grown into periodic and simultaneous. This rapid development has proved that the botnet is advanced and requires more attention and proper handling. Many studies have introduced detection models for botnet attack activity on computer networks. Apart from detecting the presence of botnet attacks, those studies have attempted to explore the characteristics of botnets, such as attack intensity, relationships between activities, and time segment analysis. However, there has been no research that explicitly detects those characteristics. On the other hand, each botnet characteristic requires different handling, while recognizing the characteristics of the botnet can help network administrators make appropriate decisions. Based on these reasons, this research builds a detection model that can recognize botnet characteristics using sequential traffic mining and similarity analysis. The proposed method consists of two main processes. The first is training to build a knowledge base, and the second is testing to detect botnet activity and attack characteristics. It involves dynamic thresholds to improve the model sensitivity in recognizing attack characteristics through similarity analysis. The novelty includes developing and combining analytical techniques of sequential traffic mining, similarity analysis, and dynamic threshold to detect and recognize the characteristics of botnet attacks explicitly on actual behavior in network traffic. Extensive experiments have been conducted for the evaluation using three different datasets whose results show better performance than others.</p>","PeriodicalId":15158,"journal":{"name":"Journal of Big Data","volume":"82 1","pages":""},"PeriodicalIF":8.1,"publicationDate":"2024-04-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140573686","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}