{"title":"Detection of anomalies in the HDFS dataset","authors":"Marwa Chnib, Wafa Gabsi","doi":"10.1109/SERA57763.2023.10197797","DOIUrl":null,"url":null,"abstract":"Big data systems are stable enough to store and process large volumes of quickly changing data. However, these systems are composed of massive hardware resources, which can easily cause their subcomponents to fail. Fault tolerance is a key attribute of such systems as they maintain availability, reliability and constant performance during failures. Implementing efficient fault-tolerant solutions in big data presents a challenge because fault tolerance has to satisfy some constraints related to system performance and resource consumption. To protect online computer systems from malicious attacks or malfunctions, log anomaly detection is crucial. This paper provides a new approach to identify anomalous log sequences in the HDFS (Hadoop Distributed File System) log dataset using three algorithms: Logbert, DeepLog and LOF. Then, it assess performance of all algorithms in terms of accuracy, recall, and F1-score.","PeriodicalId":211080,"journal":{"name":"2023 IEEE/ACIS 21st International Conference on Software Engineering Research, Management and Applications (SERA)","volume":"73 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-05-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2023 IEEE/ACIS 21st International Conference on Software Engineering Research, Management and Applications (SERA)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SERA57763.2023.10197797","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Big data systems are stable enough to store and process large volumes of quickly changing data. However, these systems are composed of massive hardware resources, which can easily cause their subcomponents to fail. Fault tolerance is a key attribute of such systems as they maintain availability, reliability and constant performance during failures. Implementing efficient fault-tolerant solutions in big data presents a challenge because fault tolerance has to satisfy some constraints related to system performance and resource consumption. To protect online computer systems from malicious attacks or malfunctions, log anomaly detection is crucial. This paper provides a new approach to identify anomalous log sequences in the HDFS (Hadoop Distributed File System) log dataset using three algorithms: Logbert, DeepLog and LOF. Then, it assess performance of all algorithms in terms of accuracy, recall, and F1-score.