Pattapon Prayurahong, P. Phunchongharn, V. C. Barroso
{"title":"A Topic Modeling for ALICE'S Log Messages using Latent Dirichlet Allocation","authors":"Pattapon Prayurahong, P. Phunchongharn, V. C. Barroso","doi":"10.1109/ICKII55100.2022.9983522","DOIUrl":null,"url":null,"abstract":"In modern-day software where digital technology is everywhere, the system can generate a massive amount of log messages every second. Like other data, a log can provide insight and depth knowledge of the system given enough resources and time. However, not all systems have an organized log system, and an unorganized log is messy and difficult to navigate. There are many challenging points for organizing the log messages. As the amount of log data generated is massive, it is impossible to be handled by human labor alone. A log message is not regular human communication. To thoroughly understand the content inside the log, assistance from specialists of that particular system is required. These problems exist everywhere, and there is no exception even for high-performance computing systems like those used in the ALICE experiment at CERN. In this paper, we propose a topic modeling for ALICE’s log messages using the Latent Dirichlet Allocation algorithm. The objective is to convert the messy log messages into categorized ones. We handled the log messages and preprocessed them using Bag of Word. Then we performed hyperparameter-tuning to find the suitable number of topics using topic coherence as an evaluated measurement. Additionally, we also applied the same method to the log dataset of HDFS, to ensure the valid ability of the model. Finally, the outputs were then handed to CERN domain experts to give the final evaluation. From the result, we could create a practical topic modeling framework for ALICE’s log messages in a real scenario.","PeriodicalId":352222,"journal":{"name":"2022 IEEE 5th International Conference on Knowledge Innovation and Invention (ICKII )","volume":"41 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-07-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 IEEE 5th International Conference on Knowledge Innovation and Invention (ICKII )","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICKII55100.2022.9983522","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
In modern-day software where digital technology is everywhere, the system can generate a massive amount of log messages every second. Like other data, a log can provide insight and depth knowledge of the system given enough resources and time. However, not all systems have an organized log system, and an unorganized log is messy and difficult to navigate. There are many challenging points for organizing the log messages. As the amount of log data generated is massive, it is impossible to be handled by human labor alone. A log message is not regular human communication. To thoroughly understand the content inside the log, assistance from specialists of that particular system is required. These problems exist everywhere, and there is no exception even for high-performance computing systems like those used in the ALICE experiment at CERN. In this paper, we propose a topic modeling for ALICE’s log messages using the Latent Dirichlet Allocation algorithm. The objective is to convert the messy log messages into categorized ones. We handled the log messages and preprocessed them using Bag of Word. Then we performed hyperparameter-tuning to find the suitable number of topics using topic coherence as an evaluated measurement. Additionally, we also applied the same method to the log dataset of HDFS, to ensure the valid ability of the model. Finally, the outputs were then handed to CERN domain experts to give the final evaluation. From the result, we could create a practical topic modeling framework for ALICE’s log messages in a real scenario.