Ying Fu, Meng Yan, Jian Xu, Jianguo Li, Zhongxin Liu, Xiaohong Zhang, Dan Yang
{"title":"Investigating and improving log parsing in practice","authors":"Ying Fu, Meng Yan, Jian Xu, Jianguo Li, Zhongxin Liu, Xiaohong Zhang, Dan Yang","doi":"10.1145/3540250.3558947","DOIUrl":null,"url":null,"abstract":"Logs are widely used for system behavior diagnosis by automatic log mining. Log parsing is an important data preprocessing step that converts semi-structured log messages into structured data as the feature input for log mining. Currently, many studies are devoted to proposing new log parsers. However, to the best of our knowledge, no previous study comprehensively investigates the effectiveness of log parsers in industrial practice. To investigate the effectiveness of the log parsers in industrial practice, in this paper, we conduct an empirical study on the effectiveness of six state-of-the-art log parsers on 10 microservice applications of Ant Group. Our empirical results highlight two challenges for log parsing in practice: 1) various separators. There are various separators in a log message, and the separators in different event templates or different applications are also various. Current log parsers cannot perform well because they do not consider various separators. 2) Various lengths due to nested objects. The log messages belonging to the same event template may also have various lengths due to nested objects. The log messages of 6 out of 10 microservice applications at Ant Group with various lengths due to nested objects. 4 out of 6 state-of-the-art log parsers cannot deal with various lengths due to nested objects. In this paper, we propose an improved log parser named Drain+ based on a state-of-the-art log parser Drain. Drain+ includes two innovative components to address the above two challenges: a statistical-based separators generation component, which generates separators automatically for log message splitting, and a candidate event template merging component, which merges the candidate event templates by a template similarity method. We evaluate the effectiveness of Drain+ on 10 microservice applications of Ant Group and 16 public datasets. The results show that Drain+ outperforms the six state-of-the-art log parsers on industrial applications and public datasets. Finally, we conclude the observations in the road ahead for log parsing to inspire other researchers and practitioners.","PeriodicalId":68155,"journal":{"name":"软件产业与工程","volume":"134 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2022-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"7","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"软件产业与工程","FirstCategoryId":"1089","ListUrlMain":"https://doi.org/10.1145/3540250.3558947","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 7
Abstract
Logs are widely used for system behavior diagnosis by automatic log mining. Log parsing is an important data preprocessing step that converts semi-structured log messages into structured data as the feature input for log mining. Currently, many studies are devoted to proposing new log parsers. However, to the best of our knowledge, no previous study comprehensively investigates the effectiveness of log parsers in industrial practice. To investigate the effectiveness of the log parsers in industrial practice, in this paper, we conduct an empirical study on the effectiveness of six state-of-the-art log parsers on 10 microservice applications of Ant Group. Our empirical results highlight two challenges for log parsing in practice: 1) various separators. There are various separators in a log message, and the separators in different event templates or different applications are also various. Current log parsers cannot perform well because they do not consider various separators. 2) Various lengths due to nested objects. The log messages belonging to the same event template may also have various lengths due to nested objects. The log messages of 6 out of 10 microservice applications at Ant Group with various lengths due to nested objects. 4 out of 6 state-of-the-art log parsers cannot deal with various lengths due to nested objects. In this paper, we propose an improved log parser named Drain+ based on a state-of-the-art log parser Drain. Drain+ includes two innovative components to address the above two challenges: a statistical-based separators generation component, which generates separators automatically for log message splitting, and a candidate event template merging component, which merges the candidate event templates by a template similarity method. We evaluate the effectiveness of Drain+ on 10 microservice applications of Ant Group and 16 public datasets. The results show that Drain+ outperforms the six state-of-the-art log parsers on industrial applications and public datasets. Finally, we conclude the observations in the road ahead for log parsing to inspire other researchers and practitioners.