基于剪影阈值的日志文本聚类分析

International Journal of Data Mining Techniques and Applications Pub Date : 2017-06-26 DOI:10.20894/IJDMTA.102.006.001.004

J. J

{"title":"基于剪影阈值的日志文本聚类分析","authors":"J. J","doi":"10.20894/IJDMTA.102.006.001.004","DOIUrl":null,"url":null,"abstract":"- Automated log analysis has been a dominant subject area of interest to both industry and academics alike. The heterogeneous nature of system logs, the disparate sources of logs (Infrastructure, Networks, Databases and Applications) and their underlying structure & formats makes the challenge harder. In this paper I present the less frequently used document clustering techniques to dynamically organize real time log events (e.g. Errors, warnings) to specific categories that are pre-built from a corpus of log archives. This kind of syntactic log categorization can be exploited for automatic log monitoring, priority flagging and dynamic solution recommendation systems. I propose practical strategies to cluster and correlate high volume log archives and high velocity real time log events; both in terms of solution quality and computational efficiency. First I compare two traditional partitional document clustering approaches to categorize high dimensional log corpus. In order to select a suitable model for our problem, Entropy, Purity and Silhouette Index are used to evaluate these different learning approaches. Then I propose computationally efficient approaches to generate vector space model for the real time log events. Then to dynamically relate them to the categories from the corpus, I suggest the use of a combination of critical distance measure and least distance approach. In addition, I introduce and evaluate three different critical distance measures to ascertain if the real time event belongs to a totally new category that is unobserved in the corpus.","PeriodicalId":414709,"journal":{"name":"International Journal of Data Mining Techniques and Applications","volume":"48 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-06-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":"{\"title\":\"Silhouette Threshold Based Text Clustering for Log Analysis\",\"authors\":\"J. J\",\"doi\":\"10.20894/IJDMTA.102.006.001.004\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"- Automated log analysis has been a dominant subject area of interest to both industry and academics alike. The heterogeneous nature of system logs, the disparate sources of logs (Infrastructure, Networks, Databases and Applications) and their underlying structure & formats makes the challenge harder. In this paper I present the less frequently used document clustering techniques to dynamically organize real time log events (e.g. Errors, warnings) to specific categories that are pre-built from a corpus of log archives. This kind of syntactic log categorization can be exploited for automatic log monitoring, priority flagging and dynamic solution recommendation systems. I propose practical strategies to cluster and correlate high volume log archives and high velocity real time log events; both in terms of solution quality and computational efficiency. First I compare two traditional partitional document clustering approaches to categorize high dimensional log corpus. In order to select a suitable model for our problem, Entropy, Purity and Silhouette Index are used to evaluate these different learning approaches. Then I propose computationally efficient approaches to generate vector space model for the real time log events. Then to dynamically relate them to the categories from the corpus, I suggest the use of a combination of critical distance measure and least distance approach. In addition, I introduce and evaluate three different critical distance measures to ascertain if the real time event belongs to a totally new category that is unobserved in the corpus.\",\"PeriodicalId\":414709,\"journal\":{\"name\":\"International Journal of Data Mining Techniques and Applications\",\"volume\":\"48 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2017-06-26\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"2\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"International Journal of Data Mining Techniques and Applications\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.20894/IJDMTA.102.006.001.004\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Journal of Data Mining Techniques and Applications","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.20894/IJDMTA.102.006.001.004","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 2

摘要

-自动化日志分析一直是业界和学术界感兴趣的主要主题领域。系统日志的异构性质、日志的不同来源(基础设施、网络、数据库和应用程序)及其底层结构和格式使挑战更加困难。在本文中，我介绍了不太常用的文档聚类技术，它可以动态地将实时日志事件(例如错误、警告)组织到从日志存档语料库预先构建的特定类别中。这种语法日志分类可以用于日志自动监控、优先级标记和动态解决方案推荐系统。我提出了一些实用的策略来集群和关联大容量日志归档和高速实时日志事件;无论是解的质量还是计算效率。首先，我比较了两种传统的分区文档聚类方法来对高维日志语料进行分类。为了选择一个适合我们问题的模型，我们使用熵、纯度和轮廓指数来评估这些不同的学习方法。然后提出了一种计算效率高的方法来生成实时日志事件的向量空间模型。然后，为了动态地将它们与语料库中的类别联系起来，我建议使用临界距离度量和最小距离方法的组合。此外，我引入并评估了三种不同的临界距离度量，以确定实时事件是否属于语料库中未观察到的全新类别。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Silhouette Threshold Based Text Clustering for Log Analysis

- Automated log analysis has been a dominant subject area of interest to both industry and academics alike. The heterogeneous nature of system logs, the disparate sources of logs (Infrastructure, Networks, Databases and Applications) and their underlying structure & formats makes the challenge harder. In this paper I present the less frequently used document clustering techniques to dynamically organize real time log events (e.g. Errors, warnings) to specific categories that are pre-built from a corpus of log archives. This kind of syntactic log categorization can be exploited for automatic log monitoring, priority flagging and dynamic solution recommendation systems. I propose practical strategies to cluster and correlate high volume log archives and high velocity real time log events; both in terms of solution quality and computational efficiency. First I compare two traditional partitional document clustering approaches to categorize high dimensional log corpus. In order to select a suitable model for our problem, Entropy, Purity and Silhouette Index are used to evaluate these different learning approaches. Then I propose computationally efficient approaches to generate vector space model for the real time log events. Then to dynamically relate them to the categories from the corpus, I suggest the use of a combination of critical distance measure and least distance approach. In addition, I introduce and evaluate three different critical distance measures to ascertain if the real time event belongs to a totally new category that is unobserved in the corpus.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

International Journal of Data Mining Techniques and Applications

自引率

0.00%

发文量