基于剪影阈值的日志文本聚类分析

J. J
{"title":"基于剪影阈值的日志文本聚类分析","authors":"J. J","doi":"10.20894/IJDMTA.102.006.001.004","DOIUrl":null,"url":null,"abstract":"- Automated log analysis has been a dominant subject area of interest to both industry and academics alike. The heterogeneous nature of system logs, the disparate sources of logs (Infrastructure, Networks, Databases and Applications) and their underlying structure & formats makes the challenge harder. In this paper I present the less frequently used document clustering techniques to dynamically organize real time log events (e.g. Errors, warnings) to specific categories that are pre-built from a corpus of log archives. This kind of syntactic log categorization can be exploited for automatic log monitoring, priority flagging and dynamic solution recommendation systems. I propose practical strategies to cluster and correlate high volume log archives and high velocity real time log events; both in terms of solution quality and computational efficiency. First I compare two traditional partitional document clustering approaches to categorize high dimensional log corpus. In order to select a suitable model for our problem, Entropy, Purity and Silhouette Index are used to evaluate these different learning approaches. Then I propose computationally efficient approaches to generate vector space model for the real time log events. Then to dynamically relate them to the categories from the corpus, I suggest the use of a combination of critical distance measure and least distance approach. In addition, I introduce and evaluate three different critical distance measures to ascertain if the real time event belongs to a totally new category that is unobserved in the corpus.","PeriodicalId":414709,"journal":{"name":"International Journal of Data Mining Techniques and Applications","volume":"48 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-06-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":"{\"title\":\"Silhouette Threshold Based Text Clustering for Log Analysis\",\"authors\":\"J. J\",\"doi\":\"10.20894/IJDMTA.102.006.001.004\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"- Automated log analysis has been a dominant subject area of interest to both industry and academics alike. The heterogeneous nature of system logs, the disparate sources of logs (Infrastructure, Networks, Databases and Applications) and their underlying structure & formats makes the challenge harder. In this paper I present the less frequently used document clustering techniques to dynamically organize real time log events (e.g. Errors, warnings) to specific categories that are pre-built from a corpus of log archives. This kind of syntactic log categorization can be exploited for automatic log monitoring, priority flagging and dynamic solution recommendation systems. I propose practical strategies to cluster and correlate high volume log archives and high velocity real time log events; both in terms of solution quality and computational efficiency. First I compare two traditional partitional document clustering approaches to categorize high dimensional log corpus. In order to select a suitable model for our problem, Entropy, Purity and Silhouette Index are used to evaluate these different learning approaches. Then I propose computationally efficient approaches to generate vector space model for the real time log events. Then to dynamically relate them to the categories from the corpus, I suggest the use of a combination of critical distance measure and least distance approach. In addition, I introduce and evaluate three different critical distance measures to ascertain if the real time event belongs to a totally new category that is unobserved in the corpus.\",\"PeriodicalId\":414709,\"journal\":{\"name\":\"International Journal of Data Mining Techniques and Applications\",\"volume\":\"48 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2017-06-26\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"2\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"International Journal of Data Mining Techniques and Applications\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.20894/IJDMTA.102.006.001.004\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Journal of Data Mining Techniques and Applications","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.20894/IJDMTA.102.006.001.004","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 2

摘要

-自动化日志分析一直是业界和学术界感兴趣的主要主题领域。系统日志的异构性质、日志的不同来源(基础设施、网络、数据库和应用程序)及其底层结构和格式使挑战更加困难。在本文中,我介绍了不太常用的文档聚类技术,它可以动态地将实时日志事件(例如错误、警告)组织到从日志存档语料库预先构建的特定类别中。这种语法日志分类可以用于日志自动监控、优先级标记和动态解决方案推荐系统。我提出了一些实用的策略来集群和关联大容量日志归档和高速实时日志事件;无论是解的质量还是计算效率。首先,我比较了两种传统的分区文档聚类方法来对高维日志语料进行分类。为了选择一个适合我们问题的模型,我们使用熵、纯度和轮廓指数来评估这些不同的学习方法。然后提出了一种计算效率高的方法来生成实时日志事件的向量空间模型。然后,为了动态地将它们与语料库中的类别联系起来,我建议使用临界距离度量和最小距离方法的组合。此外,我引入并评估了三种不同的临界距离度量,以确定实时事件是否属于语料库中未观察到的全新类别。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Silhouette Threshold Based Text Clustering for Log Analysis
- Automated log analysis has been a dominant subject area of interest to both industry and academics alike. The heterogeneous nature of system logs, the disparate sources of logs (Infrastructure, Networks, Databases and Applications) and their underlying structure & formats makes the challenge harder. In this paper I present the less frequently used document clustering techniques to dynamically organize real time log events (e.g. Errors, warnings) to specific categories that are pre-built from a corpus of log archives. This kind of syntactic log categorization can be exploited for automatic log monitoring, priority flagging and dynamic solution recommendation systems. I propose practical strategies to cluster and correlate high volume log archives and high velocity real time log events; both in terms of solution quality and computational efficiency. First I compare two traditional partitional document clustering approaches to categorize high dimensional log corpus. In order to select a suitable model for our problem, Entropy, Purity and Silhouette Index are used to evaluate these different learning approaches. Then I propose computationally efficient approaches to generate vector space model for the real time log events. Then to dynamically relate them to the categories from the corpus, I suggest the use of a combination of critical distance measure and least distance approach. In addition, I introduce and evaluate three different critical distance measures to ascertain if the real time event belongs to a totally new category that is unobserved in the corpus.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信