评估词袋嵌入法和词到向量嵌入法及降维对日志文件异常检测的影响

IF 1.5 4区 计算机科学 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS
Ziyu Qiu, Zhilei Zhou, Bradley Niblett, Andrew Johnston, Jeffrey Schwartzentruber, Nur Zincir-Heywood, Malcolm I. Heywood
{"title":"评估词袋嵌入法和词到向量嵌入法及降维对日志文件异常检测的影响","authors":"Ziyu Qiu,&nbsp;Zhilei Zhou,&nbsp;Bradley Niblett,&nbsp;Andrew Johnston,&nbsp;Jeffrey Schwartzentruber,&nbsp;Nur Zincir-Heywood,&nbsp;Malcolm I. Heywood","doi":"10.1002/nem.2251","DOIUrl":null,"url":null,"abstract":"<p>In terms of cyber security, log files represent a rich source of information regarding the state of a computer service/system. Automating the process of summarizing log file content represents an important aid for decision-making, especially given the 24/7 nature of network/service operations. We perform benchmarking over eight distinct log files in order to assess the impact of the following: (1) different embedding methods for developing semantic descriptions of the original log files, (2) applying dimension reduction to the high-dimensional semantic space, and (3) assessing the impact of using different unsupervised learning algorithms for providing a visual summary of the service state. Benchmarking demonstrates that (1) word-to-vector embeddings identified by bidirectional encoder representation from transformers (BERT) without “fine-tuning” are sufficient to match the performance of Bag-or-Words embeddings provided by term frequency-inverse document frequency (TF-IDF) and (2) the self-organizing map without dimension reduction provides the most effective anomaly detector.</p>","PeriodicalId":14154,"journal":{"name":"International Journal of Network Management","volume":"34 1","pages":""},"PeriodicalIF":1.5000,"publicationDate":"2023-10-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1002/nem.2251","citationCount":"0","resultStr":"{\"title\":\"Assessing the impact of bag-of-words versus word-to-vector embedding methods and dimension reduction on anomaly detection from log files\",\"authors\":\"Ziyu Qiu,&nbsp;Zhilei Zhou,&nbsp;Bradley Niblett,&nbsp;Andrew Johnston,&nbsp;Jeffrey Schwartzentruber,&nbsp;Nur Zincir-Heywood,&nbsp;Malcolm I. Heywood\",\"doi\":\"10.1002/nem.2251\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p>In terms of cyber security, log files represent a rich source of information regarding the state of a computer service/system. Automating the process of summarizing log file content represents an important aid for decision-making, especially given the 24/7 nature of network/service operations. We perform benchmarking over eight distinct log files in order to assess the impact of the following: (1) different embedding methods for developing semantic descriptions of the original log files, (2) applying dimension reduction to the high-dimensional semantic space, and (3) assessing the impact of using different unsupervised learning algorithms for providing a visual summary of the service state. Benchmarking demonstrates that (1) word-to-vector embeddings identified by bidirectional encoder representation from transformers (BERT) without “fine-tuning” are sufficient to match the performance of Bag-or-Words embeddings provided by term frequency-inverse document frequency (TF-IDF) and (2) the self-organizing map without dimension reduction provides the most effective anomaly detector.</p>\",\"PeriodicalId\":14154,\"journal\":{\"name\":\"International Journal of Network Management\",\"volume\":\"34 1\",\"pages\":\"\"},\"PeriodicalIF\":1.5000,\"publicationDate\":\"2023-10-27\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://onlinelibrary.wiley.com/doi/epdf/10.1002/nem.2251\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"International Journal of Network Management\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://onlinelibrary.wiley.com/doi/10.1002/nem.2251\",\"RegionNum\":4,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q3\",\"JCRName\":\"COMPUTER SCIENCE, INFORMATION SYSTEMS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Journal of Network Management","FirstCategoryId":"94","ListUrlMain":"https://onlinelibrary.wiley.com/doi/10.1002/nem.2251","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}
引用次数: 0

摘要

在网络安全方面,日志文件是有关计算机服务/系统状态的丰富信息来源。日志文件内容总结过程的自动化是决策的重要辅助工具,特别是考虑到网络/服务运行的全天候性质。我们对八个不同的日志文件进行了基准测试,以评估以下因素的影响:(1) 采用不同的嵌入方法对原始日志文件进行语义描述;(2) 对高维语义空间进行降维;(3) 评估使用不同的无监督学习算法对提供服务状态可视化摘要的影响。基准测试表明:(1) 通过变换器双向编码器表示法(BERT)确定的词到向量嵌入不需要 "微调",就足以与通过词频-反向文档频率(TF-IDF)提供的袋或词嵌入的性能相媲美;(2) 不进行维度缩减的自组织图提供了最有效的异常检测器。
本文章由计算机程序翻译,如有差异,请以英文原文为准。

Assessing the impact of bag-of-words versus word-to-vector embedding methods and dimension reduction on anomaly detection from log files

Assessing the impact of bag-of-words versus word-to-vector embedding methods and dimension reduction on anomaly detection from log files

Assessing the impact of bag-of-words versus word-to-vector embedding methods and dimension reduction on anomaly detection from log files

In terms of cyber security, log files represent a rich source of information regarding the state of a computer service/system. Automating the process of summarizing log file content represents an important aid for decision-making, especially given the 24/7 nature of network/service operations. We perform benchmarking over eight distinct log files in order to assess the impact of the following: (1) different embedding methods for developing semantic descriptions of the original log files, (2) applying dimension reduction to the high-dimensional semantic space, and (3) assessing the impact of using different unsupervised learning algorithms for providing a visual summary of the service state. Benchmarking demonstrates that (1) word-to-vector embeddings identified by bidirectional encoder representation from transformers (BERT) without “fine-tuning” are sufficient to match the performance of Bag-or-Words embeddings provided by term frequency-inverse document frequency (TF-IDF) and (2) the self-organizing map without dimension reduction provides the most effective anomaly detector.

求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
International Journal of Network Management
International Journal of Network Management COMPUTER SCIENCE, INFORMATION SYSTEMS-TELECOMMUNICATIONS
CiteScore
5.10
自引率
6.70%
发文量
25
审稿时长
>12 weeks
期刊介绍: Modern computer networks and communication systems are increasing in size, scope, and heterogeneity. The promise of a single end-to-end technology has not been realized and likely never will occur. The decreasing cost of bandwidth is increasing the possible applications of computer networks and communication systems to entirely new domains. Problems in integrating heterogeneous wired and wireless technologies, ensuring security and quality of service, and reliably operating large-scale systems including the inclusion of cloud computing have all emerged as important topics. The one constant is the need for network management. Challenges in network management have never been greater than they are today. The International Journal of Network Management is the forum for researchers, developers, and practitioners in network management to present their work to an international audience. The journal is dedicated to the dissemination of information, which will enable improved management, operation, and maintenance of computer networks and communication systems. The journal is peer reviewed and publishes original papers (both theoretical and experimental) by leading researchers, practitioners, and consultants from universities, research laboratories, and companies around the world. Issues with thematic or guest-edited special topics typically occur several times per year. Topic areas for the journal are largely defined by the taxonomy for network and service management developed by IFIP WG6.6, together with IEEE-CNOM, the IRTF-NMRG and the Emanics Network of Excellence.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信