Ziyu Qiu, Zhilei Zhou, Bradley Niblett, Andrew Johnston, Jeffrey Schwartzentruber, Nur Zincir-Heywood, Malcolm I. Heywood
{"title":"评估词袋嵌入法和词到向量嵌入法及降维对日志文件异常检测的影响","authors":"Ziyu Qiu, Zhilei Zhou, Bradley Niblett, Andrew Johnston, Jeffrey Schwartzentruber, Nur Zincir-Heywood, Malcolm I. Heywood","doi":"10.1002/nem.2251","DOIUrl":null,"url":null,"abstract":"<p>In terms of cyber security, log files represent a rich source of information regarding the state of a computer service/system. Automating the process of summarizing log file content represents an important aid for decision-making, especially given the 24/7 nature of network/service operations. We perform benchmarking over eight distinct log files in order to assess the impact of the following: (1) different embedding methods for developing semantic descriptions of the original log files, (2) applying dimension reduction to the high-dimensional semantic space, and (3) assessing the impact of using different unsupervised learning algorithms for providing a visual summary of the service state. Benchmarking demonstrates that (1) word-to-vector embeddings identified by bidirectional encoder representation from transformers (BERT) without “fine-tuning” are sufficient to match the performance of Bag-or-Words embeddings provided by term frequency-inverse document frequency (TF-IDF) and (2) the self-organizing map without dimension reduction provides the most effective anomaly detector.</p>","PeriodicalId":14154,"journal":{"name":"International Journal of Network Management","volume":"34 1","pages":""},"PeriodicalIF":1.5000,"publicationDate":"2023-10-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1002/nem.2251","citationCount":"0","resultStr":"{\"title\":\"Assessing the impact of bag-of-words versus word-to-vector embedding methods and dimension reduction on anomaly detection from log files\",\"authors\":\"Ziyu Qiu, Zhilei Zhou, Bradley Niblett, Andrew Johnston, Jeffrey Schwartzentruber, Nur Zincir-Heywood, Malcolm I. Heywood\",\"doi\":\"10.1002/nem.2251\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p>In terms of cyber security, log files represent a rich source of information regarding the state of a computer service/system. Automating the process of summarizing log file content represents an important aid for decision-making, especially given the 24/7 nature of network/service operations. We perform benchmarking over eight distinct log files in order to assess the impact of the following: (1) different embedding methods for developing semantic descriptions of the original log files, (2) applying dimension reduction to the high-dimensional semantic space, and (3) assessing the impact of using different unsupervised learning algorithms for providing a visual summary of the service state. Benchmarking demonstrates that (1) word-to-vector embeddings identified by bidirectional encoder representation from transformers (BERT) without “fine-tuning” are sufficient to match the performance of Bag-or-Words embeddings provided by term frequency-inverse document frequency (TF-IDF) and (2) the self-organizing map without dimension reduction provides the most effective anomaly detector.</p>\",\"PeriodicalId\":14154,\"journal\":{\"name\":\"International Journal of Network Management\",\"volume\":\"34 1\",\"pages\":\"\"},\"PeriodicalIF\":1.5000,\"publicationDate\":\"2023-10-27\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://onlinelibrary.wiley.com/doi/epdf/10.1002/nem.2251\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"International Journal of Network Management\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://onlinelibrary.wiley.com/doi/10.1002/nem.2251\",\"RegionNum\":4,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q3\",\"JCRName\":\"COMPUTER SCIENCE, INFORMATION SYSTEMS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Journal of Network Management","FirstCategoryId":"94","ListUrlMain":"https://onlinelibrary.wiley.com/doi/10.1002/nem.2251","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}
Assessing the impact of bag-of-words versus word-to-vector embedding methods and dimension reduction on anomaly detection from log files
In terms of cyber security, log files represent a rich source of information regarding the state of a computer service/system. Automating the process of summarizing log file content represents an important aid for decision-making, especially given the 24/7 nature of network/service operations. We perform benchmarking over eight distinct log files in order to assess the impact of the following: (1) different embedding methods for developing semantic descriptions of the original log files, (2) applying dimension reduction to the high-dimensional semantic space, and (3) assessing the impact of using different unsupervised learning algorithms for providing a visual summary of the service state. Benchmarking demonstrates that (1) word-to-vector embeddings identified by bidirectional encoder representation from transformers (BERT) without “fine-tuning” are sufficient to match the performance of Bag-or-Words embeddings provided by term frequency-inverse document frequency (TF-IDF) and (2) the self-organizing map without dimension reduction provides the most effective anomaly detector.
期刊介绍:
Modern computer networks and communication systems are increasing in size, scope, and heterogeneity. The promise of a single end-to-end technology has not been realized and likely never will occur. The decreasing cost of bandwidth is increasing the possible applications of computer networks and communication systems to entirely new domains. Problems in integrating heterogeneous wired and wireless technologies, ensuring security and quality of service, and reliably operating large-scale systems including the inclusion of cloud computing have all emerged as important topics. The one constant is the need for network management. Challenges in network management have never been greater than they are today. The International Journal of Network Management is the forum for researchers, developers, and practitioners in network management to present their work to an international audience. The journal is dedicated to the dissemination of information, which will enable improved management, operation, and maintenance of computer networks and communication systems. The journal is peer reviewed and publishes original papers (both theoretical and experimental) by leading researchers, practitioners, and consultants from universities, research laboratories, and companies around the world. Issues with thematic or guest-edited special topics typically occur several times per year. Topic areas for the journal are largely defined by the taxonomy for network and service management developed by IFIP WG6.6, together with IEEE-CNOM, the IRTF-NMRG and the Emanics Network of Excellence.