Cybersecurity Automated Information Extraction Techniques: Drawbacks of Current Methods, and Enhanced Extractors

R. A. Bridges, Kelly M. T. Huffer, Corinne L. Jones, Michael D. Iannacone, J. Goodall
{"title":"Cybersecurity Automated Information Extraction Techniques: Drawbacks of Current Methods, and Enhanced Extractors","authors":"R. A. Bridges, Kelly M. T. Huffer, Corinne L. Jones, Michael D. Iannacone, J. Goodall","doi":"10.1109/ICMLA.2017.0-122","DOIUrl":null,"url":null,"abstract":"We address a crucial element of applied information extraction—accurate identification of basic security entities in text-—by evaluating previous methods and presenting new labelers. Our survey reveals that the previous efforts have not been tested on documents similar to the targeted sources (news articles, blogs, tweets, etc.) and that no sufficiently large publicly available annotated corpus of these documents exists. By assembling a representative test corpus, we perform a quantitative evaluation of previous methods in a realistic setting, revealing an overall lack of recall, and giving insight to the models' beneficial and inhibiting elements. In particular, our results show that many previous efforts overfit to the non-representative test corpora in this domain. Informed by this evaluation, we present three novel cyber entity extractors, which seek to leverage the available labeled data but remain worthwhile on the more diverse documents encountered in the wild. Each new model increases the state of the art in recall, with maximal or near maximal F1 score. Our results establish that the state of the art in cyber entity tagging is characterized by F1 = 0.61.","PeriodicalId":6636,"journal":{"name":"2017 16th IEEE International Conference on Machine Learning and Applications (ICMLA)","volume":"126 1","pages":"437-442"},"PeriodicalIF":0.0000,"publicationDate":"2017-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"17","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2017 16th IEEE International Conference on Machine Learning and Applications (ICMLA)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICMLA.2017.0-122","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 17

Abstract

We address a crucial element of applied information extraction—accurate identification of basic security entities in text-—by evaluating previous methods and presenting new labelers. Our survey reveals that the previous efforts have not been tested on documents similar to the targeted sources (news articles, blogs, tweets, etc.) and that no sufficiently large publicly available annotated corpus of these documents exists. By assembling a representative test corpus, we perform a quantitative evaluation of previous methods in a realistic setting, revealing an overall lack of recall, and giving insight to the models' beneficial and inhibiting elements. In particular, our results show that many previous efforts overfit to the non-representative test corpora in this domain. Informed by this evaluation, we present three novel cyber entity extractors, which seek to leverage the available labeled data but remain worthwhile on the more diverse documents encountered in the wild. Each new model increases the state of the art in recall, with maximal or near maximal F1 score. Our results establish that the state of the art in cyber entity tagging is characterized by F1 = 0.61.
网络安全自动信息提取技术:现有方法的缺点,以及增强的提取器
我们通过评估以前的方法和提出新的标签来解决应用信息提取的一个关键因素-文本中基本安全实体的准确识别。我们的调查显示,之前的努力还没有在与目标来源(新闻文章、博客、tweet等)相似的文档上进行测试,并且没有足够大的公开可用的这些文档的注释语料库。通过组装一个有代表性的测试语料库,我们在一个现实的环境中对以前的方法进行了定量评估,揭示了召回的总体缺乏,并深入了解了模型的有益和抑制因素。特别是,我们的结果表明,许多以前的努力过拟合非代表性的测试语料库在这个领域。根据这一评估,我们提出了三种新的网络实体提取器,它们寻求利用可用的标记数据,但在野外遇到的更多样化的文档上仍然有价值。每个新模型都增加了召回的艺术状态,具有最大或接近最大的F1分数。我们的结果表明,网络实体标签的最新状态的特征是F1 = 0.61。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信