CDTier: A Chinese Dataset of Threat Intelligence Entity Relationships

IF 3 3区 计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE
Yinghai Zhou;Yitong Ren;Ming Yi;Yanjun Xiao;Zhiyuan Tan;Nour Moustafa;Zhihong Tian
{"title":"CDTier: A Chinese Dataset of Threat Intelligence Entity Relationships","authors":"Yinghai Zhou;Yitong Ren;Ming Yi;Yanjun Xiao;Zhiyuan Tan;Nour Moustafa;Zhihong Tian","doi":"10.1109/TSUSC.2023.3240411","DOIUrl":null,"url":null,"abstract":"Cyber Threat Intelligence (CTI), which is knowledge of cyberspace threats gathered from security data, is critical in defending against cyberattacks.However, there is no open-source CTI dataset for security researchers to effectively apply enormous CTI information for security analysis in the field of threat intelligence, particularly in the field of Chinese threat intelligence. As a result, for network security research and development, this article constructed a Chinese CTI entity relationship dataset–CDTier, which includes: 1) A threat entity extraction dataset composed of 100 CTI reports, 3744 threat sentences and 4259 threat knowledge objects; 2) A dataset for entity relation extraction including 100 CTI reports, 2598 threat sentences and 2562 knowledge object relations. CDTier is, as far as we know, the first CTI dataset. On the CDTier, we trained 4 models for threat entity extraction and relation extraction using well-established and widely used deep learning methods in the NLP. The results showed that the model trained on CDTier extracts knowledge objects and their relationships described in threat intelligence more accurately. This significantly minimizes threat intelligence analysts’ work while assessing threat intelligence.","PeriodicalId":13268,"journal":{"name":"IEEE Transactions on Sustainable Computing","volume":"8 4","pages":"627-638"},"PeriodicalIF":3.0000,"publicationDate":"2023-01-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Sustainable Computing","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10029930/","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE","Score":null,"Total":0}
引用次数: 2

Abstract

Cyber Threat Intelligence (CTI), which is knowledge of cyberspace threats gathered from security data, is critical in defending against cyberattacks.However, there is no open-source CTI dataset for security researchers to effectively apply enormous CTI information for security analysis in the field of threat intelligence, particularly in the field of Chinese threat intelligence. As a result, for network security research and development, this article constructed a Chinese CTI entity relationship dataset–CDTier, which includes: 1) A threat entity extraction dataset composed of 100 CTI reports, 3744 threat sentences and 4259 threat knowledge objects; 2) A dataset for entity relation extraction including 100 CTI reports, 2598 threat sentences and 2562 knowledge object relations. CDTier is, as far as we know, the first CTI dataset. On the CDTier, we trained 4 models for threat entity extraction and relation extraction using well-established and widely used deep learning methods in the NLP. The results showed that the model trained on CDTier extracts knowledge objects and their relationships described in threat intelligence more accurately. This significantly minimizes threat intelligence analysts’ work while assessing threat intelligence.
CDTier:威胁情报实体关系中文数据集
网络威胁情报(CTI)是从安全数据中收集到的网络空间威胁知识,是防御网络攻击的关键。然而,在威胁情报领域,尤其是在中国威胁情报领域,目前还没有一个开源的CTI数据集供安全研究人员有效地将海量的CTI信息用于安全分析。因此,为了网络安全研究与开发,本文构建了一个中文 CTI 实体关系数据集--CDTier,其中包括:1)由 100 份 CTI 报告、3744 个威胁句子和 4259 个威胁知识对象组成的威胁实体抽取数据集;2)由 100 份 CTI 报告、2598 个威胁句子和 2562 个知识对象关系组成的实体关系抽取数据集。据我们所知,CDTier 是第一个 CTI 数据集。在 CDTier 上,我们使用 NLP 中成熟且广泛使用的深度学习方法训练了 4 个威胁实体提取和关系提取模型。结果表明,在 CDTier 上训练的模型能更准确地提取威胁情报中描述的知识对象及其关系。这大大减少了威胁情报分析师在评估威胁情报时的工作量。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
IEEE Transactions on Sustainable Computing
IEEE Transactions on Sustainable Computing Mathematics-Control and Optimization
CiteScore
7.70
自引率
2.60%
发文量
54
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信