Infer the missing facts of D3FEND using knowledge graph representation learning

IF 2.5 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS

International Journal of Web Information Systems Pub Date : 2023-08-16 DOI:10.1108/ijwis-03-2023-0042

A. Khobragade, S. Ghumbre, V. Pachghare

{"title":"Infer the missing facts of D3FEND using knowledge graph representation learning","authors":"A. Khobragade, S. Ghumbre, V. Pachghare","doi":"10.1108/ijwis-03-2023-0042","DOIUrl":null,"url":null,"abstract":"\nPurpose\nMITRE and the National Security Agency cooperatively developed and maintained a D3FEND knowledge graph (KG). It provides concepts as an entity from the cybersecurity countermeasure domain, such as dynamic, emulated and file analysis. Those entities are linked by applying relationships such as analyze, may_contains and encrypt. A fundamental challenge for collaborative designers is to encode knowledge and efficiently interrelate the cyber-domain facts generated daily. However, the designers manually update the graph contents with new or missing facts to enrich the knowledge. This paper aims to propose an automated approach to predict the missing facts using the link prediction task, leveraging embedding as representation learning.\n\n\nDesign/methodology/approach\nD3FEND is available in the resource description framework (RDF) format. In the preprocessing step, the facts in RDF format converted to subject–predicate–object triplet format contain 5,967 entities and 98 relationship types. Progressive distance-based, bilinear and convolutional embedding models are applied to learn the embeddings of entities and relations. This study presents a link prediction task to infer missing facts using learned embeddings.\n\n\nFindings\nExperimental results show that the translational model performs well on high-rank results, whereas the bilinear model is superior in capturing the latent semantics of complex relationship types. However, the convolutional model outperforms 44% of the true facts and achieves a 3% improvement in results compared to other models.\n\n\nResearch limitations/implications\nDespite the success of embedding models to enrich D3FEND using link prediction under the supervised learning setup, it has some limitations, such as not capturing diversity and hierarchies of relations. The average node degree of D3FEND KG is 16.85, with 12% of entities having a node degree less than 2, especially there are many entities or relations with few or no observed links. This results in sparsity and data imbalance, which affect the model performance even after increasing the embedding vector size. Moreover, KG embedding models consider existing entities and relations and may not incorporate external or contextual information such as textual descriptions, temporal dynamics or domain knowledge, which can enhance the link prediction performance.\n\n\nPractical implications\nLink prediction in the D3FEND KG can benefit cybersecurity countermeasure strategies in several ways, such as it can help to identify gaps or weaknesses in the existing defensive methods and suggest possible ways to improve or augment them; it can help to compare and contrast different defensive methods and understand their trade-offs and synergies; it can help to discover novel or emerging defensive methods by inferring new relations from existing data or external sources; and it can help to generate recommendations or guidance for selecting or deploying appropriate defensive methods based on the characteristics and objectives of the system or network.\n\n\nOriginality/value\nThe representation learning approach helps to reduce incompleteness using a link prediction that infers possible missing facts by using the existing entities and relations of D3FEND.\n","PeriodicalId":44153,"journal":{"name":"International Journal of Web Information Systems","volume":" ","pages":""},"PeriodicalIF":2.5000,"publicationDate":"2023-08-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Journal of Web Information Systems","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1108/ijwis-03-2023-0042","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 0

Abstract

Purpose MITRE and the National Security Agency cooperatively developed and maintained a D3FEND knowledge graph (KG). It provides concepts as an entity from the cybersecurity countermeasure domain, such as dynamic, emulated and file analysis. Those entities are linked by applying relationships such as analyze, may_contains and encrypt. A fundamental challenge for collaborative designers is to encode knowledge and efficiently interrelate the cyber-domain facts generated daily. However, the designers manually update the graph contents with new or missing facts to enrich the knowledge. This paper aims to propose an automated approach to predict the missing facts using the link prediction task, leveraging embedding as representation learning. Design/methodology/approach D3FEND is available in the resource description framework (RDF) format. In the preprocessing step, the facts in RDF format converted to subject–predicate–object triplet format contain 5,967 entities and 98 relationship types. Progressive distance-based, bilinear and convolutional embedding models are applied to learn the embeddings of entities and relations. This study presents a link prediction task to infer missing facts using learned embeddings. Findings Experimental results show that the translational model performs well on high-rank results, whereas the bilinear model is superior in capturing the latent semantics of complex relationship types. However, the convolutional model outperforms 44% of the true facts and achieves a 3% improvement in results compared to other models. Research limitations/implications Despite the success of embedding models to enrich D3FEND using link prediction under the supervised learning setup, it has some limitations, such as not capturing diversity and hierarchies of relations. The average node degree of D3FEND KG is 16.85, with 12% of entities having a node degree less than 2, especially there are many entities or relations with few or no observed links. This results in sparsity and data imbalance, which affect the model performance even after increasing the embedding vector size. Moreover, KG embedding models consider existing entities and relations and may not incorporate external or contextual information such as textual descriptions, temporal dynamics or domain knowledge, which can enhance the link prediction performance. Practical implications Link prediction in the D3FEND KG can benefit cybersecurity countermeasure strategies in several ways, such as it can help to identify gaps or weaknesses in the existing defensive methods and suggest possible ways to improve or augment them; it can help to compare and contrast different defensive methods and understand their trade-offs and synergies; it can help to discover novel or emerging defensive methods by inferring new relations from existing data or external sources; and it can help to generate recommendations or guidance for selecting or deploying appropriate defensive methods based on the characteristics and objectives of the system or network. Originality/value The representation learning approach helps to reduce incompleteness using a link prediction that infers possible missing facts by using the existing entities and relations of D3FEND.

查看原文本刊更多论文

利用知识图表示学习来推断d3挡位的缺失事实

目的MITRE和国家安全局合作开发并维护了D3FEND知识图谱（KG）。它提供了网络安全对策领域的实体概念，如动态、模拟和文件分析。这些实体通过应用诸如analyze、may_contains和encrypt之类的关系进行链接。协作设计师面临的一个根本挑战是对知识进行编码，并有效地将每天生成的网络领域事实相互关联。然而，设计者用新的或缺失的事实手动更新图形内容，以丰富知识。本文旨在提出一种使用链接预测任务来预测缺失事实的自动化方法，利用嵌入作为表示学习。设计/方法论/方法D3FEND以资源描述框架（RDF）格式提供。在预处理步骤中，RDF格式转换为主语-谓语-宾语三元组格式的事实包含5967个实体和98个关系类型。应用基于渐进距离的双线性和卷积嵌入模型来学习实体和关系的嵌入。本研究提出了一个链接预测任务，使用学习嵌入来推断遗漏的事实。实验结果表明，平移模型在高阶结果上表现良好，而双线性模型在捕捉复杂关系类型的潜在语义方面表现优异。然而，与其他模型相比，卷积模型的性能优于44%的真实事实，并在结果上提高了3%。研究局限性/含义尽管在监督学习设置下使用链接预测嵌入模型来丰富D3FEND是成功的，但它也有一些局限性，例如没有捕捉到关系的多样性和层次性。D3FEND KG的平均节点度为16.85，12%的实体的节点度小于2，尤其是存在许多实体或关系，很少或没有观测到链路。这导致稀疏性和数据不平衡，即使在增加嵌入向量大小后，也会影响模型性能。此外，KG嵌入模型考虑了现有的实体和关系，可能不包含外部或上下文信息，如文本描述、时间动态或领域知识，这可以提高链接预测性能。实际含义D3FEND KG中的链路预测可以在几个方面有利于网络安全对策策略，例如它可以帮助识别现有防御方法中的差距或弱点，并提出改进或增强这些方法的可能方法；它可以帮助比较和对比不同的防御方法，并了解它们的权衡和协同作用；它可以通过从现有数据或外部来源推断新的关系来帮助发现新的或新兴的防御方法；并且它可以帮助生成用于基于系统或网络的特征和目标来选择或部署适当的防御方法的建议或指导。独创性/价值表示学习方法使用链接预测来减少不完整性，该链接预测通过使用D3FEND的现有实体和关系来推断可能缺失的事实。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

International Journal of Web Information Systems COMPUTER SCIENCE, INFORMATION SYSTEMS-

CiteScore

4.60

自引率

0.00%

发文量

期刊介绍： The Global Information Infrastructure is a daily reality. In spite of the many applications in all domains of our societies: e-business, e-commerce, e-learning, e-science, and e-government, for instance, and in spite of the tremendous advances by engineers and scientists, the seamless development of Web information systems and services remains a major challenge. The journal examines how current shared vision for the future is one of semantically-rich information and service oriented architecture for global information systems. This vision is at the convergence of progress in technologies such as XML, Web services, RDF, OWL, of multimedia, multimodal, and multilingual information retrieval, and of distributed, mobile and ubiquitous computing. Topicality While the International Journal of Web Information Systems covers a broad range of topics, the journal welcomes papers that provide a perspective on all aspects of Web information systems: Web semantics and Web dynamics, Web mining and searching, Web databases and Web data integration, Web-based commerce and e-business, Web collaboration and distributed computing, Internet computing and networks, performance of Web applications, and Web multimedia services and Web-based education.