SciND: a new triplet-based dataset for scientific novelty detection via knowledge graphs

IF 1.7 Q2 INFORMATION SCIENCE & LIBRARY SCIENCE

International Journal on Digital Libraries Pub Date : 2024-01-08 DOI:10.1007/s00799-023-00386-x

Komal Gupta, Ammaar Ahmad, Tirthankar Ghosal, Asif Ekbal

{"title":"SciND: a new triplet-based dataset for scientific novelty detection via knowledge graphs","authors":"Komal Gupta, Ammaar Ahmad, Tirthankar Ghosal, Asif Ekbal","doi":"10.1007/s00799-023-00386-x","DOIUrl":null,"url":null,"abstract":"<p>Detecting texts that contain semantic-level new information is not straightforward. The problem becomes more challenging for research articles. Over the years, many datasets and techniques have been developed to attempt automatic novelty detection. However, the majority of the existing textual novelty detection investigations are targeted toward general domains like newswire. A comprehensive dataset for scientific novelty detection is not available in the literature. In this paper, we present a new triplet-based corpus (SciND) for scientific novelty detection from research articles via knowledge graphs. The proposed dataset consists of three types of triples (i) triplet for the knowledge graph, (ii) novel triplets, and (iii) non-novel triplets. We build a scientific knowledge graph for research articles using triplets across several natural language processing (NLP) domains and extract novel triplets from the paper published in the year 2021. For the non-novel articles, we use blog post summaries of the research articles. Our knowledge graph is domain-specific. We build the knowledge graph for seven NLP domains. We further use a feature-based novelty detection scheme from the research articles as a baseline. Moreover, we show the applicability of our proposed dataset using our baseline novelty detection algorithm. Our algorithm yields a baseline F1 score of 72%. We show analysis and discuss the future scope using our proposed dataset. To the best of our knowledge, this is the very first dataset for scientific novelty detection via a knowledge graph. We make our codes and dataset publicly available at https://github.com/92Komal/Scientific_Novelty_Detection.</p>","PeriodicalId":44974,"journal":{"name":"International Journal on Digital Libraries","volume":"57 1","pages":""},"PeriodicalIF":1.7000,"publicationDate":"2024-01-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Journal on Digital Libraries","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1007/s00799-023-00386-x","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"INFORMATION SCIENCE & LIBRARY SCIENCE","Score":null,"Total":0}

引用次数: 0

Abstract

Detecting texts that contain semantic-level new information is not straightforward. The problem becomes more challenging for research articles. Over the years, many datasets and techniques have been developed to attempt automatic novelty detection. However, the majority of the existing textual novelty detection investigations are targeted toward general domains like newswire. A comprehensive dataset for scientific novelty detection is not available in the literature. In this paper, we present a new triplet-based corpus (SciND) for scientific novelty detection from research articles via knowledge graphs. The proposed dataset consists of three types of triples (i) triplet for the knowledge graph, (ii) novel triplets, and (iii) non-novel triplets. We build a scientific knowledge graph for research articles using triplets across several natural language processing (NLP) domains and extract novel triplets from the paper published in the year 2021. For the non-novel articles, we use blog post summaries of the research articles. Our knowledge graph is domain-specific. We build the knowledge graph for seven NLP domains. We further use a feature-based novelty detection scheme from the research articles as a baseline. Moreover, we show the applicability of our proposed dataset using our baseline novelty detection algorithm. Our algorithm yields a baseline F1 score of 72%. We show analysis and discuss the future scope using our proposed dataset. To the best of our knowledge, this is the very first dataset for scientific novelty detection via a knowledge graph. We make our codes and dataset publicly available at https://github.com/92Komal/Scientific_Novelty_Detection.

Abstract Image

查看原文本刊更多论文

SciND：通过知识图谱进行科学新颖性检测的基于三元组的新数据集

检测包含语义级新信息的文本并非易事。对于研究文章来说，这个问题变得更具挑战性。多年来，人们开发了许多数据集和技术来尝试自动新颖性检测。然而，现有的文本新颖性检测研究大多针对新闻通讯等一般领域。科学新颖性检测的综合数据集在文献中并不存在。在本文中，我们提出了一个基于三元组的新语料库（SciND），用于通过知识图谱从研究文章中检测科学新颖性。本文提出的数据集由三类三元组组成：(i) 知识图谱三元组；(ii) 新颖三元组；(iii) 非新颖三元组。我们利用多个自然语言处理（NLP）领域的三元组为研究文章构建科学知识图谱，并从 2021 年发表的论文中提取新颖的三元组。对于非小说类文章，我们使用研究文章的博文摘要。我们的知识图谱是针对特定领域的。我们为七个 NLP 领域构建了知识图谱。我们还以研究文章中基于特征的新颖性检测方案为基准。此外，我们还使用基线新颖性检测算法展示了我们提出的数据集的适用性。我们的算法获得了 72% 的基准 F1 分数。我们展示了使用我们提出的数据集进行的分析，并讨论了未来的应用范围。据我们所知，这是第一个通过知识图谱进行科学新颖性检测的数据集。我们将在 https://github.com/92Komal/Scientific_Novelty_Detection 上公开我们的代码和数据集。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

International Journal on Digital Libraries

CiteScore

4.30

自引率

6.70%

发文量

期刊介绍： The International Journal on Digital Libraries (IJDL) examines the theory and practice of acquisition definition organization management preservation and dissemination of digital information via global networking. It covers all aspects of digital libraries (DLs) from large-scale heterogeneous data and information management & access to linking and connectivity to security privacy and policies to its application use and evaluation.The scope of IJDL includes but is not limited to: The FAIR principle and the digital libraries infrastructure Findable: Information access and retrieval; semantic search; data and information exploration; information navigation; smart indexing and searching; resource discovery Accessible: visualization and digital collections; user interfaces; interfaces for handicapped users; HCI and UX in DLs; Security and privacy in DLs; multimodal access Interoperable: metadata (definition management curation integration); syntactic and semantic interoperability; linked data Reusable: reproducibility; Open Science; sustainability profitability repeatability of research results; confidentiality and privacy issues in DLs Digital Library Architectures including heterogeneous and dynamic data management; data and repositories Acquisition of digital information: authoring environments for digital objects; digitization of traditional content Digital Archiving and Preservation Digital Preservation and curation Digital archiving Web Archiving Archiving and preservation Strategies AI for Digital Libraries Machine Learning for DLs Data Mining in DLs NLP for DLs Applications of Digital Libraries Digital Humanities Open Data and their reuse Scholarly DLs (incl. bibliometrics altmetrics) Epigraphy and Paleography Digital Museums Future trends in Digital Libraries Definition of DLs in a ubiquitous digital library world Datafication of digital collections Interaction and user experience (UX) in DLs Information visualization Collection understanding Privacy and security Multimodal user interfaces Accessibility (or "Access for users with disabilities") UX studies