Yassir Lairgi, Ludovic Moncla, Rémy Cazabet, Khalid Benabdeslem, Pierre Cléau
{"title":"iText2KG: Incremental Knowledge Graphs Construction Using Large Language Models","authors":"Yassir Lairgi, Ludovic Moncla, Rémy Cazabet, Khalid Benabdeslem, Pierre Cléau","doi":"arxiv-2409.03284","DOIUrl":null,"url":null,"abstract":"Most available data is unstructured, making it challenging to access valuable\ninformation. Automatically building Knowledge Graphs (KGs) is crucial for\nstructuring data and making it accessible, allowing users to search for\ninformation effectively. KGs also facilitate insights, inference, and\nreasoning. Traditional NLP methods, such as named entity recognition and\nrelation extraction, are key in information retrieval but face limitations,\nincluding the use of predefined entity types and the need for supervised\nlearning. Current research leverages large language models' capabilities, such\nas zero- or few-shot learning. However, unresolved and semantically duplicated\nentities and relations still pose challenges, leading to inconsistent graphs\nand requiring extensive post-processing. Additionally, most approaches are\ntopic-dependent. In this paper, we propose iText2KG, a method for incremental,\ntopic-independent KG construction without post-processing. This plug-and-play,\nzero-shot method is applicable across a wide range of KG construction scenarios\nand comprises four modules: Document Distiller, Incremental Entity Extractor,\nIncremental Relation Extractor, and Graph Integrator and Visualization. Our\nmethod demonstrates superior performance compared to baseline methods across\nthree scenarios: converting scientific papers to graphs, websites to graphs,\nand CVs to graphs.","PeriodicalId":501281,"journal":{"name":"arXiv - CS - Information Retrieval","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2024-09-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Information Retrieval","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.03284","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Most available data is unstructured, making it challenging to access valuable
information. Automatically building Knowledge Graphs (KGs) is crucial for
structuring data and making it accessible, allowing users to search for
information effectively. KGs also facilitate insights, inference, and
reasoning. Traditional NLP methods, such as named entity recognition and
relation extraction, are key in information retrieval but face limitations,
including the use of predefined entity types and the need for supervised
learning. Current research leverages large language models' capabilities, such
as zero- or few-shot learning. However, unresolved and semantically duplicated
entities and relations still pose challenges, leading to inconsistent graphs
and requiring extensive post-processing. Additionally, most approaches are
topic-dependent. In this paper, we propose iText2KG, a method for incremental,
topic-independent KG construction without post-processing. This plug-and-play,
zero-shot method is applicable across a wide range of KG construction scenarios
and comprises four modules: Document Distiller, Incremental Entity Extractor,
Incremental Relation Extractor, and Graph Integrator and Visualization. Our
method demonstrates superior performance compared to baseline methods across
three scenarios: converting scientific papers to graphs, websites to graphs,
and CVs to graphs.
大多数可用数据都是非结构化的,因此获取有价值的信息具有挑战性。自动构建知识图谱(KG)对于构建数据并使其易于访问、让用户有效搜索信息至关重要。知识图谱还有助于洞察、推理和推理。传统的 NLP 方法(如命名实体识别和关联提取)是信息检索的关键,但也存在局限性,包括使用预定义的实体类型和需要监督学习。目前的研究利用了大型语言模型的能力,如零学习或少量学习。然而,未解决的和语义重复的实体和关系仍然构成挑战,导致图不一致,需要大量的后处理。此外,大多数方法都依赖于主题。在本文中,我们提出了 iText2KG,这是一种无需后处理的增量、与主题无关的 KG 构建方法。这种即插即用、零误差的方法适用于各种 KG 构建场景,由四个模块组成:它包括四个模块:文档蒸馏器、增量实体提取器、增量关系提取器以及图集成器和可视化。与基线方法相比,我们的方法在将科学论文转换为图形、将网站转换为图形以及将简历转换为图形这三种场景中都表现出了卓越的性能。