Ontology-conformal recognition of materials entities using language models.

IF 3.9 2区综合性期刊 Q1 MULTIDISCIPLINARY SCIENCES

Scientific Reports Pub Date : 2025-05-28 DOI:10.1038/s41598-025-03619-y

Sai Teja Potu, Rachana Niranjan Murthy, Akhil Thomas, Lokesh Mishra, Natalie Prange, Ali Riza Durmaz

{"title":"Ontology-conformal recognition of materials entities using language models.","authors":"Sai Teja Potu, Rachana Niranjan Murthy, Akhil Thomas, Lokesh Mishra, Natalie Prange, Ali Riza Durmaz","doi":"10.1038/s41598-025-03619-y","DOIUrl":null,"url":null,"abstract":"<p><p>Extracting structured and semantically annotated materials information from unstructured scientific literature is a crucial step toward constructing machine-interpretable knowledge graphs and accelerating data-driven materials research. This is especially important in materials science, which is adversely affected by data scarcity. Data scarcity further motivates employing solutions such as foundation language models for extracting information which can in principle address several subtasks of the information extraction problem in a range of domains without the need of generating costly large-scale annotated datasets for each downstream task. However, foundation language models struggle with tasks like Named Entity Recognition (NER) due to domain-specific terminologies, fine-grained entities, and semantic ambiguity. The issue is even more pronounced when entities must map directly to pre-existing domain ontologies. This work aims to assess whether foundation large language models (LLMs) can successfully perform ontology-conformal NER in the materials mechanics and fatigue domain. Specifically, we present a comparative evaluation of in-context learning (ICL) with foundation models such as GPT-4 against fine-tuned task-specific language models, including MatSciBERT and DeBERTa. The study is performed on two materials fatigue datasets, which contain annotations at a comparatively fine-grained level adhering to the class definitions of a formal ontology to ensure semantic alignment and cross-dataset interoperability. Both datasets cover adjacent domains to assess how well both NER methodologies generalize when presented with typical domain shifts. Task-specific models are shown to significantly outperform general foundation models on an ontology-constrained NER. Our findings reveal a strong dependence on the quality of few-shot demonstrations in ICL to handle domain-shift. The study also highlights the significance of domain-specific pre-training by comparing task-specific models that differ primarily in their pre-training corpus.</p>","PeriodicalId":21811,"journal":{"name":"Scientific Reports","volume":"15 1","pages":"18597"},"PeriodicalIF":3.9000,"publicationDate":"2025-05-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12116928/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Scientific Reports","FirstCategoryId":"103","ListUrlMain":"https://doi.org/10.1038/s41598-025-03619-y","RegionNum":2,"RegionCategory":"综合性期刊","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"MULTIDISCIPLINARY SCIENCES","Score":null,"Total":0}

引用次数: 0

Abstract

Extracting structured and semantically annotated materials information from unstructured scientific literature is a crucial step toward constructing machine-interpretable knowledge graphs and accelerating data-driven materials research. This is especially important in materials science, which is adversely affected by data scarcity. Data scarcity further motivates employing solutions such as foundation language models for extracting information which can in principle address several subtasks of the information extraction problem in a range of domains without the need of generating costly large-scale annotated datasets for each downstream task. However, foundation language models struggle with tasks like Named Entity Recognition (NER) due to domain-specific terminologies, fine-grained entities, and semantic ambiguity. The issue is even more pronounced when entities must map directly to pre-existing domain ontologies. This work aims to assess whether foundation large language models (LLMs) can successfully perform ontology-conformal NER in the materials mechanics and fatigue domain. Specifically, we present a comparative evaluation of in-context learning (ICL) with foundation models such as GPT-4 against fine-tuned task-specific language models, including MatSciBERT and DeBERTa. The study is performed on two materials fatigue datasets, which contain annotations at a comparatively fine-grained level adhering to the class definitions of a formal ontology to ensure semantic alignment and cross-dataset interoperability. Both datasets cover adjacent domains to assess how well both NER methodologies generalize when presented with typical domain shifts. Task-specific models are shown to significantly outperform general foundation models on an ontology-constrained NER. Our findings reveal a strong dependence on the quality of few-shot demonstrations in ICL to handle domain-shift. The study also highlights the significance of domain-specific pre-training by comparing task-specific models that differ primarily in their pre-training corpus.

查看原文本刊更多论文

使用语言模型对材料实体进行本体-适形识别。

从非结构化科学文献中提取结构化和语义注释的材料信息是构建机器可解释知识图和加速数据驱动材料研究的关键一步。这在材料科学中尤其重要，因为材料科学受到数据稀缺的不利影响。数据稀缺进一步促使采用基础语言模型等解决方案来提取信息，这些解决方案原则上可以解决一系列领域中信息提取问题的几个子任务，而无需为每个下游任务生成昂贵的大规模带注释的数据集。然而，由于领域特定的术语、细粒度实体和语义歧义，基础语言模型在处理命名实体识别（NER）等任务时遇到了困难。当实体必须直接映射到预先存在的域本体时，问题就更加明显了。这项工作旨在评估基础大型语言模型（llm）是否可以成功地在材料力学和疲劳领域执行本体共形NER。具体而言，我们提出了基于基础模型（如GPT-4）的情境学习（ICL）与微调的任务特定语言模型（包括MatSciBERT和DeBERTa）的比较评估。该研究是在两个材料疲劳数据集上进行的，这些数据集包含相对细粒度级别的注释，遵循正式本体的类定义，以确保语义对齐和跨数据集的互操作性。这两个数据集涵盖了相邻的域，以评估两种NER方法在呈现典型的域转移时的泛化程度。在本体约束的NER上，任务特定模型的表现明显优于一般基础模型。我们的研究结果表明，在ICL中，处理域移位的方法强烈依赖于少量演示的质量。该研究还通过比较在预训练语料库中主要不同的任务特定模型，强调了特定领域预训练的重要性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Scientific Reports Natural Science Disciplines-

CiteScore

7.50

自引率

4.30%

发文量

19567

审稿时长

3.9 months

期刊介绍： We publish original research from all areas of the natural sciences, psychology, medicine and engineering. You can learn more about what we publish by browsing our specific scientific subject areas below or explore Scientific Reports by browsing all articles and collections. Scientific Reports has a 2-year impact factor: 4.380 (2021), and is the 6th most-cited journal in the world, with more than 540,000 citations in 2020 (Clarivate Analytics, 2021). •Engineering Engineering covers all aspects of engineering, technology, and applied science. It plays a crucial role in the development of technologies to address some of the world''s biggest challenges, helping to save lives and improve the way we live. •Physical sciences Physical sciences are those academic disciplines that aim to uncover the underlying laws of nature — often written in the language of mathematics. It is a collective term for areas of study including astronomy, chemistry, materials science and physics. •Earth and environmental sciences Earth and environmental sciences cover all aspects of Earth and planetary science and broadly encompass solid Earth processes, surface and atmospheric dynamics, Earth system history, climate and climate change, marine and freshwater systems, and ecology. It also considers the interactions between humans and these systems. •Biological sciences Biological sciences encompass all the divisions of natural sciences examining various aspects of vital processes. The concept includes anatomy, physiology, cell biology, biochemistry and biophysics, and covers all organisms from microorganisms, animals to plants. •Health sciences The health sciences study health, disease and healthcare. This field of study aims to develop knowledge, interventions and technology for use in healthcare to improve the treatment of patients.