Extracting problem and method sentence from scientific papers: a context-enhanced transformer using formulaic expression desensitization

IF 3.5 3区管理学 Q2 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS

Scientometrics Pub Date : 2024-05-27 DOI:10.1007/s11192-024-05048-6

Yingyi Zhang, Chengzhi Zhang

{"title":"Extracting problem and method sentence from scientific papers: a context-enhanced transformer using formulaic expression desensitization","authors":"Yingyi Zhang, Chengzhi Zhang","doi":"10.1007/s11192-024-05048-6","DOIUrl":null,"url":null,"abstract":"<p>Billions of scientific papers lead to the need to identify essential parts from the massive text. Scientific research is an activity from putting forward problems to using methods. To learn the main idea from scientific papers, we focus on extracting problem and method sentences. Annotating sentences within scientific papers is labor-intensive, resulting in small-scale datasets that limit the amount of information models can learn. This limited information leads models to rely heavily on specific forms, which in turn reduces their generalization capabilities. This paper addresses the problems caused by small-scale datasets from three perspectives: increasing dataset scale, reducing dependence on specific forms, and enriching the information within sentences. To implement the first two ideas, we introduce the concept of formulaic expression (FE) desensitization and propose FE desensitization-based data augmenters to generate synthetic data and reduce models’ reliance on FEs. For the third idea, we propose a context-enhanced transformer that utilizes context to measure the importance of words in target sentences and to reduce noise in the context. Furthermore, this paper conducts experiments using large language model (LLM) based in-context learning (ICL) methods. Quantitative and qualitative experiments demonstrate that our proposed models achieve a higher macro F<sub>1</sub> score compared to the baseline models on two scientific paper datasets, with improvements of 3.71% and 2.67%, respectively. The LLM based ICL methods are found to be not suitable for the task of problem and method extraction.</p>","PeriodicalId":21755,"journal":{"name":"Scientometrics","volume":"65 1","pages":""},"PeriodicalIF":3.5000,"publicationDate":"2024-05-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Scientometrics","FirstCategoryId":"91","ListUrlMain":"https://doi.org/10.1007/s11192-024-05048-6","RegionNum":3,"RegionCategory":"管理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS","Score":null,"Total":0}

引用次数: 0

Abstract

Billions of scientific papers lead to the need to identify essential parts from the massive text. Scientific research is an activity from putting forward problems to using methods. To learn the main idea from scientific papers, we focus on extracting problem and method sentences. Annotating sentences within scientific papers is labor-intensive, resulting in small-scale datasets that limit the amount of information models can learn. This limited information leads models to rely heavily on specific forms, which in turn reduces their generalization capabilities. This paper addresses the problems caused by small-scale datasets from three perspectives: increasing dataset scale, reducing dependence on specific forms, and enriching the information within sentences. To implement the first two ideas, we introduce the concept of formulaic expression (FE) desensitization and propose FE desensitization-based data augmenters to generate synthetic data and reduce models’ reliance on FEs. For the third idea, we propose a context-enhanced transformer that utilizes context to measure the importance of words in target sentences and to reduce noise in the context. Furthermore, this paper conducts experiments using large language model (LLM) based in-context learning (ICL) methods. Quantitative and qualitative experiments demonstrate that our proposed models achieve a higher macro F₁ score compared to the baseline models on two scientific paper datasets, with improvements of 3.71% and 2.67%, respectively. The LLM based ICL methods are found to be not suitable for the task of problem and method extraction.

Abstract Image

查看原文本刊更多论文

从科学论文中提取问题句和方法句：使用公式化表达脱敏的语境增强转换器

数以亿计的科学论文导致我们需要从海量文本中找出重要部分。科学研究是一项从提出问题到使用方法的活动。为了从科学论文中学习主要观点，我们将重点放在提取问题句和方法句上。对科学论文中的句子进行注释是一项劳动密集型工作，导致数据集规模较小，限制了模型可学习的信息量。有限的信息导致模型严重依赖于特定的形式，这反过来又降低了模型的泛化能力。本文从三个方面解决了小规模数据集带来的问题：扩大数据集规模、减少对特定形式的依赖以及丰富句子中的信息。为了实现前两个想法，我们引入了公式化表达（FE）脱敏的概念，并提出了基于 FE 脱敏的数据增强器来生成合成数据，减少模型对 FE 的依赖。对于第三个想法，我们提出了一种上下文增强转换器，利用上下文来衡量目标句子中单词的重要性，并减少上下文中的噪音。此外，本文还使用基于大语言模型（LLM）的上下文学习（ICL）方法进行了实验。定量和定性实验表明，在两个科学论文数据集上，与基线模型相比，我们提出的模型获得了更高的宏观 F1 分数，分别提高了 3.71% 和 2.67%。基于 LLM 的 ICL 方法不适合问题和方法提取任务。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Scientometrics 管理科学-计算机：跨学科应用

CiteScore

7.20

自引率

17.90%

发文量

351

审稿时长

1.5 months

期刊介绍： Scientometrics aims at publishing original studies, short communications, preliminary reports, review papers, letters to the editor and book reviews on scientometrics. The topics covered are results of research concerned with the quantitative features and characteristics of science. Emphasis is placed on investigations in which the development and mechanism of science are studied by means of (statistical) mathematical methods. The Journal also provides the reader with important up-to-date information about international meetings and events in scientometrics and related fields. Appropriate bibliographic compilations are published as a separate section. Due to its fully interdisciplinary character, Scientometrics is indispensable to research workers and research administrators throughout the world. It provides valuable assistance to librarians and documentalists in central scientific agencies, ministries, research institutes and laboratories. Scientometrics includes the Journal of Research Communication Studies. Consequently its aims and scope cover that of the latter, namely, to bring the results of research investigations together in one place, in such a form that they will be of use not only to the investigators themselves but also to the entrepreneurs and research workers who form the object of these studies.