通过基于比较权力的大型模型识别潜在的破坏性研究

IF 6.9 1区管理学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

Information Processing & Management Pub Date : 2025-06-06 DOI:10.1016/j.ipm.2025.104207

Shengzhi Huang , Wei Lu , Zhenzhen Xu , Qikai Cheng , Jinqing Yang , Yong Huang

{"title":"通过基于比较权力的大型模型识别潜在的破坏性研究","authors":"Shengzhi Huang , Wei Lu , Zhenzhen Xu , Qikai Cheng , Jinqing Yang , Yong Huang","doi":"10.1016/j.ipm.2025.104207","DOIUrl":null,"url":null,"abstract":"<div><div>Timely identification of potentially disruptive research is a significant research issue, since disruptive innovation in science transforms the existing paradigm and/or opens a new paradigm. This study proposes a comparative power-based large model that can promptly and accurately identify potentially disruptive research via comparative analysis of semantically-related papers. To this end, a self-constructed dataset was built by treating accumulated disruptive and consolidating citations as crowdsourced annotation data. We employed a range of machine learning models (MLs), deep learning models (DLs), and large language models (LLMs) to build classifiers. Our optimal model, Mistral-7B<sup>+*</sup>, attains an impressive F1 score of 0.8210 and outperforms the best-performing ML and DL models by approximately 27.05 % and 14.03 %, respectively. Testing on 275 recently published biomedical papers further verifies its effectiveness. Additionally, we conduct comprehensive experiments to scrutinize the comparative power of the large model as well as the impact of the number and quality of comparative papers and distinct functional paragraphs within abstracts on identification performance. Our findings show that an appropriate number and quality of comparative papers can promote identification performance. Moreover, result-based paragraphs are the most important for identifying disruptive research, while method-based paragraphs are least important.</div></div>","PeriodicalId":50365,"journal":{"name":"Information Processing & Management","volume":"62 6","pages":"Article 104207"},"PeriodicalIF":6.9000,"publicationDate":"2025-06-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Identifying potentially disruptive research via a comparative power-based large model\",\"authors\":\"Shengzhi Huang , Wei Lu , Zhenzhen Xu , Qikai Cheng , Jinqing Yang , Yong Huang\",\"doi\":\"10.1016/j.ipm.2025.104207\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>Timely identification of potentially disruptive research is a significant research issue, since disruptive innovation in science transforms the existing paradigm and/or opens a new paradigm. This study proposes a comparative power-based large model that can promptly and accurately identify potentially disruptive research via comparative analysis of semantically-related papers. To this end, a self-constructed dataset was built by treating accumulated disruptive and consolidating citations as crowdsourced annotation data. We employed a range of machine learning models (MLs), deep learning models (DLs), and large language models (LLMs) to build classifiers. Our optimal model, Mistral-7B<sup>+*</sup>, attains an impressive F1 score of 0.8210 and outperforms the best-performing ML and DL models by approximately 27.05 % and 14.03 %, respectively. Testing on 275 recently published biomedical papers further verifies its effectiveness. Additionally, we conduct comprehensive experiments to scrutinize the comparative power of the large model as well as the impact of the number and quality of comparative papers and distinct functional paragraphs within abstracts on identification performance. Our findings show that an appropriate number and quality of comparative papers can promote identification performance. Moreover, result-based paragraphs are the most important for identifying disruptive research, while method-based paragraphs are least important.</div></div>\",\"PeriodicalId\":50365,\"journal\":{\"name\":\"Information Processing & Management\",\"volume\":\"62 6\",\"pages\":\"Article 104207\"},\"PeriodicalIF\":6.9000,\"publicationDate\":\"2025-06-06\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Information Processing & Management\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0306457325001487\",\"RegionNum\":1,\"RegionCategory\":\"管理学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, INFORMATION SYSTEMS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Information Processing & Management","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0306457325001487","RegionNum":1,"RegionCategory":"管理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 0

摘要

及时识别潜在的破坏性研究是一个重要的研究问题，因为科学中的破坏性创新改变了现有的范式和/或打开了一个新的范式。本研究提出了一个基于比较权力的大型模型，该模型可以通过对语义相关论文的比较分析，迅速准确地识别潜在的破坏性研究。为此，将累积的破坏性和整合性引文作为众包标注数据，构建自构建数据集。我们使用了一系列机器学习模型（ml）、深度学习模型（dl）和大型语言模型（llm）来构建分类器。我们的最优模型Mistral-7B+*获得了令人印象深刻的F1分数0.8210，比表现最好的ML和DL模型分别高出约27.05%和14.03%。对最近发表的275篇生物医学论文的测试进一步验证了其有效性。此外，我们进行了全面的实验，以仔细检查大型模型的比较能力，以及比较论文的数量和质量以及摘要中不同功能段落对识别性能的影响。我们的研究结果表明，适当的比较论文数量和质量可以提高鉴定性能。此外，基于结果的段落对于识别破坏性研究是最重要的，而基于方法的段落是最不重要的。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Identifying potentially disruptive research via a comparative power-based large model

Timely identification of potentially disruptive research is a significant research issue, since disruptive innovation in science transforms the existing paradigm and/or opens a new paradigm. This study proposes a comparative power-based large model that can promptly and accurately identify potentially disruptive research via comparative analysis of semantically-related papers. To this end, a self-constructed dataset was built by treating accumulated disruptive and consolidating citations as crowdsourced annotation data. We employed a range of machine learning models (MLs), deep learning models (DLs), and large language models (LLMs) to build classifiers. Our optimal model, Mistral-7B^+*, attains an impressive F1 score of 0.8210 and outperforms the best-performing ML and DL models by approximately 27.05 % and 14.03 %, respectively. Testing on 275 recently published biomedical papers further verifies its effectiveness. Additionally, we conduct comprehensive experiments to scrutinize the comparative power of the large model as well as the impact of the number and quality of comparative papers and distinct functional paragraphs within abstracts on identification performance. Our findings show that an appropriate number and quality of comparative papers can promote identification performance. Moreover, result-based paragraphs are the most important for identifying disruptive research, while method-based paragraphs are least important.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Information Processing & Management 工程技术-计算机：信息系统

CiteScore

17.00

自引率

11.60%

发文量

276

审稿时长

39 days

期刊介绍： Information Processing and Management is dedicated to publishing cutting-edge original research at the convergence of computing and information science. Our scope encompasses theory, methods, and applications across various domains, including advertising, business, health, information science, information technology marketing, and social computing. We aim to cater to the interests of both primary researchers and practitioners by offering an effective platform for the timely dissemination of advanced and topical issues in this interdisciplinary field. The journal places particular emphasis on original research articles, research survey articles, research method articles, and articles addressing critical applications of research. Join us in advancing knowledge and innovation at the intersection of computing and information science.