Yubo Chen , Baoli Zhang , Sirui Li , Zhuoran Jin , Zhengyuan Cai , Yingzheng Wang , Delai Qiu , ShengPing Liu , Jun Zhao
{"title":"中文医药命名实体识别的提示鲁棒大语言模型","authors":"Yubo Chen , Baoli Zhang , Sirui Li , Zhuoran Jin , Zhengyuan Cai , Yingzheng Wang , Delai Qiu , ShengPing Liu , Jun Zhao","doi":"10.1016/j.ipm.2025.104189","DOIUrl":null,"url":null,"abstract":"<div><div>Medical Named Entity Recognition (NER) is crucial for constructing healthcare knowledge graphs and enhancing intelligent medical systems, yet it faces three challenges: data scarcity, low recall in nested entities annotation and high prompt sensitivity of generative NER model. In this paper, we aim to address the three challenges simultaneously. First, we construct a Multi-Scenario Medical NER dataset which is the largest medical NER dataset, including over 40,000 samples and over 3400 entity types with eight major scenarios: medical web data, online consultation, medical book, etc. Second, we propose a decomposed question answering based data annotation and selection method, which improved F1 score by 6% compared to direct annotation. Third, to enhance the robustness of large models to diverse prompts in real-world scenarios, we construct diverse prompt templates and implements dynamic prompt strategy during the training phase. Finally, we conducted a comprehensive set of experiments, and the results demonstrate the effectiveness of our annotation method and robustness training approach. Notably, the proposed framework achieves a 5% performance improvement on the test set compared to conventional methods. Moreover, our method enables a 7B parameter model to surpass a 32B parameter model, highlighting its superior efficiency and capability.</div></div>","PeriodicalId":50365,"journal":{"name":"Information Processing & Management","volume":"62 5","pages":"Article 104189"},"PeriodicalIF":6.9000,"publicationDate":"2025-05-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Prompt robust large language model for Chinese medical named entity recognition\",\"authors\":\"Yubo Chen , Baoli Zhang , Sirui Li , Zhuoran Jin , Zhengyuan Cai , Yingzheng Wang , Delai Qiu , ShengPing Liu , Jun Zhao\",\"doi\":\"10.1016/j.ipm.2025.104189\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>Medical Named Entity Recognition (NER) is crucial for constructing healthcare knowledge graphs and enhancing intelligent medical systems, yet it faces three challenges: data scarcity, low recall in nested entities annotation and high prompt sensitivity of generative NER model. In this paper, we aim to address the three challenges simultaneously. First, we construct a Multi-Scenario Medical NER dataset which is the largest medical NER dataset, including over 40,000 samples and over 3400 entity types with eight major scenarios: medical web data, online consultation, medical book, etc. Second, we propose a decomposed question answering based data annotation and selection method, which improved F1 score by 6% compared to direct annotation. Third, to enhance the robustness of large models to diverse prompts in real-world scenarios, we construct diverse prompt templates and implements dynamic prompt strategy during the training phase. Finally, we conducted a comprehensive set of experiments, and the results demonstrate the effectiveness of our annotation method and robustness training approach. Notably, the proposed framework achieves a 5% performance improvement on the test set compared to conventional methods. Moreover, our method enables a 7B parameter model to surpass a 32B parameter model, highlighting its superior efficiency and capability.</div></div>\",\"PeriodicalId\":50365,\"journal\":{\"name\":\"Information Processing & Management\",\"volume\":\"62 5\",\"pages\":\"Article 104189\"},\"PeriodicalIF\":6.9000,\"publicationDate\":\"2025-05-12\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Information Processing & Management\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S030645732500130X\",\"RegionNum\":1,\"RegionCategory\":\"管理学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, INFORMATION SYSTEMS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Information Processing & Management","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S030645732500130X","RegionNum":1,"RegionCategory":"管理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}
Prompt robust large language model for Chinese medical named entity recognition
Medical Named Entity Recognition (NER) is crucial for constructing healthcare knowledge graphs and enhancing intelligent medical systems, yet it faces three challenges: data scarcity, low recall in nested entities annotation and high prompt sensitivity of generative NER model. In this paper, we aim to address the three challenges simultaneously. First, we construct a Multi-Scenario Medical NER dataset which is the largest medical NER dataset, including over 40,000 samples and over 3400 entity types with eight major scenarios: medical web data, online consultation, medical book, etc. Second, we propose a decomposed question answering based data annotation and selection method, which improved F1 score by 6% compared to direct annotation. Third, to enhance the robustness of large models to diverse prompts in real-world scenarios, we construct diverse prompt templates and implements dynamic prompt strategy during the training phase. Finally, we conducted a comprehensive set of experiments, and the results demonstrate the effectiveness of our annotation method and robustness training approach. Notably, the proposed framework achieves a 5% performance improvement on the test set compared to conventional methods. Moreover, our method enables a 7B parameter model to surpass a 32B parameter model, highlighting its superior efficiency and capability.
期刊介绍:
Information Processing and Management is dedicated to publishing cutting-edge original research at the convergence of computing and information science. Our scope encompasses theory, methods, and applications across various domains, including advertising, business, health, information science, information technology marketing, and social computing.
We aim to cater to the interests of both primary researchers and practitioners by offering an effective platform for the timely dissemination of advanced and topical issues in this interdisciplinary field. The journal places particular emphasis on original research articles, research survey articles, research method articles, and articles addressing critical applications of research. Join us in advancing knowledge and innovation at the intersection of computing and information science.