{"title":"基于较少数据的医学摘要句子分类研究","authors":"Yan Hu, Yong Chen, Hua Xu","doi":"10.1007/s41666-023-00141-6","DOIUrl":null,"url":null,"abstract":"<p><p>With the unprecedented growth of biomedical publications, it is important to have structured abstracts in bibliographic databases (i.e., PubMed), thus, to facilitate the information retrieval and knowledge synthesis in needs of researchers. Here, we propose a few-shot prompt learning-based approach to classify sentences in medical abstracts of randomized clinical trials (RCT) and observational studies (OS) to subsections of Introduction, Background, Methods, Results, and Conclusion, using an existing corpus of RCT (PubMed 200k/20k RCT) and a newly built corpus of OS (PubMed 20k OS). Five manually designed templates in a combination of 4 BERT model variants were tested and compared to a previous hierarchical sequential labeling network architecture and traditional BERT-based sentence classification method. On the PubMed 200k and 20k RCT datasets, we achieved overall F1 scores of 0.9508 and 0.9401, respectively. Under few-shot settings, we demonstrated that only 20% of training data is sufficient to achieve a comparable F1 score by the HSLN model (0.9266 by us and 0.9263 by HSLN). When trained on the RCT dataset, our method achieved a 0.9065 F1 score on the OS dataset. When trained on the OS dataset, our method achieved a 0.9203 F1 score on the RCT dataset. We show that the prompt learning-based method outperformed the existing method, even when fewer training samples were used. Moreover, the proposed method shows better generalizability across two types of medical publications when compared with the existing approach. We make the datasets and codes publicly available at: https://github.com/YanHu-or-SawyerHu/prompt-learning-based-sentence-classifier-in-medical-abstracts.</p>","PeriodicalId":36444,"journal":{"name":"Journal of Healthcare Informatics Research","volume":null,"pages":null},"PeriodicalIF":5.9000,"publicationDate":"2023-08-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10620359/pdf/","citationCount":"0","resultStr":"{\"title\":\"Towards More Generalizable and Accurate Sentence Classification in Medical Abstracts with Less Data.\",\"authors\":\"Yan Hu, Yong Chen, Hua Xu\",\"doi\":\"10.1007/s41666-023-00141-6\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><p>With the unprecedented growth of biomedical publications, it is important to have structured abstracts in bibliographic databases (i.e., PubMed), thus, to facilitate the information retrieval and knowledge synthesis in needs of researchers. Here, we propose a few-shot prompt learning-based approach to classify sentences in medical abstracts of randomized clinical trials (RCT) and observational studies (OS) to subsections of Introduction, Background, Methods, Results, and Conclusion, using an existing corpus of RCT (PubMed 200k/20k RCT) and a newly built corpus of OS (PubMed 20k OS). Five manually designed templates in a combination of 4 BERT model variants were tested and compared to a previous hierarchical sequential labeling network architecture and traditional BERT-based sentence classification method. On the PubMed 200k and 20k RCT datasets, we achieved overall F1 scores of 0.9508 and 0.9401, respectively. Under few-shot settings, we demonstrated that only 20% of training data is sufficient to achieve a comparable F1 score by the HSLN model (0.9266 by us and 0.9263 by HSLN). When trained on the RCT dataset, our method achieved a 0.9065 F1 score on the OS dataset. When trained on the OS dataset, our method achieved a 0.9203 F1 score on the RCT dataset. We show that the prompt learning-based method outperformed the existing method, even when fewer training samples were used. Moreover, the proposed method shows better generalizability across two types of medical publications when compared with the existing approach. We make the datasets and codes publicly available at: https://github.com/YanHu-or-SawyerHu/prompt-learning-based-sentence-classifier-in-medical-abstracts.</p>\",\"PeriodicalId\":36444,\"journal\":{\"name\":\"Journal of Healthcare Informatics Research\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":5.9000,\"publicationDate\":\"2023-08-30\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10620359/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Journal of Healthcare Informatics Research\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1007/s41666-023-00141-6\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"2023/12/1 0:00:00\",\"PubModel\":\"eCollection\",\"JCR\":\"Q1\",\"JCRName\":\"Computer Science\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Healthcare Informatics Research","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1007/s41666-023-00141-6","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2023/12/1 0:00:00","PubModel":"eCollection","JCR":"Q1","JCRName":"Computer Science","Score":null,"Total":0}
Towards More Generalizable and Accurate Sentence Classification in Medical Abstracts with Less Data.
With the unprecedented growth of biomedical publications, it is important to have structured abstracts in bibliographic databases (i.e., PubMed), thus, to facilitate the information retrieval and knowledge synthesis in needs of researchers. Here, we propose a few-shot prompt learning-based approach to classify sentences in medical abstracts of randomized clinical trials (RCT) and observational studies (OS) to subsections of Introduction, Background, Methods, Results, and Conclusion, using an existing corpus of RCT (PubMed 200k/20k RCT) and a newly built corpus of OS (PubMed 20k OS). Five manually designed templates in a combination of 4 BERT model variants were tested and compared to a previous hierarchical sequential labeling network architecture and traditional BERT-based sentence classification method. On the PubMed 200k and 20k RCT datasets, we achieved overall F1 scores of 0.9508 and 0.9401, respectively. Under few-shot settings, we demonstrated that only 20% of training data is sufficient to achieve a comparable F1 score by the HSLN model (0.9266 by us and 0.9263 by HSLN). When trained on the RCT dataset, our method achieved a 0.9065 F1 score on the OS dataset. When trained on the OS dataset, our method achieved a 0.9203 F1 score on the RCT dataset. We show that the prompt learning-based method outperformed the existing method, even when fewer training samples were used. Moreover, the proposed method shows better generalizability across two types of medical publications when compared with the existing approach. We make the datasets and codes publicly available at: https://github.com/YanHu-or-SawyerHu/prompt-learning-based-sentence-classifier-in-medical-abstracts.
期刊介绍:
Journal of Healthcare Informatics Research serves as a publication venue for the innovative technical contributions highlighting analytics, systems, and human factors research in healthcare informatics.Journal of Healthcare Informatics Research is concerned with the application of computer science principles, information science principles, information technology, and communication technology to address problems in healthcare, and everyday wellness. Journal of Healthcare Informatics Research highlights the most cutting-edge technical contributions in computing-oriented healthcare informatics. The journal covers three major tracks: (1) analytics—focuses on data analytics, knowledge discovery, predictive modeling; (2) systems—focuses on building healthcare informatics systems (e.g., architecture, framework, design, engineering, and application); (3) human factors—focuses on understanding users or context, interface design, health behavior, and user studies of healthcare informatics applications. Topics include but are not limited to: · healthcare software architecture, framework, design, and engineering;· electronic health records· medical data mining· predictive modeling· medical information retrieval· medical natural language processing· healthcare information systems· smart health and connected health· social media analytics· mobile healthcare· medical signal processing· human factors in healthcare· usability studies in healthcare· user-interface design for medical devices and healthcare software· health service delivery· health games· security and privacy in healthcare· medical recommender system· healthcare workflow management· disease profiling and personalized treatment· visualization of medical data· intelligent medical devices and sensors· RFID solutions for healthcare· healthcare decision analytics and support systems· epidemiological surveillance systems and intervention modeling· consumer and clinician health information needs, seeking, sharing, and use· semantic Web, linked data, and ontology· collaboration technologies for healthcare· assistive and adaptive ubiquitous computing technologies· statistics and quality of medical data· healthcare delivery in developing countries· health systems modeling and simulation· computer-aided diagnosis