Characterizing pituitary adenomas in clinical notes: Corpus construction and its application in LLMs.

IF 2.3 3区医学 Q2 HEALTH CARE SCIENCES & SERVICES

Health Informatics Journal Pub Date : 2024-10-01 DOI:10.1177/14604582241291442

Jiahui Hu, Jin Fu, Wanqing Zhao, Pei Lou, Ming Feng, Huiling Ren, Shanshan Feng, Yansheng Li, An Fang

{"title":"Characterizing pituitary adenomas in clinical notes: Corpus construction and its application in LLMs.","authors":"Jiahui Hu, Jin Fu, Wanqing Zhao, Pei Lou, Ming Feng, Huiling Ren, Shanshan Feng, Yansheng Li, An Fang","doi":"10.1177/14604582241291442","DOIUrl":null,"url":null,"abstract":"Objective: Faced with the challenges of differential diagnosis caused by the complex clinical manifestations and high pathological heterogeneity of pituitary adenomas, this study aims to construct a high-quality annotated corpus to characterize pituitary adenomas in clinical notes containing rich diagnosis and treatment information. Methods: A dataset from a pituitary adenomas neurosurgery treatment center of a tertiary first-class hospital in China was retrospectively collected. A semi-automatic corpus construction framework was designed. A total of 2000 documents containing 9430 sentences and 524,232 words were annotated, and the text corpus of pituitary adenomas (TCPA) was constructed and analyzed. Its potential application in large language models (LLMs) was explored through fine-tuning and prompting experiments. Results: TCPA had 4782 medical entities and 28,998 tokens, achieving good quality with the inter-annotator agreement value of 0.862-0.986. The LLMs experiments showed that TCPA can be used to automatically identify clinical information from free texts, and introducing instances with clinical characteristics can effectively reduce the need for training data, thereby reducing labor costs. Conclusion: This study characterized pituitary adenomas in clinical notes, and the proposed method were able to serve as references for relevant research in medical natural language scenarios with highly specialized language structure and terminology.","PeriodicalId":55069,"journal":{"name":"Health Informatics Journal","volume":"30 4","pages":"14604582241291442"},"PeriodicalIF":2.3000,"publicationDate":"2024-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Health Informatics Journal","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1177/14604582241291442","RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"HEALTH CARE SCIENCES & SERVICES","Score":null,"Total":0}

引用次数: 0

Abstract

Objective: Faced with the challenges of differential diagnosis caused by the complex clinical manifestations and high pathological heterogeneity of pituitary adenomas, this study aims to construct a high-quality annotated corpus to characterize pituitary adenomas in clinical notes containing rich diagnosis and treatment information. Methods: A dataset from a pituitary adenomas neurosurgery treatment center of a tertiary first-class hospital in China was retrospectively collected. A semi-automatic corpus construction framework was designed. A total of 2000 documents containing 9430 sentences and 524,232 words were annotated, and the text corpus of pituitary adenomas (TCPA) was constructed and analyzed. Its potential application in large language models (LLMs) was explored through fine-tuning and prompting experiments. Results: TCPA had 4782 medical entities and 28,998 tokens, achieving good quality with the inter-annotator agreement value of 0.862-0.986. The LLMs experiments showed that TCPA can be used to automatically identify clinical information from free texts, and introducing instances with clinical characteristics can effectively reduce the need for training data, thereby reducing labor costs. Conclusion: This study characterized pituitary adenomas in clinical notes, and the proposed method were able to serve as references for relevant research in medical natural language scenarios with highly specialized language structure and terminology.

查看原文本刊更多论文

在临床笔记中描述垂体腺瘤的特征：语料库构建及其在 LLM 中的应用。

研究目的面对垂体腺瘤复杂的临床表现和高度的病理异质性给鉴别诊断带来的挑战，本研究旨在构建一个高质量的注释语料库，以描述临床笔记中包含丰富诊断和治疗信息的垂体腺瘤的特征。研究方法回顾性收集中国某三级甲等医院垂体腺瘤神经外科治疗中心的数据集。设计了一个半自动语料库构建框架。共注释了 2000 份文件，包含 9430 个句子和 524 232 个单词，并构建和分析了垂体腺瘤文本语料库（TCPA）。通过微调和提示实验，探索了其在大型语言模型（LLM）中的应用潜力。结果：TCPA 共有 4782 个医学实体和 28998 个词块，质量良好，标注者之间的一致性值为 0.862-0.986。LLMs 实验表明，TCPA 可用于从自由文本中自动识别临床信息，引入具有临床特征的实例可有效减少对训练数据的需求，从而降低人力成本。结论本研究揭示了临床笔记中垂体腺瘤的特征，所提出的方法能够为具有高度专业语言结构和术语的医学自然语言场景中的相关研究提供参考。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Health Informatics Journal HEALTH CARE SCIENCES & SERVICES-MEDICAL INFORMATICS

CiteScore

7.80

自引率

6.70%

发文量

审稿时长

6 months

期刊介绍： Health Informatics Journal is an international peer-reviewed journal. All papers submitted to Health Informatics Journal are subject to peer review by members of a carefully appointed editorial board. The journal operates a conventional single-blind reviewing policy in which the reviewer’s name is always concealed from the submitting author.