Markus Kreuzthaler, Bastian Pfeifer, Stefan Schulz
{"title":"Secondary Use of Clinical Problem List Descriptions for Bi-Encoder Based ICD-10 Classification.","authors":"Markus Kreuzthaler, Bastian Pfeifer, Stefan Schulz","doi":"","DOIUrl":null,"url":null,"abstract":"<p><p>Annotated language resources are essential for supervised machine learning methods. In the clinical domain, such data sets can boost use-case specific natural language processing services. In this work, we have analyzed a clinical problem list table consisting of millions of ICD-10 codes assigned to short problem list descriptions in German. We have investigated whether the given data forms a valuable resource within a secondary use case scenario for coding support. Our proposed methodology exploits an embedding-based k-NN classifier, which was evaluated based on its coding performance, leveraging the multilingual BERT based language model SapBERT-UMLS in comparison with medBERT.de, which is specifically tailored to medical and clinical language resources in German. Our approach reached a weighted F1-measure of 0.87 using SapBERT-UMLS and an F1-measure of 0.86 for medBERT.de. The approach revealed promising coding results when reusing annotated language resources out of clinical routine documentation.</p>","PeriodicalId":72180,"journal":{"name":"AMIA ... Annual Symposium proceedings. AMIA Symposium","volume":"2024 ","pages":"620-627"},"PeriodicalIF":0.0000,"publicationDate":"2025-05-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12099355/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"AMIA ... Annual Symposium proceedings. AMIA Symposium","FirstCategoryId":"1085","ListUrlMain":"","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2024/1/1 0:00:00","PubModel":"eCollection","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Annotated language resources are essential for supervised machine learning methods. In the clinical domain, such data sets can boost use-case specific natural language processing services. In this work, we have analyzed a clinical problem list table consisting of millions of ICD-10 codes assigned to short problem list descriptions in German. We have investigated whether the given data forms a valuable resource within a secondary use case scenario for coding support. Our proposed methodology exploits an embedding-based k-NN classifier, which was evaluated based on its coding performance, leveraging the multilingual BERT based language model SapBERT-UMLS in comparison with medBERT.de, which is specifically tailored to medical and clinical language resources in German. Our approach reached a weighted F1-measure of 0.87 using SapBERT-UMLS and an F1-measure of 0.86 for medBERT.de. The approach revealed promising coding results when reusing annotated language resources out of clinical routine documentation.