Harnessing the Power of BERT in the Turkish Clinical Domain: Pretraining Approaches for Limited Data Scenarios

Clinical Natural Language Processing Workshop Pub Date : 2023-05-05 DOI:10.48550/arXiv.2305.03788

Hazal Türkmen, Oguz Dikenelli, C. Eraslan, Mehmet Cem Çalli, S. Özbek

{"title":"Harnessing the Power of BERT in the Turkish Clinical Domain: Pretraining Approaches for Limited Data Scenarios","authors":"Hazal Türkmen, Oguz Dikenelli, C. Eraslan, Mehmet Cem Çalli, S. Özbek","doi":"10.48550/arXiv.2305.03788","DOIUrl":null,"url":null,"abstract":"Recent advancements in natural language processing (NLP) have been driven by large language models (LLMs), thereby revolutionizing the field. Our study investigates the impact of diverse pre-training strategies on the performance of Turkish clinical language models in a multi-label classification task involving radiology reports, with a focus on overcoming language resource limitations. Additionally, for the first time, we evaluated the simultaneous pre-training approach by utilizing limited clinical task data. We developed four models: TurkRadBERT-task v1, TurkRadBERT-task v2, TurkRadBERT-sim v1, and TurkRadBERT-sim v2. Our results revealed superior performance from BERTurk and TurkRadBERT-task v1, both of which leverage a broad general-domain corpus. Although task-adaptive pre-training is capable of identifying domain-specific patterns, it may be prone to overfitting because of the constraints of the task-specific corpus. Our findings highlight the importance of domain-specific vocabulary during pre-training to improve performance. They also affirmed that a combination of general domain knowledge and task-specific fine-tuning is crucial for optimal performance across various categories. This study offers key insights for future research on pre-training techniques in the clinical domain, particularly for low-resource languages.","PeriodicalId":216954,"journal":{"name":"Clinical Natural Language Processing Workshop","volume":"33 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-05-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Clinical Natural Language Processing Workshop","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.48550/arXiv.2305.03788","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

Abstract

Recent advancements in natural language processing (NLP) have been driven by large language models (LLMs), thereby revolutionizing the field. Our study investigates the impact of diverse pre-training strategies on the performance of Turkish clinical language models in a multi-label classification task involving radiology reports, with a focus on overcoming language resource limitations. Additionally, for the first time, we evaluated the simultaneous pre-training approach by utilizing limited clinical task data. We developed four models: TurkRadBERT-task v1, TurkRadBERT-task v2, TurkRadBERT-sim v1, and TurkRadBERT-sim v2. Our results revealed superior performance from BERTurk and TurkRadBERT-task v1, both of which leverage a broad general-domain corpus. Although task-adaptive pre-training is capable of identifying domain-specific patterns, it may be prone to overfitting because of the constraints of the task-specific corpus. Our findings highlight the importance of domain-specific vocabulary during pre-training to improve performance. They also affirmed that a combination of general domain knowledge and task-specific fine-tuning is crucial for optimal performance across various categories. This study offers key insights for future research on pre-training techniques in the clinical domain, particularly for low-resource languages.

查看原文本刊更多论文

利用BERT在土耳其临床领域的力量:有限数据场景的预训练方法

自然语言处理(NLP)的最新进展是由大型语言模型(llm)驱动的，从而彻底改变了该领域。我们的研究调查了不同的预训练策略对土耳其临床语言模型在涉及放射学报告的多标签分类任务中的表现的影响，重点是克服语言资源限制。此外，我们首次利用有限的临床任务数据评估了同步预训练方法。我们开发了四个模型:TurkRadBERT-task v1、TurkRadBERT-task v2、TurkRadBERT-sim v1和TurkRadBERT-sim v2。我们的结果显示BERTurk和TurkRadBERT-task v1的性能优越，两者都利用了广泛的通用领域语料库。虽然任务自适应预训练能够识别特定领域的模式，但由于特定任务语料库的限制，它可能容易过度拟合。我们的研究结果强调了在预训练中特定领域词汇对提高表现的重要性。他们还肯定了一般领域知识和特定任务微调的结合对于各种类别的最佳性能至关重要。这项研究为临床领域的预训练技术的未来研究提供了关键的见解，特别是对于低资源语言。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Clinical Natural Language Processing Workshop

自引率

0.00%

发文量