BioBERTurk: Exploring Turkish Biomedical Language Model Development Strategies in Low-Resource Setting.

IF 5.4 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

Journal of healthcare informatics research Pub Date : 2023-09-19 eCollection Date: 2023-12-01 DOI:10.1007/s41666-023-00140-7

Hazal Türkmen, Oğuz Dikenelli, Cenk Eraslan, Mehmet Cem Çallı, Süha Süreyya Özbek

{"title":"BioBERTurk: Exploring Turkish Biomedical Language Model Development Strategies in Low-Resource Setting.","authors":"Hazal Türkmen, Oğuz Dikenelli, Cenk Eraslan, Mehmet Cem Çallı, Süha Süreyya Özbek","doi":"10.1007/s41666-023-00140-7","DOIUrl":null,"url":null,"abstract":"<p><p>Pretrained language models augmented with in-domain corpora show impressive results in biomedicine and clinical Natural Language Processing (NLP) tasks in English. However, there has been minimal work in low-resource languages. Although some pioneering works have shown promising results, many scenarios still need to be explored to engineer effective pretrained language models in biomedicine for low-resource settings. This study introduces the BioBERTurk family and four pretrained models in Turkish for biomedicine. To evaluate the models, we also introduced a labeled dataset to classify radiology reports of head CT examinations. Two parts of the reports, impressions and findings, are evaluated separately to observe the performance of models on longer and less informative text. We compared the models with the Turkish BERT (BERTurk) pretrained with general domain text, multilingual BERT (mBERT), and LSTM+attention-based baseline models. The first model initialized from BERTurk and then further pretrained with biomedical corpus performs statistically better than BERTurk, multilingual BERT, and baseline for both datasets. The second model continues to pretrain the BERTurk model by using only radiology Ph.D. theses to test the effect of task-related text. This model slightly outperformed all models on the impression dataset and showed that using only radiology-related data for continual pre-training could be effective. The third model continues to pretrain by adding radiology theses to the biomedical corpus but does not show a statistically meaningful difference for both datasets. The final model combines radiology and biomedicine corpora with the corpus of BERTurk and pretrains a BERT model from scratch. This model is the worst-performing model of the BioBERT family, even worse than BERTurk and multilingual BERT.</p>","PeriodicalId":101413,"journal":{"name":"Journal of healthcare informatics research","volume":"7 4","pages":"433-446"},"PeriodicalIF":5.4000,"publicationDate":"2023-09-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10620363/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of healthcare informatics research","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1007/s41666-023-00140-7","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2023/12/1 0:00:00","PubModel":"eCollection","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 0

Abstract

Pretrained language models augmented with in-domain corpora show impressive results in biomedicine and clinical Natural Language Processing (NLP) tasks in English. However, there has been minimal work in low-resource languages. Although some pioneering works have shown promising results, many scenarios still need to be explored to engineer effective pretrained language models in biomedicine for low-resource settings. This study introduces the BioBERTurk family and four pretrained models in Turkish for biomedicine. To evaluate the models, we also introduced a labeled dataset to classify radiology reports of head CT examinations. Two parts of the reports, impressions and findings, are evaluated separately to observe the performance of models on longer and less informative text. We compared the models with the Turkish BERT (BERTurk) pretrained with general domain text, multilingual BERT (mBERT), and LSTM+attention-based baseline models. The first model initialized from BERTurk and then further pretrained with biomedical corpus performs statistically better than BERTurk, multilingual BERT, and baseline for both datasets. The second model continues to pretrain the BERTurk model by using only radiology Ph.D. theses to test the effect of task-related text. This model slightly outperformed all models on the impression dataset and showed that using only radiology-related data for continual pre-training could be effective. The third model continues to pretrain by adding radiology theses to the biomedical corpus but does not show a statistically meaningful difference for both datasets. The final model combines radiology and biomedicine corpora with the corpus of BERTurk and pretrains a BERT model from scratch. This model is the worst-performing model of the BioBERT family, even worse than BERTurk and multilingual BERT.

查看原文本刊更多论文

BioBERTurk：探索低资源环境下的土耳其生物医学语言模型开发策略。

用领域内语料库增强的预训练语言模型在生物医学和临床英语自然语言处理（NLP）任务中显示出令人印象深刻的结果。然而，在低资源语言方面的工作很少。尽管一些开创性的工作已经显示出有希望的结果，但仍需要探索许多场景，以在低资源环境下设计生物医学中有效的预训练语言模型。本研究介绍了BioBERTurk家族和四个用土耳其语预训练的生物医学模型。为了评估模型，我们还引入了一个标记的数据集来对头部CT检查的放射学报告进行分类。报告的两个部分，印象和发现，分别进行了评估，以观察模型在较长和信息较少的文本上的表现。我们将这些模型与用通用领域文本、多语言BERT（mBERT）和基于LSTM+注意力的基线模型预训练的土耳其BERT（BERTurk）进行了比较。第一个模型从BERTurk初始化，然后用生物医学语料库进一步预训练，在统计上比BERTurk、多语言BERT和两个数据集的基线表现更好。第二个模型继续预训练BERTurk模型，只使用放射学博士论文来测试任务相关文本的效果。该模型略优于印象数据集上的所有模型，并表明仅使用放射学相关数据进行持续的预训练可能是有效的。第三个模型通过将放射学论文添加到生物医学语料库中继续进行预训练，但对于两个数据集没有显示出统计学上有意义的差异。最后的模型将放射学和生物医学语料库与BERTurk的语料库相结合，并从头开始预训练BERT模型。该模型是BioBERT家族中性能最差的模型，甚至比BERTurk和多语言BERT还要差。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Journal of healthcare informatics research

自引率

0.00%

发文量