Vocabulary Modifications for Domain-adaptive Pretraining of Clinical Language Models

Proceedings of the International Conference on Health Informatics and Medical Application Technology Pub Date : 2022-01-01 DOI:10.5220/0010893800003123

Anastasios Lamproudis, Aron Henriksson, H. Dalianis

{"title":"Vocabulary Modifications for Domain-adaptive Pretraining of Clinical Language Models","authors":"Anastasios Lamproudis, Aron Henriksson, H. Dalianis","doi":"10.5220/0010893800003123","DOIUrl":null,"url":null,"abstract":": Research has shown that using generic language models – speciﬁcally, BERT models – in specialized domains may be sub-optimal due to domain differences in language use and vocabulary. There are several techniques for developing domain-speciﬁc language models that leverage the use of existing generic language models, including continued and domain-adaptive pretraining with in-domain data. Here, we investigate a strategy based on using a domain-speciﬁc vocabulary, while leveraging a generic language model for initialization. The results demonstrate that domain-adaptive pretraining, in combination with a domain-speciﬁc vocabulary – as opposed to a general-domain vocabulary – yields improvements on two downstream clinical NLP tasks for Swedish. The results highlight the value of domain-adaptive pretraining when developing specialized language models and indicate that it is beneﬁcial to adapt the vocabulary of the language model to the target domain prior to continued, domain-adaptive pretraining of a generic language model.","PeriodicalId":20676,"journal":{"name":"Proceedings of the International Conference on Health Informatics and Medical Application Technology","volume":"4 1","pages":"180-188"},"PeriodicalIF":0.0000,"publicationDate":"2022-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the International Conference on Health Informatics and Medical Application Technology","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.5220/0010893800003123","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 4

Abstract

: Research has shown that using generic language models – speciﬁcally, BERT models – in specialized domains may be sub-optimal due to domain differences in language use and vocabulary. There are several techniques for developing domain-speciﬁc language models that leverage the use of existing generic language models, including continued and domain-adaptive pretraining with in-domain data. Here, we investigate a strategy based on using a domain-speciﬁc vocabulary, while leveraging a generic language model for initialization. The results demonstrate that domain-adaptive pretraining, in combination with a domain-speciﬁc vocabulary – as opposed to a general-domain vocabulary – yields improvements on two downstream clinical NLP tasks for Swedish. The results highlight the value of domain-adaptive pretraining when developing specialized language models and indicate that it is beneﬁcial to adapt the vocabulary of the language model to the target domain prior to continued, domain-adaptive pretraining of a generic language model.

查看原文本刊更多论文

临床语言模型领域自适应预训练的词汇修饰

研究表明，由于语言使用和词汇的领域差异，在特定领域使用通用语言模型(特别是BERT模型)可能不是最优的。有几种技术可用于开发利用现有通用语言模型的领域特定语言模型，包括使用领域内数据的持续和领域自适应预训练。在这里，我们研究一种基于使用特定于领域的词汇表的策略，同时利用通用语言模型进行初始化。结果表明，领域自适应预训练与特定领域词汇(而不是通用领域词汇)相结合，可以改善瑞典语的两个下游临床NLP任务。研究结果强调了领域自适应预训练在开发专门语言模型时的价值，并表明在继续进行通用语言模型的领域自适应预训练之前，将语言模型的词汇适应目标领域是有益的。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the International Conference on Health Informatics and Medical Application Technology

自引率

0.00%

发文量