Evaluating the Utility of Model Configurations and Data Augmentation on Clinical Semantic Textual Similarity

Workshop on Biomedical Natural Language Processing Pub Date : 2020-07-01 DOI:10.18653/v1/2020.bionlp-1.11

Yuxia Wang, Fei Liu, Karin M. Verspoor, Timothy Baldwin

引用次数: 19

Abstract

In this paper, we apply pre-trained language models to the Semantic Textual Similarity (STS) task, with a specific focus on the clinical domain. In low-resource setting of clinical STS, these large models tend to be impractical and prone to overfitting. Building on BERT, we study the impact of a number of model design choices, namely different fine-tuning and pooling strategies. We observe that the impact of domain-specific fine-tuning on clinical STS is much less than that in the general domain, likely due to the concept richness of the domain. Based on this, we propose two data augmentation techniques. Experimental results on N2C2-STS 1 demonstrate substantial improvements, validating the utility of the proposed methods.

查看原文本刊更多论文

评估模型配置和数据增强对临床语义文本相似度的效用

在本文中，我们将预训练的语言模型应用于语义文本相似性(STS)任务，并特别关注临床领域。在临床STS资源匮乏的情况下，这些大型模型往往不切实际，容易过拟合。在BERT的基础上，我们研究了许多模型设计选择的影响，即不同的微调和池化策略。我们观察到，领域特定微调对临床STS的影响远小于一般领域，可能是由于该领域的概念丰富。基于此，我们提出了两种数据增强技术。在N2C2-STS - 1上的实验结果显示了实质性的改进，验证了所提出方法的实用性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Workshop on Biomedical Natural Language Processing

自引率

0.00%

发文量