How different is different? Systematically identifying distribution shifts and their impacts in NER datasets

IF 1.7 3区 计算机科学 Q3 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS
Xue Li, Paul Groth
{"title":"How different is different? Systematically identifying distribution shifts and their impacts in NER datasets","authors":"Xue Li, Paul Groth","doi":"10.1007/s10579-024-09754-8","DOIUrl":null,"url":null,"abstract":"<p>When processing natural language, we are frequently confronted with the problem of distribution shift. For example, using a model trained on a news corpus to subsequently process legal text exhibits reduced performance. While this problem is well-known, to this point, there has not been a systematic study of detecting shifts and investigating the impact shifts have on model performance for NLP tasks. Therefore, in this paper, we detect and measure two types of distribution shift, across three different representations, for 12 benchmark Named Entity Recognition datasets. We show that both input shift and label shift can lead to dramatic performance degradation. For example, fine-tuning on a wide spectrum dataset (OntoNotes) and testing on an email dataset (CEREC) that shares labels leads to a 63-points drop in F1 performance. Overall, our results indicate that the measurement of distribution shift can provide guidance to the amount of data needed for fine-tuning and whether or not a model can be used “off-the-shelf” without subsequent fine-tuning. Finally, our results show that shift measurement can play an important role in NLP model pipeline definition.</p>","PeriodicalId":49927,"journal":{"name":"Language Resources and Evaluation","volume":"39 1","pages":""},"PeriodicalIF":1.7000,"publicationDate":"2024-07-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Language Resources and Evaluation","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1007/s10579-024-09754-8","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS","Score":null,"Total":0}
引用次数: 0

Abstract

When processing natural language, we are frequently confronted with the problem of distribution shift. For example, using a model trained on a news corpus to subsequently process legal text exhibits reduced performance. While this problem is well-known, to this point, there has not been a systematic study of detecting shifts and investigating the impact shifts have on model performance for NLP tasks. Therefore, in this paper, we detect and measure two types of distribution shift, across three different representations, for 12 benchmark Named Entity Recognition datasets. We show that both input shift and label shift can lead to dramatic performance degradation. For example, fine-tuning on a wide spectrum dataset (OntoNotes) and testing on an email dataset (CEREC) that shares labels leads to a 63-points drop in F1 performance. Overall, our results indicate that the measurement of distribution shift can provide guidance to the amount of data needed for fine-tuning and whether or not a model can be used “off-the-shelf” without subsequent fine-tuning. Finally, our results show that shift measurement can play an important role in NLP model pipeline definition.

Abstract Image

差异有多大?在 NER 数据集中系统识别分布变化及其影响
在处理自然语言时,我们经常会遇到分布转移的问题。例如,使用在新闻语料库中训练好的模型来处理法律文本,其性能就会下降。虽然这个问题众所周知,但到目前为止,还没有系统性的研究来检测偏移并调查偏移对 NLP 任务中模型性能的影响。因此,在本文中,我们针对 12 个基准名称实体识别数据集,通过三种不同的表示方法检测并测量了两种类型的分布偏移。我们发现,输入偏移和标签偏移都会导致性能急剧下降。例如,在广谱数据集(OntoNotes)上进行微调,并在共享标签的电子邮件数据集(CEREC)上进行测试,会导致 F1 性能下降 63 分。总之,我们的结果表明,分布偏移的测量可以为微调所需的数据量以及模型是否可以 "现成 "使用而无需后续微调提供指导。最后,我们的结果表明,偏移测量可以在 NLP 模型管道定义中发挥重要作用。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
Language Resources and Evaluation
Language Resources and Evaluation 工程技术-计算机:跨学科应用
CiteScore
6.50
自引率
3.70%
发文量
55
审稿时长
>12 weeks
期刊介绍: Language Resources and Evaluation is the first publication devoted to the acquisition, creation, annotation, and use of language resources, together with methods for evaluation of resources, technologies, and applications. Language resources include language data and descriptions in machine readable form used to assist and augment language processing applications, such as written or spoken corpora and lexica, multimodal resources, grammars, terminology or domain specific databases and dictionaries, ontologies, multimedia databases, etc., as well as basic software tools for their acquisition, preparation, annotation, management, customization, and use. Evaluation of language resources concerns assessing the state-of-the-art for a given technology, comparing different approaches to a given problem, assessing the availability of resources and technologies for a given application, benchmarking, and assessing system usability and user satisfaction.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信