用于消除缩写和缩写歧义的BERT模型

Q3 Arts and Humanities

Icon Pub Date : 2022-07-08 DOI:10.48550/arXiv.2207.04008

Prateek Kacker, Andi Cupallari, Aswin Giridhar Subramanian, Nimit Jain

{"title":"用于消除缩写和缩写歧义的BERT模型","authors":"Prateek Kacker, Andi Cupallari, Aswin Giridhar Subramanian, Nimit Jain","doi":"10.48550/arXiv.2207.04008","DOIUrl":null,"url":null,"abstract":"Abbreviations and contractions are commonly found in text across different domains. For example, doctors’ notes contain many contractions that can be personalized based on their choices. Existing spelling correction models are not suitable to handle expansions because of many reductions of characters in words. In this work, we propose ABB-BERT, a BERT-based model, which deals with an ambiguous language containing abbreviations and contractions. ABB-BERT can rank them from thousands of options and is designed for scale. It is trained on Wikipedia text, and the algorithm allows it to be fine-tuned with little compute to get better performance for a domain or person. We are publicly releasing the training dataset for abbreviations and contractions derived from Wikipedia.","PeriodicalId":53637,"journal":{"name":"Icon","volume":"1 1","pages":"289-297"},"PeriodicalIF":0.0000,"publicationDate":"2022-07-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"ABB-BERT: A BERT model for disambiguating abbreviations and contractions\",\"authors\":\"Prateek Kacker, Andi Cupallari, Aswin Giridhar Subramanian, Nimit Jain\",\"doi\":\"10.48550/arXiv.2207.04008\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Abbreviations and contractions are commonly found in text across different domains. For example, doctors’ notes contain many contractions that can be personalized based on their choices. Existing spelling correction models are not suitable to handle expansions because of many reductions of characters in words. In this work, we propose ABB-BERT, a BERT-based model, which deals with an ambiguous language containing abbreviations and contractions. ABB-BERT can rank them from thousands of options and is designed for scale. It is trained on Wikipedia text, and the algorithm allows it to be fine-tuned with little compute to get better performance for a domain or person. We are publicly releasing the training dataset for abbreviations and contractions derived from Wikipedia.\",\"PeriodicalId\":53637,\"journal\":{\"name\":\"Icon\",\"volume\":\"1 1\",\"pages\":\"289-297\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-07-08\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Icon\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.48550/arXiv.2207.04008\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q3\",\"JCRName\":\"Arts and Humanities\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Icon","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.48550/arXiv.2207.04008","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"Arts and Humanities","Score":null,"Total":0}

引用次数: 0

摘要

缩写和缩写通常出现在不同领域的文本中。例如，医生的笔记中包含许多宫缩，这些宫缩可以根据他们的选择进行个性化设置。现有的拼写校正模型不适合处理扩展，因为单词中的字符减少了很多。在这项工作中，我们提出了ABB-BERT，这是一个基于BERT的模型，它处理包含缩写和缩写的歧义语言。ABB-BERT可以从数千个选项中对它们进行排名，并且是为规模而设计的。它是在维基百科文本上训练的，算法允许它在几乎没有计算的情况下进行微调，以获得更好的域或个人性能。我们正在公开发布源自维基百科的缩写和缩写的训练数据集。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

ABB-BERT: A BERT model for disambiguating abbreviations and contractions

Abbreviations and contractions are commonly found in text across different domains. For example, doctors’ notes contain many contractions that can be personalized based on their choices. Existing spelling correction models are not suitable to handle expansions because of many reductions of characters in words. In this work, we propose ABB-BERT, a BERT-based model, which deals with an ambiguous language containing abbreviations and contractions. ABB-BERT can rank them from thousands of options and is designed for scale. It is trained on Wikipedia text, and the algorithm allows it to be fine-tuned with little compute to get better performance for a domain or person. We are publicly releasing the training dataset for abbreviations and contractions derived from Wikipedia.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Icon Arts and Humanities-History and Philosophy of Science

CiteScore

0.30

自引率

0.00%

发文量