子词分隔的下采样，以获得更好的字符级翻译

Proceedings of the Conference on Empirical Methods in Natural Language Processing. Conference on Empirical Methods in Natural Language Processing Pub Date : 2022-12-02 DOI:10.48550/arXiv.2212.01304

Lukas Edman, Antonio Toral, Gertjan van Noord

{"title":"子词分隔的下采样，以获得更好的字符级翻译","authors":"Lukas Edman, Antonio Toral, Gertjan van Noord","doi":"10.48550/arXiv.2212.01304","DOIUrl":null,"url":null,"abstract":"Subword-level models have been the dominant paradigm in NLP. However, character-level models have the benefit of seeing each character individually, providing the model with more detailed information that ultimately could lead to better models. Recent works have shown character-level models to be competitive with subword models, but costly in terms of time and computation. Character-level models with a downsampling component alleviate this, but at the cost of quality, particularly for machine translation. This work analyzes the problems of previous downsampling methods and introduces a novel downsampling method which is informed by subwords. This new downsampling method not only outperforms existing downsampling methods, showing that downsampling characters can be done without sacrificing quality, but also leads to promising performance compared to subword models for translation.","PeriodicalId":74540,"journal":{"name":"Proceedings of the Conference on Empirical Methods in Natural Language Processing. Conference on Empirical Methods in Natural Language Processing","volume":"58 1","pages":"981-992"},"PeriodicalIF":0.0000,"publicationDate":"2022-12-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":"{\"title\":\"Subword-Delimited Downsampling for Better Character-Level Translation\",\"authors\":\"Lukas Edman, Antonio Toral, Gertjan van Noord\",\"doi\":\"10.48550/arXiv.2212.01304\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Subword-level models have been the dominant paradigm in NLP. However, character-level models have the benefit of seeing each character individually, providing the model with more detailed information that ultimately could lead to better models. Recent works have shown character-level models to be competitive with subword models, but costly in terms of time and computation. Character-level models with a downsampling component alleviate this, but at the cost of quality, particularly for machine translation. This work analyzes the problems of previous downsampling methods and introduces a novel downsampling method which is informed by subwords. This new downsampling method not only outperforms existing downsampling methods, showing that downsampling characters can be done without sacrificing quality, but also leads to promising performance compared to subword models for translation.\",\"PeriodicalId\":74540,\"journal\":{\"name\":\"Proceedings of the Conference on Empirical Methods in Natural Language Processing. Conference on Empirical Methods in Natural Language Processing\",\"volume\":\"58 1\",\"pages\":\"981-992\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-12-02\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"4\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the Conference on Empirical Methods in Natural Language Processing. Conference on Empirical Methods in Natural Language Processing\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.48550/arXiv.2212.01304\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the Conference on Empirical Methods in Natural Language Processing. Conference on Empirical Methods in Natural Language Processing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.48550/arXiv.2212.01304","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 4

摘要

子词级模型一直是自然语言处理的主流范式。然而，角色级模型的好处是可以单独看到每个角色，为模型提供更详细的信息，最终可以生成更好的模型。最近的研究表明，字符级模型可以与子词模型竞争，但在时间和计算方面代价高昂。带有下采样组件的字符级模型缓解了这一点，但代价是质量，特别是对于机器翻译。本文分析了以往下采样方法存在的问题，提出了一种新的基于子词的下采样方法。这种新的下采样方法不仅优于现有的下采样方法，表明下采样字符可以在不牺牲质量的情况下完成，而且与翻译的子词模型相比，它的性能也很好。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Subword-Delimited Downsampling for Better Character-Level Translation

Subword-level models have been the dominant paradigm in NLP. However, character-level models have the benefit of seeing each character individually, providing the model with more detailed information that ultimately could lead to better models. Recent works have shown character-level models to be competitive with subword models, but costly in terms of time and computation. Character-level models with a downsampling component alleviate this, but at the cost of quality, particularly for machine translation. This work analyzes the problems of previous downsampling methods and introduces a novel downsampling method which is informed by subwords. This new downsampling method not only outperforms existing downsampling methods, showing that downsampling characters can be done without sacrificing quality, but also leads to promising performance compared to subword models for translation.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Proceedings of the Conference on Empirical Methods in Natural Language Processing. Conference on Empirical Methods in Natural Language Processing

自引率

0.00%

发文量