Annual Meeting of the Association for Computational Linguistics最新文献_第6页

A Formal Perspective on Byte-Pair Encoding 字节对编码的形式化观点

Annual Meeting of the Association for Computational Linguistics Pub Date : 2023-06-29 DOI: 10.48550/arXiv.2306.16837

Vilém Zouhar, Clara Meister, Juan Luis Gastaldi, Li Du, Tim Vieira, Mrinmaya Sachan, Ryan Cotterell

引用次数: 2

Tokenization and the Noiseless Channel 标记化和无噪声信道

Annual Meeting of the Association for Computational Linguistics Pub Date : 2023-06-29 DOI: 10.48550/arXiv.2306.16842

Vilém Zouhar, Clara Meister, Juan Luis Gastaldi, Li Du, Mrinmaya Sachan, Ryan Cotterell

{"title":"Tokenization and the Noiseless Channel","authors":"Vilém Zouhar, Clara Meister, Juan Luis Gastaldi, Li Du, Mrinmaya Sachan, Ryan Cotterell","doi":"10.48550/arXiv.2306.16842","DOIUrl":"https://doi.org/10.48550/arXiv.2306.16842","url":null,"abstract":"Subword tokenization is a key part of most NLP pipelines.However, little is known about why some tokenizer and hyperparameter combinations lead to improved downstream model performance over others. We propose that good tokenizers lead to efficient channel usage, where the channel is the means by which some input is conveyed to the model and efficiency can be quantified in information-theoretic terms as the ratio of the Shannon entropy to the maximum entropy of the subword distribution.Nevertheless, an optimal encoding according to Shannon entropy assigns extremely long codes to low-frequency subwords and very short codes to high-frequency subwords.Defining efficiency in terms of Rényi entropy, on the other hand, penalizes distributions with either very high or very low-frequency subwords.We posit that (1) extremely high-frequency subwords are problematic because their meaning is not distinct and (2) that low-frequency subwords may not appear frequently enough for their meaning to be learned properly; encodings that induce unigram distributions with either can harm model performance.In machine translation, we find that across multiple tokenizers, the Rényi entropy has a very strong correlation with BLEU: 0.82 in comparison to just -0.30 for compressed length.","PeriodicalId":352845,"journal":{"name":"Annual Meeting of the Association for Computational Linguistics","volume":"94 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-06-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132242598","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

Benchmarking Large Language Model Capabilities for Conditional Generation 条件生成大型语言模型能力的基准测试

Annual Meeting of the Association for Computational Linguistics Pub Date : 2023-06-29 DOI: 10.48550/arXiv.2306.16793

Joshua Maynez, Priyanka Agrawal, Sebastian Gehrmann

{"title":"Benchmarking Large Language Model Capabilities for Conditional Generation","authors":"Joshua Maynez, Priyanka Agrawal, Sebastian Gehrmann","doi":"10.48550/arXiv.2306.16793","DOIUrl":"https://doi.org/10.48550/arXiv.2306.16793","url":null,"abstract":"Pre-trained large language models (PLMs) underly most new developments in natural language processing. They have shifted the field from application-specific model pipelines to a single model that is adapted to a wide range of tasks. Autoregressive PLMs like GPT-3 or PaLM and associated techniques like fewshot learning, have additionally shifted the output modality to generation instead of classification or regression. Despite their ubiquitous use, the generation quality of language models is rarely evaluated when these models are introduced. Additionally, it is unclear how existing generation tasks–while they can be used to compare systems at a high level–relate to the real world use cases for which people have been adopting them. In this work, we discuss how to adapt existing application-specific generation benchmarks to PLMs and provide an in-depth, empirical study of the limitations and capabilities of PLMs in natural language generation tasks along dimensions such as scale, architecture, input and output language. Our results show that PLMs differ in their applicability to different data regimes and their generalization to multiple languages. They further inform practitioners as to which PLMs to use for a given generation task setup. We share best practices to be taken into consideration when benchmarking generation capabilities during the development of upcoming PLMs.","PeriodicalId":352845,"journal":{"name":"Annual Meeting of the Association for Computational Linguistics","volume":"24 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-06-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126558575","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

VisText: A Benchmark for Semantically Rich Chart Captioning VisText:语义丰富的图表标题的基准

Annual Meeting of the Association for Computational Linguistics Pub Date : 2023-06-28 DOI: 10.48550/arXiv.2307.05356

Benny J. Tang, Angie Boggust

{"title":"VisText: A Benchmark for Semantically Rich Chart Captioning","authors":"Benny J. Tang, Angie Boggust","doi":"10.48550/arXiv.2307.05356","DOIUrl":"https://doi.org/10.48550/arXiv.2307.05356","url":null,"abstract":"Captions that describe or explain charts help improve recall and comprehension of the depicted data and provide a more accessible medium for people with visual disabilities. However, current approaches for automatically generating such captions struggle to articulate the perceptual or cognitive features that are the hallmark of charts (e.g., complex trends and patterns). In response, we introduce VisText: a dataset of 12,441 pairs of charts and captions that describe the charts’ construction, report key statistics, and identify perceptual and cognitive phenomena. In VisText, a chart is available as three representations: a rasterized image, a backing data table, and a scene graph—a hierarchical representation of a chart’s visual elements akin to a web page’s Document Object Model (DOM). To evaluate the impact of VisText, we fine-tune state-of-the-art language models on our chart captioning task and apply prefix-tuning to produce captions that vary the semantic content they convey. Our models generate coherent, semantically rich captions and perform on par with state-of-the-art chart captioning models across machine translation and text generation metrics. Through qualitative analysis, we identify six broad categories of errors that our models make that can inform future work.","PeriodicalId":352845,"journal":{"name":"Annual Meeting of the Association for Computational Linguistics","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-06-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116985100","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 5

DSRM: Boost Textual Adversarial Training with Distribution Shift Risk Minimization 基于分布转移风险最小化的文本对抗训练

Annual Meeting of the Association for Computational Linguistics Pub Date : 2023-06-27 DOI: 10.48550/arXiv.2306.15164

Songyang Gao, Shihan Dou, Yan Liu, Xiao Wang, Qi Zhang, Zhongyu Wei, Jin Ma, Yingchun Shan

{"title":"DSRM: Boost Textual Adversarial Training with Distribution Shift Risk Minimization","authors":"Songyang Gao, Shihan Dou, Yan Liu, Xiao Wang, Qi Zhang, Zhongyu Wei, Jin Ma, Yingchun Shan","doi":"10.48550/arXiv.2306.15164","DOIUrl":"https://doi.org/10.48550/arXiv.2306.15164","url":null,"abstract":"Adversarial training is one of the best-performing methods in improving the robustness of deep language models. However, robust models come at the cost of high time consumption, as they require multi-step gradient ascents or word substitutions to obtain adversarial samples. In addition, these generated samples are deficient in grammatical quality and semantic consistency, which impairs the effectiveness of adversarial training.To address these problems, we introduce a novel, effective procedure for instead adversarial training with only clean data. Our procedure, distribution shift risk minimization (DSRM), estimates the adversarial loss by perturbing the input data’s probability distribution rather than their embeddings. This formulation results in a robust model that minimizes the expected global loss under adversarial attacks. Our approach requires zero adversarial samples for training and reduces time consumption by up to 70% compared to current best-performing adversarial training methods.Experiments demonstrate that DSRM considerably improves BERT’s resistance to textual adversarial attacks and achieves state-of-the-art robust accuracy on various benchmarks.","PeriodicalId":352845,"journal":{"name":"Annual Meeting of the Association for Computational Linguistics","volume":"103 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133174111","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

IDOL: Indicator-oriented Logic Pre-training for Logical Reasoning IDOL:面向指标的逻辑逻辑逻辑推理的预训练

Annual Meeting of the Association for Computational Linguistics Pub Date : 2023-06-27 DOI: 10.48550/arXiv.2306.15273

Zihang Xu, Ziqing Yang, Yiming Cui, Shijin Wang

{"title":"IDOL: Indicator-oriented Logic Pre-training for Logical Reasoning","authors":"Zihang Xu, Ziqing Yang, Yiming Cui, Shijin Wang","doi":"10.48550/arXiv.2306.15273","DOIUrl":"https://doi.org/10.48550/arXiv.2306.15273","url":null,"abstract":"In the field of machine reading comprehension (MRC), existing systems have surpassed the average performance of human beings in many tasks like SQuAD. However, there is still a long way to go when it comes to logical reasoning. Although some methods for it have been put forward, they either are designed in a quite complicated way or rely too much on external structures. In this paper, we proposed IDOL (InDicator-Oriented Logic Pre-training), an easy-to-understand but highly effective further pre-training task which logically strengthens the pre-trained models with the help of 6 types of logical indicators and a logically rich dataset LGP (LoGic Pre-training). IDOL achieves state-of-the-art performance on ReClor and LogiQA, the two most representative benchmarks in logical reasoning MRC, and is proven to be capable of generalizing to different pre-trained models and other types of MRC benchmarks like RACE and SQuAD 2.0 while keeping competitive general language understanding ability through testing on tasks in GLUE. Besides, at the beginning of the era of large language models, we take several of them like ChatGPT into comparison and find that IDOL still shows its advantage.","PeriodicalId":352845,"journal":{"name":"Annual Meeting of the Association for Computational Linguistics","volume":"28 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127734572","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Constructing Multilingual Code Search Dataset Using Neural Machine Translation 利用神经机器翻译构建多语言代码搜索数据集

Annual Meeting of the Association for Computational Linguistics Pub Date : 2023-06-27 DOI: 10.48550/arXiv.2306.15604

Ryo Sekizawa, Nan Duan, Shuai Lu, Hitomi Yanaka

引用次数: 0

Automatic Annotation of Direct Speech in Written French Narratives 法语书面叙事中直接引语的自动标注

Annual Meeting of the Association for Computational Linguistics Pub Date : 2023-06-27 DOI: 10.48550/arXiv.2306.15634

No'e Durandard, Viet Tan, Gaspard Michel, Elena V. Epure

引用次数: 0

Learn over Past, Evolve for Future: Forecasting Temporal Trends for Fake News Detection 学习过去，发展未来:预测假新闻检测的时间趋势

Annual Meeting of the Association for Computational Linguistics Pub Date : 2023-06-26 DOI: 10.48550/arXiv.2306.14728

Beizhe Hu, Qiang Sheng, Juan Cao, Yongchun Zhu, Danding Wang, Zhengjia Wang, Zhiwei Jin

引用次数: 7

Understanding In-Context Learning via Supportive Pretraining Data 通过支持性预训练数据理解上下文学习

Annual Meeting of the Association for Computational Linguistics Pub Date : 2023-06-26 DOI: 10.48550/arXiv.2306.15091

Xiaochuang Han, Daniel Simig, Todor Mihaylov, Yulia Tsvetkov, Asli Celikyilmaz, Tianlu Wang

{"title":"Understanding In-Context Learning via Supportive Pretraining Data","authors":"Xiaochuang Han, Daniel Simig, Todor Mihaylov, Yulia Tsvetkov, Asli Celikyilmaz, Tianlu Wang","doi":"10.48550/arXiv.2306.15091","DOIUrl":"https://doi.org/10.48550/arXiv.2306.15091","url":null,"abstract":"In-context learning (ICL) improves language models’ performance on a variety of NLP tasks by simply demonstrating a handful of examples at inference time. It is not well understood why ICL ability emerges, as the model has never been specifically trained on such demonstrations. Unlike prior work that explores implicit mechanisms behind ICL, we study ICL via investigating the pretraining data. Specifically, we first adapt an iterative, gradient-based approach to find a small subset of pretraining data that supports ICL. We observe that a continued pretraining on this small subset significantly improves the model’s ICL ability, by up to 18%. We then compare the supportive subset constrastively with random subsets of pretraining data and discover: (1) The supportive pretraining data to ICL do not have a higher domain relevance to downstream tasks. (2) The supportive pretraining data have a higher mass of rarely occurring, long-tail tokens. (3) The supportive pretraining data are challenging examples where the information gain from long-range context is below average, indicating learning to incorporate difficult long-range context encourages ICL. Our work takes a first step towards understanding ICL via analyzing instance-level pretraining data. Our insights have a potential to enhance the ICL ability of language models by actively guiding the construction of pretraining data in the future.","PeriodicalId":352845,"journal":{"name":"Annual Meeting of the Association for Computational Linguistics","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-06-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132708289","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4