North American Chapter of the Association for Computational Linguistics最新文献

On Synthetic Data for Back Translation 关于反译的合成数据

North American Chapter of the Association for Computational Linguistics Pub Date : 2023-10-20 DOI: 10.18653/v1/2022.naacl-main.32

Jiahao Xu, Yubin Ruan, Wei Bi, Guoping Huang, Shuming Shi, Lihui Chen, Lemao Liu

{"title":"On Synthetic Data for Back Translation","authors":"Jiahao Xu, Yubin Ruan, Wei Bi, Guoping Huang, Shuming Shi, Lihui Chen, Lemao Liu","doi":"10.18653/v1/2022.naacl-main.32","DOIUrl":"https://doi.org/10.18653/v1/2022.naacl-main.32","url":null,"abstract":"Back translation (BT) is one of the most significant technologies in NMT research fields. Existing attempts on BT share a common characteristic: they employ either beam search or random sampling to generate synthetic data with a backward model but seldom work studies the role of synthetic data in the performance of BT. This motivates us to ask a fundamental question: what kind of synthetic data contributes to BT performance?Through both theoretical and empirical studies, we identify two key factors on synthetic data controlling the back-translation NMT performance, which are quality and importance. Furthermore, based on our findings, we propose a simple yet effective method to generate synthetic data to better trade off both factors so as to yield the better performance for BT. We run extensive experiments on WMT14 DE-EN, EN-DE, and RU-EN benchmark tasks. By employing our proposed method to generate synthetic data, our BT model significantly outperforms the standard BT baselines (i.e., beam and sampling based methods for data generation), which proves the effectiveness of our proposed methods.","PeriodicalId":382084,"journal":{"name":"North American Chapter of the Association for Computational Linguistics","volume":"87 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-10-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129227045","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 5

Mining Clues from Incomplete Utterance: A Query-enhanced Network for Incomplete Utterance Rewriting 从不完整话语中挖掘线索:一个用于不完整话语重写的查询增强网络

North American Chapter of the Association for Computational Linguistics Pub Date : 2023-07-03 DOI: 10.18653/v1/2022.naacl-main.356

Shuzheng Si, Shuang Zeng, Baobao Chang

引用次数: 4

Using Paraphrases to Study Properties of Contextual Embeddings 用释义研究语境嵌入的性质

North American Chapter of the Association for Computational Linguistics Pub Date : 2022-07-12 DOI: 10.48550/arXiv.2207.05553

Laura Burdick, Jonathan K. Kummerfeld, Rada Mihalcea

引用次数: 2

GMN: Generative Multi-modal Network for Practical Document Information Extraction 实用文档信息抽取的生成式多模态网络

North American Chapter of the Association for Computational Linguistics Pub Date : 2022-07-11 DOI: 10.48550/arXiv.2207.04713

H. Cao, Jiefeng Ma, Antai Guo, Yiqing Hu, Hao Liu, Deqiang Jiang, Yinsong Liu, Bo Ren

{"title":"GMN: Generative Multi-modal Network for Practical Document Information Extraction","authors":"H. Cao, Jiefeng Ma, Antai Guo, Yiqing Hu, Hao Liu, Deqiang Jiang, Yinsong Liu, Bo Ren","doi":"10.48550/arXiv.2207.04713","DOIUrl":"https://doi.org/10.48550/arXiv.2207.04713","url":null,"abstract":"Document Information Extraction (DIE) has attracted increasing attention due to its various advanced applications in the real world. Although recent literature has already achieved competitive results, these approaches usually fail when dealing with complex documents with noisy OCR results or mutative layouts. This paper proposes Generative Multi-modal Network (GMN) for real-world scenarios to address these problems, which is a robust multi-modal generation method without predefined label categories. With the carefully designed spatial encoder and modal-aware mask module, GMN can deal with complex documents that are hard to serialized into sequential order. Moreover, GMN tolerates errors in OCR results and requires no character-level annotation, which is vital because fine-grained annotation of numerous documents is laborious and even requires annotators with specialized domain knowledge. Extensive experiments show that GMN achieves new state-of-the-art performance on several public DIE datasets and surpasses other methods by a large margin, especially in realistic scenes.","PeriodicalId":382084,"journal":{"name":"North American Chapter of the Association for Computational Linguistics","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-07-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130216053","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 6

Domain Confused Contrastive Learning for Unsupervised Domain Adaptation 无监督领域自适应的领域混淆对比学习

North American Chapter of the Association for Computational Linguistics Pub Date : 2022-07-10 DOI: 10.48550/arXiv.2207.04564

Quanyu Long, Tianze Luo, Wenya Wang, Sinno Jialin Pan

引用次数: 5

A Study of Syntactic Multi-Modality in Non-Autoregressive Machine Translation 非自回归机器翻译中句法多模态的研究

North American Chapter of the Association for Computational Linguistics Pub Date : 2022-07-09 DOI: 10.48550/arXiv.2207.04206

Kexun Zhang, Rui Wang, Xu Tan, Junliang Guo, Yi Ren, Tao Qin, Tie-Yan Liu

{"title":"A Study of Syntactic Multi-Modality in Non-Autoregressive Machine Translation","authors":"Kexun Zhang, Rui Wang, Xu Tan, Junliang Guo, Yi Ren, Tao Qin, Tie-Yan Liu","doi":"10.48550/arXiv.2207.04206","DOIUrl":"https://doi.org/10.48550/arXiv.2207.04206","url":null,"abstract":"It is difficult for non-autoregressive translation (NAT) models to capture the multi-modal distribution of target translations due to their conditional independence assumption, which is known as the “multi-modality problem”, including the lexical multi-modality and the syntactic multi-modality. While the first one has been well studied, the syntactic multi-modality brings severe challenges to the standard cross entropy (XE) loss in NAT and is understudied. In this paper, we conduct a systematic study on the syntactic multi-modality problem. Specifically, we decompose it into short- and long-range syntactic multi-modalities and evaluate several recent NAT algorithms with advanced loss functions on both carefully designed synthesized datasets and real datasets. We find that the Connectionist Temporal Classification (CTC) loss and the Order-Agnostic Cross Entropy (OAXE) loss can better handle short- and long-range syntactic multi-modalities respectively. Furthermore, we take the best of both and design a new loss function to better handle the complicated syntactic multi-modality in real-world datasets. To facilitate practical usage, we provide a guide to using different loss functions for different kinds of syntactic multi-modality.","PeriodicalId":382084,"journal":{"name":"North American Chapter of the Association for Computational Linguistics","volume":"80 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-07-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121954709","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

CoSIm: Commonsense Reasoning for Counterfactual Scene Imagination CoSIm:反事实场景想象的常识推理

North American Chapter of the Association for Computational Linguistics Pub Date : 2022-07-08 DOI: 10.48550/arXiv.2207.03961

Hyounghun Kim, Abhaysinh Zala, Mohit Bansal

{"title":"CoSIm: Commonsense Reasoning for Counterfactual Scene Imagination","authors":"Hyounghun Kim, Abhaysinh Zala, Mohit Bansal","doi":"10.48550/arXiv.2207.03961","DOIUrl":"https://doi.org/10.48550/arXiv.2207.03961","url":null,"abstract":"As humans, we can modify our assumptions about a scene by imagining alternative objects or concepts in our minds. For example, we can easily anticipate the implications of the sun being overcast by rain clouds (e.g., the street will get wet) and accordingly prepare for that. In this paper, we introduce a new dataset called Commonsense Reasoning for Counterfactual Scene Imagination (CoSIm) which is designed to evaluate the ability of AI systems to reason about scene change imagination. To be specific, in this multimodal task/dataset, models are given an image and an initial question-response pair about the image. Next, a counterfactual imagined scene change (in textual form) is applied, and the model has to predict the new response to the initial question based on this scene change. We collect 3.5K high-quality and challenging data instances, with each instance consisting of an image, a commonsense question with a response, a description of a counterfactual change, a new response to the question, and three distractor responses. Our dataset contains various complex scene change types (such as object addition/removal/state change, event description, environment change, etc.) that require models to imagine many different scenarios and reason about the changed scenes. We present a baseline model based on a vision-language Transformer (i.e., LXMERT) and ablation studies. Through human evaluation, we demonstrate a large human-model performance gap, suggesting room for promising future work on this challenging, counterfactual multimodal task.","PeriodicalId":382084,"journal":{"name":"North American Chapter of the Association for Computational Linguistics","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-07-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130300682","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

OmniTab: Pretraining with Natural and Synthetic Data for Few-shot Table-based Question Answering OmniTab:使用自然和合成数据进行预训练，用于几次基于表格的问答

North American Chapter of the Association for Computational Linguistics Pub Date : 2022-07-08 DOI: 10.48550/arXiv.2207.03637

Zhengbao Jiang, Yi Mao, Pengcheng He, Graham Neubig, Weizhu Chen

{"title":"OmniTab: Pretraining with Natural and Synthetic Data for Few-shot Table-based Question Answering","authors":"Zhengbao Jiang, Yi Mao, Pengcheng He, Graham Neubig, Weizhu Chen","doi":"10.48550/arXiv.2207.03637","DOIUrl":"https://doi.org/10.48550/arXiv.2207.03637","url":null,"abstract":"The information in tables can be an important complement to text, making table-based question answering (QA) systems of great value. The intrinsic complexity of handling tables often adds an extra burden to both model design and data annotation. In this paper, we aim to develop a simple table-based QA model with minimal annotation effort. Motivated by the fact that table-based QA requires both alignment between questions and tables and the ability to perform complicated reasoning over multiple table elements, we propose an omnivorous pretraining approach that consumes both natural and synthetic data to endow models with these respective abilities. Specifically, given freely available tables, we leverage retrieval to pair them with relevant natural sentences for mask-based pretraining, and synthesize NL questions by converting SQL sampled from tables for pretraining with a QA loss. We perform extensive experiments in both few-shot and full settings, and the results clearly demonstrate the superiority of our model OmniTab, with the best multitasking approach achieving an absolute gain of 16.2% and 2.7% in 128-shot and full settings respectively, also establishing a new state-of-the-art on WikiTableQuestions. Detailed ablations and analyses reveal different characteristics of natural and synthetic data, shedding light on future directions in omnivorous pretraining.","PeriodicalId":382084,"journal":{"name":"North American Chapter of the Association for Computational Linguistics","volume":"284 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-07-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122962578","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 24

Compositional Generalization in Grounded Language Learning via Induced Model Sparsity 基于诱导模型稀疏性的基础语言学习中的组合泛化

North American Chapter of the Association for Computational Linguistics Pub Date : 2022-07-06 DOI: 10.48550/arXiv.2207.02518

Sam Spilsbury, A. Ilin

引用次数: 4

Putting the Con in Context: Identifying Deceptive Actors in the Game of Mafia 将骗局置于情境中:识别黑手党游戏中的欺骗行为者

North American Chapter of the Association for Computational Linguistics Pub Date : 2022-07-05 DOI: 10.48550/arXiv.2207.02253

Samee Ibraheem, G. Zhou, John DeNero

引用次数: 4