{"title":"Read, spot and translate.","authors":"Lucia Specia, Josiah Wang, Sun Jae Lee, Alissa Ostapenko, Pranava Madhyastha","doi":"10.1007/s10590-021-09259-z","DOIUrl":null,"url":null,"abstract":"<p><p>We propose multimodal machine translation (MMT) approaches that exploit the correspondences between words and image regions. In contrast to existing work, our referential grounding method considers <i>objects</i> as the visual unit for grounding, rather than whole images or abstract image regions, and performs visual grounding in the <i>source</i> language, rather than at the decoding stage via attention. We explore two referential grounding approaches: (i) implicit grounding, where the model jointly learns how to ground the source language in the visual representation and to translate; and (ii) explicit grounding, where grounding is performed independent of the translation model, and is subsequently used to guide machine translation. We performed experiments on the Multi30K dataset for three language pairs: English-German, English-French and English-Czech. Our referential grounding models outperform existing MMT models according to automatic and human evaluation metrics.</p>","PeriodicalId":44400,"journal":{"name":"MACHINE TRANSLATION","volume":"35 2","pages":"145-165"},"PeriodicalIF":2.1000,"publicationDate":"2021-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1007/s10590-021-09259-z","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"MACHINE TRANSLATION","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1007/s10590-021-09259-z","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2021/4/4 0:00:00","PubModel":"Epub","JCR":"Q3","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 1
Abstract
We propose multimodal machine translation (MMT) approaches that exploit the correspondences between words and image regions. In contrast to existing work, our referential grounding method considers objects as the visual unit for grounding, rather than whole images or abstract image regions, and performs visual grounding in the source language, rather than at the decoding stage via attention. We explore two referential grounding approaches: (i) implicit grounding, where the model jointly learns how to ground the source language in the visual representation and to translate; and (ii) explicit grounding, where grounding is performed independent of the translation model, and is subsequently used to guide machine translation. We performed experiments on the Multi30K dataset for three language pairs: English-German, English-French and English-Czech. Our referential grounding models outperform existing MMT models according to automatic and human evaluation metrics.
期刊介绍:
Machine Translation covers all branches of computational linguistics and language engineering, wherever they incorporate a multilingual aspect. It features papers that cover the theoretical, descriptive or computational aspects of any of the following topics: •machine translation and machine-aided translation •human translation theory and practice •multilingual text composition and generation •multilingual information retrieval •multilingual natural language interfaces •multilingual dialogue systems •multilingual message understanding systems