西班牙低资源语言的免费/开源机器翻译(特邀演讲)

M. Forcada
{"title":"西班牙低资源语言的免费/开源机器翻译(特邀演讲)","authors":"M. Forcada","doi":"10.4230/OASIcs.LDK.2021.3","DOIUrl":null,"url":null,"abstract":"While machine translation has historically been rule-based, that is, based on dictionaries and rules written by experts, most present-day machine translation is corpus-based. In the last few years, statistical machine translation, the dominant corpus-based approach, has been displaced by neural machine translation in most applications, in view of the better results reported, particularly for languages with very different syntax. But both statistical and neural machine translation need to be trained on large amounts of parallel data, that is, sentences in one language carefully paired with their translations in their other language, and this is a resource that may not be available for some low-resource languages. While some of the languages of Spain may be considered to be reasonably endowed with parallel corpora connecting them to Spanish or even to English - Basque, Catalan, Galician -, and are well-served with machine translation systems, there are many other languages which cannot afford them such as Aranese Occitan, Aragonese, or Asturian/Leonese. Fortunately, languages in this last group belong to the Romance language family, as Spanish does, and this makes translation from and into Spanish under a rule-based paradigm the only feasible approach. After describing briefly the main machine translation paradigms, I will describe the Apertium free/open-source rule-based machine translation platform, which has been used to build machine translation systems for these low-resource languages of Spain, indeed, sometimes the only ones available. The free/open-source setting has made linguistic data for these languages available for anyone in their linguistic communities to build other linguistic technologies for these low-resourced languages. For example, the Apertium family of bilingual and monolingual data has been converted into RDF and they have been made accessible on the Web as linked data.","PeriodicalId":377119,"journal":{"name":"International Conference on Language, Data, and Knowledge","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Free/Open-Source Machine Translation for the Low-Resource Languages of Spain (Invited Talk)\",\"authors\":\"M. Forcada\",\"doi\":\"10.4230/OASIcs.LDK.2021.3\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"While machine translation has historically been rule-based, that is, based on dictionaries and rules written by experts, most present-day machine translation is corpus-based. In the last few years, statistical machine translation, the dominant corpus-based approach, has been displaced by neural machine translation in most applications, in view of the better results reported, particularly for languages with very different syntax. But both statistical and neural machine translation need to be trained on large amounts of parallel data, that is, sentences in one language carefully paired with their translations in their other language, and this is a resource that may not be available for some low-resource languages. While some of the languages of Spain may be considered to be reasonably endowed with parallel corpora connecting them to Spanish or even to English - Basque, Catalan, Galician -, and are well-served with machine translation systems, there are many other languages which cannot afford them such as Aranese Occitan, Aragonese, or Asturian/Leonese. Fortunately, languages in this last group belong to the Romance language family, as Spanish does, and this makes translation from and into Spanish under a rule-based paradigm the only feasible approach. After describing briefly the main machine translation paradigms, I will describe the Apertium free/open-source rule-based machine translation platform, which has been used to build machine translation systems for these low-resource languages of Spain, indeed, sometimes the only ones available. The free/open-source setting has made linguistic data for these languages available for anyone in their linguistic communities to build other linguistic technologies for these low-resourced languages. For example, the Apertium family of bilingual and monolingual data has been converted into RDF and they have been made accessible on the Web as linked data.\",\"PeriodicalId\":377119,\"journal\":{\"name\":\"International Conference on Language, Data, and Knowledge\",\"volume\":\"1 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"1900-01-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"International Conference on Language, Data, and Knowledge\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.4230/OASIcs.LDK.2021.3\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Conference on Language, Data, and Knowledge","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.4230/OASIcs.LDK.2021.3","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

摘要

虽然机器翻译历来是基于规则的,即基于专家编写的字典和规则,但大多数当今的机器翻译是基于语料库的。在过去的几年里,统计机器翻译,主要的基于语料库的方法,在大多数应用中已经被神经机器翻译所取代,考虑到更好的结果,特别是对于语法非常不同的语言。但是统计机器翻译和神经机器翻译都需要在大量的并行数据上进行训练,也就是说,一种语言中的句子与他们在另一种语言中的翻译仔细配对,这是一些低资源语言可能无法获得的资源。虽然西班牙的一些语言可能被认为是合理地赋予了与西班牙语甚至英语相连接的平行语料库-巴斯克语,加泰罗尼亚语,加利西亚语-并且有很好的机器翻译系统,但还有许多其他语言无法提供它们,如阿兰语,欧西坦语,阿拉贡语或阿斯图里亚语/莱昂语。幸运的是,最后一类语言和西班牙语一样都属于罗曼语系,这使得在基于规则的范式下进行西班牙语的翻译成为唯一可行的方法。在简要介绍了主要的机器翻译范例之后,我将介绍Apertium免费/开源的基于规则的机器翻译平台,该平台已被用于为西班牙的这些低资源语言构建机器翻译系统,实际上,有时是唯一可用的。免费/开源的设置使得这些语言的语言数据可供语言社区中的任何人使用,从而为这些资源匮乏的语言构建其他语言技术。例如,Apertium系列的双语和单语数据已经被转换成RDF,并且可以作为链接数据在Web上访问。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Free/Open-Source Machine Translation for the Low-Resource Languages of Spain (Invited Talk)
While machine translation has historically been rule-based, that is, based on dictionaries and rules written by experts, most present-day machine translation is corpus-based. In the last few years, statistical machine translation, the dominant corpus-based approach, has been displaced by neural machine translation in most applications, in view of the better results reported, particularly for languages with very different syntax. But both statistical and neural machine translation need to be trained on large amounts of parallel data, that is, sentences in one language carefully paired with their translations in their other language, and this is a resource that may not be available for some low-resource languages. While some of the languages of Spain may be considered to be reasonably endowed with parallel corpora connecting them to Spanish or even to English - Basque, Catalan, Galician -, and are well-served with machine translation systems, there are many other languages which cannot afford them such as Aranese Occitan, Aragonese, or Asturian/Leonese. Fortunately, languages in this last group belong to the Romance language family, as Spanish does, and this makes translation from and into Spanish under a rule-based paradigm the only feasible approach. After describing briefly the main machine translation paradigms, I will describe the Apertium free/open-source rule-based machine translation platform, which has been used to build machine translation systems for these low-resource languages of Spain, indeed, sometimes the only ones available. The free/open-source setting has made linguistic data for these languages available for anyone in their linguistic communities to build other linguistic technologies for these low-resourced languages. For example, the Apertium family of bilingual and monolingual data has been converted into RDF and they have been made accessible on the Web as linked data.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信