A Review on Building Bilingual Comparable Corpora for Resource-limited Languages

Nurul Amelina Nasharuddin, M. T. Abdullah, A. Azman, R. A. Kadir
{"title":"A Review on Building Bilingual Comparable Corpora for Resource-limited Languages","authors":"Nurul Amelina Nasharuddin, M. T. Abdullah, A. Azman, R. A. Kadir","doi":"10.1109/INFRKM.2018.8464798","DOIUrl":null,"url":null,"abstract":"Information retrieval tasks on certain Asian languages have the problem of limited knowledge resources such as the bilingual and multilingual dictionaries and corpora. Thus, there is a need to create multilingual resources for these languages. One of the ways is to automatically align document by identifying the chances that two documents are related to each other and these documents are not necessarily in one language. Multilingual corpora can then be automatically developed from these aligned documents. Numerous approaches for document alignment have been developed to date. In this paper, we gave an overview of recent progress made for bilingual and multilingual document alignments within the last 5 years. In addition, we also discussed the current progress made in developing bilingual comparable corpus especially on the Malay language, which is one of the resource-limited languages in Asia.","PeriodicalId":196731,"journal":{"name":"2018 Fourth International Conference on Information Retrieval and Knowledge Management (CAMP)","volume":"70 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2018 Fourth International Conference on Information Retrieval and Knowledge Management (CAMP)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/INFRKM.2018.8464798","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1

Abstract

Information retrieval tasks on certain Asian languages have the problem of limited knowledge resources such as the bilingual and multilingual dictionaries and corpora. Thus, there is a need to create multilingual resources for these languages. One of the ways is to automatically align document by identifying the chances that two documents are related to each other and these documents are not necessarily in one language. Multilingual corpora can then be automatically developed from these aligned documents. Numerous approaches for document alignment have been developed to date. In this paper, we gave an overview of recent progress made for bilingual and multilingual document alignments within the last 5 years. In addition, we also discussed the current progress made in developing bilingual comparable corpus especially on the Malay language, which is one of the resource-limited languages in Asia.
资源有限语言双语可比语料库建设综述
某些亚洲语言的信息检索任务存在着知识资源有限的问题,如双语和多语词典和语料库。因此,有必要为这些语言创建多语言资源。其中一种方法是通过识别两个文档相互关联的可能性以及这些文档不一定使用一种语言来自动对齐文档。然后可以从这些对齐的文档中自动开发多语言语料库。迄今为止,已经开发了许多文档对齐方法。在本文中,我们概述了近5年来双语和多语言文档对齐的最新进展。此外,我们还讨论了目前在开发双语可比语料库方面取得的进展,特别是马来语,这是亚洲资源有限的语言之一。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信