基于群体源的孟加拉语英译语料库研究

2018 21st International Conference of Computer and Information Technology (ICCIT) Pub Date : 2018-12-01 DOI:10.1109/ICCITECHN.2018.8631947

Nafisa Nowshin, Zakia Sultana Ritu, Sabir Ismail

{"title":"基于群体源的孟加拉语英译语料库研究","authors":"Nafisa Nowshin, Zakia Sultana Ritu, Sabir Ismail","doi":"10.1109/ICCITECHN.2018.8631947","DOIUrl":null,"url":null,"abstract":"In this paper, we present a crowd-source based Bangla to English parallel corpus and evaluate its accuracy. A complete and informative corpus is necessary for any language for its development through automated process. A Bangla to English parallel corpus has importance in various multi-lingual applications and NLP research works. But there is still scarcity of a complete Bangla to English parallel corpus. In this paper we propose a large scale crowd-source method of construction of a Bangla to English parallel corpus through crowd-sourcing. We chose crowd-sourcing method to venture a new approach in corpus construction and evaluate human behavior pattern in doing so. The translations were collected form under graduate students of university to ensure strong language knowledge. A Bangla to English parallel corpus will help in comparing linguistic features of these languages. In this paper we present an initial dataset prepared via crowd-sourcing which will serve as a baseline for further analysis of crowd source based corpus. Our primary dataset is consists of 517 Bangla sentences and for every Bangla sentence, we collected 4 English sentences on an average and 2143 English sentences in total via crowd-sourcing. This data was collected over a period of 2 months and from 62 users. Finally we analyze the dataset and give some conclusive idea about further research.","PeriodicalId":355984,"journal":{"name":"2018 21st International Conference of Computer and Information Technology (ICCIT)","volume":"35 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":"{\"title\":\"A Crowd-Source Based Corpus on Bangla to English Translation\",\"authors\":\"Nafisa Nowshin, Zakia Sultana Ritu, Sabir Ismail\",\"doi\":\"10.1109/ICCITECHN.2018.8631947\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"In this paper, we present a crowd-source based Bangla to English parallel corpus and evaluate its accuracy. A complete and informative corpus is necessary for any language for its development through automated process. A Bangla to English parallel corpus has importance in various multi-lingual applications and NLP research works. But there is still scarcity of a complete Bangla to English parallel corpus. In this paper we propose a large scale crowd-source method of construction of a Bangla to English parallel corpus through crowd-sourcing. We chose crowd-sourcing method to venture a new approach in corpus construction and evaluate human behavior pattern in doing so. The translations were collected form under graduate students of university to ensure strong language knowledge. A Bangla to English parallel corpus will help in comparing linguistic features of these languages. In this paper we present an initial dataset prepared via crowd-sourcing which will serve as a baseline for further analysis of crowd source based corpus. Our primary dataset is consists of 517 Bangla sentences and for every Bangla sentence, we collected 4 English sentences on an average and 2143 English sentences in total via crowd-sourcing. This data was collected over a period of 2 months and from 62 users. Finally we analyze the dataset and give some conclusive idea about further research.\",\"PeriodicalId\":355984,\"journal\":{\"name\":\"2018 21st International Conference of Computer and Information Technology (ICCIT)\",\"volume\":\"35 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2018-12-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"2\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2018 21st International Conference of Computer and Information Technology (ICCIT)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICCITECHN.2018.8631947\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2018 21st International Conference of Computer and Information Technology (ICCIT)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICCITECHN.2018.8631947","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 2

摘要

本文提出了一个基于众源的孟加拉语-英语平行语料库，并对其准确性进行了评价。一个完整的、信息丰富的语料库对于任何语言的自动化开发都是必要的。孟加拉语-英语平行语料库在各种多语言应用和NLP研究工作中具有重要意义。但目前尚缺乏完整的孟加拉语-英语平行语料库。本文提出了一种大规模的众包方法，通过众包构建孟加拉语-英语平行语料库。我们选择了众包的方法来探索语料库构建的新方法，并在此过程中评估人类的行为模式。这些翻译都是从大学本科生中收集的，以确保他们有很强的语言知识。孟加拉语与英语平行语料库将有助于比较这些语言的语言特征。在本文中，我们提出了一个通过众包准备的初始数据集，它将作为进一步分析基于众源的语料库的基线。我们的主要数据集由517个孟加拉语句子组成，对于每个孟加拉语句子，我们通过众包的方式平均收集4个英语句子，总共收集2143个英语句子。这些数据是在2个月的时间里从62名用户中收集的。最后对数据集进行了分析，并对进一步的研究提出了一些结论性的看法。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

A Crowd-Source Based Corpus on Bangla to English Translation

In this paper, we present a crowd-source based Bangla to English parallel corpus and evaluate its accuracy. A complete and informative corpus is necessary for any language for its development through automated process. A Bangla to English parallel corpus has importance in various multi-lingual applications and NLP research works. But there is still scarcity of a complete Bangla to English parallel corpus. In this paper we propose a large scale crowd-source method of construction of a Bangla to English parallel corpus through crowd-sourcing. We chose crowd-sourcing method to venture a new approach in corpus construction and evaluate human behavior pattern in doing so. The translations were collected form under graduate students of university to ensure strong language knowledge. A Bangla to English parallel corpus will help in comparing linguistic features of these languages. In this paper we present an initial dataset prepared via crowd-sourcing which will serve as a baseline for further analysis of crowd source based corpus. Our primary dataset is consists of 517 Bangla sentences and for every Bangla sentence, we collected 4 English sentences on an average and 2143 English sentences in total via crowd-sourcing. This data was collected over a period of 2 months and from 62 users. Finally we analyze the dataset and give some conclusive idea about further research.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2018 21st International Conference of Computer and Information Technology (ICCIT)

自引率

0.00%

发文量