{"title":"Parallel Text Identification Using Lexical and Corpus Features for the English-Maori Language Pair","authors":"Mahsa Mohaghegh, A. Sarrafzadeh","doi":"10.1109/ICMLA.2016.0163","DOIUrl":null,"url":null,"abstract":"Comparable corpora contain significant quantities of useful data for Natural Language Processing tasks, especially in the area of Machine Translation. They are mainly the source of parallel text fragments. This paper investigates how to effectively extract bilingual texts from comparable corpora relying on a small-size parallel training corpus. We propose a new technique to filter non parallel articles in Wikipedia based on Zipfian frequency distribution. We also use the SVM approach to find parallel chunks of text in a candidate comparable document. In our approach we use a parallel corpus to generate the required features for the training step. The evaluations of generated bilingual texts are promising.","PeriodicalId":356182,"journal":{"name":"2016 15th IEEE International Conference on Machine Learning and Applications (ICMLA)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2016-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2016 15th IEEE International Conference on Machine Learning and Applications (ICMLA)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICMLA.2016.0163","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 2
Abstract
Comparable corpora contain significant quantities of useful data for Natural Language Processing tasks, especially in the area of Machine Translation. They are mainly the source of parallel text fragments. This paper investigates how to effectively extract bilingual texts from comparable corpora relying on a small-size parallel training corpus. We propose a new technique to filter non parallel articles in Wikipedia based on Zipfian frequency distribution. We also use the SVM approach to find parallel chunks of text in a candidate comparable document. In our approach we use a parallel corpus to generate the required features for the training step. The evaluations of generated bilingual texts are promising.