{"title":"从可比语料库中提取平行短语","authors":"Jiexin Zhang, Hailong Cao, T. Zhao","doi":"10.1109/IALP.2014.6973501","DOIUrl":null,"url":null,"abstract":"The state-of-the-art statistical machine translation models are trained with the parallel corpora. However, the traditional SMT loses its power when it comes to language pairs with few bilingual resources. This paper proposes a novel method that treats the phrase extraction as a classification task. We first automatically generate the training and testing phrase pairs for the classifier. Then, we train a SVM classifier which can determine the phrase pairs are either parallel or non-parallel. The proposed approach is evaluated on the translation task of Chinese-English. Experimental results show that the precision of the classifier on test sets is above 70% and the accuracy is above 98% The quality of the extracted data is also evaluated by measuring the impact on the performance of a state-of-the-art SMT system, which is built with a small parallel corpus. It shows better results over the baseline system.","PeriodicalId":117334,"journal":{"name":"2014 International Conference on Asian Language Processing (IALP)","volume":"42 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2014-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"Extracting parallel phrases from comparable corpora\",\"authors\":\"Jiexin Zhang, Hailong Cao, T. Zhao\",\"doi\":\"10.1109/IALP.2014.6973501\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The state-of-the-art statistical machine translation models are trained with the parallel corpora. However, the traditional SMT loses its power when it comes to language pairs with few bilingual resources. This paper proposes a novel method that treats the phrase extraction as a classification task. We first automatically generate the training and testing phrase pairs for the classifier. Then, we train a SVM classifier which can determine the phrase pairs are either parallel or non-parallel. The proposed approach is evaluated on the translation task of Chinese-English. Experimental results show that the precision of the classifier on test sets is above 70% and the accuracy is above 98% The quality of the extracted data is also evaluated by measuring the impact on the performance of a state-of-the-art SMT system, which is built with a small parallel corpus. It shows better results over the baseline system.\",\"PeriodicalId\":117334,\"journal\":{\"name\":\"2014 International Conference on Asian Language Processing (IALP)\",\"volume\":\"42 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2014-10-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2014 International Conference on Asian Language Processing (IALP)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/IALP.2014.6973501\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2014 International Conference on Asian Language Processing (IALP)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IALP.2014.6973501","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Extracting parallel phrases from comparable corpora
The state-of-the-art statistical machine translation models are trained with the parallel corpora. However, the traditional SMT loses its power when it comes to language pairs with few bilingual resources. This paper proposes a novel method that treats the phrase extraction as a classification task. We first automatically generate the training and testing phrase pairs for the classifier. Then, we train a SVM classifier which can determine the phrase pairs are either parallel or non-parallel. The proposed approach is evaluated on the translation task of Chinese-English. Experimental results show that the precision of the classifier on test sets is above 70% and the accuracy is above 98% The quality of the extracted data is also evaluated by measuring the impact on the performance of a state-of-the-art SMT system, which is built with a small parallel corpus. It shows better results over the baseline system.