{"title":"统计机器翻译的数据选择","authors":"Peng Liu, Yu Zhou, Chengqing Zong","doi":"10.1109/NLPKE.2010.5587827","DOIUrl":null,"url":null,"abstract":"The bilingual language corpus has a great effect on the performance of a statistical machine translation system. More data will lead to better performance. However, more data also increase the computational load. In this paper, we propose methods to estimate the sentence weight and select more informative sentences from the training corpus and the development corpus based on the sentence weight. The translation system is built and tuned on the compact corpus. The experimental results show that we can obtain a competitive performance with much less data.","PeriodicalId":259975,"journal":{"name":"Proceedings of the 6th International Conference on Natural Language Processing and Knowledge Engineering(NLPKE-2010)","volume":"392 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2010-09-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"10","resultStr":"{\"title\":\"Data selection for statistical machine translation\",\"authors\":\"Peng Liu, Yu Zhou, Chengqing Zong\",\"doi\":\"10.1109/NLPKE.2010.5587827\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The bilingual language corpus has a great effect on the performance of a statistical machine translation system. More data will lead to better performance. However, more data also increase the computational load. In this paper, we propose methods to estimate the sentence weight and select more informative sentences from the training corpus and the development corpus based on the sentence weight. The translation system is built and tuned on the compact corpus. The experimental results show that we can obtain a competitive performance with much less data.\",\"PeriodicalId\":259975,\"journal\":{\"name\":\"Proceedings of the 6th International Conference on Natural Language Processing and Knowledge Engineering(NLPKE-2010)\",\"volume\":\"392 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2010-09-30\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"10\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 6th International Conference on Natural Language Processing and Knowledge Engineering(NLPKE-2010)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/NLPKE.2010.5587827\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 6th International Conference on Natural Language Processing and Knowledge Engineering(NLPKE-2010)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/NLPKE.2010.5587827","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Data selection for statistical machine translation
The bilingual language corpus has a great effect on the performance of a statistical machine translation system. More data will lead to better performance. However, more data also increase the computational load. In this paper, we propose methods to estimate the sentence weight and select more informative sentences from the training corpus and the development corpus based on the sentence weight. The translation system is built and tuned on the compact corpus. The experimental results show that we can obtain a competitive performance with much less data.