{"title":"从电影字幕中提取印尼语和英语平行句","authors":"Boon Hong Yeo, AiTi Aw, Xuancong Wang","doi":"10.1109/IALP.2017.8300602","DOIUrl":null,"url":null,"abstract":"Parallel corpus serves as a mandatory resource to develop machine-learning-based statistical translation engine. The size and coverage of parallel corpus available for training affects directly the translation accuracy of the engine. To have more training data available for the development of the translation engine in conversational domain, we propose a method to extract parallel data from Movie Subtitles using dynamic time warping, cosine similarity and beam search algorithm. The proposed method is capable of extracting 30% parallel sentences from a set of Indonesian-English movie subtitles with a precision of 98%.","PeriodicalId":183586,"journal":{"name":"2017 International Conference on Asian Language Processing (IALP)","volume":"23 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-12-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Extraction of Indonesian and english parallel sentences from movie subtitles\",\"authors\":\"Boon Hong Yeo, AiTi Aw, Xuancong Wang\",\"doi\":\"10.1109/IALP.2017.8300602\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Parallel corpus serves as a mandatory resource to develop machine-learning-based statistical translation engine. The size and coverage of parallel corpus available for training affects directly the translation accuracy of the engine. To have more training data available for the development of the translation engine in conversational domain, we propose a method to extract parallel data from Movie Subtitles using dynamic time warping, cosine similarity and beam search algorithm. The proposed method is capable of extracting 30% parallel sentences from a set of Indonesian-English movie subtitles with a precision of 98%.\",\"PeriodicalId\":183586,\"journal\":{\"name\":\"2017 International Conference on Asian Language Processing (IALP)\",\"volume\":\"23 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2017-12-05\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2017 International Conference on Asian Language Processing (IALP)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/IALP.2017.8300602\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2017 International Conference on Asian Language Processing (IALP)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IALP.2017.8300602","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Extraction of Indonesian and english parallel sentences from movie subtitles
Parallel corpus serves as a mandatory resource to develop machine-learning-based statistical translation engine. The size and coverage of parallel corpus available for training affects directly the translation accuracy of the engine. To have more training data available for the development of the translation engine in conversational domain, we propose a method to extract parallel data from Movie Subtitles using dynamic time warping, cosine similarity and beam search algorithm. The proposed method is capable of extracting 30% parallel sentences from a set of Indonesian-English movie subtitles with a precision of 98%.