Miguel A. Martínez-Prieto, J. Adiego, F. Sánchez-Martínez, P. Fuente, Rafael C. Carrasco
{"title":"利用字对齐增强文本压缩","authors":"Miguel A. Martínez-Prieto, J. Adiego, F. Sánchez-Martínez, P. Fuente, Rafael C. Carrasco","doi":"10.1109/DCC.2009.22","DOIUrl":null,"url":null,"abstract":"This paper describes a novel approach for bilingual parallel corpora (bitexts) compression. The approach takes advantage of the fact that the two texts that form a bitext are mutual translations. First, the two texts are aligned both at the sentence and the word level. Then, word alignments are used to define biwords, that is, pairs of two words, each one from a different text, that are mutual translations. Finally, a biword-based PPM compressor is applied. The results obtained compressing the two texts of the bitext together improve the compression ratios achieved when both texts are independently compressed through a word-based PPM compressor; thus, saving storage and transmission costs.","PeriodicalId":377880,"journal":{"name":"2009 Data Compression Conference","volume":"29 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2009-03-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"9","resultStr":"{\"title\":\"On the Use of Word Alignments to Enhance Bitext Compression\",\"authors\":\"Miguel A. Martínez-Prieto, J. Adiego, F. Sánchez-Martínez, P. Fuente, Rafael C. Carrasco\",\"doi\":\"10.1109/DCC.2009.22\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"This paper describes a novel approach for bilingual parallel corpora (bitexts) compression. The approach takes advantage of the fact that the two texts that form a bitext are mutual translations. First, the two texts are aligned both at the sentence and the word level. Then, word alignments are used to define biwords, that is, pairs of two words, each one from a different text, that are mutual translations. Finally, a biword-based PPM compressor is applied. The results obtained compressing the two texts of the bitext together improve the compression ratios achieved when both texts are independently compressed through a word-based PPM compressor; thus, saving storage and transmission costs.\",\"PeriodicalId\":377880,\"journal\":{\"name\":\"2009 Data Compression Conference\",\"volume\":\"29 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2009-03-16\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"9\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2009 Data Compression Conference\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/DCC.2009.22\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2009 Data Compression Conference","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/DCC.2009.22","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
On the Use of Word Alignments to Enhance Bitext Compression
This paper describes a novel approach for bilingual parallel corpora (bitexts) compression. The approach takes advantage of the fact that the two texts that form a bitext are mutual translations. First, the two texts are aligned both at the sentence and the word level. Then, word alignments are used to define biwords, that is, pairs of two words, each one from a different text, that are mutual translations. Finally, a biword-based PPM compressor is applied. The results obtained compressing the two texts of the bitext together improve the compression ratios achieved when both texts are independently compressed through a word-based PPM compressor; thus, saving storage and transmission costs.