{"title":"无损序列编码中参数间相似性的利用","authors":"J. Åberg","doi":"10.1109/DCC.1999.785670","DOIUrl":null,"url":null,"abstract":"Summary form only given. In sequential lossless data compression algorithms the data stream is often transformed into short subsequences that are modeled as memoryless. Then it is desirable to use any information that each sequence might provide about the behaviour of other sequences that can be expected to have similar properties. Here we examine one such situation, as follows. We want to encode, using arithmetic coding with a sequential estimator, an M-ary memoryless source with unknown parameters /spl theta/, from which we have encoded already a sequence x/sup n/. In addition, both the encoder and the decoder have observed a sequence y/sup n/ that is generated independently by another source with unknown parameters /spl theta//spl tilde/ that are known to be \"similar\" to /spl theta/ by a pseudodistance /spl delta/(/spl theta/,/spl theta//spl tilde/) that is approximately equal to the relative entropy. Known to both sides is also a number d such that /spl delta/(/spl theta/,/spl theta//spl tilde/)/spl les/d. For a stand-alone memoryless source, the worst-case average redundancy of the (n+1)-th encoding is lower bounded by 0.5(M-1)/n+O(1/n/sup 2/), and the Dirichlet estimator is close to optimal for this case. We show that this bound holds also for the case with side information as described above, meaning that we can improve, at best, the O(1/n/sup 2/)-term. We define a frequency weighted estimator for this. Application of the frequency weighted estimator to to the PPM algorithm (Bell et al., 1989) by weighting order-4 statistics into an order-5 model, with d estimated during encoding, yields improvements that are consistent with the bounds above, which means that in practice we improve the performance by about 0.5 bits per active state of the model, making a gain of approximately 20000 bits on the Calgary Corpus.","PeriodicalId":103598,"journal":{"name":"Proceedings DCC'99 Data Compression Conference (Cat. No. PR00096)","volume":"51 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"1999-03-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"On taking advantage of similarities between parameters in lossless sequential coding\",\"authors\":\"J. Åberg\",\"doi\":\"10.1109/DCC.1999.785670\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Summary form only given. In sequential lossless data compression algorithms the data stream is often transformed into short subsequences that are modeled as memoryless. Then it is desirable to use any information that each sequence might provide about the behaviour of other sequences that can be expected to have similar properties. Here we examine one such situation, as follows. We want to encode, using arithmetic coding with a sequential estimator, an M-ary memoryless source with unknown parameters /spl theta/, from which we have encoded already a sequence x/sup n/. In addition, both the encoder and the decoder have observed a sequence y/sup n/ that is generated independently by another source with unknown parameters /spl theta//spl tilde/ that are known to be \\\"similar\\\" to /spl theta/ by a pseudodistance /spl delta/(/spl theta/,/spl theta//spl tilde/) that is approximately equal to the relative entropy. Known to both sides is also a number d such that /spl delta/(/spl theta/,/spl theta//spl tilde/)/spl les/d. For a stand-alone memoryless source, the worst-case average redundancy of the (n+1)-th encoding is lower bounded by 0.5(M-1)/n+O(1/n/sup 2/), and the Dirichlet estimator is close to optimal for this case. We show that this bound holds also for the case with side information as described above, meaning that we can improve, at best, the O(1/n/sup 2/)-term. We define a frequency weighted estimator for this. Application of the frequency weighted estimator to to the PPM algorithm (Bell et al., 1989) by weighting order-4 statistics into an order-5 model, with d estimated during encoding, yields improvements that are consistent with the bounds above, which means that in practice we improve the performance by about 0.5 bits per active state of the model, making a gain of approximately 20000 bits on the Calgary Corpus.\",\"PeriodicalId\":103598,\"journal\":{\"name\":\"Proceedings DCC'99 Data Compression Conference (Cat. No. PR00096)\",\"volume\":\"51 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"1999-03-29\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings DCC'99 Data Compression Conference (Cat. No. PR00096)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/DCC.1999.785670\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings DCC'99 Data Compression Conference (Cat. No. PR00096)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/DCC.1999.785670","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
On taking advantage of similarities between parameters in lossless sequential coding
Summary form only given. In sequential lossless data compression algorithms the data stream is often transformed into short subsequences that are modeled as memoryless. Then it is desirable to use any information that each sequence might provide about the behaviour of other sequences that can be expected to have similar properties. Here we examine one such situation, as follows. We want to encode, using arithmetic coding with a sequential estimator, an M-ary memoryless source with unknown parameters /spl theta/, from which we have encoded already a sequence x/sup n/. In addition, both the encoder and the decoder have observed a sequence y/sup n/ that is generated independently by another source with unknown parameters /spl theta//spl tilde/ that are known to be "similar" to /spl theta/ by a pseudodistance /spl delta/(/spl theta/,/spl theta//spl tilde/) that is approximately equal to the relative entropy. Known to both sides is also a number d such that /spl delta/(/spl theta/,/spl theta//spl tilde/)/spl les/d. For a stand-alone memoryless source, the worst-case average redundancy of the (n+1)-th encoding is lower bounded by 0.5(M-1)/n+O(1/n/sup 2/), and the Dirichlet estimator is close to optimal for this case. We show that this bound holds also for the case with side information as described above, meaning that we can improve, at best, the O(1/n/sup 2/)-term. We define a frequency weighted estimator for this. Application of the frequency weighted estimator to to the PPM algorithm (Bell et al., 1989) by weighting order-4 statistics into an order-5 model, with d estimated during encoding, yields improvements that are consistent with the bounds above, which means that in practice we improve the performance by about 0.5 bits per active state of the model, making a gain of approximately 20000 bits on the Calgary Corpus.