{"title":"Towards a calibrated corpus for compression testing","authors":"M. Titchener, P. Fenwick, M. C. Chen","doi":"10.1109/DCC.1999.785711","DOIUrl":null,"url":null,"abstract":"Summary form only given. A mini-corpus of twelve 'calibrated' binary-data files have been produced for systematic evaluation of compression algorithms. These are generated within the framework of a deterministic theory of string complexity. Here the T-complexity of a string x (measured in taugs) is defined as C/sub T/(x/sub i/)=/spl Sigma//sub i/log/sub 2/(k/sub i/+1), where the positive integers k/sub i/ are the T-expansion parameters for the corresponding string production process. C/sub T/(x) is observed to be the logarithmic integral of the total information content I/sub x/ of x (measured in nats), i.e., C/sub T/(x)=li(I/sub x/). The average entropy is H~/sub x/=I/sub x//|x|, i.e., the total information content divided by the length of x. Thus C/sub T/(x)=li(H~/sub x//spl times/|x|). Alternatively, the information rate along a string may be described by an entropy function H/sub x/(n),0/spl les/n/spl les/|x| for the string. Assuming that H/sub x/(n) is continuously integrable along the length of the x, then I/sub x/=/spl int//sub 0//sup |/x|H/sub x/(n)/spl delta/n. Thus C/sub T/(x)=li(/spl int//sub 0//sup |/x|H/sub x/(n)/spl delta/n). Solving for H/sub x/(n): that is differentiating both sides and rearranging, we get: H/sub x/(n)=(/spl delta/C/sub T/(x|n)//spl delta/n)/spl times/log/sub e/(li/sup -1/(C/sub T/(x|/sub n/))). With x being in fact discrete, and the T-complexity function being computed in terms of the discrete T-augmentation steps, we may accordingly re-express the equation in terms of the T-prefix increments: /spl delta/n/spl ap//spl Delta//sub i/|x|=k/sub i/|p/sub i/|; and from the definition of C/sub T/(x): /spl delta/C/sub T/(x) is replaced by /spl Delta//sub i/C/sub T/(x)=log/sub 2/(k/sub i/+1). The average slope over the i-th T-prefix p/sub i/ increment is then simply (/spl Delta//sub i/C/sub T/(x))/(/spl Delta//sub i/|x|)=(log/sub 2/(k/sub i/+1))/(k/sub i/|p/sub i/|). The entropy function is now replaced by a discrete approximation.","PeriodicalId":103598,"journal":{"name":"Proceedings DCC'99 Data Compression Conference (Cat. No. PR00096)","volume":"48 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"1999-03-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"6","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings DCC'99 Data Compression Conference (Cat. No. PR00096)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/DCC.1999.785711","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 6
Abstract
Summary form only given. A mini-corpus of twelve 'calibrated' binary-data files have been produced for systematic evaluation of compression algorithms. These are generated within the framework of a deterministic theory of string complexity. Here the T-complexity of a string x (measured in taugs) is defined as C/sub T/(x/sub i/)=/spl Sigma//sub i/log/sub 2/(k/sub i/+1), where the positive integers k/sub i/ are the T-expansion parameters for the corresponding string production process. C/sub T/(x) is observed to be the logarithmic integral of the total information content I/sub x/ of x (measured in nats), i.e., C/sub T/(x)=li(I/sub x/). The average entropy is H~/sub x/=I/sub x//|x|, i.e., the total information content divided by the length of x. Thus C/sub T/(x)=li(H~/sub x//spl times/|x|). Alternatively, the information rate along a string may be described by an entropy function H/sub x/(n),0/spl les/n/spl les/|x| for the string. Assuming that H/sub x/(n) is continuously integrable along the length of the x, then I/sub x/=/spl int//sub 0//sup |/x|H/sub x/(n)/spl delta/n. Thus C/sub T/(x)=li(/spl int//sub 0//sup |/x|H/sub x/(n)/spl delta/n). Solving for H/sub x/(n): that is differentiating both sides and rearranging, we get: H/sub x/(n)=(/spl delta/C/sub T/(x|n)//spl delta/n)/spl times/log/sub e/(li/sup -1/(C/sub T/(x|/sub n/))). With x being in fact discrete, and the T-complexity function being computed in terms of the discrete T-augmentation steps, we may accordingly re-express the equation in terms of the T-prefix increments: /spl delta/n/spl ap//spl Delta//sub i/|x|=k/sub i/|p/sub i/|; and from the definition of C/sub T/(x): /spl delta/C/sub T/(x) is replaced by /spl Delta//sub i/C/sub T/(x)=log/sub 2/(k/sub i/+1). The average slope over the i-th T-prefix p/sub i/ increment is then simply (/spl Delta//sub i/C/sub T/(x))/(/spl Delta//sub i/|x|)=(log/sub 2/(k/sub i/+1))/(k/sub i/|p/sub i/|). The entropy function is now replaced by a discrete approximation.