{"title":"Compression of unicode files","authors":"P. Fenwick, S. Brierley","doi":"10.1109/DCC.1998.672274","DOIUrl":null,"url":null,"abstract":"Summary form only given. The increasing importance of unicode for text files, for example with Java and in some modern operating systems, implies a possible increase of data storage space and data transmission time, with a corresponding need for data compression. However data compressors designed for traditional 8-bit byte data are not necessarily well matched to the peculiarities of unicode data. Different \"standard\" text compression methods behave in different ways, as compared with the performance already known from ASCII or other 8-bit data. A small corpus of unicode files has been compressed on several widely-available text compressors of the various types, confirming that unicode files have different compression characteristics from those known for 8-bit data. Tests with a simple LZ-77 compressor designed to operate in both 8-bit and 16-bit modes indicate that it may be useful to design compressors specifically for unicode data.","PeriodicalId":191890,"journal":{"name":"Proceedings DCC '98 Data Compression Conference (Cat. No.98TB100225)","volume":"49 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"1998-03-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"11","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings DCC '98 Data Compression Conference (Cat. No.98TB100225)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/DCC.1998.672274","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 11
Abstract
Summary form only given. The increasing importance of unicode for text files, for example with Java and in some modern operating systems, implies a possible increase of data storage space and data transmission time, with a corresponding need for data compression. However data compressors designed for traditional 8-bit byte data are not necessarily well matched to the peculiarities of unicode data. Different "standard" text compression methods behave in different ways, as compared with the performance already known from ASCII or other 8-bit data. A small corpus of unicode files has been compressed on several widely-available text compressors of the various types, confirming that unicode files have different compression characteristics from those known for 8-bit data. Tests with a simple LZ-77 compressor designed to operate in both 8-bit and 16-bit modes indicate that it may be useful to design compressors specifically for unicode data.