Phrase hierarchy inference and compression in bounded space

Proceedings DCC '98 Data Compression Conference (Cat. No.98TB100225) Pub Date : 1998-03-30 DOI:10.1109/DCC.1998.672146

C. Nevill-Manning, I. Witten

{"title":"Phrase hierarchy inference and compression in bounded space","authors":"C. Nevill-Manning, I. Witten","doi":"10.1109/DCC.1998.672146","DOIUrl":null,"url":null,"abstract":"Text compression by inferring a phrase hierarchy from the input is a technique that shows promise as a compression scheme and as a machine learning method that extracts some comprehensible account of the structure of the input text. Its performance as a data compression scheme outstrips other dictionary schemes, and the structures that it learns from sequences have been put to such eclectic uses as phrase browsing in digital libraries, music analysis, and inferring rules for fractal images. We focus attention on the memory requirements of the method. Since the algorithm operates in linear time, the space it consumes is at most linear with input size. The space consumed does in fact grow linearly with the size of the inferred hierarchy, and this makes operation on very large files infeasible. We describe two elegant ways of curtailing the space complexity of hierarchy inference, one of which yields a bounded space algorithm. We begin with a review of the hierarchy inference procedure that is embodied in the SEQUITUR program. Then we consider its performance on quite large files, and show how the compression performance improves as the file size increases.","PeriodicalId":191890,"journal":{"name":"Proceedings DCC '98 Data Compression Conference (Cat. No.98TB100225)","volume":"44 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"1998-03-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"14","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings DCC '98 Data Compression Conference (Cat. No.98TB100225)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/DCC.1998.672146","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 14

Abstract

Text compression by inferring a phrase hierarchy from the input is a technique that shows promise as a compression scheme and as a machine learning method that extracts some comprehensible account of the structure of the input text. Its performance as a data compression scheme outstrips other dictionary schemes, and the structures that it learns from sequences have been put to such eclectic uses as phrase browsing in digital libraries, music analysis, and inferring rules for fractal images. We focus attention on the memory requirements of the method. Since the algorithm operates in linear time, the space it consumes is at most linear with input size. The space consumed does in fact grow linearly with the size of the inferred hierarchy, and this makes operation on very large files infeasible. We describe two elegant ways of curtailing the space complexity of hierarchy inference, one of which yields a bounded space algorithm. We begin with a review of the hierarchy inference procedure that is embodied in the SEQUITUR program. Then we consider its performance on quite large files, and show how the compression performance improves as the file size increases.

查看原文本刊更多论文

有界空间中的短语层次推理和压缩

通过从输入推断短语层次结构来进行文本压缩是一种很有前途的压缩方案，也是一种机器学习方法，可以从输入文本的结构中提取一些可理解的说明。作为一种数据压缩方案，它的性能超过了其他字典方案，并且它从序列中学习的结构已经被用于数字图书馆的短语浏览、音乐分析和分形图像的推断规则等各种用途。我们将注意力集中在该方法的内存要求上。由于算法在线性时间内运行，因此它所消耗的空间最多与输入大小成线性关系。实际上，所消耗的空间随着推断的层次结构的大小线性增长，这使得对非常大的文件的操作变得不可行的。我们描述了两种降低层次推理空间复杂度的优雅方法，其中一种产生了有界空间算法。我们首先回顾在SEQUITUR程序中体现的层次推理过程。然后我们考虑它在相当大的文件上的性能，并显示压缩性能如何随着文件大小的增加而提高。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings DCC '98 Data Compression Conference (Cat. No.98TB100225)

自引率

0.00%

发文量