Variable-to-Fixed-Length Encoding for Large Texts Using Re-Pair Algorithm with Shared Dictionaries

Kei Sekine, Hirohito Sasakawa, S. Yoshida, T. Kida
{"title":"Variable-to-Fixed-Length Encoding for Large Texts Using Re-Pair Algorithm with Shared Dictionaries","authors":"Kei Sekine, Hirohito Sasakawa, S. Yoshida, T. Kida","doi":"10.1109/DCC.2013.97","DOIUrl":null,"url":null,"abstract":"The Re-Pair algorithm proposed by Larsson and Moffat in 1999 is a simple grammar-based compression method that achieves an extremely high compression ratio. However, Re-Pair is an offline and very space consuming algorithm. Thus, to apply it to a very large text, we need to divide the text into smaller blocks. Consequently, if we share a part of the dictionary among all blocks, we expect that the compression speed and ratio of the algorithm will improve. In this paper, we implemented our method with exploiting variable-to-fixed-length codes, and empirically show how the compression speed and ratio of the method vary by adjusting three parameters: block size, dictionary size, and size of shared dictionary. Finally, we discuss the tendencies of compression speed and ratio with respect to the three parameters.","PeriodicalId":388717,"journal":{"name":"2013 Data Compression Conference","volume":"101 7","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2013-03-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2013 Data Compression Conference","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/DCC.2013.97","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 2

Abstract

The Re-Pair algorithm proposed by Larsson and Moffat in 1999 is a simple grammar-based compression method that achieves an extremely high compression ratio. However, Re-Pair is an offline and very space consuming algorithm. Thus, to apply it to a very large text, we need to divide the text into smaller blocks. Consequently, if we share a part of the dictionary among all blocks, we expect that the compression speed and ratio of the algorithm will improve. In this paper, we implemented our method with exploiting variable-to-fixed-length codes, and empirically show how the compression speed and ratio of the method vary by adjusting three parameters: block size, dictionary size, and size of shared dictionary. Finally, we discuss the tendencies of compression speed and ratio with respect to the three parameters.
基于共享字典的重对算法的大文本变到定长编码
1999年Larsson和Moffat提出的Re-Pair算法是一种简单的基于语法的压缩方法,实现了极高的压缩比。然而,Re-Pair是一种离线且非常占用空间的算法。因此,要将其应用于非常大的文本,我们需要将文本分成更小的块。因此,如果我们在所有块中共享字典的一部分,我们预计算法的压缩速度和压缩比将会提高。在本文中,我们利用可变长度到固定长度的代码实现了我们的方法,并通过调整三个参数:块大小、字典大小和共享字典大小,实证地展示了该方法的压缩速度和比率是如何变化的。最后,我们讨论了压缩速度和压缩比相对于这三个参数的趋势。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信