Permutation coding using divide-and-conquer strategy

2023 Data Compression Conference (DCC) Pub Date : 2023-03-01 DOI:10.1109/DCC55655.2023.00046

Kun Tu, D. Puchala

{"title":"Permutation coding using divide-and-conquer strategy","authors":"Kun Tu, D. Puchala","doi":"10.1109/DCC55655.2023.00046","DOIUrl":null,"url":null,"abstract":"In computer science permutations are used, e.g., in the tasks of pattern searching, duplicate documents detection and data compression [1], [2]. For this reason the reduction of redundancy leading to succinct representation of permutations is of great importance. In this paper, we introduce a novel method for succinct representation of permutations where the average number of bits per element required to encode permutations is $\\log_{2}n-1.269$, which is close to the theoretic limit. Furthermore, it is possible to formulate precise expressions for the average value, lower, and upper bounds to the number of bits required by the method. Let n be an integer power of 2. Then the proposed method can be described as follows: (i) the method follows the ‘‘divide-and-conquer’’ strategy and at each stage a considered permutation is divided into two equal halves (bins), (ii) binary encoding is used to describe elements-to-bins assignment (’ 0’-first, ‘l’-second bin), (iii) depending on a permutation some bits can be omitted, which leads to succinct representation. For instance, let $\\pi_{2}=(0,2,1,3,7,6,4,5)$. We start with the identity permutation $\\pi_{1}=(0,1,2,3,4,5,6,7)$. At the first stage $\\pi_{1}$ is split between two bins in relation to $\\pi_{2}$ as $\\pi_{1}=(0,1,2,3|4,5,6,7)$ which is encoded with bits ‘0000’. At the second stage we repeat the same operations leading to $\\pi_{1}=(0,2|1,3|6,7|4,5)$, and formulate the coding bits ‘01011’ Finally, at the last stage, we get $\\pi_{1}=\\pi_{2}=(0|2|1|3|7|6|4|5)$ encoded as ‘0010’. The concatenated bits give the unique code $C=0000010110010$ for $\\pi_{2}$. The lower and upper bounds for the length of codes $\\displaystyle \\frac{1}{n}|C|$ are $G^{\\min}(n)=\\displaystyle \\frac{1}{2}\\log_{2}n$ and $G^{\\max}\\left(n\\right)=\\displaystyle \\log_{2}n-\\left(1-\\frac{1}{n}\\right)$. The average number of bits per element required to encode permutations can be calculated as:","PeriodicalId":209029,"journal":{"name":"2023 Data Compression Conference (DCC)","volume":"15 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2023 Data Compression Conference (DCC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/DCC55655.2023.00046","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

In computer science permutations are used, e.g., in the tasks of pattern searching, duplicate documents detection and data compression [1], [2]. For this reason the reduction of redundancy leading to succinct representation of permutations is of great importance. In this paper, we introduce a novel method for succinct representation of permutations where the average number of bits per element required to encode permutations is $\log_{2}n-1.269$, which is close to the theoretic limit. Furthermore, it is possible to formulate precise expressions for the average value, lower, and upper bounds to the number of bits required by the method. Let n be an integer power of 2. Then the proposed method can be described as follows: (i) the method follows the ‘‘divide-and-conquer’’ strategy and at each stage a considered permutation is divided into two equal halves (bins), (ii) binary encoding is used to describe elements-to-bins assignment (’ 0’-first, ‘l’-second bin), (iii) depending on a permutation some bits can be omitted, which leads to succinct representation. For instance, let $\pi_{2}=(0,2,1,3,7,6,4,5)$. We start with the identity permutation $\pi_{1}=(0,1,2,3,4,5,6,7)$. At the first stage $\pi_{1}$ is split between two bins in relation to $\pi_{2}$ as $\pi_{1}=(0,1,2,3|4,5,6,7)$ which is encoded with bits ‘0000’. At the second stage we repeat the same operations leading to $\pi_{1}=(0,2|1,3|6,7|4,5)$, and formulate the coding bits ‘01011’ Finally, at the last stage, we get $\pi_{1}=\pi_{2}=(0|2|1|3|7|6|4|5)$ encoded as ‘0010’. The concatenated bits give the unique code $C=0000010110010$ for $\pi_{2}$. The lower and upper bounds for the length of codes $\displaystyle \frac{1}{n}|C|$ are $G^{\min}(n)=\displaystyle \frac{1}{2}\log_{2}n$ and $G^{\max}\left(n\right)=\displaystyle \log_{2}n-\left(1-\frac{1}{n}\right)$. The average number of bits per element required to encode permutations can be calculated as:

查看原文本刊更多论文

采用分治策略的排列编码

在计算机科学中，排列被用于模式搜索、重复文档检测和数据压缩等任务中[1]，[2]。因此，减少冗余导致排列的简洁表示是非常重要的。在本文中，我们引入了一种新的排列简洁表示方法，其中编码排列所需的每个元素的平均位数为$\log_{2}n-1.269$，接近理论极限。此外，还可以为该方法所需的位数的平均值、下界和上界制定精确的表达式。设n是2的整数次幂。然后提出的方法可以描述如下:(i)该方法遵循“分而治之”策略，在每个阶段将考虑的排列分为两个相等的一半(箱)，(ii)二进制编码用于描述元素到箱的分配(' 0 ' -first， ' l ' -second bin)， (iii)根据排列可以省略一些比特，这导致简洁的表示。例如，让$\pi_{2}=(0,2,1,3,7,6,4,5)$。我们从单位置换$\pi_{1}=(0,1,2,3,4,5,6,7)$开始。在第一阶段，$\pi_{1}$被分成两个相对于$\pi_{2}$的箱子，$\pi_{1}=(0,1,2,3|4,5,6,7)$用位' 0000 '编码。在第二阶段，我们重复导致$\pi_{1}=(0,2|1,3|6,7|4,5)$的相同操作，并制定编码位' 01011 '。最后，在最后阶段，我们将$\pi_{1}=\pi_{2}=(0|2|1|3|7|6|4|5)$编码为' 0010 '。连接的位给出了$\pi_{2}$的唯一代码$C=0000010110010$。编码$\displaystyle \frac{1}{n}|C|$长度的下界为$G^{\min}(n)=\displaystyle \frac{1}{2}\log_{2}n$，上界为$G^{\max}\left(n\right)=\displaystyle \log_{2}n-\left(1-\frac{1}{n}\right)$。编码排列所需的每个元素的平均位数可以计算为:

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2023 Data Compression Conference (DCC)

自引率

0.00%

发文量