{"title":"Lempel-Ziv (LZ77) Factorization in Sublinear Time","authors":"Dominik Kempa, Tomasz Kociumaka","doi":"arxiv-2409.12146","DOIUrl":null,"url":null,"abstract":"Lempel-Ziv (LZ77) factorization is a fundamental problem in string\nprocessing: Greedily partition a given string $T$ from left to right into\nblocks (called phrases) so that each phrase is either the leftmost occurrence\nof a letter or the longest prefix of the unprocessed suffix that has another\noccurrence earlier in $T$. Due to numerous applications, LZ77 factorization is\none of the most studied problems on strings. In the 47 years since its\ninception, several algorithms were developed for different models of\ncomputation, including parallel, GPU, external-memory, and quantum. Remarkably,\nhowever, the complexity of the most basic variant is still not settled: All\nexisting algorithms in the RAM model run in $\\Omega(n)$ time, which is a\n$\\Theta(\\log n)$ factor away from the lower bound of $\\Omega(n/\\log n)$\n(following from the necessity to read the input, which takes $\\Theta(n/\\log n)$\nspace for $T\\in\\{0,1\\}^{n}$). We present the first $o(n)$-time algorithm for LZ77 factorization, breaking\nthe linear-time barrier present for nearly 50 years. For $T\\in\\{0,1\\}^{n}$, our\nalgorithm runs in $\\mathcal{O}(n/\\sqrt{\\log n})=o(n)$ time and uses the optimal\n$\\mathcal{O}(n/\\log n)$ working space. Our algorithm generalizes to\n$\\Sigma=[0..\\sigma)$, where $\\sigma=n^{\\mathcal{O}(1)}$. The runtime and\nworking space then become $\\mathcal{O}((n\\log\\sigma)/\\sqrt{\\log n})$ and\n$\\mathcal{O}(n/\\log_{\\sigma} n)$. To obtain our algorithm, we prove a more\ngeneral result: For any constant $\\epsilon\\in(0,1)$ and $T\\in[0..\\sigma)^{n}$,\nin $\\mathcal{O}((n\\log\\sigma)/\\sqrt{\\log n})$ time and using\n$\\mathcal{O}(n/\\log_{\\sigma}n)$ space, we can construct an\n$\\mathcal{O}(n/\\log_{\\sigma}n)$-size index that, given any $P=T[j..j+\\ell)$\n(represented as $(j,\\ell)$), computes the leftmost occurrence of $P$ in $T$ in\n$\\mathcal{O}(\\log^{\\epsilon}n)$ time. In other words, we solve the\nindexing/online variant of the LZ77 problem.","PeriodicalId":501525,"journal":{"name":"arXiv - CS - Data Structures and Algorithms","volume":"27 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Data Structures and Algorithms","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.12146","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Lempel-Ziv (LZ77) factorization is a fundamental problem in string
processing: Greedily partition a given string $T$ from left to right into
blocks (called phrases) so that each phrase is either the leftmost occurrence
of a letter or the longest prefix of the unprocessed suffix that has another
occurrence earlier in $T$. Due to numerous applications, LZ77 factorization is
one of the most studied problems on strings. In the 47 years since its
inception, several algorithms were developed for different models of
computation, including parallel, GPU, external-memory, and quantum. Remarkably,
however, the complexity of the most basic variant is still not settled: All
existing algorithms in the RAM model run in $\Omega(n)$ time, which is a
$\Theta(\log n)$ factor away from the lower bound of $\Omega(n/\log n)$
(following from the necessity to read the input, which takes $\Theta(n/\log n)$
space for $T\in\{0,1\}^{n}$). We present the first $o(n)$-time algorithm for LZ77 factorization, breaking
the linear-time barrier present for nearly 50 years. For $T\in\{0,1\}^{n}$, our
algorithm runs in $\mathcal{O}(n/\sqrt{\log n})=o(n)$ time and uses the optimal
$\mathcal{O}(n/\log n)$ working space. Our algorithm generalizes to
$\Sigma=[0..\sigma)$, where $\sigma=n^{\mathcal{O}(1)}$. The runtime and
working space then become $\mathcal{O}((n\log\sigma)/\sqrt{\log n})$ and
$\mathcal{O}(n/\log_{\sigma} n)$. To obtain our algorithm, we prove a more
general result: For any constant $\epsilon\in(0,1)$ and $T\in[0..\sigma)^{n}$,
in $\mathcal{O}((n\log\sigma)/\sqrt{\log n})$ time and using
$\mathcal{O}(n/\log_{\sigma}n)$ space, we can construct an
$\mathcal{O}(n/\log_{\sigma}n)$-size index that, given any $P=T[j..j+\ell)$
(represented as $(j,\ell)$), computes the leftmost occurrence of $P$ in $T$ in
$\mathcal{O}(\log^{\epsilon}n)$ time. In other words, we solve the
indexing/online variant of the LZ77 problem.