{"title":"On k-Mer-Based and Maximum Likelihood Estimation Algorithms for Trace Reconstruction","authors":"Kuan Cheng;Elena Grigorescu;Xin Li;Madhu Sudan;Minshen Zhu","doi":"10.1109/TIT.2025.3541375","DOIUrl":null,"url":null,"abstract":"The goal of the trace reconstruction problem is to recover a string <inline-formula> <tex-math>$\\mathbf {x}\\in \\{0,1\\}^{n}$ </tex-math></inline-formula> given many independent <italic>traces</i> of <bold>x</b>, where a trace is a subsequence obtained from deleting bits of <bold>x</b> independently with some given probability <inline-formula> <tex-math>$p\\in [0,1$ </tex-math></inline-formula>). A recent result of Chase (STOC 2021) shows how <bold>x</b> can be determined (in exponential time) from <inline-formula> <tex-math>$\\exp ({O}(n^{1/5})\\log ^{5} n)$ </tex-math></inline-formula> traces. This is the state-of-the-art result on the sample complexity of trace reconstruction. In this paper we consider two kinds of algorithms for the trace reconstruction problem. We first observe that the bound of Chase, which is based on statistics of arbitrary length-<italic>k</i> subsequences, can also be obtained by considering the “<italic>k</i>-mer statistics”, i.e., statistics regarding occurrences of <italic>contiguous k</i>-bit strings (a.k.a, <italic>k-mers</i>) in the initial string <bold>x</b>, for <inline-formula> <tex-math>$k = 2n^{1/5}$ </tex-math></inline-formula>. Mazooji and Shomorony (arXiv.2210.10917) show that such statistics (called <italic>k</i>-mer density map) can be estimated within <inline-formula> <tex-math>$\\varepsilon $ </tex-math></inline-formula> accuracy from <inline-formula> <tex-math>$ {\\mathrm {poly}} (n, 2^{k}, 1/ {\\varepsilon })$ </tex-math></inline-formula> traces. We call an algorithm to be <italic>k-mer-based</i> if it reconstructs <bold>x</b> given estimates of the <italic>k</i>-mer density map. Such algorithms essentially capture all the analyses in the worst-case and smoothed-complexity models of the trace reconstruction problem we know of so far. Our first, and technically more involved, result shows that any <italic>k</i>-mer-based algorithm for trace reconstruction must use <inline-formula> <tex-math>$\\exp (\\Omega (n^{1/5} \\sqrt {\\log n}))$ </tex-math></inline-formula> traces, thus establishing the optimality of this number of traces. The analysis of this result also shows that the analysis technique used by Chase (STOC 2021) is essentially tight, and hence new techniques are needed in order to improve the worst-case upper bound. This result is shown by considering an appropriate class of real polynomials, that have been previously studied in the context of trace estimation (De, O’Donnell, Servedio. Annals of Probability 2019; Nazarov, Peres. STOC 2017), and proving that two of these polynomials are very close to each other on an arc in the complex plane. Our proof of the proximity of such polynomials uses new technical ingredients that allow us to focus on just a few coefficients of these polynomials. Our second, simple, result considers the performance of the Maximum Likelihood Estimator (MLE), which specifically picks the source string that has the maximum likelihood to generate the samples (traces). We show that the MLE algorithm uses a nearly optimal number of traces, i.e., up to a factor of <italic>n</i> in the number of samples needed for an optimal algorithm, and show that this factor of <italic>n</i> loss may be necessary under general “model estimation” settings.","PeriodicalId":13494,"journal":{"name":"IEEE Transactions on Information Theory","volume":"71 4","pages":"2591-2603"},"PeriodicalIF":2.2000,"publicationDate":"2025-02-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Information Theory","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10884609/","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}
引用次数: 0
Abstract
The goal of the trace reconstruction problem is to recover a string $\mathbf {x}\in \{0,1\}^{n}$ given many independent traces of x, where a trace is a subsequence obtained from deleting bits of x independently with some given probability $p\in [0,1$ ). A recent result of Chase (STOC 2021) shows how x can be determined (in exponential time) from $\exp ({O}(n^{1/5})\log ^{5} n)$ traces. This is the state-of-the-art result on the sample complexity of trace reconstruction. In this paper we consider two kinds of algorithms for the trace reconstruction problem. We first observe that the bound of Chase, which is based on statistics of arbitrary length-k subsequences, can also be obtained by considering the “k-mer statistics”, i.e., statistics regarding occurrences of contiguous k-bit strings (a.k.a, k-mers) in the initial string x, for $k = 2n^{1/5}$ . Mazooji and Shomorony (arXiv.2210.10917) show that such statistics (called k-mer density map) can be estimated within $\varepsilon $ accuracy from $ {\mathrm {poly}} (n, 2^{k}, 1/ {\varepsilon })$ traces. We call an algorithm to be k-mer-based if it reconstructs x given estimates of the k-mer density map. Such algorithms essentially capture all the analyses in the worst-case and smoothed-complexity models of the trace reconstruction problem we know of so far. Our first, and technically more involved, result shows that any k-mer-based algorithm for trace reconstruction must use $\exp (\Omega (n^{1/5} \sqrt {\log n}))$ traces, thus establishing the optimality of this number of traces. The analysis of this result also shows that the analysis technique used by Chase (STOC 2021) is essentially tight, and hence new techniques are needed in order to improve the worst-case upper bound. This result is shown by considering an appropriate class of real polynomials, that have been previously studied in the context of trace estimation (De, O’Donnell, Servedio. Annals of Probability 2019; Nazarov, Peres. STOC 2017), and proving that two of these polynomials are very close to each other on an arc in the complex plane. Our proof of the proximity of such polynomials uses new technical ingredients that allow us to focus on just a few coefficients of these polynomials. Our second, simple, result considers the performance of the Maximum Likelihood Estimator (MLE), which specifically picks the source string that has the maximum likelihood to generate the samples (traces). We show that the MLE algorithm uses a nearly optimal number of traces, i.e., up to a factor of n in the number of samples needed for an optimal algorithm, and show that this factor of n loss may be necessary under general “model estimation” settings.
期刊介绍:
The IEEE Transactions on Information Theory is a journal that publishes theoretical and experimental papers concerned with the transmission, processing, and utilization of information. The boundaries of acceptable subject matter are intentionally not sharply delimited. Rather, it is hoped that as the focus of research activity changes, a flexible policy will permit this Transactions to follow suit. Current appropriate topics are best reflected by recent Tables of Contents; they are summarized in the titles of editorial areas that appear on the inside front cover.