On k-Mer-Based and Maximum Likelihood Estimation Algorithms for Trace Reconstruction

IF 2.2 3区计算机科学 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS

IEEE Transactions on Information Theory Pub Date : 2025-02-13 DOI:10.1109/TIT.2025.3541375

Kuan Cheng;Elena Grigorescu;Xin Li;Madhu Sudan;Minshen Zhu

{"title":"On k-Mer-Based and Maximum Likelihood Estimation Algorithms for Trace Reconstruction","authors":"Kuan Cheng;Elena Grigorescu;Xin Li;Madhu Sudan;Minshen Zhu","doi":"10.1109/TIT.2025.3541375","DOIUrl":null,"url":null,"abstract":"The goal of the trace reconstruction problem is to recover a string <inline-formula> <tex-math>$\\mathbf {x}\\in \\{0,1\\}^{n}$ </tex-math></inline-formula> given many independent <italic>traces of <bold>x, where a trace is a subsequence obtained from deleting bits of <bold>x independently with some given probability <inline-formula> <tex-math>$p\\in [0,1$ </tex-math></inline-formula>). A recent result of Chase (STOC 2021) shows how <bold>x can be determined (in exponential time) from <inline-formula> <tex-math>$\\exp ({O}(n^{1/5})\\log ^{5} n)$ </tex-math></inline-formula> traces. This is the state-of-the-art result on the sample complexity of trace reconstruction. In this paper we consider two kinds of algorithms for the trace reconstruction problem. We first observe that the bound of Chase, which is based on statistics of arbitrary length-<italic>k subsequences, can also be obtained by considering the “<italic>k-mer statistics”, i.e., statistics regarding occurrences of <italic>contiguous k-bit strings (a.k.a, <italic>k-mers) in the initial string <bold>x, for <inline-formula> <tex-math>$k = 2n^{1/5}$ </tex-math></inline-formula>. Mazooji and Shomorony (arXiv.2210.10917) show that such statistics (called <italic>k-mer density map) can be estimated within <inline-formula> <tex-math>$\\varepsilon $ </tex-math></inline-formula> accuracy from <inline-formula> <tex-math>$ {\\mathrm {poly}} (n, 2^{k}, 1/ {\\varepsilon })$ </tex-math></inline-formula> traces. We call an algorithm to be <italic>k-mer-based if it reconstructs <bold>x given estimates of the <italic>k-mer density map. Such algorithms essentially capture all the analyses in the worst-case and smoothed-complexity models of the trace reconstruction problem we know of so far. Our first, and technically more involved, result shows that any <italic>k-mer-based algorithm for trace reconstruction must use <inline-formula> <tex-math>$\\exp (\\Omega (n^{1/5} \\sqrt {\\log n}))$ </tex-math></inline-formula> traces, thus establishing the optimality of this number of traces. The analysis of this result also shows that the analysis technique used by Chase (STOC 2021) is essentially tight, and hence new techniques are needed in order to improve the worst-case upper bound. This result is shown by considering an appropriate class of real polynomials, that have been previously studied in the context of trace estimation (De, O’Donnell, Servedio. Annals of Probability 2019; Nazarov, Peres. STOC 2017), and proving that two of these polynomials are very close to each other on an arc in the complex plane. Our proof of the proximity of such polynomials uses new technical ingredients that allow us to focus on just a few coefficients of these polynomials. Our second, simple, result considers the performance of the Maximum Likelihood Estimator (MLE), which specifically picks the source string that has the maximum likelihood to generate the samples (traces). We show that the MLE algorithm uses a nearly optimal number of traces, i.e., up to a factor of <italic>n in the number of samples needed for an optimal algorithm, and show that this factor of <italic>n loss may be necessary under general “model estimation” settings.","PeriodicalId":13494,"journal":{"name":"IEEE Transactions on Information Theory","volume":"71 4","pages":"2591-2603"},"PeriodicalIF":2.2000,"publicationDate":"2025-02-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Information Theory","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10884609/","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 0

Abstract

The goal of the trace reconstruction problem is to recover a string

$\mathbf {x}\in \{0,1\}^{n}$

given many independent traces of x, where a trace is a subsequence obtained from deleting bits of x independently with some given probability

$p\in [0,1$

). A recent result of Chase (STOC 2021) shows how x can be determined (in exponential time) from

$\exp ({O}(n^{1/5})\log ^{5} n)$

traces. This is the state-of-the-art result on the sample complexity of trace reconstruction. In this paper we consider two kinds of algorithms for the trace reconstruction problem. We first observe that the bound of Chase, which is based on statistics of arbitrary length-k subsequences, can also be obtained by considering the “k-mer statistics”, i.e., statistics regarding occurrences of contiguous k-bit strings (a.k.a, k-mers) in the initial string x, for

$k = 2n^{1/5}$

. Mazooji and Shomorony (arXiv.2210.10917) show that such statistics (called k-mer density map) can be estimated within

$\varepsilon $

accuracy from

$ {\mathrm {poly}} (n, 2^{k}, 1/ {\varepsilon })$

traces. We call an algorithm to be k-mer-based if it reconstructs x given estimates of the k-mer density map. Such algorithms essentially capture all the analyses in the worst-case and smoothed-complexity models of the trace reconstruction problem we know of so far. Our first, and technically more involved, result shows that any k-mer-based algorithm for trace reconstruction must use

$\exp (\Omega (n^{1/5} \sqrt {\log n}))$

traces, thus establishing the optimality of this number of traces. The analysis of this result also shows that the analysis technique used by Chase (STOC 2021) is essentially tight, and hence new techniques are needed in order to improve the worst-case upper bound. This result is shown by considering an appropriate class of real polynomials, that have been previously studied in the context of trace estimation (De, O’Donnell, Servedio. Annals of Probability 2019; Nazarov, Peres. STOC 2017), and proving that two of these polynomials are very close to each other on an arc in the complex plane. Our proof of the proximity of such polynomials uses new technical ingredients that allow us to focus on just a few coefficients of these polynomials. Our second, simple, result considers the performance of the Maximum Likelihood Estimator (MLE), which specifically picks the source string that has the maximum likelihood to generate the samples (traces). We show that the MLE algorithm uses a nearly optimal number of traces, i.e., up to a factor of n in the number of samples needed for an optimal algorithm, and show that this factor of n loss may be necessary under general “model estimation” settings.

查看原文本刊更多论文

基于 k-Mer 和最大似然估计的溯源重建算法

迹线重构问题的目标是在给定 x 的许多独立迹线的情况下，恢复一个 $\mathbf {x}\in \{0,1}^{n}$ 的字符串，其中迹线是以某种给定概率 $p\in [0,1$ ) 独立删除 x 的比特后得到的子序列。切斯（Chase）最近的一项成果（STOC 2021）显示了如何（在指数时间内）从 $\exp ({O}(n^{1/5})\log ^{5} n)$ 跟踪中确定 x。这是关于迹线重构的采样复杂度的最先进结果。在本文中，我们考虑了两种迹线重构问题的算法。我们首先发现，在 $k = 2n^{1/5}$ 的情况下，基于任意长度-k 子序列统计的 Chase 约束也可以通过考虑 "k-mer 统计 "获得，即关于初始字符串 x 中连续 k 位字符串（又称 k-mer）出现情况的统计。Mazooji 和 Shomorony（arXiv.2210.10917）的研究表明，这种统计数据（称为 k-mer 密度图）可以从 $ {\mathrm {poly}} 中以 $\varepsilon $ 的精度估算出来。(n, 2^{k}, 1/ {\varepsilon })$ 迹线。如果一种算法能根据 k-mer 密度图的估计值重建 x，那么我们就称这种算法为基于 k-mer 的算法。这种算法基本上捕捉到了我们目前所知的痕量重建问题的最坏情况模型和平滑复杂度模型中的所有分析。我们的第一个也是技术上更复杂的结果表明，任何基于k-mer的痕量重建算法都必须使用$\exp (\Omega (n^{1/5} \sqrt {\log n}))$痕量，从而确定了痕量数量的最优性。对这一结果的分析还表明，Chase（STOC 2021）使用的分析技术本质上是严密的，因此需要新技术来改进最坏情况上限。这一结果是通过考虑一类适当的实多项式得到的，这些实多项式以前曾在迹估计的背景下被研究过（De, O'Donnell, Servedio.概率年鉴 2019；Nazarov，Peres.STOC 2017），并证明其中两个多项式在复平面的弧上非常接近。我们对这种多项式邻近性的证明使用了新的技术成分，使我们能够只关注这些多项式的几个系数。我们的第二个简单结果考虑了最大似然估计器（MLE）的性能，该估计器专门挑选具有最大似然的源字符串来生成样本（迹线）。我们证明，MLE 算法使用的迹线数量几乎是最优的，也就是说，与最优算法所需的样本数量相比，MLE 算法的样本数量最多只能达到 n 倍，而且我们还证明，在一般的 "模型估计 "设置下，这种 n 倍的损失可能是必要的。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE Transactions on Information Theory 工程技术-工程：电子与电气

CiteScore

5.70

自引率

20.00%

发文量

514

审稿时长

12 months

期刊介绍： The IEEE Transactions on Information Theory is a journal that publishes theoretical and experimental papers concerned with the transmission, processing, and utilization of information. The boundaries of acceptable subject matter are intentionally not sharply delimited. Rather, it is hoped that as the focus of research activity changes, a flexible policy will permit this Transactions to follow suit. Current appropriate topics are best reflected by recent Tables of Contents; they are summarized in the titles of editorial areas that appear on the inside front cover.