A New Algebraic Approach for String Reconstruction From Substring Compositions

IF 2.2 3区计算机科学 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS

IEEE Transactions on Information Theory Pub Date : 2024-11-15 DOI:10.1109/TIT.2024.3493762

Utkarsh Gupta;Hessam Mahdavifar

{"title":"A New Algebraic Approach for String Reconstruction From Substring Compositions","authors":"Utkarsh Gupta;Hessam Mahdavifar","doi":"10.1109/TIT.2024.3493762","DOIUrl":null,"url":null,"abstract":"In this paper, we propose a new algorithm for the problem of string reconstruction from its substring composition multiset. Motivated by applications in polymer-based data storage for recovering strings from tandem mass-spectrometry sequencing, the proposed algorithm leverages the equivalent polynomial formulation of the problem which facilitates efficient parallel implementation. The computational complexity of the proposed reconstruction algorithm is upper bounded by \n<inline-formula> <tex-math>$6.5n^{2}$ </tex-math></inline-formula>\n finite field operations, where the field size is upper bounded by \n<inline-formula> <tex-math>$10n$ </tex-math></inline-formula>\n, implying that the computational complexity is upper bounded by \n<inline-formula> <tex-math>$6.5n^{2}(3.22+\\log {n})$ </tex-math></inline-formula>\n binary operations. Furthermore, it allows parallelization leading to \n<inline-formula> <tex-math>$O(n \\log n)$ </tex-math></inline-formula>\n reconstruction latency. We characterize sufficient conditions for a length n binary string that guarantee the string’s reconstruction time complexity to be bounded polynomially. Moreover, the sufficient conditions on binary strings that guarantee reconstruction in polynomial time are more general than the conditions for the algorithm by Acharya et al. This is used to construct new codebooks of reconstruction codes that have efficient encoding procedures, and are larger, by at least a linear factor in size, compared to the previously best known construction by Pattabiraman et al., (2023).","PeriodicalId":13494,"journal":{"name":"IEEE Transactions on Information Theory","volume":"71 1","pages":"125-137"},"PeriodicalIF":2.2000,"publicationDate":"2024-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Information Theory","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10754998/","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 0

Abstract

In this paper, we propose a new algorithm for the problem of string reconstruction from its substring composition multiset. Motivated by applications in polymer-based data storage for recovering strings from tandem mass-spectrometry sequencing, the proposed algorithm leverages the equivalent polynomial formulation of the problem which facilitates efficient parallel implementation. The computational complexity of the proposed reconstruction algorithm is upper bounded by

$6.5n^{2}$

finite field operations, where the field size is upper bounded by

$10n$

, implying that the computational complexity is upper bounded by

$6.5n^{2}(3.22+\log {n})$

binary operations. Furthermore, it allows parallelization leading to

$O(n \log n)$

reconstruction latency. We characterize sufficient conditions for a length n binary string that guarantee the string’s reconstruction time complexity to be bounded polynomially. Moreover, the sufficient conditions on binary strings that guarantee reconstruction in polynomial time are more general than the conditions for the algorithm by Acharya et al. This is used to construct new codebooks of reconstruction codes that have efficient encoding procedures, and are larger, by at least a linear factor in size, compared to the previously best known construction by Pattabiraman et al., (2023).

查看原文本刊更多论文

一种从子字符串组合重构字符串的新代数方法

本文提出了一种基于子串组成多集的字符串重构算法。在基于聚合物的数据存储应用的激励下，从串联质谱测序中恢复字符串，所提出的算法利用了问题的等效多项式公式，从而促进了高效的并行实现。重构算法的计算复杂度以$6.5n^{2}$有限域运算为上界，其中域大小以$10n$为上界，即计算复杂度以$6.5n^{2}(3.22+\log {n})$二进制运算为上界。此外，它允许并行化导致$O(n \log n)$重建延迟。给出了长度为n的二进制字符串重构时间复杂度多项式有界的充分条件。此外，二元字符串保证在多项式时间内重构的充分条件比Acharya等算法的条件更为一般。这用于构建具有有效编码程序的重构码的新码本，并且与之前最著名的Pattabiraman等人（2023）的构建相比，至少在大小上增加了一个线性因子。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE Transactions on Information Theory 工程技术-工程：电子与电气

CiteScore

5.70

自引率

20.00%

发文量

514

审稿时长

12 months

期刊介绍： The IEEE Transactions on Information Theory is a journal that publishes theoretical and experimental papers concerned with the transmission, processing, and utilization of information. The boundaries of acceptable subject matter are intentionally not sharply delimited. Rather, it is hoped that as the focus of research activity changes, a flexible policy will permit this Transactions to follow suit. Current appropriate topics are best reflected by recent Tables of Contents; they are summarized in the titles of editorial areas that appear on the inside front cover.