统计序列匹配的大偏差和小偏差

IF 2.2 3区计算机科学 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS

IEEE Transactions on Information Theory Pub Date : 2024-09-25 DOI:10.1109/TIT.2024.3464586

Lin Zhou;Qianyun Wang;Jingjing Wang;Lin Bai;Alfred O. Hero

{"title":"统计序列匹配的大偏差和小偏差","authors":"Lin Zhou;Qianyun Wang;Jingjing Wang;Lin Bai;Alfred O. Hero","doi":"10.1109/TIT.2024.3464586","DOIUrl":null,"url":null,"abstract":"We revisit the problem of statistical sequence matching between two databases of sequences initiated by Unnikrishnan, (2015) and derive theoretical performance guarantees for the generalized likelihood ratio test (GLRT). We first consider the case where the number of matched pairs of sequences between the databases is known. In this case, the task is to accurately find the matched pairs of sequences among all possible matches between the sequences in the two databases. We analyze the performance of the GLRT by Unnikrishnan and explicitly characterize the tradeoff between the mismatch and false reject probabilities under each hypothesis in both large and small deviations regimes. Furthermore, we demonstrate the optimality of Unnikrishnan’s GLRT test under the generalized Neyman-Person criterion for both regimes and illustrate our theoretical results via numerical examples. Subsequently, we generalize our achievability analyses to the case where the number of matched pairs is unknown, and an additional error probability needs to be considered. When one of the two databases contains a single sequence, the problem of statistical sequence matching specializes to the problem of multiple classification introduced by Gutman, (1989). For this special case, our result for the small deviations regime strengthens previous result of Zhou et al., (2020) by removing unnecessary conditions on the generating distributions.","PeriodicalId":13494,"journal":{"name":"IEEE Transactions on Information Theory","volume":"70 11","pages":"7532-7562"},"PeriodicalIF":2.2000,"publicationDate":"2024-09-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Large and Small Deviations for Statistical Sequence Matching\",\"authors\":\"Lin Zhou;Qianyun Wang;Jingjing Wang;Lin Bai;Alfred O. Hero\",\"doi\":\"10.1109/TIT.2024.3464586\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"We revisit the problem of statistical sequence matching between two databases of sequences initiated by Unnikrishnan, (2015) and derive theoretical performance guarantees for the generalized likelihood ratio test (GLRT). We first consider the case where the number of matched pairs of sequences between the databases is known. In this case, the task is to accurately find the matched pairs of sequences among all possible matches between the sequences in the two databases. We analyze the performance of the GLRT by Unnikrishnan and explicitly characterize the tradeoff between the mismatch and false reject probabilities under each hypothesis in both large and small deviations regimes. Furthermore, we demonstrate the optimality of Unnikrishnan’s GLRT test under the generalized Neyman-Person criterion for both regimes and illustrate our theoretical results via numerical examples. Subsequently, we generalize our achievability analyses to the case where the number of matched pairs is unknown, and an additional error probability needs to be considered. When one of the two databases contains a single sequence, the problem of statistical sequence matching specializes to the problem of multiple classification introduced by Gutman, (1989). For this special case, our result for the small deviations regime strengthens previous result of Zhou et al., (2020) by removing unnecessary conditions on the generating distributions.\",\"PeriodicalId\":13494,\"journal\":{\"name\":\"IEEE Transactions on Information Theory\",\"volume\":\"70 11\",\"pages\":\"7532-7562\"},\"PeriodicalIF\":2.2000,\"publicationDate\":\"2024-09-25\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE Transactions on Information Theory\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/10694735/\",\"RegionNum\":3,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q3\",\"JCRName\":\"COMPUTER SCIENCE, INFORMATION SYSTEMS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Information Theory","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10694735/","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 0

摘要

我们重温了 Unnikrishnan（2015 年）提出的两个序列数据库之间的统计序列匹配问题，并推导出广义似然比检验（GLRT）的理论性能保证。我们首先考虑数据库之间匹配的序列对数量已知的情况。在这种情况下，任务是在两个数据库中所有可能匹配的序列中准确找到匹配的序列对。我们分析了 Unnikrishnan 提出的 GLRT 的性能，并明确描述了在大偏差和小偏差两种情况下，每种假设下的错配概率和错误拒绝概率之间的权衡。此外，我们还证明了 Unnikrishnan 的 GLRT 检验在广义 Neyman-Person 准则下在两种情况下的最优性，并通过数值示例说明了我们的理论结果。随后，我们将可实现性分析推广到配对数量未知的情况，并需要考虑额外的错误概率。当两个数据库中的一个包含单一序列时，统计序列匹配问题就会特殊化为 Gutman（1989 年）提出的多重分类问题。对于这种特殊情况，我们的小偏差机制结果通过消除生成分布的不必要条件，加强了 Zhou 等人（2020）之前的结果。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Large and Small Deviations for Statistical Sequence Matching

We revisit the problem of statistical sequence matching between two databases of sequences initiated by Unnikrishnan, (2015) and derive theoretical performance guarantees for the generalized likelihood ratio test (GLRT). We first consider the case where the number of matched pairs of sequences between the databases is known. In this case, the task is to accurately find the matched pairs of sequences among all possible matches between the sequences in the two databases. We analyze the performance of the GLRT by Unnikrishnan and explicitly characterize the tradeoff between the mismatch and false reject probabilities under each hypothesis in both large and small deviations regimes. Furthermore, we demonstrate the optimality of Unnikrishnan’s GLRT test under the generalized Neyman-Person criterion for both regimes and illustrate our theoretical results via numerical examples. Subsequently, we generalize our achievability analyses to the case where the number of matched pairs is unknown, and an additional error probability needs to be considered. When one of the two databases contains a single sequence, the problem of statistical sequence matching specializes to the problem of multiple classification introduced by Gutman, (1989). For this special case, our result for the small deviations regime strengthens previous result of Zhou et al., (2020) by removing unnecessary conditions on the generating distributions.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

IEEE Transactions on Information Theory 工程技术-工程：电子与电气

CiteScore

5.70

自引率

20.00%

发文量

514

审稿时长

12 months

期刊介绍： The IEEE Transactions on Information Theory is a journal that publishes theoretical and experimental papers concerned with the transmission, processing, and utilization of information. The boundaries of acceptable subject matter are intentionally not sharply delimited. Rather, it is hoped that as the focus of research activity changes, a flexible policy will permit this Transactions to follow suit. Current appropriate topics are best reflected by recent Tables of Contents; they are summarized in the titles of editorial areas that appear on the inside front cover.