测试无标记数据库的依赖性

IF 2.2 3区 计算机科学 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS
Vered Paslev;Wasim Huleihel
{"title":"测试无标记数据库的依赖性","authors":"Vered Paslev;Wasim Huleihel","doi":"10.1109/TIT.2024.3442977","DOIUrl":null,"url":null,"abstract":"In this paper, we investigate the problem of deciding whether two random databases \n<inline-formula> <tex-math>$\\textsf {X}\\in {\\mathcal { X}} ^{n\\times d}$ </tex-math></inline-formula>\n and \n<inline-formula> <tex-math>$\\textsf {Y}\\in {\\mathcal { Y}} ^{n\\times d}$ </tex-math></inline-formula>\n are statistically dependent or not. This is formulated as a hypothesis testing problem, where under the null hypothesis, these two databases are statistically independent, while under the alternative, there exists an unknown row permutation \n<inline-formula> <tex-math>$\\sigma $ </tex-math></inline-formula>\n, such that \n<inline-formula> <tex-math>$\\textsf {X}$ </tex-math></inline-formula>\n and \n<inline-formula> <tex-math>$\\textsf {Y}^{\\sigma } $ </tex-math></inline-formula>\n, a permuted version of \n<inline-formula> <tex-math>$\\textsf {Y}$ </tex-math></inline-formula>\n, are statistically dependent with some known joint distribution, but have the same marginal distributions as the null. We characterize the thresholds at which optimal testing is information-theoretically impossible and possible, as a function of n, d, and some spectral properties of the generative distributions of the datasets. For example, we prove that if a certain function of the eigenvalues of the likelihood function and d, is below a certain threshold, as \n<inline-formula> <tex-math>$d\\to \\infty $ </tex-math></inline-formula>\n, then weak detection (performing slightly better than random guessing) is statistically impossible, no matter what the value of n is. This mimics the performance of an efficient test that thresholds a centered version of the log-likelihood function of the observed matrices. We also analyze the case where d is fixed, for which we derive strong (vanishing error) and weak detection lower and upper bounds.","PeriodicalId":13494,"journal":{"name":"IEEE Transactions on Information Theory","volume":"70 10","pages":"7410-7431"},"PeriodicalIF":2.2000,"publicationDate":"2024-08-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Testing Dependency of Unlabeled Databases\",\"authors\":\"Vered Paslev;Wasim Huleihel\",\"doi\":\"10.1109/TIT.2024.3442977\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"In this paper, we investigate the problem of deciding whether two random databases \\n<inline-formula> <tex-math>$\\\\textsf {X}\\\\in {\\\\mathcal { X}} ^{n\\\\times d}$ </tex-math></inline-formula>\\n and \\n<inline-formula> <tex-math>$\\\\textsf {Y}\\\\in {\\\\mathcal { Y}} ^{n\\\\times d}$ </tex-math></inline-formula>\\n are statistically dependent or not. This is formulated as a hypothesis testing problem, where under the null hypothesis, these two databases are statistically independent, while under the alternative, there exists an unknown row permutation \\n<inline-formula> <tex-math>$\\\\sigma $ </tex-math></inline-formula>\\n, such that \\n<inline-formula> <tex-math>$\\\\textsf {X}$ </tex-math></inline-formula>\\n and \\n<inline-formula> <tex-math>$\\\\textsf {Y}^{\\\\sigma } $ </tex-math></inline-formula>\\n, a permuted version of \\n<inline-formula> <tex-math>$\\\\textsf {Y}$ </tex-math></inline-formula>\\n, are statistically dependent with some known joint distribution, but have the same marginal distributions as the null. We characterize the thresholds at which optimal testing is information-theoretically impossible and possible, as a function of n, d, and some spectral properties of the generative distributions of the datasets. For example, we prove that if a certain function of the eigenvalues of the likelihood function and d, is below a certain threshold, as \\n<inline-formula> <tex-math>$d\\\\to \\\\infty $ </tex-math></inline-formula>\\n, then weak detection (performing slightly better than random guessing) is statistically impossible, no matter what the value of n is. This mimics the performance of an efficient test that thresholds a centered version of the log-likelihood function of the observed matrices. We also analyze the case where d is fixed, for which we derive strong (vanishing error) and weak detection lower and upper bounds.\",\"PeriodicalId\":13494,\"journal\":{\"name\":\"IEEE Transactions on Information Theory\",\"volume\":\"70 10\",\"pages\":\"7410-7431\"},\"PeriodicalIF\":2.2000,\"publicationDate\":\"2024-08-13\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE Transactions on Information Theory\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/10634574/\",\"RegionNum\":3,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q3\",\"JCRName\":\"COMPUTER SCIENCE, INFORMATION SYSTEMS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Information Theory","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10634574/","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}
引用次数: 0

摘要

在本文中,我们研究了如何决定两个随机数据库 $\textsf {X}\in {\mathcal { X}} 是否是和 $\textsf {Y}\in {\mathcal { Y}} 是统计上的吗是否具有统计依赖性。这被表述为一个假设检验问题,在零假设下,这两个数据库在统计上是独立的,而在备择假设下,存在一个未知的行排列组合 $\sigma $ ,使得 $\textsf {X}$ 和 $\textsf {Y}^\{sigma } $ ,是 $\textsf {Y}^\{sigma } 的一个排列版本。$ ,$\textsf {Y}$的一个置换版本,在统计上与某种已知的联合分布相关,但具有与空值相同的边际分布。作为 n、d 和数据集生成分布的一些谱属性的函数,我们描述了最佳测试在信息论上不可能和可能的阈值。例如,我们证明,如果似然函数的特征值和 d 的某个函数低于某个阈值,即 $d\to \infty $,那么无论 n 的值是多少,弱检测(比随机猜测表现稍好)在统计学上都是不可能的。这模仿了高效测试的性能,该测试对观测矩阵的对数似然函数的居中版本进行阈值化。我们还分析了 d 固定的情况,并得出了强检测(误差消失)和弱检测的下限和上限。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Testing Dependency of Unlabeled Databases
In this paper, we investigate the problem of deciding whether two random databases $\textsf {X}\in {\mathcal { X}} ^{n\times d}$ and $\textsf {Y}\in {\mathcal { Y}} ^{n\times d}$ are statistically dependent or not. This is formulated as a hypothesis testing problem, where under the null hypothesis, these two databases are statistically independent, while under the alternative, there exists an unknown row permutation $\sigma $ , such that $\textsf {X}$ and $\textsf {Y}^{\sigma } $ , a permuted version of $\textsf {Y}$ , are statistically dependent with some known joint distribution, but have the same marginal distributions as the null. We characterize the thresholds at which optimal testing is information-theoretically impossible and possible, as a function of n, d, and some spectral properties of the generative distributions of the datasets. For example, we prove that if a certain function of the eigenvalues of the likelihood function and d, is below a certain threshold, as $d\to \infty $ , then weak detection (performing slightly better than random guessing) is statistically impossible, no matter what the value of n is. This mimics the performance of an efficient test that thresholds a centered version of the log-likelihood function of the observed matrices. We also analyze the case where d is fixed, for which we derive strong (vanishing error) and weak detection lower and upper bounds.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
IEEE Transactions on Information Theory
IEEE Transactions on Information Theory 工程技术-工程:电子与电气
CiteScore
5.70
自引率
20.00%
发文量
514
审稿时长
12 months
期刊介绍: The IEEE Transactions on Information Theory is a journal that publishes theoretical and experimental papers concerned with the transmission, processing, and utilization of information. The boundaries of acceptable subject matter are intentionally not sharply delimited. Rather, it is hoped that as the focus of research activity changes, a flexible policy will permit this Transactions to follow suit. Current appropriate topics are best reflected by recent Tables of Contents; they are summarized in the titles of editorial areas that appear on the inside front cover.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信