配对和非配对数据的主题集大小设计

Proceedings of the 2018 ACM SIGIR International Conference on Theory of Information Retrieval Pub Date : 2018-09-10 DOI:10.1145/3234944.3234971

T. Sakai

{"title":"配对和非配对数据的主题集大小设计","authors":"T. Sakai","doi":"10.1145/3234944.3234971","DOIUrl":null,"url":null,"abstract":"Topic set size design is an approach to determining the sample sizes of an experiment (e.g., number of topics) based on a statistical requirement, namely a desired statistical power or a cap on the confidence interval (CI) width for the difference in means. Previous work considered paired data cases for a desired power of the t-test and for a cap on CI width, as well as unpaired data cases for a desired power of one-way ANOVA. In the present study, we consider unpaired (i.e., two-sample) cases for the t-test and for the CI width. Since one-way ANOVA with two groups is strictly equivalent to the two-sample t-test, we compare the outcomes of the topic set size design results based on these two approaches, and show that the one-way ANOVA-based approach actually returns tighter sample sizes than the two-sample t-test approach. Moreover, we compare the paired and unpaired cases for both t-test-based and CI-based topic set size design approaches. Because estimating the variance of the score differences for the paired data setting is problematic, we recommend the use of our unpaired-data versions of t-test-based and CI-based topic set size design tools, as they only require a variance estimate for individual scores and the appropriate sample sizes for unpaired data are also large enough for paired data.","PeriodicalId":193631,"journal":{"name":"Proceedings of the 2018 ACM SIGIR International Conference on Theory of Information Retrieval","volume":"29 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-09-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"Topic Set Size Design for Paired and Unpaired Data\",\"authors\":\"T. Sakai\",\"doi\":\"10.1145/3234944.3234971\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Topic set size design is an approach to determining the sample sizes of an experiment (e.g., number of topics) based on a statistical requirement, namely a desired statistical power or a cap on the confidence interval (CI) width for the difference in means. Previous work considered paired data cases for a desired power of the t-test and for a cap on CI width, as well as unpaired data cases for a desired power of one-way ANOVA. In the present study, we consider unpaired (i.e., two-sample) cases for the t-test and for the CI width. Since one-way ANOVA with two groups is strictly equivalent to the two-sample t-test, we compare the outcomes of the topic set size design results based on these two approaches, and show that the one-way ANOVA-based approach actually returns tighter sample sizes than the two-sample t-test approach. Moreover, we compare the paired and unpaired cases for both t-test-based and CI-based topic set size design approaches. Because estimating the variance of the score differences for the paired data setting is problematic, we recommend the use of our unpaired-data versions of t-test-based and CI-based topic set size design tools, as they only require a variance estimate for individual scores and the appropriate sample sizes for unpaired data are also large enough for paired data.\",\"PeriodicalId\":193631,\"journal\":{\"name\":\"Proceedings of the 2018 ACM SIGIR International Conference on Theory of Information Retrieval\",\"volume\":\"29 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2018-09-10\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 2018 ACM SIGIR International Conference on Theory of Information Retrieval\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3234944.3234971\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2018 ACM SIGIR International Conference on Theory of Information Retrieval","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3234944.3234971","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

摘要

主题集大小设计是一种根据统计要求确定实验样本量(例如，主题数量)的方法，即期望的统计功率或平均值差异的置信区间(CI)宽度上限。以前的工作考虑了配对数据案例的t检验的期望功率和CI宽度上限，以及非配对数据案例的单向方差分析的期望功率。在本研究中，我们考虑未配对(即双样本)的情况下进行t检验和CI宽度。由于两组的单因素方差分析严格等同于两样本t检验，我们比较了基于这两种方法的主题集大小设计结果的结果，并表明基于单因素方差分析的方法实际上返回比两样本t检验方法更紧凑的样本量。此外，我们比较了基于t检验和基于ci的主题集大小设计方法的配对和非配对情况。因为估计成对数据设置的分数差异的方差是有问题的，我们建议使用我们的基于t检验和基于ci的主题集大小设计工具的非成对数据版本，因为它们只需要对单个分数进行方差估计，并且对成对数据来说，非成对数据的适当样本量也足够大。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Topic Set Size Design for Paired and Unpaired Data

Topic set size design is an approach to determining the sample sizes of an experiment (e.g., number of topics) based on a statistical requirement, namely a desired statistical power or a cap on the confidence interval (CI) width for the difference in means. Previous work considered paired data cases for a desired power of the t-test and for a cap on CI width, as well as unpaired data cases for a desired power of one-way ANOVA. In the present study, we consider unpaired (i.e., two-sample) cases for the t-test and for the CI width. Since one-way ANOVA with two groups is strictly equivalent to the two-sample t-test, we compare the outcomes of the topic set size design results based on these two approaches, and show that the one-way ANOVA-based approach actually returns tighter sample sizes than the two-sample t-test approach. Moreover, we compare the paired and unpaired cases for both t-test-based and CI-based topic set size design approaches. Because estimating the variance of the score differences for the paired data setting is problematic, we recommend the use of our unpaired-data versions of t-test-based and CI-based topic set size design tools, as they only require a variance estimate for individual scores and the appropriate sample sizes for unpaired data are also large enough for paired data.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Proceedings of the 2018 ACM SIGIR International Conference on Theory of Information Retrieval

自引率

0.00%

发文量