基于密度估计的方法确定大数据随机抽样分区的样本量

IF 4.6 3区计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS

Frontiers of Computer Science Pub Date : 2023-12-16 DOI:10.1007/s11704-023-2356-x

{"title":"基于密度估计的方法确定大数据随机抽样分区的样本量","authors":"","doi":"10.1007/s11704-023-2356-x","DOIUrl":null,"url":null,"abstract":"<h3>Abstract</h3> <p>Random sample partition (RSP) is a newly developed big data representation and management model to deal with big data approximate computation problems. Academic research and practical applications have confirmed that RSP is an efficient solution for big data processing and analysis. However, a challenge for implementing RSP is determining an appropriate sample size for RSP data blocks. While a large sample size increases the burden of big data computation, a small size will lead to insufficient distribution information for RSP data blocks. To address this problem, this paper presents a novel density estimation-based method (DEM) to determine the optimal sample size for RSP data blocks. First, a theoretical sample size is calculated based on the multivariate Dvoretzky-Kiefer-Wolfowitz (DKW) inequality by using the fixed-point iteration (FPI) method. Second, a practical sample size is determined by minimizing the validation error of a kernel density estimator (KDE) constructed on RSP data blocks for an increasing sample size. Finally, a series of persuasive experiments are conducted to validate the feasibility, rationality, and effectiveness of DEM. Experimental results show that (1) the iteration function of the FPI method is convergent for calculating the theoretical sample size from the multivariate DKW inequality; (2) the KDE constructed on RSP data blocks with sample size determined by DEM can yield a good approximation of the probability density function (<em>p.d.f.</em>); and (3) DEM provides more accurate sample sizes than the existing sample size determination methods from the perspective of <em>p.d.f.</em> estimation. This demonstrates that DEM is a viable approach to deal with the sample size determination problem for big data RSP implementation.</p>","PeriodicalId":12640,"journal":{"name":"Frontiers of Computer Science","volume":"60 1","pages":""},"PeriodicalIF":4.6000,"publicationDate":"2023-12-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Density estimation-based method to determine sample size for random sample partition of big data\",\"authors\":\"\",\"doi\":\"10.1007/s11704-023-2356-x\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<h3>Abstract</h3> <p>Random sample partition (RSP) is a newly developed big data representation and management model to deal with big data approximate computation problems. Academic research and practical applications have confirmed that RSP is an efficient solution for big data processing and analysis. However, a challenge for implementing RSP is determining an appropriate sample size for RSP data blocks. While a large sample size increases the burden of big data computation, a small size will lead to insufficient distribution information for RSP data blocks. To address this problem, this paper presents a novel density estimation-based method (DEM) to determine the optimal sample size for RSP data blocks. First, a theoretical sample size is calculated based on the multivariate Dvoretzky-Kiefer-Wolfowitz (DKW) inequality by using the fixed-point iteration (FPI) method. Second, a practical sample size is determined by minimizing the validation error of a kernel density estimator (KDE) constructed on RSP data blocks for an increasing sample size. Finally, a series of persuasive experiments are conducted to validate the feasibility, rationality, and effectiveness of DEM. Experimental results show that (1) the iteration function of the FPI method is convergent for calculating the theoretical sample size from the multivariate DKW inequality; (2) the KDE constructed on RSP data blocks with sample size determined by DEM can yield a good approximation of the probability density function (<em>p.d.f.</em>); and (3) DEM provides more accurate sample sizes than the existing sample size determination methods from the perspective of <em>p.d.f.</em> estimation. This demonstrates that DEM is a viable approach to deal with the sample size determination problem for big data RSP implementation.</p>\",\"PeriodicalId\":12640,\"journal\":{\"name\":\"Frontiers of Computer Science\",\"volume\":\"60 1\",\"pages\":\"\"},\"PeriodicalIF\":4.6000,\"publicationDate\":\"2023-12-16\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Frontiers of Computer Science\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://doi.org/10.1007/s11704-023-2356-x\",\"RegionNum\":3,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"COMPUTER SCIENCE, INFORMATION SYSTEMS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Frontiers of Computer Science","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1007/s11704-023-2356-x","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 0

摘要

摘要随机抽样分区（RSP）是一种新开发的大数据表示和管理模型，用于处理大数据近似计算问题。学术研究和实际应用证实，RSP 是一种高效的大数据处理和分析解决方案。然而，实施 RSP 的一个挑战是确定 RSP 数据块的适当样本大小。样本量大会增加大数据计算的负担，而样本量小又会导致 RSP 数据块的分布信息不足。为解决这一问题，本文提出了一种新颖的基于密度估计的方法（DEM）来确定 RSP 数据块的最佳样本量。首先，根据多元 Dvoretzky-Kiefer-Wolfowitz (DKW) 不等式，使用定点迭代 (FPI) 方法计算出理论样本量。其次，通过最小化在 RSP 数据块上构建的核密度估算器 (KDE) 的验证误差来确定实际样本量，以增加样本量。最后，通过一系列有说服力的实验来验证 DEM 的可行性、合理性和有效性。实验结果表明：(1) 从多元 DKW 不等式计算理论样本量时，FPI 方法的迭代函数是收敛的；(2) 在 RSP 数据块上构建的 KDE，其样本量由 DEM 确定，可以很好地近似概率密度函数（p.d.f.）；(3) 从 p.d.f. 估计的角度来看，DEM 比现有的样本量确定方法提供了更精确的样本量。这表明，DEM 是处理大数据 RSP 实施中样本量确定问题的一种可行方法。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Density estimation-based method to determine sample size for random sample partition of big data

Abstract

Random sample partition (RSP) is a newly developed big data representation and management model to deal with big data approximate computation problems. Academic research and practical applications have confirmed that RSP is an efficient solution for big data processing and analysis. However, a challenge for implementing RSP is determining an appropriate sample size for RSP data blocks. While a large sample size increases the burden of big data computation, a small size will lead to insufficient distribution information for RSP data blocks. To address this problem, this paper presents a novel density estimation-based method (DEM) to determine the optimal sample size for RSP data blocks. First, a theoretical sample size is calculated based on the multivariate Dvoretzky-Kiefer-Wolfowitz (DKW) inequality by using the fixed-point iteration (FPI) method. Second, a practical sample size is determined by minimizing the validation error of a kernel density estimator (KDE) constructed on RSP data blocks for an increasing sample size. Finally, a series of persuasive experiments are conducted to validate the feasibility, rationality, and effectiveness of DEM. Experimental results show that (1) the iteration function of the FPI method is convergent for calculating the theoretical sample size from the multivariate DKW inequality; (2) the KDE constructed on RSP data blocks with sample size determined by DEM can yield a good approximation of the probability density function (p.d.f.); and (3) DEM provides more accurate sample sizes than the existing sample size determination methods from the perspective of p.d.f. estimation. This demonstrates that DEM is a viable approach to deal with the sample size determination problem for big data RSP implementation.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Frontiers of Computer Science COMPUTER SCIENCE, INFORMATION SYSTEMS-COMPUTER SCIENCE, SOFTWARE ENGINEERING

CiteScore

8.60

自引率

2.40%

发文量

799

审稿时长

6-12 weeks

期刊介绍： Frontiers of Computer Science aims to provide a forum for the publication of peer-reviewed papers to promote rapid communication and exchange between computer scientists. The journal publishes research papers and review articles in a wide range of topics, including: architecture, software, artificial intelligence, theoretical computer science, networks and communication, information systems, multimedia and graphics, information security, interdisciplinary, etc. The journal especially encourages papers from new emerging and multidisciplinary areas, as well as papers reflecting the international trends of research and development and on special topics reporting progress made by Chinese computer scientists.