{"title":"从大型数据库中抽取代表性样本的蒙特卡罗抽样方法","authors":"Hong Guo, W. Hou, Feng Yan, Qiang Zhu","doi":"10.1109/SSDBM.2004.5","DOIUrl":null,"url":null,"abstract":"Sampling is important in areas like data mining, OLAP, selectivity estimation, clustering, etc. It has also become a necessity in social, economical, engineering, scientific, and statistical studies where databases are too large to handle. In this paper, a sampling method based on the Metropolis algorithm is proposed. Unlike the conventional uniform sampling methods, this method is able to select objects consistent with the underlying probability distribution. It is a simple, efficient, and powerful method suitable for all distributions. We have performed experiments to examine the qualities of the samples by comparing their statistical properties with the underlying population. The experimental results show that the samples selected by our method are bona fide representative.","PeriodicalId":383615,"journal":{"name":"Proceedings. 16th International Conference on Scientific and Statistical Database Management, 2004.","volume":"55 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2004-06-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"6","resultStr":"{\"title\":\"A Monte Carlo sampling method for drawing representative samples from large databases\",\"authors\":\"Hong Guo, W. Hou, Feng Yan, Qiang Zhu\",\"doi\":\"10.1109/SSDBM.2004.5\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Sampling is important in areas like data mining, OLAP, selectivity estimation, clustering, etc. It has also become a necessity in social, economical, engineering, scientific, and statistical studies where databases are too large to handle. In this paper, a sampling method based on the Metropolis algorithm is proposed. Unlike the conventional uniform sampling methods, this method is able to select objects consistent with the underlying probability distribution. It is a simple, efficient, and powerful method suitable for all distributions. We have performed experiments to examine the qualities of the samples by comparing their statistical properties with the underlying population. The experimental results show that the samples selected by our method are bona fide representative.\",\"PeriodicalId\":383615,\"journal\":{\"name\":\"Proceedings. 16th International Conference on Scientific and Statistical Database Management, 2004.\",\"volume\":\"55 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2004-06-21\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"6\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings. 16th International Conference on Scientific and Statistical Database Management, 2004.\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/SSDBM.2004.5\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings. 16th International Conference on Scientific and Statistical Database Management, 2004.","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SSDBM.2004.5","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
A Monte Carlo sampling method for drawing representative samples from large databases
Sampling is important in areas like data mining, OLAP, selectivity estimation, clustering, etc. It has also become a necessity in social, economical, engineering, scientific, and statistical studies where databases are too large to handle. In this paper, a sampling method based on the Metropolis algorithm is proposed. Unlike the conventional uniform sampling methods, this method is able to select objects consistent with the underlying probability distribution. It is a simple, efficient, and powerful method suitable for all distributions. We have performed experiments to examine the qualities of the samples by comparing their statistical properties with the underlying population. The experimental results show that the samples selected by our method are bona fide representative.