在非常大的文件系统上进行采样的案例

2014 30th Symposium on Mass Storage Systems and Technologies (MSST) Pub Date : 2014-06-02 DOI:10.1109/MSST.2014.6855542

George Goldberg, Danny Harnik, D. Sotnikov

{"title":"在非常大的文件系统上进行采样的案例","authors":"George Goldberg, Danny Harnik, D. Sotnikov","doi":"10.1109/MSST.2014.6855542","DOIUrl":null,"url":null,"abstract":"Sampling has long been a prominent tool in statistics and analytics, first and foremost when very large amounts of data are involved. In the realm of very large file systems (and hierarchical data stores in general), however, sampling has mostly been ignored and for several good reasons. Mainly, running sampling in such an environment introduces technical challenges that make the entire sampling process non-beneficial. In this work we demonstrate that there are cases for which sampling is very worthwhile in very large file systems. We address this topic in two aspect: (a) the technical side where we design and implement solutions to efficient weighted sampling that is also distributed, one-pass and addresses multiple efficiency aspects; and (b) the usability aspect in which we demonstrate several use-cases in which weighted sampling over large file systems is extremely beneficial. In particular, we show use-cases regarding estimation of compression ratios, testing and auditing and offline collection of statistics on very large data stores.","PeriodicalId":188071,"journal":{"name":"2014 30th Symposium on Mass Storage Systems and Technologies (MSST)","volume":"21 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2014-06-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":"{\"title\":\"The case for sampling on very large file systems\",\"authors\":\"George Goldberg, Danny Harnik, D. Sotnikov\",\"doi\":\"10.1109/MSST.2014.6855542\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Sampling has long been a prominent tool in statistics and analytics, first and foremost when very large amounts of data are involved. In the realm of very large file systems (and hierarchical data stores in general), however, sampling has mostly been ignored and for several good reasons. Mainly, running sampling in such an environment introduces technical challenges that make the entire sampling process non-beneficial. In this work we demonstrate that there are cases for which sampling is very worthwhile in very large file systems. We address this topic in two aspect: (a) the technical side where we design and implement solutions to efficient weighted sampling that is also distributed, one-pass and addresses multiple efficiency aspects; and (b) the usability aspect in which we demonstrate several use-cases in which weighted sampling over large file systems is extremely beneficial. In particular, we show use-cases regarding estimation of compression ratios, testing and auditing and offline collection of statistics on very large data stores.\",\"PeriodicalId\":188071,\"journal\":{\"name\":\"2014 30th Symposium on Mass Storage Systems and Technologies (MSST)\",\"volume\":\"21 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2014-06-02\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"2\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2014 30th Symposium on Mass Storage Systems and Technologies (MSST)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/MSST.2014.6855542\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2014 30th Symposium on Mass Storage Systems and Technologies (MSST)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/MSST.2014.6855542","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 2

摘要

抽样长期以来一直是统计和分析中的重要工具，尤其是在涉及大量数据时。然而，在非常大的文件系统(以及一般的分层数据存储)领域中，由于几个很好的原因，抽样基本上被忽略了。主要是，在这样的环境中运行采样引入了技术挑战，使整个采样过程没有好处。在这项工作中，我们证明了在一些情况下，在非常大的文件系统中采样是非常值得的。我们从两个方面解决这个问题:(a)技术方面，我们设计和实施有效加权抽样的解决方案，该解决方案也是分布式的，一次通过并解决多个效率方面的问题;(b)可用性方面，我们展示了几个用例，在这些用例中，对大型文件系统进行加权抽样是非常有益的。特别是，我们展示了有关压缩比估计、测试和审计以及在非常大的数据存储上离线收集统计信息的用例。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

The case for sampling on very large file systems

Sampling has long been a prominent tool in statistics and analytics, first and foremost when very large amounts of data are involved. In the realm of very large file systems (and hierarchical data stores in general), however, sampling has mostly been ignored and for several good reasons. Mainly, running sampling in such an environment introduces technical challenges that make the entire sampling process non-beneficial. In this work we demonstrate that there are cases for which sampling is very worthwhile in very large file systems. We address this topic in two aspect: (a) the technical side where we design and implement solutions to efficient weighted sampling that is also distributed, one-pass and addresses multiple efficiency aspects; and (b) the usability aspect in which we demonstrate several use-cases in which weighted sampling over large file systems is extremely beneficial. In particular, we show use-cases regarding estimation of compression ratios, testing and auditing and offline collection of statistics on very large data stores.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2014 30th Symposium on Mass Storage Systems and Technologies (MSST)

自引率

0.00%

发文量