BlobSeer:大规模分布的数据密集型应用程序的高效数据管理

2010 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum (IPDPSW) Pub Date : 2010-04-19 DOI:10.1109/IPDPSW.2010.5470802

Bogdan Nicolae, Gabriel Antoniu, L. Bougé

{"title":"BlobSeer:大规模分布的数据密集型应用程序的高效数据管理","authors":"Bogdan Nicolae, Gabriel Antoniu, L. Bougé","doi":"10.1109/IPDPSW.2010.5470802","DOIUrl":null,"url":null,"abstract":"As the rate, scale and variety of data increases in complexity, the need for flexible applications that can crunch huge amounts of heterogeneous data fast and cost-effective is of utmost importance. Such applications are data-intensive: in a typical scenario, they continuously acquire massive datasets (e.g. by crawling the Web or analyzing access logs) while performing computations over these changing datasets (e.g. building up-to-date search indexes). In order to achieve scalability and performance, data acquisitions and computations need to be distributed at large scale in infrastructures comprising hundreds and thousands of machines. As these applications focus on data rather then on computation, a heavy burden is put on the storage service employed to handle data management, because it must efficiently deal with massively parallel data accesses. In order to achieve this, a series of issues need to be address properly: scalable aggregation of storage space from the participating nodes with minimal overhead, the ability to store huge data objects, efficient fine-grain access to data subsets, high throughput even under heavy access concurrency, versioning, as well as fault tolerance and a high quality of service for access throughput. This paper introduces BlobSeer, an efficient distributed data management service that addresses the issues presented above. In BlobSeer, long sequences of bytes representing unstructured data are called blobs (Binary Large OBject).","PeriodicalId":329280,"journal":{"name":"2010 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum (IPDPSW)","volume":"6 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2010-04-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":"{\"title\":\"BlobSeer: Efficient data management for data-intensive applications distributed at large-scale\",\"authors\":\"Bogdan Nicolae, Gabriel Antoniu, L. Bougé\",\"doi\":\"10.1109/IPDPSW.2010.5470802\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"As the rate, scale and variety of data increases in complexity, the need for flexible applications that can crunch huge amounts of heterogeneous data fast and cost-effective is of utmost importance. Such applications are data-intensive: in a typical scenario, they continuously acquire massive datasets (e.g. by crawling the Web or analyzing access logs) while performing computations over these changing datasets (e.g. building up-to-date search indexes). In order to achieve scalability and performance, data acquisitions and computations need to be distributed at large scale in infrastructures comprising hundreds and thousands of machines. As these applications focus on data rather then on computation, a heavy burden is put on the storage service employed to handle data management, because it must efficiently deal with massively parallel data accesses. In order to achieve this, a series of issues need to be address properly: scalable aggregation of storage space from the participating nodes with minimal overhead, the ability to store huge data objects, efficient fine-grain access to data subsets, high throughput even under heavy access concurrency, versioning, as well as fault tolerance and a high quality of service for access throughput. This paper introduces BlobSeer, an efficient distributed data management service that addresses the issues presented above. In BlobSeer, long sequences of bytes representing unstructured data are called blobs (Binary Large OBject).\",\"PeriodicalId\":329280,\"journal\":{\"name\":\"2010 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum (IPDPSW)\",\"volume\":\"6 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2010-04-19\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"4\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2010 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum (IPDPSW)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/IPDPSW.2010.5470802\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2010 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum (IPDPSW)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IPDPSW.2010.5470802","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 4

摘要

随着数据的速率、规模和种类的复杂性不断增加，对能够快速且经济高效地处理大量异构数据的灵活应用程序的需求变得至关重要。这样的应用程序是数据密集型的:在一个典型的场景中，它们在对这些不断变化的数据集执行计算(例如，构建最新的搜索索引)的同时，不断获取大量数据集(例如，通过抓取Web或分析访问日志)。为了实现可伸缩性和性能，数据采集和计算需要大规模地分布在由成百上千台机器组成的基础设施中。由于这些应用程序关注数据而不是计算，因此用于处理数据管理的存储服务负担沉重，因为它必须有效地处理大规模并行数据访问。为了实现这一目标，需要正确解决一系列问题:以最小的开销从参与节点可伸缩的存储空间聚合，存储巨大数据对象的能力，对数据子集的高效细粒度访问，即使在高访问并发性下的高吞吐量，版本控制，以及容错和高质量的访问吞吐量服务。本文介绍了BlobSeer，它是一种高效的分布式数据管理服务，可以解决上述问题。在BlobSeer中，表示非结构化数据的长字节序列称为blob(二进制大对象)。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

BlobSeer: Efficient data management for data-intensive applications distributed at large-scale

As the rate, scale and variety of data increases in complexity, the need for flexible applications that can crunch huge amounts of heterogeneous data fast and cost-effective is of utmost importance. Such applications are data-intensive: in a typical scenario, they continuously acquire massive datasets (e.g. by crawling the Web or analyzing access logs) while performing computations over these changing datasets (e.g. building up-to-date search indexes). In order to achieve scalability and performance, data acquisitions and computations need to be distributed at large scale in infrastructures comprising hundreds and thousands of machines. As these applications focus on data rather then on computation, a heavy burden is put on the storage service employed to handle data management, because it must efficiently deal with massively parallel data accesses. In order to achieve this, a series of issues need to be address properly: scalable aggregation of storage space from the participating nodes with minimal overhead, the ability to store huge data objects, efficient fine-grain access to data subsets, high throughput even under heavy access concurrency, versioning, as well as fault tolerance and a high quality of service for access throughput. This paper introduces BlobSeer, an efficient distributed data management service that addresses the issues presented above. In BlobSeer, long sequences of bytes representing unstructured data are called blobs (Binary Large OBject).

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2010 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum (IPDPSW)

自引率

0.00%

发文量