A new paradigm in data intensive computing: Stork and the data-aware schedulers

2006 IEEE Challenges of Large Applications in Distributed Environments Pub Date : 2006-07-10 DOI:10.1109/CLADE.2006.1652048

T. Kosar

{"title":"A new paradigm in data intensive computing: Stork and the data-aware schedulers","authors":"T. Kosar","doi":"10.1109/CLADE.2006.1652048","DOIUrl":null,"url":null,"abstract":"The unbounded increase in the computation and data requirements of scientific applications has necessitated the use of widely distributed compute and storage resources to meet the demand. In a widely distributed environment, data is no more locally accessible and has thus to be remotely retrieved and stored. Efficient and reliable access to data sources and archiving destinations in such an environment brings new challenges. Placing data on temporary local storage devices offers many advantages, but such \"data placements\" also require careful management of storage resources and data movement, i.e. allocating storage space, staging-in of input data, staging-out of generated data, and de-allocation of local storage after the data is safely stored at the destination. Traditional systems closely couple data placement and computation, and consider data placement as a side effect of computation. Data placement is either embedded in the computation and causes the computation to delay, or performed as simple scripts which do not have the privileges of a job. The insufficiency of the traditional systems and existing CPU-oriented schedulers in dealing with the complex data handling problem has yielded a new emerging era: the data-aware schedulers. One of the first examples of such schedulers is the Stork data placement scheduler. In this paper, we discuss the limitations of the traditional schedulers in handling the challenging data scheduling problem of large scale distributed applications; give our vision for the new paradigm in data-intensive scheduling; and elaborate on our case study: the Stork data placement scheduler","PeriodicalId":299480,"journal":{"name":"2006 IEEE Challenges of Large Applications in Distributed Environments","volume":"30 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2006-07-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"28","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2006 IEEE Challenges of Large Applications in Distributed Environments","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CLADE.2006.1652048","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 28

Abstract

The unbounded increase in the computation and data requirements of scientific applications has necessitated the use of widely distributed compute and storage resources to meet the demand. In a widely distributed environment, data is no more locally accessible and has thus to be remotely retrieved and stored. Efficient and reliable access to data sources and archiving destinations in such an environment brings new challenges. Placing data on temporary local storage devices offers many advantages, but such "data placements" also require careful management of storage resources and data movement, i.e. allocating storage space, staging-in of input data, staging-out of generated data, and de-allocation of local storage after the data is safely stored at the destination. Traditional systems closely couple data placement and computation, and consider data placement as a side effect of computation. Data placement is either embedded in the computation and causes the computation to delay, or performed as simple scripts which do not have the privileges of a job. The insufficiency of the traditional systems and existing CPU-oriented schedulers in dealing with the complex data handling problem has yielded a new emerging era: the data-aware schedulers. One of the first examples of such schedulers is the Stork data placement scheduler. In this paper, we discuss the limitations of the traditional schedulers in handling the challenging data scheduling problem of large scale distributed applications; give our vision for the new paradigm in data-intensive scheduling; and elaborate on our case study: the Stork data placement scheduler

查看原文本刊更多论文

数据密集型计算的新范式:Stork和数据感知调度器

科学应用对计算和数据需求的无限增长，要求使用广泛分布的计算和存储资源来满足需求。在广泛分布的环境中，数据不再是本地可访问的，因此必须远程检索和存储。在这样的环境中高效可靠地访问数据源和存档目的地带来了新的挑战。将数据放在临时本地存储设备上有很多优点，但是这样的“数据放置”也需要仔细管理存储资源和数据移动，即分配存储空间，分阶段输入数据，分阶段输出生成的数据，以及在数据安全存储在目的地后重新分配本地存储。传统系统将数据放置和计算紧密结合在一起，并将数据放置视为计算的副作用。数据放置要么嵌入到计算中并导致计算延迟，要么作为没有作业特权的简单脚本执行。传统系统和现有的面向cpu的调度器在处理复杂数据处理问题方面的不足，产生了一个新的时代:数据感知调度器。这种调度器的第一个例子是Stork数据放置调度器。本文讨论了传统调度器在处理大规模分布式应用中具有挑战性的数据调度问题时的局限性;给出我们对数据密集型调度新范式的看法;并详细介绍我们的案例研究:Stork数据放置调度程序

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2006 IEEE Challenges of Large Applications in Distributed Environments

自引率

0.00%

发文量