Caveat emptor: making grid services dependable from the client side

2002 Pacific Rim International Symposium on Dependable Computing, 2002. Proceedings. Pub Date : 2002-12-16 DOI:10.1109/PRDC.2002.1185626

M. Livny, D. Thain

{"title":"Caveat emptor: making grid services dependable from the client side","authors":"M. Livny, D. Thain","doi":"10.1109/PRDC.2002.1185626","DOIUrl":null,"url":null,"abstract":"Grid computing relies on fragile partnerships. Clients with hundreds or even thousands of pending service requests must seek out and form temporary alliances with remote servers eager to satisfy them. Yet, despite the high quality and reliability of these servers and their software, unexpected events and behavior are common. Communication networks, power systems, operating systems, middleware and operator intervention all conspire to attack even the most carefully arranged client-server interaction. To survive in such an imperfect world, customers of grid resources must be equipped with resilient client software that tolerates failures while aggressively representing their interests. Following our tradition of developing technology that harnesses the power of opportunistic resources, the Condor Project is actively engaged in developing the basic mechanisms for building dependable and effective grid computing clients. Guided by our experience and the practical needs of production users in disciplines as diverse as astronomy and sociology, the Project aims to equip users with powerful software that complements the reliability of the servers that they exploit. Our most visible product is the Condor-G job manager. Other research ventures, including the full Condor distributed system, offer valuable lessons in dependable client-side management. Dependability has been explored in a number of branches of computing, ranging from database systems to programming languages. The hard-earned lessons from these fields are also essential to grid computing. Fundamental concepts such as timeouts, logging, checkpoints, transactions, leases, and atomic operations must be employed and expressed in basic protocols and interfaces for CPU and I/O access. Without these techniques, clients and servers lose track of the other’s state, leading to missed opportunities, wasted resources, incorrect results, and unnecessary failures. This principle is espoused in systems such as Condor-G and protocols such as the most recent version of GRAM. In a Grid environment we must never view failure as a disaster. Rather, failures occur at every level and every interface, and must be expected and structured. No single failure must bring a computation to a halt, nor can any type of failure be retried indefinitely. Jobs may be retracted even from systems deemed reliable when better performance may be found elsewhere. In addition, we must always be careful to determine whether the source of a failure lies in the system or in the job itself. Examples of this principle are found in the DAGMan meta-scheduler and the fault-tolerant shell.","PeriodicalId":362330,"journal":{"name":"2002 Pacific Rim International Symposium on Dependable Computing, 2002. Proceedings.","volume":"68 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2002-12-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2002 Pacific Rim International Symposium on Dependable Computing, 2002. Proceedings.","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/PRDC.2002.1185626","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 4

Abstract

Grid computing relies on fragile partnerships. Clients with hundreds or even thousands of pending service requests must seek out and form temporary alliances with remote servers eager to satisfy them. Yet, despite the high quality and reliability of these servers and their software, unexpected events and behavior are common. Communication networks, power systems, operating systems, middleware and operator intervention all conspire to attack even the most carefully arranged client-server interaction. To survive in such an imperfect world, customers of grid resources must be equipped with resilient client software that tolerates failures while aggressively representing their interests. Following our tradition of developing technology that harnesses the power of opportunistic resources, the Condor Project is actively engaged in developing the basic mechanisms for building dependable and effective grid computing clients. Guided by our experience and the practical needs of production users in disciplines as diverse as astronomy and sociology, the Project aims to equip users with powerful software that complements the reliability of the servers that they exploit. Our most visible product is the Condor-G job manager. Other research ventures, including the full Condor distributed system, offer valuable lessons in dependable client-side management. Dependability has been explored in a number of branches of computing, ranging from database systems to programming languages. The hard-earned lessons from these fields are also essential to grid computing. Fundamental concepts such as timeouts, logging, checkpoints, transactions, leases, and atomic operations must be employed and expressed in basic protocols and interfaces for CPU and I/O access. Without these techniques, clients and servers lose track of the other’s state, leading to missed opportunities, wasted resources, incorrect results, and unnecessary failures. This principle is espoused in systems such as Condor-G and protocols such as the most recent version of GRAM. In a Grid environment we must never view failure as a disaster. Rather, failures occur at every level and every interface, and must be expected and structured. No single failure must bring a computation to a halt, nor can any type of failure be retried indefinitely. Jobs may be retracted even from systems deemed reliable when better performance may be found elsewhere. In addition, we must always be careful to determine whether the source of a failure lies in the system or in the job itself. Examples of this principle are found in the DAGMan meta-scheduler and the fault-tolerant shell.

查看原文本刊更多论文

注意事项:从客户端使网格服务可靠

网格计算依赖于脆弱的伙伴关系。拥有数百甚至数千个待处理服务请求的客户机必须寻找并与渴望满足它们的远程服务器结成临时联盟。然而，尽管这些服务器及其软件具有高质量和可靠性，但意外事件和行为是常见的。通信网络、电力系统、操作系统、中间件和操作人员的干预都在共同攻击即使是最精心安排的客户机-服务器交互。为了在这样一个不完美的世界中生存，网格资源的客户必须配备有弹性的客户端软件，这种软件能够容忍故障，同时积极地代表他们的利益。遵循我们开发技术的传统，利用机会性资源的力量，Condor项目积极参与开发用于构建可靠和有效的网格计算客户端的基本机制。根据我们的经验和天文学和社会学等不同学科的生产用户的实际需求，该项目旨在为用户提供功能强大的软件，以补充他们所使用的服务器的可靠性。我们最引人注目的产品是Condor-G作业管理器。其他的研究项目，包括完整的Condor分布式系统，为可靠的客户端管理提供了宝贵的经验。从数据库系统到编程语言，许多计算分支都对可靠性进行了探索。这些领域来之不易的经验教训对于网格计算也是必不可少的。超时、日志记录、检查点、事务、租约和原子操作等基本概念必须在CPU和I/O访问的基本协议和接口中使用和表达。如果没有这些技术，客户机和服务器将无法跟踪对方的状态，从而导致错失机会、浪费资源、错误的结果和不必要的故障。这一原则在诸如Condor-G之类的系统和诸如最新版本的GRAM之类的协议中得到支持。在网格环境中，我们绝不能将失败视为灾难。相反，故障发生在每个级别和每个接口，必须预期和结构化。任何一次失败都不能使计算停止，也不能无限期地重试任何类型的失败。当在其他地方可以找到更好的性能时，工作甚至可能从被认为可靠的系统中撤回。此外，我们必须始终小心地确定故障的根源是系统还是工作本身。在DAGMan元调度程序和容错shell中可以找到该原则的示例。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2002 Pacific Rim International Symposium on Dependable Computing, 2002. Proceedings.

自引率

0.00%

发文量