{"title":"袋鼠:用DAG编程模型可靠地执行科学应用程序","authors":"Kai Zhang, Kang Chen, Wei Xue","doi":"10.1109/ICPPW.2011.28","DOIUrl":null,"url":null,"abstract":"As high performance computing (HPC) systems increase in scale with higher potential level of component failure, the need rises for developing fault tolerant systems. However, current fault tolerance mechanisms, including Reply, Check pointing, and Redundant Execution, dose not scale well in large-scale scientific computing. Kangaroo is a reliable execution engine for scientific applications. Parallel programs are modeled as directed acyclic graph (DAG), and executed on clusters with graph theory based scheduling policy. Kangaroo provides effective execution of scalable parallel programs and transparently tolerates failures during runtime. In this paper, we describe the implementations of Kangaroo system, discuss designs of scheduling and fault tolerance, and evaluate the performance by a dense matrix inversion program. The results demonstrate that scheduling policies have a strong effect on program performance. They also demonstrate the feasibility and effectiveness of our approach to fault tolerance.","PeriodicalId":173271,"journal":{"name":"2011 40th International Conference on Parallel Processing Workshops","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2011-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":"{\"title\":\"Kangaroo: Reliable Execution of Scientific Applications with DAG Programming Model\",\"authors\":\"Kai Zhang, Kang Chen, Wei Xue\",\"doi\":\"10.1109/ICPPW.2011.28\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"As high performance computing (HPC) systems increase in scale with higher potential level of component failure, the need rises for developing fault tolerant systems. However, current fault tolerance mechanisms, including Reply, Check pointing, and Redundant Execution, dose not scale well in large-scale scientific computing. Kangaroo is a reliable execution engine for scientific applications. Parallel programs are modeled as directed acyclic graph (DAG), and executed on clusters with graph theory based scheduling policy. Kangaroo provides effective execution of scalable parallel programs and transparently tolerates failures during runtime. In this paper, we describe the implementations of Kangaroo system, discuss designs of scheduling and fault tolerance, and evaluate the performance by a dense matrix inversion program. The results demonstrate that scheduling policies have a strong effect on program performance. They also demonstrate the feasibility and effectiveness of our approach to fault tolerance.\",\"PeriodicalId\":173271,\"journal\":{\"name\":\"2011 40th International Conference on Parallel Processing Workshops\",\"volume\":\"1 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2011-09-13\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"2\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2011 40th International Conference on Parallel Processing Workshops\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICPPW.2011.28\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2011 40th International Conference on Parallel Processing Workshops","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICPPW.2011.28","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Kangaroo: Reliable Execution of Scientific Applications with DAG Programming Model
As high performance computing (HPC) systems increase in scale with higher potential level of component failure, the need rises for developing fault tolerant systems. However, current fault tolerance mechanisms, including Reply, Check pointing, and Redundant Execution, dose not scale well in large-scale scientific computing. Kangaroo is a reliable execution engine for scientific applications. Parallel programs are modeled as directed acyclic graph (DAG), and executed on clusters with graph theory based scheduling policy. Kangaroo provides effective execution of scalable parallel programs and transparently tolerates failures during runtime. In this paper, we describe the implementations of Kangaroo system, discuss designs of scheduling and fault tolerance, and evaluate the performance by a dense matrix inversion program. The results demonstrate that scheduling policies have a strong effect on program performance. They also demonstrate the feasibility and effectiveness of our approach to fault tolerance.