CheCUDA: CUDA应用程序的检查点/重启工具

2009 International Conference on Parallel and Distributed Computing, Applications and Technologies Pub Date : 2009-12-08 DOI:10.1109/PDCAT.2009.78

H. Takizawa, Katsuto Sato, K. Komatsu, Hiroaki Kobayashi

{"title":"CheCUDA: CUDA应用程序的检查点/重启工具","authors":"H. Takizawa, Katsuto Sato, K. Komatsu, Hiroaki Kobayashi","doi":"10.1109/PDCAT.2009.78","DOIUrl":null,"url":null,"abstract":"In this paper, a tool named CheCUDA is designed to checkpoint CUDA applications that use GPUs as accelerators. As existing checkpoint/restart implementations do not support checkpointing the GPU status, CheCUDA hooks a part of basic CUDA driver API calls in order to record the status changes on the main memory. At checkpointing, CheCUDA stores the status changes in a file after copying all necessary data in the video memory to the main memory and then disabling the CUDA runtime. At restarting, CheCUDA reads the file, re-initializes the CUDA runtime, and recovers the resources on GPUs so as to restart from the stored status. This paper demonstrates that a prototype implementation of CheCUDA can correctly checkpoint and restart a CUDA application written with basic APIs. This also indicates that CheCUDA can migrate a process from one PC to another even if the process uses a GPU. Accordingly, CheCUDA is useful not only to enhance the dependability of CUDA applications but also to enable dynamic task scheduling of CUDA applications required especially on heterogeneous GPU cluster systems. This paper also shows the timing overhead for checkpointing.","PeriodicalId":312929,"journal":{"name":"2009 International Conference on Parallel and Distributed Computing, Applications and Technologies","volume":"429 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2009-12-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"87","resultStr":"{\"title\":\"CheCUDA: A Checkpoint/Restart Tool for CUDA Applications\",\"authors\":\"H. Takizawa, Katsuto Sato, K. Komatsu, Hiroaki Kobayashi\",\"doi\":\"10.1109/PDCAT.2009.78\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"In this paper, a tool named CheCUDA is designed to checkpoint CUDA applications that use GPUs as accelerators. As existing checkpoint/restart implementations do not support checkpointing the GPU status, CheCUDA hooks a part of basic CUDA driver API calls in order to record the status changes on the main memory. At checkpointing, CheCUDA stores the status changes in a file after copying all necessary data in the video memory to the main memory and then disabling the CUDA runtime. At restarting, CheCUDA reads the file, re-initializes the CUDA runtime, and recovers the resources on GPUs so as to restart from the stored status. This paper demonstrates that a prototype implementation of CheCUDA can correctly checkpoint and restart a CUDA application written with basic APIs. This also indicates that CheCUDA can migrate a process from one PC to another even if the process uses a GPU. Accordingly, CheCUDA is useful not only to enhance the dependability of CUDA applications but also to enable dynamic task scheduling of CUDA applications required especially on heterogeneous GPU cluster systems. This paper also shows the timing overhead for checkpointing.\",\"PeriodicalId\":312929,\"journal\":{\"name\":\"2009 International Conference on Parallel and Distributed Computing, Applications and Technologies\",\"volume\":\"429 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2009-12-08\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"87\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2009 International Conference on Parallel and Distributed Computing, Applications and Technologies\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/PDCAT.2009.78\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2009 International Conference on Parallel and Distributed Computing, Applications and Technologies","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/PDCAT.2009.78","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 87

摘要

在本文中，设计了一个名为CheCUDA的工具来检查点使用gpu作为加速器的CUDA应用程序。由于现有的检查点/重启实现不支持对GPU状态进行检查点，所以CheCUDA钩住了基本CUDA驱动程序API调用的一部分，以便在主内存上记录状态变化。在检查点，在将显存中的所有必要数据复制到主存然后禁用CUDA运行时后，CheCUDA将状态变化存储在一个文件中。在重新启动时，CheCUDA读取文件，重新初始化CUDA运行时，并恢复gpu上的资源，以便从存储状态重新启动。本文证明了CheCUDA的原型实现可以正确地检查点和重新启动使用基本api编写的CUDA应用程序。这也表明，即使进程使用GPU, CheCUDA也可以将进程从一台PC迁移到另一台PC。因此，CheCUDA不仅可以增强CUDA应用程序的可靠性，还可以实现CUDA应用程序所需的动态任务调度，特别是在异构GPU集群系统上。本文还展示了检查点的计时开销。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

CheCUDA: A Checkpoint/Restart Tool for CUDA Applications

In this paper, a tool named CheCUDA is designed to checkpoint CUDA applications that use GPUs as accelerators. As existing checkpoint/restart implementations do not support checkpointing the GPU status, CheCUDA hooks a part of basic CUDA driver API calls in order to record the status changes on the main memory. At checkpointing, CheCUDA stores the status changes in a file after copying all necessary data in the video memory to the main memory and then disabling the CUDA runtime. At restarting, CheCUDA reads the file, re-initializes the CUDA runtime, and recovers the resources on GPUs so as to restart from the stored status. This paper demonstrates that a prototype implementation of CheCUDA can correctly checkpoint and restart a CUDA application written with basic APIs. This also indicates that CheCUDA can migrate a process from one PC to another even if the process uses a GPU. Accordingly, CheCUDA is useful not only to enhance the dependability of CUDA applications but also to enable dynamic task scheduling of CUDA applications required especially on heterogeneous GPU cluster systems. This paper also shows the timing overhead for checkpointing.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2009 International Conference on Parallel and Distributed Computing, Applications and Technologies

自引率

0.00%

发文量