Fault Tolerance and Scaling in e-Science Cloud Applications: Observations from the Continuing Development of MODISAzure

2010 IEEE Sixth International Conference on e-Science Pub Date : 2010-12-07 DOI:10.1109/ESCIENCE.2010.47

Jie Li, M. Humphrey, Y. Cheah, Y. Ryu, D. Agarwal, K. Jackson, C. Ingen

{"title":"Fault Tolerance and Scaling in e-Science Cloud Applications: Observations from the Continuing Development of MODISAzure","authors":"Jie Li, M. Humphrey, Y. Cheah, Y. Ryu, D. Agarwal, K. Jackson, C. Ingen","doi":"10.1109/ESCIENCE.2010.47","DOIUrl":null,"url":null,"abstract":"It can be natural to believe that many of the traditional issues of scale have been eliminated or at least greatly reduced via cloud computing. That is, if one can create a seemingly well functioning cloud application that operates correctly on small or moderate-sized problems, then the very nature of cloud programming abstractions means that the same application will run as well on potentially significantly larger problems. In this paper, we present our experiences taking MODISAzure, our satellite data processing system built on the Windows Azure cloud computing platform, from the proof-of-concept stage to a point of being able to run on significantly larger problem sizes (e.g., from national-scale data sizes to global-scale data sizes). To our knowledge, this is the longest-running eScience application on the nascent Windows Azure platform. We found that while many infrastructure-level issues were thankfully masked from us by the cloud infrastructure, it was valuable to design additional redundancy and fault-tolerance capabilities such as transparent idempotent task retry and logging to support debugging of user code encountering unanticipated data issues. Further, we found that using a commercial cloud means anticipating inconsistent performance and black-box behavior of virtualized compute instances, as well as leveraging changing platform capabilities over time. We believe that the experiences presented in this paper can help future eScience cloud application developers on Windows Azure and other commercial cloud providers.","PeriodicalId":441488,"journal":{"name":"2010 IEEE Sixth International Conference on e-Science","volume":"35 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2010-12-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"28","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2010 IEEE Sixth International Conference on e-Science","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ESCIENCE.2010.47","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 28

Abstract

It can be natural to believe that many of the traditional issues of scale have been eliminated or at least greatly reduced via cloud computing. That is, if one can create a seemingly well functioning cloud application that operates correctly on small or moderate-sized problems, then the very nature of cloud programming abstractions means that the same application will run as well on potentially significantly larger problems. In this paper, we present our experiences taking MODISAzure, our satellite data processing system built on the Windows Azure cloud computing platform, from the proof-of-concept stage to a point of being able to run on significantly larger problem sizes (e.g., from national-scale data sizes to global-scale data sizes). To our knowledge, this is the longest-running eScience application on the nascent Windows Azure platform. We found that while many infrastructure-level issues were thankfully masked from us by the cloud infrastructure, it was valuable to design additional redundancy and fault-tolerance capabilities such as transparent idempotent task retry and logging to support debugging of user code encountering unanticipated data issues. Further, we found that using a commercial cloud means anticipating inconsistent performance and black-box behavior of virtualized compute instances, as well as leveraging changing platform capabilities over time. We believe that the experiences presented in this paper can help future eScience cloud application developers on Windows Azure and other commercial cloud providers.

查看原文本刊更多论文

电子科学云应用中的容错和扩展:来自MODISAzure持续发展的观察

人们很自然地认为，通过云计算，许多传统的规模问题已经消除，或者至少大大减少了。也就是说，如果可以创建一个看起来功能良好的云应用程序，它可以在小型或中等规模的问题上正确运行，那么云编程抽象的本质意味着相同的应用程序也可以在潜在的更大的问题上运行。在本文中，我们介绍了我们在Windows Azure云计算平台上构建的卫星数据处理系统MODISAzure从概念验证阶段到能够在更大的问题规模(例如，从国家规模的数据规模到全球规模的数据规模)上运行的经验。据我们所知，这是在新生的Windows Azure平台上运行时间最长的eScience应用程序。我们发现，虽然云基础设施掩盖了许多基础设施级别的问题，但设计额外的冗余和容错功能(如透明的幂等任务重试和日志记录)是有价值的，以支持调试遇到意外数据问题的用户代码。此外，我们发现，使用商业云意味着可以预见到不一致的性能和虚拟计算实例的黑箱行为，以及随着时间的推移利用不断变化的平台功能。我们相信本文中介绍的经验可以帮助未来在Windows Azure和其他商业云提供商上的eScience云应用程序开发人员。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2010 IEEE Sixth International Conference on e-Science

自引率

0.00%

发文量