A Generic Framework for Testing Parallel File Systems

2016 1st Joint International Workshop on Parallel Data Storage and data Intensive Scalable Computing Systems (PDSW-DISCS) Pub Date : 2016-11-13 DOI:10.1109/PDSW-DISCS.2016.12

Jinrui Cao, Simeng Wang, Dong Dai, Mai Zheng, Yong Chen

{"title":"A Generic Framework for Testing Parallel File Systems","authors":"Jinrui Cao, Simeng Wang, Dong Dai, Mai Zheng, Yong Chen","doi":"10.1109/PDSW-DISCS.2016.12","DOIUrl":null,"url":null,"abstract":"Large-scale parallel file systems are of prime importance today. However, despite of the importance, their failure-recovery capability is much less studied compared with local storage systems. Recent studies on local storage systems have exposed various vulnerabilities that could lead to data loss under failure events, which raise the concern for parallel file systems built on top of them.This paper proposes a generic framework for testing the failure handling of large-scale parallel file systems. The framework captures all disk I/O commands on all storage nodes of the target system to emulate realistic failure states, and checks if the target system can recover to a consistent state without incurring data loss. We have built a prototype for the Lustre file system. Our preliminary results show that the framework is able to uncover the internal I/O behavior of Lustre under different workloads and failure conditions, which provides a solid foundation for further analyzing the failure recovery of parallel file systems.","PeriodicalId":375550,"journal":{"name":"2016 1st Joint International Workshop on Parallel Data Storage and data Intensive Scalable Computing Systems (PDSW-DISCS)","volume":"63 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2016-11-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"13","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2016 1st Joint International Workshop on Parallel Data Storage and data Intensive Scalable Computing Systems (PDSW-DISCS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/PDSW-DISCS.2016.12","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 13

Abstract

Large-scale parallel file systems are of prime importance today. However, despite of the importance, their failure-recovery capability is much less studied compared with local storage systems. Recent studies on local storage systems have exposed various vulnerabilities that could lead to data loss under failure events, which raise the concern for parallel file systems built on top of them.This paper proposes a generic framework for testing the failure handling of large-scale parallel file systems. The framework captures all disk I/O commands on all storage nodes of the target system to emulate realistic failure states, and checks if the target system can recover to a consistent state without incurring data loss. We have built a prototype for the Lustre file system. Our preliminary results show that the framework is able to uncover the internal I/O behavior of Lustre under different workloads and failure conditions, which provides a solid foundation for further analyzing the failure recovery of parallel file systems.

查看原文本刊更多论文

测试并行文件系统的通用框架

大规模并行文件系统在今天是非常重要的。然而，尽管它们很重要，但与本地存储系统相比，它们的故障恢复能力研究得很少。最近对本地存储系统的研究暴露了在故障事件下可能导致数据丢失的各种漏洞，这引起了人们对构建在其上的并行文件系统的关注。本文提出了一个测试大规模并行文件系统故障处理的通用框架。该框架捕获目标系统所有存储节点上的所有磁盘I/O命令，以模拟实际的故障状态，并检查目标系统是否可以在不导致数据丢失的情况下恢复到一致状态。我们已经为Lustre文件系统构建了一个原型。初步结果表明，该框架能够揭示Lustre在不同工作负载和故障条件下的内部I/O行为，为进一步分析并行文件系统的故障恢复提供了坚实的基础。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2016 1st Joint International Workshop on Parallel Data Storage and data Intensive Scalable Computing Systems (PDSW-DISCS)

自引率

0.00%

发文量