Haryadi S. Gunawi, Riza O. Suminto, R. Sears, Casey Golliher, S. Sundararaman, Xing Lin, Tim Emami, Weiguang Sheng, N. Bidokhti, C. McCaffrey, Deepthi Srinivasan, Biswaranjan Panda, A. Baptist, G. Grider, P. Fields, K. Harms, R. Ross, Andree Jacobson, R. Ricci, Kirk Webb, P. Alvaro, H. Runesha, M. Hao, Huaicheng Li
{"title":"大规模慢速故障","authors":"Haryadi S. Gunawi, Riza O. Suminto, R. Sears, Casey Golliher, S. Sundararaman, Xing Lin, Tim Emami, Weiguang Sheng, N. Bidokhti, C. McCaffrey, Deepthi Srinivasan, Biswaranjan Panda, A. Baptist, G. Grider, P. Fields, K. Harms, R. Ross, Andree Jacobson, R. Ricci, Kirk Webb, P. Alvaro, H. Runesha, M. Hao, Huaicheng Li","doi":"10.1145/3242086","DOIUrl":null,"url":null,"abstract":"Fail-slow hardware is an under-studied failure mode. We present a study of 114 reports of fail-slow hardware incidents, collected from large-scale cluster deployments in 14 institutions. We show that all hardware types such as disk, SSD, CPU, memory, and network components can exhibit performance faults. We made several important observations such as faults convert from one form to another, the cascading root causes and impacts can be long, and fail-slow faults can have varying symptoms. From this study, we make suggestions to vendors, operators, and systems designers.","PeriodicalId":273014,"journal":{"name":"ACM Transactions on Storage (TOS)","volume":"50 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-10-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"45","resultStr":"{\"title\":\"Fail-Slow at Scale\",\"authors\":\"Haryadi S. Gunawi, Riza O. Suminto, R. Sears, Casey Golliher, S. Sundararaman, Xing Lin, Tim Emami, Weiguang Sheng, N. Bidokhti, C. McCaffrey, Deepthi Srinivasan, Biswaranjan Panda, A. Baptist, G. Grider, P. Fields, K. Harms, R. Ross, Andree Jacobson, R. Ricci, Kirk Webb, P. Alvaro, H. Runesha, M. Hao, Huaicheng Li\",\"doi\":\"10.1145/3242086\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Fail-slow hardware is an under-studied failure mode. We present a study of 114 reports of fail-slow hardware incidents, collected from large-scale cluster deployments in 14 institutions. We show that all hardware types such as disk, SSD, CPU, memory, and network components can exhibit performance faults. We made several important observations such as faults convert from one form to another, the cascading root causes and impacts can be long, and fail-slow faults can have varying symptoms. From this study, we make suggestions to vendors, operators, and systems designers.\",\"PeriodicalId\":273014,\"journal\":{\"name\":\"ACM Transactions on Storage (TOS)\",\"volume\":\"50 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2018-10-03\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"45\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"ACM Transactions on Storage (TOS)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3242086\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACM Transactions on Storage (TOS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3242086","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Fail-slow hardware is an under-studied failure mode. We present a study of 114 reports of fail-slow hardware incidents, collected from large-scale cluster deployments in 14 institutions. We show that all hardware types such as disk, SSD, CPU, memory, and network components can exhibit performance faults. We made several important observations such as faults convert from one form to another, the cascading root causes and impacts can be long, and fail-slow faults can have varying symptoms. From this study, we make suggestions to vendors, operators, and systems designers.