Sheng Li, Ke Chen, Ming-yu Hsieh, Naveen Muralimanohar, C. Kersey, J. Brockman, Arun Rodrigues, N. Jouppi
{"title":"百亿亿次计算中存储器可靠性的系统含义","authors":"Sheng Li, Ke Chen, Ming-yu Hsieh, Naveen Muralimanohar, C. Kersey, J. Brockman, Arun Rodrigues, N. Jouppi","doi":"10.1145/2063384.2063445","DOIUrl":null,"url":null,"abstract":"Resiliency will be one of the toughest challenges in future exascale systems. Memory errors contribute more than 40% of the total hardware-related failures and are projected to increase in future exascale systems. The use of error correction codes (ECC) and checkpointing are two effective approaches to fault tolerance. While there are numerous studies on ECC or checkpointing in isolation, this is the first paper to investigate the combined effect of both on overall system performance and power. Specifically, we study the impact of various ECC schemes (SECDED, BCH, and chip-kill) in conjunction with checkpointing on future exascale systems. Our simulation results show that while chipkill is 13% better for computation-intensive applications, BCH has a 28% advantage in system energy-delay product (EDP) for memory-intensive applications. We also propose to use BCH in tagged memory systems with commodity DRAMs where chipkill is impractical. Our proposed architecture achieves 2.3× better system EDP than state-of-the-art tagged memory systems.","PeriodicalId":358797,"journal":{"name":"2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC)","volume":"3 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2011-11-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"53","resultStr":"{\"title\":\"System implications of memory reliability in exascale computing\",\"authors\":\"Sheng Li, Ke Chen, Ming-yu Hsieh, Naveen Muralimanohar, C. Kersey, J. Brockman, Arun Rodrigues, N. Jouppi\",\"doi\":\"10.1145/2063384.2063445\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Resiliency will be one of the toughest challenges in future exascale systems. Memory errors contribute more than 40% of the total hardware-related failures and are projected to increase in future exascale systems. The use of error correction codes (ECC) and checkpointing are two effective approaches to fault tolerance. While there are numerous studies on ECC or checkpointing in isolation, this is the first paper to investigate the combined effect of both on overall system performance and power. Specifically, we study the impact of various ECC schemes (SECDED, BCH, and chip-kill) in conjunction with checkpointing on future exascale systems. Our simulation results show that while chipkill is 13% better for computation-intensive applications, BCH has a 28% advantage in system energy-delay product (EDP) for memory-intensive applications. We also propose to use BCH in tagged memory systems with commodity DRAMs where chipkill is impractical. Our proposed architecture achieves 2.3× better system EDP than state-of-the-art tagged memory systems.\",\"PeriodicalId\":358797,\"journal\":{\"name\":\"2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC)\",\"volume\":\"3 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2011-11-12\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"53\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/2063384.2063445\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2063384.2063445","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
System implications of memory reliability in exascale computing
Resiliency will be one of the toughest challenges in future exascale systems. Memory errors contribute more than 40% of the total hardware-related failures and are projected to increase in future exascale systems. The use of error correction codes (ECC) and checkpointing are two effective approaches to fault tolerance. While there are numerous studies on ECC or checkpointing in isolation, this is the first paper to investigate the combined effect of both on overall system performance and power. Specifically, we study the impact of various ECC schemes (SECDED, BCH, and chip-kill) in conjunction with checkpointing on future exascale systems. Our simulation results show that while chipkill is 13% better for computation-intensive applications, BCH has a 28% advantage in system energy-delay product (EDP) for memory-intensive applications. We also propose to use BCH in tagged memory systems with commodity DRAMs where chipkill is impractical. Our proposed architecture achieves 2.3× better system EDP than state-of-the-art tagged memory systems.