Chihun Song;Michael Jaemin Kim;Yan Sun;Houxiang Ji;Kyungsan Kim;TaeKyeong Ko;Jung Ho Ahn;Nam Sung Kim
{"title":"X-PPR: Post Package Repair for CXL Memory","authors":"Chihun Song;Michael Jaemin Kim;Yan Sun;Houxiang Ji;Kyungsan Kim;TaeKyeong Ko;Jung Ho Ahn;Nam Sung Kim","doi":"10.1109/LCA.2025.3552190","DOIUrl":null,"url":null,"abstract":"CXL is an emerging interface that can cost-efficiently expand the capacity and bandwidth of servers, recycling DRAM modules from retired servers. Such DRAM modules, however, will likely have many uncorrectable faulty words due to years of strenuous use in datacenters. To repair faulty words in the field, a few solutions based on Post Package Repair (PPR) and memory offlining have been proposed. Nonetheless, they are either unable to fix thousands of faulty words or prone to causing severe memory fragmentation, as they operate at the granularity of DRAM row and memory page addresses, respectively. In this work, for cost-efficient use of recycled DRAM modules with thousands of faulty words, we propose C<u>X</u>L-<u>PPR</u> (X-PPR), exploiting the CXL’s support for near-memory processing and variable memory access latency. We demonstrate that X-PPR implemented in a commercial CXL device with DDR4 DRAM modules can handle a faulty bit probability that is <inline-formula><tex-math>$3.3 \\times 10^{4}$</tex-math></inline-formula> higher than ECC for a 512GB DRAM module. Meanwhile, X-PPR negligibly degrades the performance of popular memory-intensive benchmarks, which is achieved through two mechanisms designed in X-PPR to minimize the performance impact of additional DRAM accesses required for repairing faulty words.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"24 1","pages":"97-100"},"PeriodicalIF":1.4000,"publicationDate":"2025-03-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Computer Architecture Letters","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10935665/","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE","Score":null,"Total":0}
引用次数: 0
Abstract
CXL is an emerging interface that can cost-efficiently expand the capacity and bandwidth of servers, recycling DRAM modules from retired servers. Such DRAM modules, however, will likely have many uncorrectable faulty words due to years of strenuous use in datacenters. To repair faulty words in the field, a few solutions based on Post Package Repair (PPR) and memory offlining have been proposed. Nonetheless, they are either unable to fix thousands of faulty words or prone to causing severe memory fragmentation, as they operate at the granularity of DRAM row and memory page addresses, respectively. In this work, for cost-efficient use of recycled DRAM modules with thousands of faulty words, we propose CXL-PPR (X-PPR), exploiting the CXL’s support for near-memory processing and variable memory access latency. We demonstrate that X-PPR implemented in a commercial CXL device with DDR4 DRAM modules can handle a faulty bit probability that is $3.3 \times 10^{4}$ higher than ECC for a 512GB DRAM module. Meanwhile, X-PPR negligibly degrades the performance of popular memory-intensive benchmarks, which is achieved through two mechanisms designed in X-PPR to minimize the performance impact of additional DRAM accesses required for repairing faulty words.
期刊介绍:
IEEE Computer Architecture Letters is a rigorously peer-reviewed forum for publishing early, high-impact results in the areas of uni- and multiprocessor computer systems, computer architecture, microarchitecture, workload characterization, performance evaluation and simulation techniques, and power-aware computing. Submissions are welcomed on any topic in computer architecture, especially but not limited to: microprocessor and multiprocessor systems, microarchitecture and ILP processors, workload characterization, performance evaluation and simulation techniques, compiler-hardware and operating system-hardware interactions, interconnect architectures, memory and cache systems, power and thermal issues at the architecture level, I/O architectures and techniques, independent validation of previously published results, analysis of unsuccessful techniques, domain-specific processor architectures (e.g., embedded, graphics, network, etc.), real-time and high-availability architectures, reconfigurable systems.