ACM Transactions on Architecture and Code Optimization最新文献

A Survey of General-purpose Polyhedral Compilers 通用多面体编译器概览

IF 1.6 3区计算机科学

ACM Transactions on Architecture and Code Optimization Pub Date : 2024-06-22 DOI: 10.1145/3674735

Arun Thangamani, Vincent Loechner, Stéphane Genaud

引用次数: 0

Sectored DRAM: A Practical Energy-Efficient and High-Performance Fine-Grained DRAM Architecture 分段式 DRAM：实用的高能效、高性能细粒度 DRAM 架构

IF 1.6 3区计算机科学

ACM Transactions on Architecture and Code Optimization Pub Date : 2024-06-14 DOI: 10.1145/3673653

Ataberk Olgun, Fatma Bostanci, Geraldo Francisco de Oliveira Junior, Yahya Can Tugrul, Rahul Bera, Abdullah Giray Yaglikci, Hasan Hassan, Oguz Ergin, Onur Mutlu

{"title":"Sectored DRAM: A Practical Energy-Efficient and High-Performance Fine-Grained DRAM Architecture","authors":"Ataberk Olgun, Fatma Bostanci, Geraldo Francisco de Oliveira Junior, Yahya Can Tugrul, Rahul Bera, Abdullah Giray Yaglikci, Hasan Hassan, Oguz Ergin, Onur Mutlu","doi":"10.1145/3673653","DOIUrl":"https://doi.org/10.1145/3673653","url":null,"abstract":"Modern computing systems access data in main memory at coarse granularity (e.g., at 512-bit cache block granularity). Coarse-grained access leads to wasted energy because the system does not use all individually accessed small portions (e.g., words, each of which typically is 64 bits) of a cache block. In modern DRAM-based computing systems, two key coarse-grained access mechanisms lead to wasted energy: large and fixed-size (i) data transfers between DRAM and the memory controller and (ii) DRAM row activations. We propose Sectored DRAM, a new, low-overhead DRAM substrate that reduces wasted energy by enabling fine-grained DRAM data transfer and DRAM row activation. To retrieve only useful data from DRAM, Sectored DRAM exploits the observation that many cache blocks are not fully utilized in many workloads due to poor spatial locality. Sectored DRAM predicts the words in a cache block that will likely be accessed during the cache block’s residency in cache and: (i) transfers only the predicted words on the memory channel by dynamically tailoring the DRAM data transfer size for the workload and (ii) activates a smaller set of cells that contain the predicted words by carefully operating physically isolated portions of DRAM rows (i.e., mats). Activating a smaller set of cells on each access relaxes DRAM power delivery constraints and allows the memory controller to schedule DRAM accesses faster. We evaluate Sectored DRAM using 41 workloads from widely-used benchmark suites. Compared to a system with coarse-grained DRAM, Sectored DRAM reduces the DRAM energy consumption of highly-memory-intensive workloads by up to (on average) 33% (20%) while improving their performance by up to (on average) 36% (17%). Sectored DRAM’s DRAM energy savings, combined with its system performance improvement, allows system-wide energy savings of up to 23%. Sectored DRAM’s DRAM chip area overhead is 1.7% of the area of a modern DDR4 chip. Compared to state-of-the-art fine-grained DRAM architectures, Sectored DRAM greatly reduces DRAM energy consumption, does not reduce DRAM bandwidth, and can be implemented with low hardware cost. Sectored DRAM provides 89% of the performance benefits of, consumes 12% less DRAM energy than, and takes up 34% less DRAM chip area than a high-performance state-of-the-art fine-grained DRAM architecture (Half-DRAM). We hope and believe that Sectored DRAM’s ideas and results will help to enable more efficient and high-performance memory systems. To this end, we open source Sectored DRAM at https://github.com/CMU-SAFARI/Sectored-DRAM.","PeriodicalId":50920,"journal":{"name":"ACM Transactions on Architecture and Code Optimization","volume":"94 1","pages":""},"PeriodicalIF":1.6,"publicationDate":"2024-06-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141508982","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Scythe: A Low-latency RDMA-enabled Distributed Transaction System for Disaggregated Memory Scythe：面向分解内存的低延迟 RDMA 分布式事务系统

IF 1.6 3区计算机科学

ACM Transactions on Architecture and Code Optimization Pub Date : 2024-05-27 DOI: 10.1145/3666004

Kai Lu, Siqi Zhao, Haikang Shan, Qiang Wei, Guokuan Li, Jiguang Wan, Ting Yao, Huatao Wu, Daohui Wang

{"title":"Scythe: A Low-latency RDMA-enabled Distributed Transaction System for Disaggregated Memory","authors":"Kai Lu, Siqi Zhao, Haikang Shan, Qiang Wei, Guokuan Li, Jiguang Wan, Ting Yao, Huatao Wu, Daohui Wang","doi":"10.1145/3666004","DOIUrl":"https://doi.org/10.1145/3666004","url":null,"abstract":"Disaggregated memory separates compute and memory resources into independent pools connected by RDMA (Remote Direct Memory Access) networks, which can improve memory utilization, reduce cost, and enable elastic scaling of compute and memory resources. However, existing RDMA-based distributed transactions on disaggregated memory suffer from severe long-tail latency under high-contention workloads. In this paper, we propose Scythe, a novel low-latency RDMA-enabled distributed transaction system for disaggregated memory. Scythe optimizes the latency of high-contention transactions in three approaches: 1) Scythe proposes a hot-aware concurrency control policy that uses optimistic concurrency control (OCC) to improve transaction processing efficiency in low-conflict scenarios. Under high conflicts, Scythe designs a timestamp-ordered OCC (TOCC) strategy based on fair locking to reduce the number of retries and cross-node communication overhead. 2) Scythe presents an RDMA-friendly timestamp service for improved timestamp management. 3) Scythe designs an RDMA-optimized RPC framework to improve RDMA bandwidth utilization. The evaluation results show that, compared to state-of-the-art distributed transaction systems, Scythe achieves more than 2.5 × lower latency with 1.8 × higher throughput under high-contention workloads.","PeriodicalId":50920,"journal":{"name":"ACM Transactions on Architecture and Code Optimization","volume":"21 1","pages":""},"PeriodicalIF":1.6,"publicationDate":"2024-05-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141171974","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

FASA-DRAM: Reducing DRAM Latency with Destructive Activation and Delayed Restoration FASA-DRAM：通过破坏性激活和延迟恢复减少 DRAM 延迟

IF 1.6 3区计算机科学

ACM Transactions on Architecture and Code Optimization Pub Date : 2024-05-21 DOI: 10.1145/3649455

Haitao Du, Yuhan Qin, Song Chen, Yi Kang

{"title":"FASA-DRAM: Reducing DRAM Latency with Destructive Activation and Delayed Restoration","authors":"Haitao Du, Yuhan Qin, Song Chen, Yi Kang","doi":"10.1145/3649455","DOIUrl":"https://doi.org/10.1145/3649455","url":null,"abstract":"DRAM memory is a performance bottleneck for many applications, due to its high access latency. Previous work has mainly focused on data locality, introducing small but fast regions to cache frequently accessed data, thereby reducing the average latency. However, these locality-based designs have three challenges in modern multi-core systems: (1) inter-application interference leads to random memory access traffic, (2) fairness issues prevent the memory controller from over-prioritizing data locality, and (3) write-intensive applications have much lower locality and evict substantial dirty entries. With frequent data movement between the fast in-DRAM cache and slow regular arrays, the overhead induced by moving data may even offset the performance and energy benefits of in-DRAM caching.In this article, we decouple the data movement process into two distinct phases. The first phase is Load-Reduced Destructive Activation (LRDA), which destructively promotes data into the in-DRAM cache. The second phase is Delayed Cycle-Stealing Restoration (DCSR), which restores the original data when the DRAM bank is idle. LRDA decouples the most time-consuming restoration phase from activation, and DCSR hides the restoration latency through prevalent bank-level parallelism. We propose FASA-DRAM, incorporating destructive activation and delayed restoration techniques to enable both in-DRAM caching and proactive latency-hiding mechanisms. Our evaluation shows that FASA-DRAM improves the average performance by 19.9% and reduces average DRAM energy consumption by 18.1% over DDR4 DRAM for four-core workloads, with less than 3.4% extra area overhead. Furthermore, FASA-DRAM outperforms state-of-the-art designs in both performance and energy efficiency.","PeriodicalId":50920,"journal":{"name":"ACM Transactions on Architecture and Code Optimization","volume":"35 1","pages":""},"PeriodicalIF":1.6,"publicationDate":"2024-05-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141147023","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Fixed-point Encoding and Architecture Exploration for Residue Number Systems 余数系统的定点编码和架构探索

IF 1.6 3区计算机科学

ACM Transactions on Architecture and Code Optimization Pub Date : 2024-05-14 DOI: 10.1145/3664923

Bobin Deng, Bhargava Nadendla, Kun Suo, Yixin Xie, Dan Chia-Tien Lo

{"title":"Fixed-point Encoding and Architecture Exploration for Residue Number Systems","authors":"Bobin Deng, Bhargava Nadendla, Kun Suo, Yixin Xie, Dan Chia-Tien Lo","doi":"10.1145/3664923","DOIUrl":"https://doi.org/10.1145/3664923","url":null,"abstract":"Residue Number Systems (RNS) demonstrate the fascinating potential to serve integer addition/multiplication-intensive applications. The complexity of Artificial Intelligence (AI) models has grown enormously in recent years. From a computer system’s perspective, ensuring the training of these large-scale AI models within an adequate time and energy consumption has become a big concern. Matrix multiplication is a dominant subroutine in many prevailing AI models, with an addition/multiplication-intensive attribute. However, the data type of matrix multiplication within machine learning training typically requires real numbers, which indicates that RNS benefits for integer applications cannot be directly gained by AI training. The state-of-the-art RNS real number encodings, including floating-point and fixed-point, have defects and can be further enhanced. To transform default RNS benefits to the efficiency of large-scale AI training, we propose a low-cost and high-accuracy RNS fixed-point representation: Single RNS Logical Partition (S-RNS-Logic-P) representation with Scaling Down Postprocessing Multiplication (SD-Post-Mul). Moreover, we extend the implementation details of the other two RNS fixed-point methods: Double RNS Concatenation (D-RNS-Concat) and Single RNS Logical Partition (S-RNS-Logic-P) representation with Scaling Down Preprocessing Multiplication (SD-Pre-Mul). We also design the architectures of these three fixed-point multipliers. In empirical experiments, our S-RNS-Logic-P representation with SD-Post-Mul method achieves less latency and energy overhead while maintaining good accuracy. Furthermore, this method can easily extend to the Redundant Residue Number System (RRNS) to raise the efficiency of error-tolerant domains, such as improving the error correction efficiency of quantum computing.","PeriodicalId":50920,"journal":{"name":"ACM Transactions on Architecture and Code Optimization","volume":"147 1","pages":""},"PeriodicalIF":1.6,"publicationDate":"2024-05-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140931088","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

CoolDC: A Cost-Effective Immersion-Cooled Datacenter with Workload-Aware Temperature Scaling CoolDC：具有工作负载感知温度扩展功能的低成本浸入式冷却数据中心

IF 1.6 3区计算机科学

ACM Transactions on Architecture and Code Optimization Pub Date : 2024-05-14 DOI: 10.1145/3664925

Dongmoon Min, Ilkwon Byun, Gyu-hyeon Lee, Jangwoo Kim

{"title":"CoolDC: A Cost-Effective Immersion-Cooled Datacenter with Workload-Aware Temperature Scaling","authors":"Dongmoon Min, Ilkwon Byun, Gyu-hyeon Lee, Jangwoo Kim","doi":"10.1145/3664925","DOIUrl":"https://doi.org/10.1145/3664925","url":null,"abstract":"For datacenter architects, it is the most important goal to minimize the datacenter’s total cost of ownership for the target performance (i.e., TCO/performance). As the major component of a datacenter is a server farm, the most effective way of reducing TCO/performance is to improve the server’s performance and power efficiency. To achieve the goal, we claim that it is highly promising to reduce each server’s temperature to its most cost-effective point (or temperature scaling). In this paper, we propose CoolDC, a novel and immediately-applicable low-temperature cooling method to minimize the datacenter’s TCO. The key idea is to find and apply the most cost-effective sub-freezing temperature to target servers and workloads. For that purpose, we first apply the immersion cooling method to the entire servers to maintain a stable low temperature with little extra cooling and maintenance costs. Second, we define the TCO-optimal temperature for datacenter operation (e.g., 248K~273K (-25℃~0℃)) by carefully estimating all the costs and benefits at low temperatures. Finally, we propose CoolDC, our immersion-cooling datacenter architecture to run every workload at its own TCO-optimal temperature. By incorporating our low-temperature workload-aware temperature scaling, CoolDC achieves 12.7% and 13.4% lower TCO/performance than the conventional air-cooled and immersion-cooled datacenters, respectively, without any modification to existing computers.","PeriodicalId":50920,"journal":{"name":"ACM Transactions on Architecture and Code Optimization","volume":"54 1","pages":""},"PeriodicalIF":1.6,"publicationDate":"2024-05-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140931024","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Stripe-schedule Aware Repair in Erasure-coded Clusters with Heterogeneous Star Networks 具有异构星型网络的擦除编码集群中的条带调度感知修复

IF 1.6 3区计算机科学

ACM Transactions on Architecture and Code Optimization Pub Date : 2024-05-13 DOI: 10.1145/3664926

Hai Zhou, Dan Feng

{"title":"Stripe-schedule Aware Repair in Erasure-coded Clusters with Heterogeneous Star Networks","authors":"Hai Zhou, Dan Feng","doi":"10.1145/3664926","DOIUrl":"https://doi.org/10.1145/3664926","url":null,"abstract":"More and more storage systems use erasure code to tolerate faults. It takes pieces of data blocks as input and encodes a small number of parity blocks as output, where these blocks form a stripe. When reconsidering the recovery problem in the multi-stripe level and heterogeneous network clusters, quickly generating an efficient multi-stripe recovery solution that reduces recovery time remains a challenging and time-consuming task. Previous works either use a greedy algorithm that may fall into the local optimal and have low recovery performance or a meta-heuristic algorithm with a long running time and low solution generation efficiency. In this paper, we propose a Stripe-schedule Aware Repair (SARepair) technique for multi-stripe recovery in heterogeneous erasure-coded clusters based on RS code. By carefully examining the metadata of blocks, SARepair intelligently adjusts the recovery solution for each stripe and obtains another multi-stripe solution with less recovery time in a computationally efficient manner. It then tolerates worse solutions to overcome the local optimal and uses a rollback mechanism to adjust search regions to reduce recovery time further. Moreover, instead of reading blocks sequentially from each node, SARepair also selectively schedules the reading order for each block to reduce the memory overhead. We extend SARepair to address the full-node recovery and adapt to the LRC code. We prototype SARepair and show via both simulations and Amazon EC2 experiments that the recovery performance can be improved by up to 59.97% over a state-of-the-art recovery approach while keeping running time and memory overhead low.","PeriodicalId":50920,"journal":{"name":"ACM Transactions on Architecture and Code Optimization","volume":"42 1","pages":""},"PeriodicalIF":1.6,"publicationDate":"2024-05-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140931025","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Asynchronous Memory Access Unit: Exploiting Massive Parallelism for Far Memory Access 异步内存访问单元：利用大规模并行性实现远距离内存访问

IF 1.6 3区计算机科学

ACM Transactions on Architecture and Code Optimization Pub Date : 2024-05-09 DOI: 10.1145/3663479

Luming Wang, Xu Zhang, Songyue Wang, Zhuolun Jiang, Tianyue Lu, Mingyu Chen, Siwei Luo, Keji Huang

{"title":"Asynchronous Memory Access Unit: Exploiting Massive Parallelism for Far Memory Access","authors":"Luming Wang, Xu Zhang, Songyue Wang, Zhuolun Jiang, Tianyue Lu, Mingyu Chen, Siwei Luo, Keji Huang","doi":"10.1145/3663479","DOIUrl":"https://doi.org/10.1145/3663479","url":null,"abstract":"The growing memory demands of modern applications have driven the adoption of far memory technologies in data centers to provide cost-effective, high-capacity memory solutions. However, far memory presents new performance challenges because its access latencies are significantly longer and more variable than local DRAM. For applications to achieve acceptable performance on far memory, a high degree of memory-level parallelism (MLP) is needed to tolerate the long access latency. While modern out-of-order processors are capable of exploiting a certain degree of MLP, they are constrained by resource limitations and hardware complexity. The key obstacle is the synchronous memory access semantics of traditional load/store instructions, which occupy critical hardware resources for a long time. The longer far memory latencies exacerbate this limitation. This paper proposes a set of Asynchronous Memory Access Instructions (AMI) and its supporting function unit, Asynchronous Memory Access Unit (AMU), inside contemporary Out-of-Order Core. AMI separates memory request issuing from response handling to reduce resource occupation. Additionally, AMU architecture supports up to several hundreds of asynchronous memory requests through re-purposing a portion of L2 Cache as scratchpad memory (SPM) to provide sufficient temporal storage. Together with a coroutine-based programming framework, this scheme can achieve significantly higher MLP for hiding far memory latencies. Evaluation with a cycle-accurate simulation shows AMI achieves 2.42 × speedup on average for memory-bound benchmarks with 1μs additional far memory latency. Over 130 outstanding requests are supported with 26.86 × speedup for GUPS (random access) with 5 μs latency. These demonstrate how the techniques tackle far memory performance impacts through explicit MLP expression and latency adaptation.","PeriodicalId":50920,"journal":{"name":"ACM Transactions on Architecture and Code Optimization","volume":"6 1","pages":""},"PeriodicalIF":1.6,"publicationDate":"2024-05-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140936310","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Characterizing and Optimizing LDPC Performance on 3D NAND Flash Memories 鉴定和优化 3D NAND 闪存上的 LDPC 性能

IF 1.6 3区计算机科学

ACM Transactions on Architecture and Code Optimization Pub Date : 2024-05-03 DOI: 10.1145/3663478

Qiao Li, Yu Chen, Guanyu Wu, Yajuan Du, Min Ye, Xinbiao Gan, jie zhang, Zhirong Shen, Jiwu Shu, Chun Xue

{"title":"Characterizing and Optimizing LDPC Performance on 3D NAND Flash Memories","authors":"Qiao Li, Yu Chen, Guanyu Wu, Yajuan Du, Min Ye, Xinbiao Gan, jie zhang, Zhirong Shen, Jiwu Shu, Chun Xue","doi":"10.1145/3663478","DOIUrl":"https://doi.org/10.1145/3663478","url":null,"abstract":"With the development of NAND flash memories’ bit density and stacking technologies, while storage capacity keeps increasing, the issue of reliability becomes increasingly prominent. Low-density parity check (LDPC) code, as a robust error-correcting code, is extensively employed in flash memory. However, when the RBER is prohibitively high, LDPC decoding would introduce long latency. To study how LDPC performs on the latest 3D NAND flash memory, we conduct a comprehensive analysis of LDPC decoding performance using both the theoretically derived threshold voltage distribution model obtained through modeling (Modeling-based method) and the actual voltage distribution collected from on-chip data through testing (Ideal case). Based on LDPC decoding results under various interference conditions, we summarize four findings that can help us gain a better understanding of the characteristics of LDPC decoding in 3D NAND flash memory. Following our characterization, we identify the differences in LDPC decoding performance between the Modeling-based method and the Ideal case. Due to the accuracy of initial probability information, the threshold voltage distribution derived through modeling deviates by certain degrees from the actual threshold voltage distribution. This leads to a performance gap between using the threshold voltage distribution derived from the Modeling-based method and the actual distribution. By observing the abnormal behaviors in the decoding with the Modeling-based method, we introduce an Offsetted Read Voltage (ΔRV) method, for optimizing LDPC decoding performance by offsetting the reading voltage in each layer of a flash block. The evaluation results show that our ΔRV method enhances the decoding performance of LDPC on the Modeling-based method by reducing the total number of sensing levels needed for LDPC decoding by 0.67% to 18.92% for different interference conditions on average, under the P/E cycles from 3000 to 7000.","PeriodicalId":50920,"journal":{"name":"ACM Transactions on Architecture and Code Optimization","volume":"4 1","pages":""},"PeriodicalIF":1.6,"publicationDate":"2024-05-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140828697","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

GraphSER: Distance-Aware Stream-Based Edge Repartition for Many-Core Systems GraphSER：多核系统中基于距离感知流的边缘重划分

IF 1.6 3区计算机科学

ACM Transactions on Architecture and Code Optimization Pub Date : 2024-04-26 DOI: 10.1145/3661998

Junkaixuan Li, Yi Kang

{"title":"GraphSER: Distance-Aware Stream-Based Edge Repartition for Many-Core Systems","authors":"Junkaixuan Li, Yi Kang","doi":"10.1145/3661998","DOIUrl":"https://doi.org/10.1145/3661998","url":null,"abstract":"With the explosive growth of graph data, distributed graph processing becomes popular and many graph hardware accelerators use distributed frameworks. Graph partitioning is foundation in distributed graph processing. However, dynamic changes in graph make existing partitioning shifted from its optimized points and cause system performance degraded. Therefore, more efficient dynamic graph partition methods are needed. In this work, we propose GraphSER, a dynamic graph partition method for many-core systems. In order to improve the cross-node spatial locality and reduce the overhead of repartition, we propose a stream-based edge repartition, in which each computing node sequentially traverses its local edge list in parallel, then migrating edges based on distance and replica degree. GraphSER does not need costly searching and prioritizes nodes so it can avoid poor cross-node spatial locality. Our evaluation shows that compared to state-of-the-art edge repartition software methods, GraphSER has an average speedup 1.52x, with the maximum up to 2x. Compared to the previous many-core hardware repartition method, GraphSER performance has an average of 40% improvement, with the maximum to 117%.","PeriodicalId":50920,"journal":{"name":"ACM Transactions on Architecture and Code Optimization","volume":"9 1","pages":""},"PeriodicalIF":1.6,"publicationDate":"2024-04-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140799625","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0