ACM Transactions on Architecture and Code Optimization最新文献_第9页

MPU: Memory-centric SIMT Processor via In-DRAM Near-bank Computing MPU:以内存为中心的SIMT处理器，通过In-DRAM近银行计算

IF 1.6 3区计算机科学

ACM Transactions on Architecture and Code Optimization Pub Date : 2023-07-19 DOI: https://dl.acm.org/doi/10.1145/3603113

Xinfeng Xie, Peng Gu, Yufei Ding, Dimin Niu, Hongzhong Zheng, Yuan Xie

{"title":"MPU: Memory-centric SIMT Processor via In-DRAM Near-bank Computing","authors":"Xinfeng Xie, Peng Gu, Yufei Ding, Dimin Niu, Hongzhong Zheng, Yuan Xie","doi":"https://dl.acm.org/doi/10.1145/3603113","DOIUrl":"https://doi.org/https://dl.acm.org/doi/10.1145/3603113","url":null,"abstract":"With the growing number of data-intensive workloads, GPU, which is the state-of-the-art single-instruction-multiple-thread (SIMT) processor, is hindered by the memory bandwidth wall. To alleviate this bottleneck, previously proposed 3D-stacking near-bank computing accelerators benefit from abundant bank-internal bandwidth by bringing computations closer to the DRAM banks. However, these accelerators are specialized for certain application domains with simple architecture data paths and customized software mapping schemes. For general-purpose scenarios, lightweight hardware designs for diverse data paths, architectural supports for the SIMT programming model, and end-to-end software optimizations remain challenging.To address these issues, we propose Memory-centric Processing Unit (MPU), the first SIMT processor based on 3D-stacking near-bank computing architecture. First, to realize diverse data paths with small overheads, MPU adopts a hybrid pipeline with the capability of offloading instructions to near-bank compute-logic. Second, we explore two architectural supports for the SIMT programming model, including a near-bank shared memory design and a multiple activated row-buffers enhancement. Third, we present an end-to-end compilation flow for MPU to support CUDA programs. To fully utilize MPU’s hybrid pipeline, we develop a backend optimization for the instruction offloading decision. The evaluation results of MPU demonstrate 3.46× speedup and 2.57× energy reduction compared with an NVIDIA Tesla V100 GPU on a set of representative data-intensive workloads.","PeriodicalId":50920,"journal":{"name":"ACM Transactions on Architecture and Code Optimization","volume":"15 3-4","pages":""},"PeriodicalIF":1.6,"publicationDate":"2023-07-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138509553","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

The Impact of Page Size and Microarchitecture on Instruction Address Translation Overhead 页面大小和微架构对指令地址转换开销的影响

IF 1.6 3区计算机科学

ACM Transactions on Architecture and Code Optimization Pub Date : 2023-07-19 DOI: https://dl.acm.org/doi/10.1145/3600089

Yufeng Zhou, Alan L. Cox, Sandhya Dwarkadas, Xiaowan Dong

{"title":"The Impact of Page Size and Microarchitecture on Instruction Address Translation Overhead","authors":"Yufeng Zhou, Alan L. Cox, Sandhya Dwarkadas, Xiaowan Dong","doi":"https://dl.acm.org/doi/10.1145/3600089","DOIUrl":"https://doi.org/https://dl.acm.org/doi/10.1145/3600089","url":null,"abstract":"As the volume of data processed by applications has increased, considerable attention has been paid to data address translation overheads, leading to the widespread use of larger page sizes (“superpages”) and multi-level translation lookaside buffers (TLBs). However, far less attention has been paid to instruction address translation and its relation to TLB and pipeline structure. In prior work, we quantified the impact of using code superpages on a variety of widely used applications, ranging from compilers to web user-interface frameworks, and the impact of sharing page table pages for executables and shared libraries. Within this article, we augment those results by first uncovering the effects that microarchitectural differences between Intel Skylake and AMD Zen+, particularly their different TLB organizations, have on instruction address translation overhead. This analysis provides some key insights into the microarchitectural design decisions that impact the cost of instruction address translation. First, a lower-level (level 2) TLB that has both instruction and data mappings competing for space within the same structure allows better overall performance and utilization when using code superpages. Code superpages not only reduce instruction address translation overhead but also indirectly reduce data address translation overhead. In fact, for a few applications, the use of just a few code superpages has a larger impact on overall performance than the use of a much larger number of data superpages. Second, a level 1 (L1) TLB with separate structures for different page sizes may require careful tuning of the superpage promotion policy for code, and a correspondingly suboptimal utilization of the level 2 TLB. In particular, increasing the number of superpages when the size of the L1 superpage structure is small may result in more L1 TLB misses for some applications. Moreover, on some microarchitectures, the cost of these misses can be highly variable, because replacement is delayed until all of the in-flight instructions mapped by the victim entry are retired. Hence, more superpage promotions can result in a performance regression. Finally, our findings also make a case for first-class OS support for superpages on ordinary files containing executables and shared libraries, as well as a more aggressive superpage policy for code.","PeriodicalId":50920,"journal":{"name":"ACM Transactions on Architecture and Code Optimization","volume":"4 3","pages":""},"PeriodicalIF":1.6,"publicationDate":"2023-07-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138509551","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Turn-based Spatiotemporal Coherence for GPUs gpu的回合制时空相干性

IF 1.6 3区计算机科学

ACM Transactions on Architecture and Code Optimization Pub Date : 2023-07-19 DOI: https://dl.acm.org/doi/10.1145/3593054

Sooraj Puthoor, Mikko H. Lipasti

引用次数: 0

TNT: A Modular Approach to Traversing Physically Heterogeneous NOCs at Bare-wire Latency TNT:在裸线延迟下穿越物理异构noc的模块化方法

IF 1.6 3区计算机科学

ACM Transactions on Architecture and Code Optimization Pub Date : 2023-07-19 DOI: https://dl.acm.org/doi/10.1145/3597611

Gokul Subramanian Ravi, Tushar Krishna, Mikko Lipasti

{"title":"TNT: A Modular Approach to Traversing Physically Heterogeneous NOCs at Bare-wire Latency","authors":"Gokul Subramanian Ravi, Tushar Krishna, Mikko Lipasti","doi":"https://dl.acm.org/doi/10.1145/3597611","DOIUrl":"https://doi.org/https://dl.acm.org/doi/10.1145/3597611","url":null,"abstract":"The ideal latency for on-chip network traversal would be the delay incurred from wire traversal alone. Unfortunately, in a realistic modular network, the latency for a packet to traverse the network is significantly higher than this wire delay. The main limiter to achieving lower latency is the modular quantization of network traversal into hops. Beyond this, the physical heterogeneity in real-world systems further complicate the ability to reach ideal wire-only delay.In this work, we propose TNT or Transparent Network Traversal. TNT targets ideal network latency by attempting source to destination network traversal as a single multi-cycle ‘long-hop’, bypassing the quantization effects of intermediate routers via transparent data/information flow. TNT is built in a modular tile-scalable manner via a novel control path performing neighbor-to-neighbor interactions but enabling end-to-end transparent flit traversal. Further, TNT’s fine grained on-the-fly delay tracking allows it to cope with physical NOC heterogeneity across the chip.Analysis on Ligra graph workloads shows that TNT can reduce NOC latency by as much as 43% compared to the state of the art and allows efficiency gains up to 38%. Further, it can achieve more than 3x the benefits of the best/closest alternative research proposal, SMART [43].","PeriodicalId":50920,"journal":{"name":"ACM Transactions on Architecture and Code Optimization","volume":"82 1","pages":""},"PeriodicalIF":1.6,"publicationDate":"2023-07-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138529814","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Jointly Optimizing Job Assignment and Resource Partitioning for Improving System Throughput in Cloud Datacenters 共同优化作业分配和资源划分，提高云数据中心系统吞吐量

IF 1.6 3区计算机科学

ACM Transactions on Architecture and Code Optimization Pub Date : 2023-07-19 DOI: https://dl.acm.org/doi/10.1145/3593055

Ruobing Chen, Haosen Shi, Jinping Wu, Yusen Li, Xiaoguang Liu, Gang Wang

{"title":"Jointly Optimizing Job Assignment and Resource Partitioning for Improving System Throughput in Cloud Datacenters","authors":"Ruobing Chen, Haosen Shi, Jinping Wu, Yusen Li, Xiaoguang Liu, Gang Wang","doi":"https://dl.acm.org/doi/10.1145/3593055","DOIUrl":"https://doi.org/https://dl.acm.org/doi/10.1145/3593055","url":null,"abstract":"Colocating multiple jobs on the same server has been widely applied for improving resource utilization in cloud datacenters. However, the colocated jobs would contend for the shared resources, which could lead to significant performance degradation. An efficient approach for eliminating performance interference is to partition the shared resources among the colocated jobs. However, this makes the resource management in datacenters very challenging. In this paper, we propose JointOPT, the first resource management framework that optimizes job assignment and resource partitioning jointly for improving the throughput of cloud datacenters. JointOPT uses a local search based algorithm to find the near optimal job assignment configuration, and uses a deep reinforcement learning (DRL) based approach to dynamically partition the shared resources among the colocated jobs. In order to reduce the interaction overhead with real systems, it leverages deep learning to estimate job performance without running them on real servers. We conduct extensive experiments to evaluate JointOPT and the results show that JointOPT significantly outperforms the state-of-the-art baselines, with an advantage from 13.3% to 47.7%.","PeriodicalId":50920,"journal":{"name":"ACM Transactions on Architecture and Code Optimization","volume":"254-255 1","pages":""},"PeriodicalIF":1.6,"publicationDate":"2023-07-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138529836","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Cache Programming for Scientific Loops Using Leases 使用租约的科学循环缓存编程

IF 1.6 3区计算机科学

ACM Transactions on Architecture and Code Optimization Pub Date : 2023-07-19 DOI: https://dl.acm.org/doi/10.1145/3600090

Benjamin Reber, Matthew Gould, Alexander H. Kneipp, Fangzhou Liu, Ian Prechtl, Chen Ding, Linlin Chen, Dorin Patru

引用次数: 0

ASM: An Adaptive Secure Multicore for Co-located Mutually Distrusting Processes ASM:一种自适应安全多核共存的互不信任进程

IF 1.6 3区计算机科学

ACM Transactions on Architecture and Code Optimization Pub Date : 2023-07-19 DOI: https://dl.acm.org/doi/10.1145/3587480

Abdul Rasheed Sahni, Hamza Omar, Usman Ali, Omer Khan

{"title":"ASM: An Adaptive Secure Multicore for Co-located Mutually Distrusting Processes","authors":"Abdul Rasheed Sahni, Hamza Omar, Usman Ali, Omer Khan","doi":"https://dl.acm.org/doi/10.1145/3587480","DOIUrl":"https://doi.org/https://dl.acm.org/doi/10.1145/3587480","url":null,"abstract":"With the ever-increasing virtualization of software and hardware, the privacy of user-sensitive data is a fundamental concern in computation outsourcing. Secure processors enable a trusted execution environment to guarantee security properties based on the principles of isolation, sealing, and integrity. However, the shared hardware resources within the microarchitecture are increasingly being used by co-located adversarial software to create timing-based side-channel attacks. State-of-the-art secure processors implement the strong isolation primitive to enable non-interference for shared hardware but suffer from frequent state purging and resource utilization overheads, leading to degraded performance. This article proposes <sans-serif>ASM</sans-serif>, an adaptive secure multicore architecture that enables a reconfigurable, yet strongly isolated execution environment. For outsourced security-critical processes, the proposed security kernel and hardware extensions allow either a given process to execute using all available cores or co-execute multiple processes on strongly isolated clusters of cores. This spatio-temporal execution environment is configured based on resource demands of processes, such that the secure processor mitigates state purging overheads and maximizes hardware resource utilization.","PeriodicalId":50920,"journal":{"name":"ACM Transactions on Architecture and Code Optimization","volume":"19 1","pages":""},"PeriodicalIF":1.6,"publicationDate":"2023-07-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138529835","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

GraphTune: An Efficient Dependency-Aware Substrate to Alleviate Irregularity in Concurrent Graph Processing GraphTune:一种有效的依赖性感知基板以减轻并发图处理中的不规则性

IF 1.6 3区计算机科学

ACM Transactions on Architecture and Code Optimization Pub Date : 2023-07-19 DOI: https://dl.acm.org/doi/10.1145/3600091

Jin Zhao, Yu Zhang, Ligang He, Qikun Li, Xiang Zhang, Xinyu Jiang, Hui Yu, Xiaofei Liao, Hai Jin, Lin Gu, Haikun Liu, Bingsheng He, Ji Zhang, Xianzheng Song, Lin Wang, Jun Zhou

{"title":"GraphTune: An Efficient Dependency-Aware Substrate to Alleviate Irregularity in Concurrent Graph Processing","authors":"Jin Zhao, Yu Zhang, Ligang He, Qikun Li, Xiang Zhang, Xinyu Jiang, Hui Yu, Xiaofei Liao, Hai Jin, Lin Gu, Haikun Liu, Bingsheng He, Ji Zhang, Xianzheng Song, Lin Wang, Jun Zhou","doi":"https://dl.acm.org/doi/10.1145/3600091","DOIUrl":"https://doi.org/https://dl.acm.org/doi/10.1145/3600091","url":null,"abstract":"With the increasing need for graph analysis, massive Concurrent iterative Graph Processing (CGP) jobs are usually performed on the common large-scale real-world graph. Although several solutions have been proposed, these CGP jobs are not coordinated with the consideration of the inherent dependencies in graph data driven by graph topology. As a result, they suffer from redundant and fragmented accesses of the same underlying graph dispersed over distributed platform, because the same graph is typically irregularly traversed by these jobs along different paths at the same time.In this work, we develop GraphTune, which can be integrated into existing distributed graph processing systems, such as D-Galois, Gemini, PowerGraph, and Chaos, to efficiently perform CGP jobs and enhance system throughput. The key component of GraphTune is a dependency-aware synchronous execution engine in conjunction with several optimization strategies based on the constructed cross-iteration dependency graph of chunks. Specifically, GraphTune transparently regularizes the processing behavior of the CGP jobs in a novel synchronous way and assigns the chunks of graph data to be handled by them based on the topological order of the dependency graph so as to maximize the performance. In this way, it can transform the irregular accesses of the chunks into more regular ones so that as many CGP jobs as possible can fully share the data accesses to the common graph. Meanwhile, it also efficiently synchronizes the communications launched by different CGP jobs based on the dependency graph to minimize the communication cost. We integrate it into four cutting-edge distributed graph processing systems and a popular out-of-core graph processing system to demonstrate the efficiency of GraphTune. Experimental results show that GraphTune improves the throughput of CGP jobs by 3.1∼6.2, 3.8∼8.5, 3.5∼10.8, 4.3∼12.4, and 3.8∼6.9 times over D-Galois, Gemini, PowerGraph, Chaos, and GraphChi, respectively.","PeriodicalId":50920,"journal":{"name":"ACM Transactions on Architecture and Code Optimization","volume":"5 2","pages":""},"PeriodicalIF":1.6,"publicationDate":"2023-07-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138509545","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Hierarchical Model Parallelism for Optimizing Inference on Many-core Processor via Decoupled 3D-CNN Structure 基于解耦3D-CNN结构的分层模型并行化多核处理器推理优化

IF 1.6 3区计算机科学

ACM Transactions on Architecture and Code Optimization Pub Date : 2023-07-19 DOI: https://dl.acm.org/doi/10.1145/3605149

Jiazhi Jiang, Zijian Huang, Dan Huang, Jiangsu Du, Lin Chen, Ziguan Chen, Yutong Lu

{"title":"Hierarchical Model Parallelism for Optimizing Inference on Many-core Processor via Decoupled 3D-CNN Structure","authors":"Jiazhi Jiang, Zijian Huang, Dan Huang, Jiangsu Du, Lin Chen, Ziguan Chen, Yutong Lu","doi":"https://dl.acm.org/doi/10.1145/3605149","DOIUrl":"https://doi.org/https://dl.acm.org/doi/10.1145/3605149","url":null,"abstract":"The tremendous success of convolutional neural network (CNN) has made it ubiquitous in many fields of human endeavor. Many applications such as biomedical analysis and scientific data analysis involve analyzing volumetric data. This spawns huge demand for 3D-CNN. Although accelerators such as GPU may provide higher throughput on deep learning applications, they may not be available in all scenarios. CPU, especially many-core CPU with non-uniform memory access (NUMA) architecture, remains an attractive choice for deep learning inference in many scenarios.In this article, we propose a distributed inference solution for 3D-CNN that targets on the emerging ARM many-core CPU platform. A hierarchical partition approach is claimed to accelerate 3D-CNN inference by exploiting characteristics of memory and cache on ARM many-core CPU. Based on the hierarchical model partition approach, other optimization techniques such as NUMA-aware thread scheduling and optimization of 3D-img2row convolution are designed to exploit the potential of ARM many-core CPU for 3D-CNN. We evaluate our proposed inference solution with several classic 3D-CNNs: C3D, 3D-resnet34, 3D-resnet50, 3D-vgg11, and P3D. Our experimental results show that our solution can boost the performance of the 3D-CNN inference, and achieve much better scalability, with a negligible fluctuation in accuracy. When employing our 3D-CNN inference solution on ACL libraries, it can outperform naive ACL implementations by 11× to 50× on ARM many-core processor. When employing our 3D-CNN inference solution on NCNN libraries, it can outperform the naive NCNN implementations by 5.2× to 14.2× on ARM many-core processor.","PeriodicalId":50920,"journal":{"name":"ACM Transactions on Architecture and Code Optimization","volume":"7 4","pages":""},"PeriodicalIF":1.6,"publicationDate":"2023-07-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138509541","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

SplitZNS: Towards an Efficient LSM-Tree on Zoned Namespace SSDs SplitZNS:在分区命名空间ssd上实现高效的lsm树

IF 1.6 3区计算机科学

ACM Transactions on Architecture and Code Optimization Pub Date : 2023-07-10 DOI: 10.1145/3608476

Dong Huang, D. Feng, Qian-Qian Liu, Bo Ding, Wei Zhao, Xueliang Wei, Wei Tong

{"title":"SplitZNS: Towards an Efficient LSM-Tree on Zoned Namespace SSDs","authors":"Dong Huang, D. Feng, Qian-Qian Liu, Bo Ding, Wei Zhao, Xueliang Wei, Wei Tong","doi":"10.1145/3608476","DOIUrl":"https://doi.org/10.1145/3608476","url":null,"abstract":"The Zoned Namespace (ZNS) Solid State Drive (SSD) is a nascent form of storage device that offers novel prospects for the Log Structured Merge Tree (LSM-tree). ZNS exposes erase blocks in SSD as append-only zones, enabling the LSM-tree to gain awareness of the physical layout of data. Nevertheless, LSM-tree on ZNS SSDs necessitates Garbage Collection (GC) owing to the mismatch between the gigantic zones and relatively small Sorted String Tables (SSTables). Through extensive experiments, we observe that a smaller zone size can reduce data migration in GC at the cost of a significant performance decline owing to inadequate parallelism exploitation. In this article, we present SplitZNS, which introduces small zones by tweaking the zone-to-chip mapping to maximize GC efficiency for LSM-tree on ZNS SSDs. Following the multi-level peculiarity of LSM-tree and the inherent parallel architecture of ZNS SSDs, we propose a number of techniques to leverage and accelerate small zones to alleviate the performance impact due to underutilized parallelism. (1) First, we use small zones selectively to prevent exacerbating write slowdowns and stalls due to their suboptimal performance. (2) Second, to enhance parallelism utilization, we propose SubZone Ring, which employs a per-chip FIFO buffer to imitate a large zone writing style; (3) Read Prefetcher, which prefetches data concurrently through multiple chips during compactions; (4) and Read Scheduler, which assigns query requests the highest priority. We build a prototype integrated with SplitZNS to validate its efficiency and efficacy. Experimental results demonstrate that SplitZNS achieves up to 2.77× performance and reduces data migration considerably compared to the lifetime-based data placement.1","PeriodicalId":50920,"journal":{"name":"ACM Transactions on Architecture and Code Optimization","volume":"140 1","pages":"1 - 26"},"PeriodicalIF":1.6,"publicationDate":"2023-07-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"73369174","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0