IEEE Computer Architecture Letters最新文献

筛选
英文 中文
Kobold: Simplified Cache Coherence for Cache-Attached Accelerators Kobold:简化缓存连接加速器的缓存一致性
IF 2.3 3区 计算机科学
IEEE Computer Architecture Letters Pub Date : 2023-04-21 DOI: 10.1109/LCA.2023.3269399
Jennifer Brana;Brian C. Schwedock;Yatin A. Manerkar;Nathan Beckmann
{"title":"Kobold: Simplified Cache Coherence for Cache-Attached Accelerators","authors":"Jennifer Brana;Brian C. Schwedock;Yatin A. Manerkar;Nathan Beckmann","doi":"10.1109/LCA.2023.3269399","DOIUrl":"10.1109/LCA.2023.3269399","url":null,"abstract":"The ever-increasing cost of data movement in computer systems is driving a new era of data-centric computing. One of the most common data-centric paradigms is near-data computing (NDC), where accelerators are placed \u0000<italic>inside</i>\u0000 the memory hierarchy to avoid the costly transfer of data to the core. NDC systems show immense potential to improve performance and energy efficiency. Unfortunately, adding accelerators into the memory hierarchy incurs significant complexity for system integration because accelerators often require cache-coherent access to memory. The complex coherence protocols required to handle both cores and cache-attached accelerators result in significantly higher verification costs as well as an increase in directory state and on-chip network traffic. Furthermore, these mechanisms can cause cache pollution and worsen baseline processor performance. To simplify the integration of cache-attached accelerators, we present Kobold, a new coherence protocol and implementation which restricts the added complexity of an accelerator to its local tile. Kobold introduces a new directory structure within the L2 cache to track the accelerator's private cache and maintain coherence between the core and accelerator. A minor modification to the LLC protocol also enables accelerators to improve performance by bypassing the local L2. We verified Kobold's stable-state coherence protocols using the Murphi model checker and estimated area overhead using Cacti 7. Kobold simplifies integration of cache-attached accelerators, adds only 0.09% area over the baseline caches, and provides clear performance advantages versus naïve extensions of existing directory coherence protocols.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":null,"pages":null},"PeriodicalIF":2.3,"publicationDate":"2023-04-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"43340299","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Canal: A Flexible Interconnect Generator for Coarse-Grained Reconfigurable Arrays Canal:用于粗粒度可重构阵列的柔性互连生成器
IF 2.3 3区 计算机科学
IEEE Computer Architecture Letters Pub Date : 2023-04-19 DOI: 10.1109/LCA.2023.3268126
Jackson Melchert;Keyi Zhang;Yuchen Mei;Mark Horowitz;Christopher Torng;Priyanka Raina
{"title":"Canal: A Flexible Interconnect Generator for Coarse-Grained Reconfigurable Arrays","authors":"Jackson Melchert;Keyi Zhang;Yuchen Mei;Mark Horowitz;Christopher Torng;Priyanka Raina","doi":"10.1109/LCA.2023.3268126","DOIUrl":"10.1109/LCA.2023.3268126","url":null,"abstract":"The architecture of a coarse-grained reconfigurable array (CGRA) interconnect has a significant effect on not only the flexibility of the resulting accelerator, but also its power, performance, and area. Design decisions that have complex trade-offs need to be explored to maintain efficiency and performance across a variety of evolving applications. This paper presents Canal, a Python-embedded domain-specific language (eDSL) and compiler for specifying and generating reconfigurable interconnects for CGRAs. Canal uses a graph-based intermediate representation (IR) that allows for easy hardware generation and tight integration with place and route tools. We evaluate Canal by constructing both a fully static interconnect and a hybrid interconnect with ready-valid signaling, and by conducting design space exploration of the interconnect architecture by modifying the switch box topology, the number of routing tracks, and the interconnect tile connections. Through the use of a graph-based IR for CGRA interconnects, the eDSL, and the interconnect generation system, Canal enables fast design space exploration and creation of CGRA interconnects.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":null,"pages":null},"PeriodicalIF":2.3,"publicationDate":"2023-04-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"43724888","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
SmartIndex: Learning to Index Caches to Improve Performance SmartIndex:学习索引缓存以提高性能
IF 2.3 3区 计算机科学
IEEE Computer Architecture Letters Pub Date : 2023-04-05 DOI: 10.1109/LCA.2023.3264478
Kevin Weston;Farabi Mahmud;Vahid Janfaza;Abdullah Muzahid
{"title":"SmartIndex: Learning to Index Caches to Improve Performance","authors":"Kevin Weston;Farabi Mahmud;Vahid Janfaza;Abdullah Muzahid","doi":"10.1109/LCA.2023.3264478","DOIUrl":"10.1109/LCA.2023.3264478","url":null,"abstract":"Modern computers rely heavily on caches to achieve higher performance. Unfortunately, a cache indexing scheme can often cause an uneven distribution of addresses across cache sets resulting in many evictions of useful cache blocks. To address this issue, we propose \u0000<sc>SmartIndex</small>\u0000, a self-optimized indexing scheme that leverages machine learning to actively learn the memory access pattern and dynamically adjust indexes to evenly distribute the cache lines across all sets in the cache, thereby reducing cache misses. Experimental results on a set of 26 memory-intensive applications show that for non-uniform applications, \u0000<sc>SmartIndex</small>\u0000 can reduce the misses per kilo instructions (MPKI) of a direct mapped cache by up to 39%, translating into an IPC speedup of 7.23% compared to the conventional power-of-two indexing scheme. Our experiments also show that \u0000<sc>SmartIndex</small>\u0000 can work with any cache associativity.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":null,"pages":null},"PeriodicalIF":2.3,"publicationDate":"2023-04-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"48921816","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
An Intermediate Language for General Sparse Format Customization 通用稀疏格式自定义的中间语言
IF 2.3 3区 计算机科学
IEEE Computer Architecture Letters Pub Date : 2023-03-28 DOI: 10.1109/LCA.2023.3262610
Jie Liu;Zhongyuan Zhao;Zijian Ding;Benjamin Brock;Hongbo Rong;Zhiru Zhang
{"title":"An Intermediate Language for General Sparse Format Customization","authors":"Jie Liu;Zhongyuan Zhao;Zijian Ding;Benjamin Brock;Hongbo Rong;Zhiru Zhang","doi":"10.1109/LCA.2023.3262610","DOIUrl":"https://doi.org/10.1109/LCA.2023.3262610","url":null,"abstract":"The inevitable trend of hardware specialization drives an increasing use of custom data formats in processing sparse workloads, which are typically memory-bound. These formats facilitate the automated generation of target-aware data layouts to improve memory access latency and bandwidth utilization. However, existing sparse tensor programming models and compilers offer little or no support for productively customizing the sparse formats. Moreover, since these frameworks adopt an attribute-based approach for format abstraction, they cannot easily be extended to support general format customization. To overcome this deficiency, we propose UniSparse, an intermediate language that provides a unified abstraction for representing and customizing sparse formats. We also develop a compiler leveraging the MLIR infrastructure, which supports adaptive customization of formats. We demonstrate the efficacy of our approach through experiments running commonly-used sparse linear algebra operations with hybrid formats on multiple different hardware targets, including an Intel CPU, an NVIDIA GPU, and a simulated processing-in-memory (PIM) device.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":null,"pages":null},"PeriodicalIF":2.3,"publicationDate":"2023-03-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"49962233","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
XLA-NDP: Efficient Scheduling and Code Generation for Deep Learning Model Training on Near-Data Processing Memory XLA-NDP:用于近数据处理存储器上的深度学习模型训练的高效调度和代码生成
IF 2.3 3区 计算机科学
IEEE Computer Architecture Letters Pub Date : 2023-03-23 DOI: 10.1109/LCA.2023.3261136
Jueon Park;Hyojin Sung
{"title":"XLA-NDP: Efficient Scheduling and Code Generation for Deep Learning Model Training on Near-Data Processing Memory","authors":"Jueon Park;Hyojin Sung","doi":"10.1109/LCA.2023.3261136","DOIUrl":"10.1109/LCA.2023.3261136","url":null,"abstract":"Deep learning (DL) model training must address the memory bottleneck to continue scaling. Processing-in-memory approaches can be a viable solution as they move computations near or into the memory, reducing substantial data movement. However, to deploy applications on such hardware, end-to-end software support is crucial for efficient computation mapping and scheduling as well as extensible code generation, but no consideration has been made for DL training workloads. In this paper, we propose XLA-NDP, a compiler and runtime solution for NDPX, a near-data processing (NDP) architecture integrated with an existing DL training framework. XLA-NDP offloads NDPX kernels and schedules them to overlap with GPU kernels to maximize parallelism based on GPU and NDPX costs, while providing a template-based code generator with low-level optimizations. The experiments showed that XLA-NDP provides up to a 41% speedup (24% on average) over the GPU baseline for four DL model training.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":null,"pages":null},"PeriodicalIF":2.3,"publicationDate":"2023-03-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"42002591","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Towards Improved Power Management in Cloud GPUs 改进云gpu的电源管理
IF 2.3 3区 计算机科学
IEEE Computer Architecture Letters Pub Date : 2023-03-22 DOI: 10.1109/LCA.2023.3278652
Pratyush Patel;Zibo Gong;Syeda Rizvi;Esha Choukse;Pulkit Misra;Thomas Anderson;Akshitha Sriraman
{"title":"Towards Improved Power Management in Cloud GPUs","authors":"Pratyush Patel;Zibo Gong;Syeda Rizvi;Esha Choukse;Pulkit Misra;Thomas Anderson;Akshitha Sriraman","doi":"10.1109/LCA.2023.3278652","DOIUrl":"10.1109/LCA.2023.3278652","url":null,"abstract":"As modern server GPUs are increasingly power intensive, better power management mechanisms can significantly reduce the power consumption, capital costs, and carbon emissions in large cloud datacenters. This letter uses diverse datacenter workloads to study the power management capabilities of modern GPUs. We find that current GPU management mechanisms have limited compatibility and monitoring support under cloud virtualization. They have sub-optimal, imprecise, and non-intuitive implementations of Dynamic Voltage and Frequency Scaling (DVFS) and power capping. Consequently, efficient GPU power management is not widely deployed in clouds today. To address these issues, we make actionable recommendations for GPU vendors and researchers.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":null,"pages":null},"PeriodicalIF":2.3,"publicationDate":"2023-03-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"48510260","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
The Jaseci Programming Paradigm and Runtime Stack: Building Scale-Out Production Applications Easy and Fast Jaseci编程范式和运行时堆栈:构建横向扩展的生产应用程序容易和快速
IF 2.3 3区 计算机科学
IEEE Computer Architecture Letters Pub Date : 2023-03-18 DOI: 10.1109/LCA.2023.3274038
Jason Mars;Yiping Kang;Roland Daynauth;Baichuan Li;Ashish Mahendra;Krisztian Flautner;Lingjia Tang
{"title":"The Jaseci Programming Paradigm and Runtime Stack: Building Scale-Out Production Applications Easy and Fast","authors":"Jason Mars;Yiping Kang;Roland Daynauth;Baichuan Li;Ashish Mahendra;Krisztian Flautner;Lingjia Tang","doi":"10.1109/LCA.2023.3274038","DOIUrl":"10.1109/LCA.2023.3274038","url":null,"abstract":"Today's production scale-out applications include many sub-application components, such as storage backends, logging infrastructure and AI models. These components have drastically different characteristics, are required to work in collaboration, and interface with each other as microservices. This leads to increasingly high complexity in developing, optimizing, configuring, and deploying scale-out applications, raising the barrier to entry for most individuals and small teams. We developed a novel co-designed runtime system, \u0000<italic>Jaseci</i>\u0000, and programming language, \u0000<italic>Jac</i>\u0000, which aims to reduce this complexity. The key design principle throughout Jaseci's design is to raise the level of abstraction by moving as much of the scale-out data management, microservice componentization, and live update complexity into the runtime stack to be automated and optimized automatically. We use real-world AI applications to demonstrate Jaseci's benefit for application performance and developer productivity.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":null,"pages":null},"PeriodicalIF":2.3,"publicationDate":"2023-03-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"44390952","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Mitigating Timing-Based NoC Side-Channel Attacks With LLC Remapping 利用LLC重映射缓解基于定时的NoC侧信道攻击
IF 2.3 3区 计算机科学
IEEE Computer Architecture Letters Pub Date : 2023-03-16 DOI: 10.1109/LCA.2023.3276709
Anurag Kar;Xueyang Liu;Yonghae Kim;Gururaj Saileshwar;Hyesoon Kim;Tushar Krishna
{"title":"Mitigating Timing-Based NoC Side-Channel Attacks With LLC Remapping","authors":"Anurag Kar;Xueyang Liu;Yonghae Kim;Gururaj Saileshwar;Hyesoon Kim;Tushar Krishna","doi":"10.1109/LCA.2023.3276709","DOIUrl":"10.1109/LCA.2023.3276709","url":null,"abstract":"Recent CPU microarchitectural attacks utilize contention over the NoC to mount covert and side-channel attacks on multicore CPUs and leak information from victim applications. We propose NoIR, a dynamic LLC slice selection mechanism using slice remapping to obfuscate interconnect contention patterns. NoIR reduces contention variance by 92.18% and mean IPC degradation due to cache invalidation is limited to 7.38% for SPEC CPU 2017 benchmarks for a 1000-access threshold. While previous defenses focused on redesigning the NoC and routing algorithms, we show that a top-down system-level approach can significantly raise the bar for a NoC security vulnerability with minimal modifications to the NoC hardware.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":null,"pages":null},"PeriodicalIF":2.3,"publicationDate":"2023-03-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"48939213","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
RouteReplies: Alleviating Long Latency in Many-Chip-Module GPUs RouteReplies:缓解多芯片模块GPU的长延迟
IF 2.3 3区 计算机科学
IEEE Computer Architecture Letters Pub Date : 2023-03-13 DOI: 10.1109/LCA.2023.3255555
Xia Zhao;Guangda Zhang;Lu Wang;Yangmei Li;Yongjun Zhang
{"title":"RouteReplies: Alleviating Long Latency in Many-Chip-Module GPUs","authors":"Xia Zhao;Guangda Zhang;Lu Wang;Yangmei Li;Yongjun Zhang","doi":"10.1109/LCA.2023.3255555","DOIUrl":"10.1109/LCA.2023.3255555","url":null,"abstract":"GPU chip module count is expected to keep increasing to meet the strong scaling demands of parallel applications. In many-chip-module GPUs, memory access latency seriously limits the performance since the transferring latency between different GPU modules is very high, which cannot be easily hidden by switching between different ready threads. To handle this problem, we propose RouteReplies, which enables a GPU module to fetch data from other GPU modules in the routing path. Leveraging the data locality between different GPU modules, RouteReplies significantly reduces the memory access latency since the memory request does not need to fetch data from the faraway memory partition. For a set of applications exhibiting varying degrees of inter-module locality, RouteReplies reduces memory access latency and increases performance by 54.8% on average (up to 364.8%).","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":null,"pages":null},"PeriodicalIF":2.3,"publicationDate":"2023-03-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"45547459","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Enhancing DNN Training Efficiency Via Dynamic Asymmetric Architecture 通过动态不对称架构提高深度神经网络训练效率
IF 2.3 3区 计算机科学
IEEE Computer Architecture Letters Pub Date : 2023-03-12 DOI: 10.1109/LCA.2023.3275909
Samer Kurzum;Gil Shomron;Freddy Gabbay;Uri Weiser
{"title":"Enhancing DNN Training Efficiency Via Dynamic Asymmetric Architecture","authors":"Samer Kurzum;Gil Shomron;Freddy Gabbay;Uri Weiser","doi":"10.1109/LCA.2023.3275909","DOIUrl":"10.1109/LCA.2023.3275909","url":null,"abstract":"Deep neural networks (DNNs) require abundant multiply-and-accumulate (MAC) operations. Thanks to DNNs’ ability to accommodate noise, some of the computational burden is commonly mitigated by quantization–that is, by using lower precision floating-point operations. Layer granularity is the preferred method, as it is easily mapped to commodity hardware. In this paper, we propose Dynamic Asymmetric Architecture (DAA), in which the micro-architecture decides what the precision of each MAC operation should be during runtime. We demonstrate a DAA with two data streams and a value-based controller that decides which data stream deserves the higher precision resource. We evaluate this mechanism in terms of accuracy on a number of convolutional neural networks (CNNs) and demonstrate its feasibility on top of a systolic array. Our experimental analysis shows that DAA potentially achieves 2x throughput improvement for ResNet-18 while saving 35% of the energy with less than 0.5% degradation in accuracy.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":null,"pages":null},"PeriodicalIF":2.3,"publicationDate":"2023-03-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"45178350","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信