Ipoom Jeong;Jiaqi Lou;Yongseok Son;Yongjoo Park;Yifan Yuan;Nam Sung Kim
{"title":"LADIO: Leakage-Aware Direct I/O for I/O-Intensive Workloads","authors":"Ipoom Jeong;Jiaqi Lou;Yongseok Son;Yongjoo Park;Yifan Yuan;Nam Sung Kim","doi":"10.1109/LCA.2023.3290427","DOIUrl":"10.1109/LCA.2023.3290427","url":null,"abstract":"The advancement in I/O technology has posed an unprecedented demand for high-performance processing on I/O data, leading to the development of Data Direct I/O (DDIO) technology. DDIO improves I/O processing efficiency by directly injecting all inbound I/O data into the last-level cache (LLC) in cooperation with any type of I/O device. Nonetheless, in certain scenarios with more than one I/O applications, DDIO may have sub-optimal performance caused by interference inside the LLC, resulting in the degradation of system performance. Especially, in this paper, we demonstrate that storage I/O on modern high-performance NVMe SSDs hardly benefits from DDIO, sometimes causing inefficient use of the shared LLC due to the “leaky DMA problem”. To address this problem, we propose \u0000<monospace>LADIO</monospace>\u0000, an adaptive approach that mitigates inter-application interference by dynamically controlling the DDIO functionality and reallocating LLC ways based on the leakage and locality of storage I/O data, respectively. In scenarios with heavy I/O interference, \u0000<monospace>LADIO</monospace>\u0000 improves the throughput of network-intensive applications by 20% while maintaining that of storage-intensive applications.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"22 2","pages":"77-80"},"PeriodicalIF":2.3,"publicationDate":"2023-07-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"48507765","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"T-CAT: Dynamic Cache Allocation for Tiered Memory Systems With Memory Interleaving","authors":"Hwanjun Lee;Seunghak Lee;Yeji Jung;Daehoon Kim","doi":"10.1109/LCA.2023.3290197","DOIUrl":"10.1109/LCA.2023.3290197","url":null,"abstract":"New memory interconnect technology, such as Intel's Compute Express Link (CXL), helps to expand memory bandwidth and capacity by adding CPU-less NUMA nodes to the main memory system, addressing the growing memory wall challenge. Consequently, modern computing systems embrace the heterogeneity in memory systems, composing the memory systems with a tiered memory system with near and far memory (e.g., local DRAM and CXL-DRAM). However, adopting NUMA interleaving, which can improve performance by exploiting node-level parallelism and utilizing aggregate bandwidth, to the tiered memory systems can face challenges due to differences in the access latency between the two types of memory, leading to potential performance degradation for memory-intensive workloads. By tackling the challenges, we first investigate the effects of the NUMA interleaving on the performance of the tiered memory systems. We observe that while NUMA interleaving is essential for applications demanding high memory bandwidth, it can negatively impact the performance of applications demanding low memory bandwidth. Next, we propose a dynamic cache management, called \u0000<monospace>T-CAT</monospace>\u0000, which partitions the last-level cache between near and far memory, aiming to mitigate performance degradation by accessing far memory. \u0000<monospace>T-CAT</monospace>\u0000 attempts to reduce the difference in the average access latency between near and far memory by re-sizing the cache partitions. Through dynamic cache management, \u0000<monospace>T-CAT</monospace>\u0000 can preserve the performance benefits of NUMA interleaving while mitigating performance degradation by the far memory accesses. Our experimental results show that \u0000<monospace>T-CAT</monospace>\u0000 improves performance by up to 17% compared to cases with NUMA interleaving without the cache management.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"22 2","pages":"73-76"},"PeriodicalIF":2.3,"publicationDate":"2023-06-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"48978905","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Hardware Accelerated Reusable Merkle Tree Generation for Bitcoin Blockchain Headers","authors":"Kiseok Jeon;Junghee Lee;Bumsoo Kim;James J. Kim","doi":"10.1109/LCA.2023.3289515","DOIUrl":"10.1109/LCA.2023.3289515","url":null,"abstract":"As the value of Bitcoin increases, the difficulty level of mining keeps increasing. This is generally addressed with application-specific integrated circuits (ASIC), but block candidates are still created by the software. The overhead of block candidate generation is relatively growing because the hash computation is boosted by ASIC. Additionally, it is getting harder to find the target nonce; If it is not found for a block candidate, a new block candidate must be generated. A new candidate can be generated to reduce the overhead of block candidate generation by modifying the coinbase without selecting and verifying transactions again. To this end, we propose a hardware accelerator for generating Merkle trees efficiently. The hash computation for Merkle tree generation is conducted with ASIC to reduce the overhead of block candidate generation, and the tree with only the modified coinbase is rapidly regenerated by reusing the intermediate results of the previously generated tree. Our simulation results demonstrate that the execution time can be reduced by up to 98.92% and power consumption by up to 99.73% when the number of transactions in a tree is 2048.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"22 2","pages":"69-72"},"PeriodicalIF":2.3,"publicationDate":"2023-06-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"41591979","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Containerized In-Storage Processing Model and Hardware Acceleration for Fully-Flexible Computational SSDs","authors":"Donghyun Gouk, Miryeong Kwon, Hanyeoreum Bae, Myoungsoo Jung","doi":"10.1109/lca.2023.3289828","DOIUrl":"https://doi.org/10.1109/lca.2023.3289828","url":null,"abstract":"In-storage processing (ISP) efficiently examines large datasets but faces performance and security challenges. We introduce DockerSSD, a flexible ISP model that runs various applications near flash without modification. It employs lightweight OS-level virtualization in modern SSDs for faster ISP and better storage intelligence with a high flexiblity. DockerSSD reuses existing Docker container images for real-time data processing without altering the storage interface or runtime. Our design includes a new communication method and virtual firmware, alongside automated container-related network and I/O handling hardware. DockerSSD achieves a 2× speed improvement and reduces system-level power by 35.7%, on average.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"26 1","pages":""},"PeriodicalIF":2.3,"publicationDate":"2023-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142183968","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Fernando Mosquera;Krishna Kavi;Gayatri Mehta;Lizy John
{"title":"Guard Cache: Creating Noisy Side-Channels","authors":"Fernando Mosquera;Krishna Kavi;Gayatri Mehta;Lizy John","doi":"10.1109/LCA.2023.3289710","DOIUrl":"10.1109/LCA.2023.3289710","url":null,"abstract":"Microarchitectural innovations such as deep cache hierarchies, out-of-order execution, branch prediction and speculative execution have made possible the design of processors that meet ever-increasing demands for performance. However, these innovations have inadvertently introduced vulnerabilities, which are exploited by side-channel attacks and attacks relying on speculative executions. Mitigating the attacks while preserving the performance has been a challenge. In this letter we present an approach to obfuscate cache timing, making it more difficult for side-channel attacks to succeed. We create \u0000<italic>false cache hits</i>\u0000 using a small \u0000<italic>Guard Cache</i>\u0000 with randomization, and \u0000<italic>false cache misses</i>\u0000 by randomly evicting cache lines. We show that our \u0000<italic>false hits</i>\u0000 and \u0000<italic>false misses</i>\u0000 cause very minimal performance penalties and our obfuscation can make it difficult for common side-channel attacks such as Prime &Probe, Flush &Reload or Evict &Time to succeed.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"22 2","pages":"97-100"},"PeriodicalIF":2.3,"publicationDate":"2023-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"41685733","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"TURBULENCE: Complexity-Effective Out-of-Order Execution on GPU With Distance-Based ISA","authors":"Reoma Matsuo;Toru Koizumi;Hidetsugu Irie;Shuichi Sakai;Ryota Shioya","doi":"10.1109/LCA.2023.3289317","DOIUrl":"10.1109/LCA.2023.3289317","url":null,"abstract":"A graphics processing unit (GPU) is a processor that achieves high throughput by exploiting data parallelism. We found that many GPU workloads also contain instruction-level parallelism that can be extracted through out-of-order execution to provide additional performance improvement opportunities. We propose the TURBULENCE architecture for very low-cost out-of-order execution on GPUs. TURBULENCE consists of a novel ISA that introduces the concept of referencing operands by inter-instruction distance instead of register numbers, and a novel microarchitecture that executes the novel ISA. This distance-based operand has the property of not causing false dependencies. By exploiting this property, we achieve cost-effective out-of-order execution on GPUs without introducing expensive hardware such as a rename logic and a load-store queue. Simulation results show that TURBULENCE improves performance by 17.6% without increasing energy consumption over an existing GPU.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"23 2","pages":"175-178"},"PeriodicalIF":1.4,"publicationDate":"2023-06-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142183933","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Chandana S. Deshpande;Arthur Perais;Frédéric Pétrot
{"title":"Toward Practical 128-Bit General Purpose Microarchitectures","authors":"Chandana S. Deshpande;Arthur Perais;Frédéric Pétrot","doi":"10.1109/LCA.2023.3287762","DOIUrl":"10.1109/LCA.2023.3287762","url":null,"abstract":"Intel introduced 5-level paging mode to support 57-bit virtual address space in 2017. This, coupled to paradigms where backup storage can be accessed through load and store instructions (e.g., non volatile memories), lets us envision a future in which a 64-bit address space has become insufficient. In that event, the straightforward solution would be to adopt a flat 128-bit address space. In this early stage letter, we conduct high-level experiments that lead us to suggest a possible general-purpose processor micro-architecture providing 128-bit support with limited hardware cost.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"22 2","pages":"81-84"},"PeriodicalIF":2.3,"publicationDate":"2023-06-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"48523174","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"DVFaaS: Leveraging DVFS for FaaS Workflows","authors":"Achilleas Tzenetopoulos;Dimosthenis Masouros;Dimitrios Soudris;Sotirios Xydis","doi":"10.1109/LCA.2023.3288089","DOIUrl":"10.1109/LCA.2023.3288089","url":null,"abstract":"In this letter, we propose \u0000<italic>DVFaaS</i>\u0000, a per-core DVFS framework that utilizes control systems theory to assign \u0000<italic>just-enough</i>\u0000 frequency for the purpose of addressing the QoS requirements on serverless workflows comprising unseen functions. \u0000<italic>DVFaaS</i>\u0000 exploits the intermittent nature of serverless workflows, which enables staged control on distinguishable functions, which jointly contribute to the end-to-end latency. Our results show that \u0000<italic>DVFaaS</i>\u0000 considerably outperforms related work, reducing power consumption by up to 22%, with 2x fewer QoS violations.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"22 2","pages":"85-88"},"PeriodicalIF":2.3,"publicationDate":"2023-06-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"46172046","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Design of a High-Performance, High-Endurance Key-Value SSD for Large-Key Workloads","authors":"Chanyoung Park;Chun-Yi Liu;Kyungtae Kang;Mahmut Kandemir;Wonil Choi","doi":"10.1109/LCA.2023.3282276","DOIUrl":"https://doi.org/10.1109/LCA.2023.3282276","url":null,"abstract":"Current KV-SSD design assumes a specific range of typical workloads, where the size of values is quite large while that of keys is relatively small. However, we find that (i) there exist another spectrum of workloads, whose key sizes are relatively large, compared to their value sizes, and (ii) the current KV-SSD design suffers from long tail latencies and low storage utilization under such large-key workloads. To this end, we present novel design of a KV-SSD (called LK-SSD), which can reduce tail latences and increase storage utilization under large-key workloads, and add an enhancement to it for longer device lifetime. Through extensive experiments, we show that LK-SSD is more suitable design for the large-key workloads, and also available for the typical workloads.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"22 2","pages":"149-152"},"PeriodicalIF":2.3,"publicationDate":"2023-06-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"49962230","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jennifer Brana;Brian C. Schwedock;Yatin A. Manerkar;Nathan Beckmann
{"title":"Kobold: Simplified Cache Coherence for Cache-Attached Accelerators","authors":"Jennifer Brana;Brian C. Schwedock;Yatin A. Manerkar;Nathan Beckmann","doi":"10.1109/LCA.2023.3269399","DOIUrl":"10.1109/LCA.2023.3269399","url":null,"abstract":"The ever-increasing cost of data movement in computer systems is driving a new era of data-centric computing. One of the most common data-centric paradigms is near-data computing (NDC), where accelerators are placed \u0000<italic>inside</i>\u0000 the memory hierarchy to avoid the costly transfer of data to the core. NDC systems show immense potential to improve performance and energy efficiency. Unfortunately, adding accelerators into the memory hierarchy incurs significant complexity for system integration because accelerators often require cache-coherent access to memory. The complex coherence protocols required to handle both cores and cache-attached accelerators result in significantly higher verification costs as well as an increase in directory state and on-chip network traffic. Furthermore, these mechanisms can cause cache pollution and worsen baseline processor performance. To simplify the integration of cache-attached accelerators, we present Kobold, a new coherence protocol and implementation which restricts the added complexity of an accelerator to its local tile. Kobold introduces a new directory structure within the L2 cache to track the accelerator's private cache and maintain coherence between the core and accelerator. A minor modification to the LLC protocol also enables accelerators to improve performance by bypassing the local L2. We verified Kobold's stable-state coherence protocols using the Murphi model checker and estimated area overhead using Cacti 7. Kobold simplifies integration of cache-attached accelerators, adds only 0.09% area over the baseline caches, and provides clear performance advantages versus naïve extensions of existing directory coherence protocols.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"22 1","pages":"41-44"},"PeriodicalIF":2.3,"publicationDate":"2023-04-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"43340299","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}