Seunghyuk Yu;Hyeonu Kim;Kyoungho Jeun;Sunyoung Hwang;Eojin Lee
{"title":"Architecting Compatible PIM Protocol for CPU-PIM Collaboration","authors":"Seunghyuk Yu;Hyeonu Kim;Kyoungho Jeun;Sunyoung Hwang;Eojin Lee","doi":"10.1109/LCA.2024.3432936","DOIUrl":"10.1109/LCA.2024.3432936","url":null,"abstract":"Processing in Memory (PIM) technology is gaining traction with the introduction of several prototype products. However, the interfaces of existing PIM devices hinder CPU performance excessively by delaying normal memory requests for long periods during PIM operations. In this paper, we propose a new PIM command and protocol designed for compatibility across various PIM devices and host processors, focusing on DRAM standards with limited command space. Our proposed command, PIM-ACT, activates multiple banks simultaneously with assigning the specific PIM operation. It closely follows the functionality of the ACT command for straightforward control by the memory controller. We also explore memory scheduling policies that balance the latency of conventional memory requests with the throughput of PIM workloads. Our evaluation demonstrates the effectiveness of our approach in optimizing both PIM and conventional workload performance.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"23 2","pages":"183-186"},"PeriodicalIF":1.4,"publicationDate":"2024-07-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141778103","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A Quantitative Analysis of State Space Model-Based Large Language Model: Study of Hungry Hungry Hippos","authors":"Dongho Yoon;Taehun Kim;Jae W. Lee;Minsoo Rhu","doi":"10.1109/LCA.2024.3422492","DOIUrl":"10.1109/LCA.2024.3422492","url":null,"abstract":"As the need for processing long contexts in large language models (LLMs) increases, attention-based LLMs face significant challenges due to their high computation and memory requirements. To overcome this challenge, there have been several recent works that seek to alleviate attention's system-level bottlenecks. An approach that has been receiving a lot of attraction lately is state space models (SSMs) thanks to their ability to substantially reduce computational complexity and memory footprint. Despite the excitement around SSMs, there is a lack of an in-depth characterization and analysis on this important model architecture. In this paper, we delve into a representative SSM named Hungry Hungry Hippos (H3), examining its advantages as well as its current limitations. We also discuss future research directions on improving the efficiency of SSMs via hardware architectural support.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"23 2","pages":"154-157"},"PeriodicalIF":1.4,"publicationDate":"2024-07-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141547667","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Empirical Architectural Analysis on Performance Scalability of Petascale All-Flash Storage Systems","authors":"Mohammadamin Ajdari;Behrang Montazerzohour;Kimia Abdi;Hossein Asadi","doi":"10.1109/LCA.2024.3418874","DOIUrl":"10.1109/LCA.2024.3418874","url":null,"abstract":"In this paper, we \u0000<italic>first</i>\u0000 analyze a real storage system consisting of 72 SSDs utilizing either \u0000<italic>Hardware RAID</i>\u0000 (HW-RAID) or \u0000<italic>Software RAID</i>\u0000 (SW-RAID), and show that SW-RAID is up to 7× faster. We then reveal that with an increasing number of SSDs, the limited I/O parallelism in SAS controllers and multi-enclosure handshaking overheads cause a significant performance drop, minimizing the total \u0000<italic>I/O Per Second</i>\u0000 (IOPS) of a 144-SSD system to less than a single SSD. \u0000<italic>Second</i>\u0000, we disclose the most important architectural parameters that affect a large-scale storage system. \u0000<italic>Third</i>\u0000, we propose a framework that models a large-scale storage system and estimates the system IOPS and system resource usage for various architectures. We verify our framework against a real system and show its high accuracy. \u0000<italic>Lastly</i>\u0000, we analyze a use case of a 240-SSD system and reveal how our framework guides architects in storage system scaling.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"23 2","pages":"158-161"},"PeriodicalIF":1.4,"publicationDate":"2024-06-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141509955","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Accelerating Programmable Bootstrapping Targeting Contemporary GPU Microarchitecture","authors":"Hyesung Ji;Sangpyo Kim;Jaewan Choi;Jung Ho Ahn","doi":"10.1109/LCA.2024.3418448","DOIUrl":"10.1109/LCA.2024.3418448","url":null,"abstract":"Fully homomorphic encryption (FHE) enables computation on encrypted data without privacy leakage, among which GSW-based schemes are notable for supporting the evaluation of arbitrary univariate functions using programmable bootstrapping (PBS). Despite their wide applicability, their computational complexity in a single PBS impedes widespread adoption. However, at the application level, there are enough number of independent PBSs to achieve high data-level parallelism, making them suitable for running on GPUs known for their high parallel computing capability. On contemporary GPUs, peak integer performance has steadily increased, and the sizes of L2 cache and shared memory have also grown rapidly since the Volta architecture. Prior attempts to accelerate PBS on GPUs have fallen short due to their outdated implementations that cannot leverage recent GPU advances. In this paper, we introduce a GPU implementation that supports the latest PBS algorithm and incorporates GPU-trend-aware optimizations. Our implementation achieves a 10.8× performance improvement over the state-of-the-art (SOTA) GPU implementations on RTX 4090 and even outperforms the SOTA ASIC implementation.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"23 2","pages":"207-210"},"PeriodicalIF":1.4,"publicationDate":"2024-06-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10570278","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141532506","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"TeleVM: A Lightweight Virtual Machine for RISC-V Architecture","authors":"Tianzheng Li;Enfang Cui;Yuting Wu;Qian Wei;Yue Gao","doi":"10.1109/LCA.2024.3394835","DOIUrl":"10.1109/LCA.2024.3394835","url":null,"abstract":"Serverless computing has become an important paradigm in cloud computing due to its advantages such as fast large-scale deployment and pay-as-you-go charging model. Due to shared infrastructure and multi-tenant environments, serverless applications have high security requirements. Traditional virtual machines and containers cannot fully meet the requirements of serverless applications. Therefore, lightweight virtual machine technology has emerged, which can reduce overhead and boot time while ensuring security. In this letter, we propose TeleVM, a lightweight virtual machine for RISC-V architecture. TeleVM can achieve strong isolation through the hypervisor extension of RISC-V. Compared with traditional virtual machines, TeleVM only implements a small number of IO devices and functions, which can effectively reduce memory overhead and boot time. We compared TeleVM and QEMU+KVM through experiments. Compared to QEMU+KVM, the boot time and memory overhead of TeleVM have decreased by 74% and 90% respectively. This work further improves the cloud computing software ecosystem of RISC-V architecture and promotes the use of RISC-V architecture in cloud computing scenarios.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"23 1","pages":"121-124"},"PeriodicalIF":2.3,"publicationDate":"2024-04-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140830077","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Analysis of Data Transfer Bottlenecks in Commercial PIM Systems: A Study With UPMEM-PIM","authors":"Dongjae Lee;Bongjoon Hyun;Taehun Kim;Minsoo Rhu","doi":"10.1109/LCA.2024.3387472","DOIUrl":"10.1109/LCA.2024.3387472","url":null,"abstract":"Due to emerging workloads that require high memory bandwidth, Processing-in-Memory (PIM) has gained significant attention and led several industrial PIM products to be introduced which are integrated with conventional computing systems. This letter characterizes the data transfer overheads between conventional DRAM address space and PIM address space within a PIM-integrated system using the commercialized PIM device made by UPMEM. Our findings highlight the need for optimization in PIM-integrated systems to address these overheads, offering critical insights for future PIM technologies.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"23 2","pages":"179-182"},"PeriodicalIF":1.4,"publicationDate":"2024-04-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140569511","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Shiyan Yi;Yudi Qiu;Lingfei Lu;Guohao Xu;Yong Gong;Xiaoyang Zeng;Yibo Fan
{"title":"GATe: Streamlining Memory Access and Communication to Accelerate Graph Attention Network With Near-Memory Processing","authors":"Shiyan Yi;Yudi Qiu;Lingfei Lu;Guohao Xu;Yong Gong;Xiaoyang Zeng;Yibo Fan","doi":"10.1109/LCA.2024.3386734","DOIUrl":"10.1109/LCA.2024.3386734","url":null,"abstract":"Graph Attention Network (GAT) has gained widespread adoption thanks to its exceptional performance. The critical components of a GAT model involve aggregation and attention, which cause numerous main-memory access. Recently, much research has proposed near-memory processing (NMP) architectures to accelerate aggregation. However, graph attention requires additional operations distinct from aggregation, making previous NMP architectures less suitable for supporting GAT. In this paper, we propose GATe, a practical and efficient \u0000<underline>GAT</u>\u0000 acc\u0000<underline>e</u>\u0000lerator with NMP architecture. To the best of our knowledge, this is the first time that accelerates both attention and aggregation computation on DIMM. In the attention and aggregation phases, we unify feature vector access to reduce repetitive memory accesses and refine the computation flow to reduce communication. Furthermore, we introduce a novel sharding method that enhances the data reusability. Experiments show that our work achieves substantial speedup of up to 6.77× and 2.46×, respectively, compared to state-of-the-art NMP works GNNear and GraNDe.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"23 1","pages":"87-90"},"PeriodicalIF":2.3,"publicationDate":"2024-04-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140569407","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"An Area Efficient Architecture of a Novel Chaotic System for High Randomness Security in e-Health","authors":"Kyriaki Tsantikidou;Nicolas Sklavos","doi":"10.1109/LCA.2024.3387352","DOIUrl":"10.1109/LCA.2024.3387352","url":null,"abstract":"An e-Health application must be carefully designed, as a malicious attack has ethical and legal consequences. While common cryptography protocols enhance security, they also add high computation overhead. In this letter, an area efficient architecture of a novel chaotic system for high randomness security is proposed. It consists of the chaotic logistic map and a novel component that efficiently combines it with a block cipher's key generation function. The proposed architecture operates as both a key scheduling/management scheme and a stream cipher. All operations are implemented in an FPGA with appropriate resource utilization techniques. The proposed architecture achieves smaller area consumption, minimum 41.5%, compared to published cryptography architectures and a 5.7% increase in throughput-to-area efficiency compared to published chaotic designs. Finally, it passes all NIST randomness tests, presents avalanche effect and produces the highest number of random bits with a single seed compared to other published security systems.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"23 1","pages":"104-107"},"PeriodicalIF":2.3,"publicationDate":"2024-04-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140569832","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"The Importance of Generalizability in Machine Learning for Systems","authors":"Varun Gohil;Sundar Dev;Gaurang Upasani;David Lo;Parthasarathy Ranganathan;Christina Delimitrou","doi":"10.1109/LCA.2024.3384449","DOIUrl":"10.1109/LCA.2024.3384449","url":null,"abstract":"Using machine learning (ML) to tackle computer systems tasks is gaining popularity. One of the shortcomings of such ML-based approaches is the inability of models to generalize to out-of-distribution data i.e., data whose distribution is different than the training dataset. We showcase that this issue exists in cloud environments by analyzing various ML models used to improve resource balance in Google's fleet. We discuss the trade-offs associated with different techniques used to detect out-of-distribution data. Finally, we propose and demonstrate the efficacy of using Bayesian models to detect the model's confidence in its output when used to improve cloud server resource balance.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"23 1","pages":"95-98"},"PeriodicalIF":2.3,"publicationDate":"2024-04-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140569515","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"MajorK: Majority Based kmer Matching in Commodity DRAM","authors":"Z. Jahshan;L. Yavits","doi":"10.1109/LCA.2024.3384259","DOIUrl":"10.1109/LCA.2024.3384259","url":null,"abstract":"Fast parallel search capabilities on large datasets are required across multiple application domains. One such domain is genome analysis, which requires high-performance \u0000<i>k</i>\u0000mer matching in large genome databases. Recently proposed solutions implemented \u0000<i>k</i>\u0000mer matching in DRAM, utilizing its sheer capacity and parallelism. However, their operation is essentially bit-serial, which ultimately limits the performance, especially when matching long strings, as customary in genome analysis pipelines. The proposed solution, MajorK, enables bit-parallel majority based \u0000<i>k</i>\u0000mer matching in an unmodified commodity DRAM. MajorK employs multiple DRAM row activation, where the search patterns (query \u0000<i>k</i>\u0000mers) are coded into DRAM addresses. We evaluate MajorK on viral genome \u0000<i>k</i>\u0000mer matching and show that it can achieve up to 2.7\u0000<inline-formula><tex-math>$ times $</tex-math></inline-formula>\u0000 higher performance while providing a better matching accuracy compared to state-of-the-art DRAM based \u0000<i>k</i>\u0000mer matching accelerators.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"23 1","pages":"83-86"},"PeriodicalIF":2.3,"publicationDate":"2024-04-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140569825","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}