{"title":"DRAMA: Commodity DRAM Based Content Addressable Memory","authors":"L. Yavits","doi":"10.1109/LCA.2023.3341830","DOIUrl":"10.1109/LCA.2023.3341830","url":null,"abstract":"Fast parallel search capabilities on large datasets provided by content addressable memories (CAM) are required across multiple application domains. However compared to RAM, CAMs feature high area overhead and power consumption, and as a result, they scale poorly. The proposed solution, DRAMA, enables CAM, ternary CAM (TCAM) and approximate (similarity) search CAM functionalities in unmodified commodity DRAM. DRAMA performs compare operation in a bit-serial fashion, where the search pattern (query) is coded in DRAM addresses. A single bit compare (XNOR) in DRAMA is identical to a regular DRAM read. AND and OR operations required for NAND CAM and NOR CAM respectively are implemented using nonstandard DRAM timing. We evaluate DRAMA on bacterial DNA classification and show that DRAMA can achieve 3.6\u0000<inline-formula><tex-math>$ times $</tex-math></inline-formula>\u0000 higher performance and 19.6\u0000<inline-formula><tex-math>$ times $</tex-math></inline-formula>\u0000 lower power consumption compared to state-of-the-art CMOS CAM based genome classification accelerator.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"23 1","pages":"65-68"},"PeriodicalIF":2.3,"publicationDate":"2023-12-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139160798","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yang-Gon Kim;Yun-Ki Han;Jae-Kang Shin;Jun-Kyum Kim;Lee-Sup Kim
{"title":"Accelerating Deep Reinforcement Learning via Phase-Level Parallelism for Robotics Applications","authors":"Yang-Gon Kim;Yun-Ki Han;Jae-Kang Shin;Jun-Kyum Kim;Lee-Sup Kim","doi":"10.1109/LCA.2023.3341152","DOIUrl":"https://doi.org/10.1109/LCA.2023.3341152","url":null,"abstract":"Deep Reinforcement Learning (DRL) plays a critical role in controlling future intelligent machines like robots and drones. Constantly retrained by newly arriving real-world data, DRL provides optimal autonomous control solutions for adapting to ever-changing environments. However, DRL repeats inference and training that are computationally expensive on resource-constraint mobile/embedded platforms. Even worse, DRL produces a severe hardware underutilization problem due to its unique execution pattern. To overcome the inefficiency of DRL, we propose \u0000<italic>Train Early Start</i>\u0000, a new execution pattern for building the efficient DRL algorithm. \u0000<italic>Train Early Start</i>\u0000 parallelizes the inference and training execution, hiding the serialized performance bottleneck and improving the hardware utilization dramatically. Compared to the state-of-the-art mobile SoC, \u0000<italic>Train Early Start</i>\u0000 achieves 1.42x speedup and 1.13x energy efficiency.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"23 1","pages":"41-44"},"PeriodicalIF":2.3,"publicationDate":"2023-12-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140063484","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Courtney Golden;Dan Ilan;Caroline Huang;Niansong Zhang;Zhiru Zhang;Christopher Batten
{"title":"Supporting a Virtual Vector Instruction Set on a Commercial Compute-in-SRAM Accelerator","authors":"Courtney Golden;Dan Ilan;Caroline Huang;Niansong Zhang;Zhiru Zhang;Christopher Batten","doi":"10.1109/LCA.2023.3341389","DOIUrl":"https://doi.org/10.1109/LCA.2023.3341389","url":null,"abstract":"Recent work has explored compute-in-SRAM as a promising approach to overcome the traditional processor-memory performance gap. The recently released Associative Processing Unit (APU) from GSI Technology is, to our knowledge, the first commercial compute-in-SRAM accelerator. Prior work on this platform has focused on domain-specific acceleration using direct microcode programming and/or specialized libraries. In this letter, we demonstrate the potential for supporting a more general-purpose vector abstraction on the APU. We implement a virtual vector instruction set based on the recently proposed RISC-V Vector (RVV) extensions, analyze tradeoffs in instruction implementations, and perform detailed instruction microbenchmarking to identify performance benefits and overheads. This work is a first step towards general-purpose computing on domain-specific compute-in-SRAM accelerators.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"23 1","pages":"29-32"},"PeriodicalIF":2.3,"publicationDate":"2023-12-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139976194","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Exploiting Intrinsic Redundancies in Dynamic Graph Neural Networks for Processing Efficiency","authors":"Deniz Gurevin;Caiwen Ding;Omer Khan","doi":"10.1109/LCA.2023.3340504","DOIUrl":"10.1109/LCA.2023.3340504","url":null,"abstract":"Modern dynamical systems are rapidly incorporating artificial intelligence to improve the efficiency and quality of complex predictive analytics. To efficiently operate on increasingly large datasets and intrinsically dynamic non-euclidean data structures, the computing community has turned to Graph Neural Networks (GNNs). We make a key observation that existing GNN processing frameworks do not efficiently handle the intrinsic dynamics in modern GNNs. The dynamic processing of GNN operates on the complete static graph at each time step, leading to repetitive redundant computations that introduce tremendous under-utilization of system resources. We propose a novel dynamic graph neural network (DGNN) processing framework that captures the dynamically evolving dataflow of the GNN semantics, i.e., graph embeddings and sparse connections between graph nodes. The framework identifies intrinsic redundancies in node-connections and captures representative node-sparse graph information that is readily ingested for processing by the system. Our evaluation on an NVIDIA GPU shows up to 3.5× speedup over the baseline setup that processes all nodes at each time step.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"23 2","pages":"170-174"},"PeriodicalIF":1.4,"publicationDate":"2023-12-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142183932","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Enhancing the Reach and Reliability of Quantum Annealers by Pruning Longer Chains","authors":"Ramin Ayanzadeh;Moinuddin Qureshi","doi":"10.1109/LCA.2023.3340030","DOIUrl":"https://doi.org/10.1109/LCA.2023.3340030","url":null,"abstract":"Analog Quantum Computers (QCs), such as D-Wave's \u0000<italic>Quantum Annealers</i>\u0000 (\u0000<italic>QAs</i>\u0000) and QuEra's neutral atom platform, rival their digital counterparts in computing power. Existing QAs boast over 5,700 qubits, but their single-instruction operation model prevents using SWAP operations for making physically distant qubits adjacent. Instead, QAs use an \u0000<italic>embedding</i>\u0000 process to chain multiple \u0000<italic>physical qubits</i>\u0000 together, representing a \u0000<italic>program qubit</i>\u0000 with higher connectivity and reducing effective QA capacity by up to 33x. We observe that, post-embedding, nearly 25% of physical qubits remain unused, becoming trapped between chains. Additionally, we observe a “Power-Law” distribution in the chain lengths, where a few \u0000<italic>dominant chains</i>\u0000 possess significantly more qubits, thereby exerting a considerably more significant impact on both qubit utilization and isolation. Leveraging these insights, we propose \u0000<italic>Skipper</i>\u0000, a software technique designed to enhance the capacity and fidelity of QAs by skipping dominant chains and substituting their program qubit with two measurement outcomes. Using a 5761-qubit QA, we observed that by skipping up to eleven chains, the capacity increased by up to 59% (avg 28%), and the error decreased by up to 44% (avg 33%).","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"23 1","pages":"25-28"},"PeriodicalIF":2.3,"publicationDate":"2023-12-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139976212","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Tulip: Turn-Free Low-Power Network-on-Chip","authors":"Atiyeh Gheibi-Fetrat;Negar Akbarzadeh;Shaahin Hessabi;Hamid Sarbazi-Azad","doi":"10.1109/LCA.2023.3339646","DOIUrl":"https://doi.org/10.1109/LCA.2023.3339646","url":null,"abstract":"The semiconductor industry has seen significant technological advancements, leading to an increase in the number of processing cores in a system-on-chip (SoC). To facilitate communication among the numerous on-chip cores, a network-on-chip (NoC) is employed. One of the main challenges of designing NoCs is power management since the NoC consumes a significant portion of the total power of the SoC. Among the power-intensive components of the NoC, routers stand out. We observe that some power-intensive components of routers, responsible for implementing turn in the mesh topology, are underutilized compared to others. Therefore, we propose Tulip, a turn-free low-power network-in-chip, that avoids within-router turns by removing the corresponding components from the router structure. On a turn (e.g., at the end of the current dimension), Tulip forces the packet to be ejected and then reinjects it to the next dimension channel (i.e., the beginning of the path along the next dimension). Due to its deadlock-free nature, Tulip's scheme may be used orthogonally with any deterministic, partially-adaptive, and fully-adaptive routing algorithms, and can easily be extended for any n-dimensional mesh topology. Our analysis reveals that Tulip can reduce the static power and area by 24%−50% and 25%-55%, respectively, for 2D-5D mesh routers.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"23 1","pages":"5-8"},"PeriodicalIF":2.3,"publicationDate":"2023-12-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139060173","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"FPGA-Accelerated Data Preprocessing for Personalized Recommendation Systems","authors":"Hyeseong Kim;Yunjae Lee;Minsoo Rhu","doi":"10.1109/LCA.2023.3336841","DOIUrl":"https://doi.org/10.1109/LCA.2023.3336841","url":null,"abstract":"Deep neural network (DNN)-based recommendation systems (RecSys) are one of the most successfully deployed machine learning applications in commercial services for predicting ad click-through rates or rankings. While numerous prior work explored hardware and software solutions to reduce the training time of RecSys, its end-to-end training pipeline including the data preprocessing stage has received little attention. In this work, we provide a comprehensive analysis of RecSys data preprocessing, root-causing the feature generation and normalization steps to cause a major performance bottleneck. Based on our characterization, we explore the efficacy of an FPGA-accelerated RecSys preprocessing system that achieves a significant 3.4–12.1× end-to-end speedup compared to the baseline CPU-based RecSys preprocessing system.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"23 1","pages":"7-10"},"PeriodicalIF":2.3,"publicationDate":"2023-11-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139504430","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Redundant Array of Independent Memory Devices","authors":"Peiyun Wu;Trung Le;Zhichun Zhu;Zhao Zhang","doi":"10.1109/LCA.2023.3334989","DOIUrl":"https://doi.org/10.1109/LCA.2023.3334989","url":null,"abstract":"DRAM memory reliability is increasingly a concern as recent studies found. In this letter, we propose RAIMD (Redundant Array of Independent Memory Devices), an energy-efficient memory organization with RAID-like error protection. In this organization, each memory device works as an independent memory module to serve a whole memory request and to support error detection and error recovery. It relies on the high data rate of modern memory device to minimize the performance impact of increased data transfer time. RAIMD provides chip-level error protection similar to Chipkill but with significant energy savings. Our simulation results indicate that RAIMD can save memory energy by 26.3% on average with a small performance overhead of 5.3% on DDR5-4800 memory systems for SPEC2017 multi-core workloads.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"22 2","pages":"181-184"},"PeriodicalIF":2.3,"publicationDate":"2023-11-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138633920","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Haocong Luo;Yahya Can Tuğrul;F. Nisa Bostancı;Ataberk Olgun;A. Giray Yağlıkçı;Onur Mutlu
{"title":"Ramulator 2.0: A Modern, Modular, and Extensible DRAM Simulator","authors":"Haocong Luo;Yahya Can Tuğrul;F. Nisa Bostancı;Ataberk Olgun;A. Giray Yağlıkçı;Onur Mutlu","doi":"10.1109/LCA.2023.3333759","DOIUrl":"https://doi.org/10.1109/LCA.2023.3333759","url":null,"abstract":"We present Ramulator 2.0, a highly modular and extensible DRAM simulator that enables rapid and agile implementation and evaluation of design changes in the memory controller and DRAM to meet the increasing research effort in improving the performance, security, and reliability of memory systems. Ramulator 2.0 abstracts and models key components in a DRAM-based memory system and their interactions into shared \u0000<italic>interfaces</i>\u0000 and independent \u0000<italic>implementations</i>\u0000. Doing so enables easy modification and extension of the modeled functions of the memory controller and DRAM in Ramulator 2.0. The DRAM specification syntax of Ramulator 2.0 is concise and human-readable, facilitating easy modifications and extensions. Ramulator 2.0 implements a library of reusable templated lambda functions to model the functionalities of DRAM commands to simplify the implementation of new DRAM standards, including DDR5, LPDDR5, HBM3, and GDDR6. We showcase Ramulator 2.0's modularity and extensibility by implementing and evaluating a wide variety of RowHammer mitigation techniques that require \u0000<italic>different</i>\u0000 memory controller design changes. These techniques are added modularly as separate implementations \u0000<italic>without</i>\u0000 changing \u0000<italic>any</i>\u0000 code in the baseline memory controller implementation. Ramulator 2.0 is rigorously validated and maintains a fast simulation speed compared to existing cycle-accurate DRAM simulators.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"23 1","pages":"112-116"},"PeriodicalIF":2.3,"publicationDate":"2023-11-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140822019","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jonathan Garcia-Mallen;Shuohao Ping;Alex Miralles-Cordal;Ian Martin;Mukund Ramakrishnan;Yipeng Huang
{"title":"Towards an Accelerator for Differential and Algebraic Equations Useful to Scientists","authors":"Jonathan Garcia-Mallen;Shuohao Ping;Alex Miralles-Cordal;Ian Martin;Mukund Ramakrishnan;Yipeng Huang","doi":"10.1109/LCA.2023.3332318","DOIUrl":"10.1109/LCA.2023.3332318","url":null,"abstract":"We discuss our preliminary results in building a configurable accelerator for differential equation time stepping and iterative methods for algebraic equations. Relative to prior efforts in building hardware accelerators for numerical methods, our focus is on the following: 1) Demonstrating a higher order of numerical convergence that is needed to actually support existing numerical algorithms. 2) Providing the capacity for wide vectors of variables by keeping the hardware design components as simple as possible. 3) Demonstrating configurable hardware support for a variety of numerical algorithms that form the core of scientific computation libraries. These efforts are toward the goal of making the accelerator democratically accessible by computational scientists.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"22 2","pages":"185-188"},"PeriodicalIF":2.3,"publicationDate":"2023-11-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135612504","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}