ACM Transactions on Design Automation of Electronic Systems最新文献

筛选
英文 中文
CuPBoP: Making CUDA a Portable Language CuPBoP:让 CUDA 成为可移植语言
IF 1.4 4区 计算机科学
ACM Transactions on Design Automation of Electronic Systems Pub Date : 2024-04-23 DOI: 10.1145/3659949
Ruobing Han, Jun Chen, Bhanu Garg, Xule Zhou, John Lu, Jeffrey Young, Jaewoong Sim, Hyesoon Kim
{"title":"CuPBoP: Making CUDA a Portable Language","authors":"Ruobing Han, Jun Chen, Bhanu Garg, Xule Zhou, John Lu, Jeffrey Young, Jaewoong Sim, Hyesoon Kim","doi":"10.1145/3659949","DOIUrl":"https://doi.org/10.1145/3659949","url":null,"abstract":"CUDA is designed specifically for NVIDIA GPUs and is not compatible with non-NVIDIA devices. Enabling CUDA execution on alternative backends could greatly benefit the hardware community by fostering a more diverse software ecosystem.\u0000 To address the need for portability, our objective is to develop a framework that meets key requirements, such as extensive coverage, comprehensive end-to-end support, superior performance, and hardware scalability. Existing solutions that translate CUDA source code into other high-level languages, however, fall short of these goals.\u0000 In contrast to these source-to-source approaches, we present a novel framework, CuPBoP, which treats CUDA as a portable language in its own right. Compared to two commercial source-to-source solutions, CuPBoP offers a broader coverage and superior performance for the CUDA-to-CPU migration. Additionally, we evaluate the performance of CuPBoP against manually optimized CPU programs, highlighting the differences between CPU programs derived from CUDA and those that are manually optimized.\u0000 Furthermore, we demonstrate the hardware scalability of CuPBoP by showcasing its successful migration of CUDA to AMD GPUs.\u0000 To promote further research in this field, we have released CuPBoP as an open-source resource.","PeriodicalId":50944,"journal":{"name":"ACM Transactions on Design Automation of Electronic Systems","volume":null,"pages":null},"PeriodicalIF":1.4,"publicationDate":"2024-04-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140670889","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A Scenario-Based DVFS-Aware Hybrid Application Mapping Methodology for MPSoCs 面向 MPSoC 的基于场景的 DVFS 感知混合应用映射方法学
IF 1.4 4区 计算机科学
ACM Transactions on Design Automation of Electronic Systems Pub Date : 2024-04-23 DOI: 10.1145/3660633
J. Spieck, Stefan Wildermann, Jürgen Teich
{"title":"A Scenario-Based DVFS-Aware Hybrid Application Mapping Methodology for MPSoCs","authors":"J. Spieck, Stefan Wildermann, Jürgen Teich","doi":"10.1145/3660633","DOIUrl":"https://doi.org/10.1145/3660633","url":null,"abstract":"Sound techniques for mapping soft real-time applications to resources are indispensable for meeting the application deadlines and minimizing objectives such as energy consumption, particularly on heterogeneous MPSoC architectures. For applications with input-dependent workload variations, static mappings are not able to sufficiently cope with the run-time variation, which can lead to deadline misses or unnecessary energy consumption. As a remedy, hybrid application mapping (HAM) techniques combine a design-time optimization with run-time management that adapts the mappings dynamically to the changes of the arriving input. This paper focuses on scenario-based HAM techniques. Here, the application input space is systematically clustered such that data inside the same scenario exhibit similar characteristics concerning workload when being processed under the same operating points. This static clustering of the input space into data scenarios has proven to be a good abstraction layer for simplifying the design and employment of high-quality run-time managers. However, existing state-of-the-art scenario-based HAM approaches neglect or underutilize the synergistic interplay between mapping selection and the usage of dynamic voltage/frequency scaling (DVFS) when adapting to workload variation. By combining mapping and DVFS selection, variations in the input can be either compensated by a complete re-mapping of the application, evoking a potential high reconfiguration overhead or by just changing the DVFS settings of the resources, offering a low-overhead adaptation alternative and thus significantly reducing the necessary overhead compared to DVFS-agnostic HAM. Furthermore, DVFS enables a fine-grained adaptation of a mapped application to the input data variation, e.g., by slowing down tasks with no impact on the end-to-end latency for the current input using low-frequency DVFS settings. It is shown that this combined approach can save even more energy than a pure mapping adaptation scheme, especially in the presence of data scenarios. In particular, scenario-based design operates as a catalyst for eliciting the synergies between a combined DVFS and mapping optimization and the peculiarities inside a data scenario, i.e., exploiting the commonalities inside a data scenario by perfectly tailored DVFS settings and task mapping. In this scope, this paper proposes two supplementary scenario-based DVFS-aware HAM approaches that consistently outperform existing state-of-the-art mapping approaches in terms of the number of deadline misses and energy consumption as we demonstrate in an empirical study on the basis of four different applications and three different architectures. It is also shown that these benefits still apply to target architectures with increasing mapping migration overheads, thwarting frequent mapping reconfigurations.","PeriodicalId":50944,"journal":{"name":"ACM Transactions on Design Automation of Electronic Systems","volume":null,"pages":null},"PeriodicalIF":1.4,"publicationDate":"2024-04-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140669739","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Enhanced Compiler Technology for Software-based Hardware Fault Detection 基于软件的硬件故障检测的增强型编译器技术
IF 1.4 4区 计算机科学
ACM Transactions on Design Automation of Electronic Systems Pub Date : 2024-04-22 DOI: 10.1145/3660524
Davide Baroffio, Federico Reghenzani, William Fornaciari
{"title":"Enhanced Compiler Technology for Software-based Hardware Fault Detection","authors":"Davide Baroffio, Federico Reghenzani, William Fornaciari","doi":"10.1145/3660524","DOIUrl":"https://doi.org/10.1145/3660524","url":null,"abstract":"<p>Software-Implemented Hardware Fault Tolerance (SIHFT) is a modern approach for tackling random hardware faults of dependable systems employing solely software solutions. This work extends an automatic compiler-based SIHFT hardening tool called ASPIS, enhancing it with novel protection mechanisms and overhead-reduction techniques, also providing an extensive analysis of its compliance with the non-trivial workload of the open-source Real-Time Operating System FreeRTOS. A thorough experimental fault-injection campaign on an STM32 board shows how the system achieves remarkably high tolerance to single-event upsets and a comparison between the SIHFT mechanisms implemented summarises the trade-off between the overhead introduced and the detection capabilities of the various solutions.</p>","PeriodicalId":50944,"journal":{"name":"ACM Transactions on Design Automation of Electronic Systems","volume":null,"pages":null},"PeriodicalIF":1.4,"publicationDate":"2024-04-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140634581","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Load Balanced PIM-Based Graph Processing 基于负载平衡 PIM 的图形处理
IF 1.4 4区 计算机科学
ACM Transactions on Design Automation of Electronic Systems Pub Date : 2024-04-18 DOI: 10.1145/3659951
Xiang Zhao, Song Chen, Yi Kang
{"title":"Load Balanced PIM-Based Graph Processing","authors":"Xiang Zhao, Song Chen, Yi Kang","doi":"10.1145/3659951","DOIUrl":"https://doi.org/10.1145/3659951","url":null,"abstract":"<p>Graph processing is widely used for many modern applications, such as social networks, recommendation systems, and knowledge graphs. However, processing large-scale graphs on traditional Von Neumann architectures is challenging due to the irregular graph data and memory-bound graph algorithms. Processing-in-memory (PIM) architecture has emerged as a promising approach for accelerating graph processing by enabling computation to be performed directly on memory. Despite having many processing units and high local memory bandwidth, PIM often suffers from insufficient global communication bandwidth and high synchronization overhead due to load imbalance. </p><p>This paper proposes GraphB, a novel PIM-based graph processing system, to address all these issues. From the algorithm perspective, we propose a degree-aware graph partitioning algorithm that can generate balanced partitioning at a low cost. From the architecture perspective, we introduce tile buffers incorporated with an on-chip 2D-Mesh, which provides high bandwidth for inter-node data transfer. Dataflow in GraphB is designed to enable computation-communication overlap and dynamic load balancing. In a PyMTL3-based cycle-accurate simulator with five real-world graphs and three common algorithms, GraphB achieves an average 2.2 × and maximum 2.8 × speedup compared to the SOTA PIM-based graph processing system GraphQ.</p>","PeriodicalId":50944,"journal":{"name":"ACM Transactions on Design Automation of Electronic Systems","volume":null,"pages":null},"PeriodicalIF":1.4,"publicationDate":"2024-04-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140617741","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Wages: The Worst Transistor Aging Analysis for Large-scale Analog Integrated Circuits via Domain Generalization 工资:通过领域泛化进行大规模模拟集成电路最差晶体管老化分析
IF 1.4 4区 计算机科学
ACM Transactions on Design Automation of Electronic Systems Pub Date : 2024-04-17 DOI: 10.1145/3659950
Tinghuan Chen, Hao Geng, Qi Sun, Sanping Wan, Yongsheng Sun, Huatao Yu, Bei Yu
{"title":"Wages: The Worst Transistor Aging Analysis for Large-scale Analog Integrated Circuits via Domain Generalization","authors":"Tinghuan Chen, Hao Geng, Qi Sun, Sanping Wan, Yongsheng Sun, Huatao Yu, Bei Yu","doi":"10.1145/3659950","DOIUrl":"https://doi.org/10.1145/3659950","url":null,"abstract":"<p>Transistor aging leads to the deterioration of analog circuit performance over time. The worst aging degradation is used to evaluate the circuit reliability. It is extremely expensive to obtain it since several circuit stimuli need to be simulated. The worst degradation collection cost reduction brings an inaccurate training dataset when a machine learning (ML) model is used to fast perform the estimation. Motivated by the fact that there are many similar subcircuits in large-scale analog circuits, in this paper, we propose Wages to train an ML model on an inaccurate dataset for the worst aging degradation estimation via domain generalization technique. A sampling-based method on the feature space of the transistor and its neighborhood subcircuit is developed to replace inaccurate labels. A consistent estimation for the worst degradation is enforced to update model parameters. Label updating and model updating are performed alternately to train an ML model on the inaccurate dataset. Experimental results on the very advanced 5<i>nm</i> technology node show our Wages can significantly reduce the label collection cost with a negligible estimation error for the worst aging degradations compared to the traditional methods.</p>","PeriodicalId":50944,"journal":{"name":"ACM Transactions on Design Automation of Electronic Systems","volume":null,"pages":null},"PeriodicalIF":1.4,"publicationDate":"2024-04-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140610540","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Capacity-Aware Wash Optimization with Dynamic Fluid Scheduling and Channel Storage for Continuous-Flow Microfluidic Biochips 利用动态流体调度和通道存储对连续流微流控生物芯片进行容量感知清洗优化
IF 1.4 4区 计算机科学
ACM Transactions on Design Automation of Electronic Systems Pub Date : 2024-04-17 DOI: 10.1145/3659952
Zhisheng Chen, Xu Hu, Wenzhong Guo, Genggeng Liu, Jiaxuan Wang, Tsung-Yi Ho, Xing Huang
{"title":"Capacity-Aware Wash Optimization with Dynamic Fluid Scheduling and Channel Storage for Continuous-Flow Microfluidic Biochips","authors":"Zhisheng Chen, Xu Hu, Wenzhong Guo, Genggeng Liu, Jiaxuan Wang, Tsung-Yi Ho, Xing Huang","doi":"10.1145/3659952","DOIUrl":"https://doi.org/10.1145/3659952","url":null,"abstract":"<p>Continuous-flow microfluidic biochips are gaining increasing attention with promising applications for automatically executing various laboratory procedures in biology and biochemistry. Biochips with distributed channel-storage architectures enable each channel to switch between the roles of transportation and storage. Consequently, fluid transportation, caching, and fetch can occur concurrently through different flow paths. When two dissimilar types of fluidic flows occur through the same channels in a time-interleaved manner, it may cause contamination to the latter as some residues of the former flow may be stuck at the channel wall during transportation. To remove the residues, wash operations are introduced as an essential step to avoid incorrect assay outcomes. However, existing work has been considered that the washing capacity of a buffer fluid is unlimited. In the actual scenario, a fixed-volume buffer fluid irrefutably possesses a limited washing capacity, which can be successively consumed while washing away residues from the channels. Hence, capacity-aware wash scheme is a basic requirement to fulfil the dynamic fluid scheduling and channel storage. In this paper, we formulate a practical wash optimization problem for microfluidic biochips, which considers the requirements of dynamic fluid scheduling, channel storage, as well as washing capacity constraints of buffer fluids simultaneously, and present an efficient design flow to solve this problem systematically. Given the high-level synthesis result of a biochemical application and the corresponding component placement solution, our goal is to complete a contamination-aware flow-path planning with short flow-channel length. Meanwhile, the biochemical application can be executed efficiently and correctly with an optimized capacity-aware wash scheme. Experimental results show that compared to a state-of-the-art washing method, the proposed method achieves an average reduction of 26.1%, 43.1%, and 34.1% across all the benchmarks with respect to the total channel length, total wash time, and execution time of bioassays, respectively.</p>","PeriodicalId":50944,"journal":{"name":"ACM Transactions on Design Automation of Electronic Systems","volume":null,"pages":null},"PeriodicalIF":1.4,"publicationDate":"2024-04-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140610571","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Enhancing Lifetime and Performance of MLC NVM Caches using Embedded Trace buffers 利用嵌入式跟踪缓冲器提高 MLC NVM 高速缓存的寿命和性能
IF 1.4 4区 计算机科学
ACM Transactions on Design Automation of Electronic Systems Pub Date : 2024-04-16 DOI: 10.1145/3659102
S. Sivakumar, John Jose, Vijaykrishnan Narayanan
{"title":"Enhancing Lifetime and Performance of MLC NVM Caches using Embedded Trace buffers","authors":"S. Sivakumar, John Jose, Vijaykrishnan Narayanan","doi":"10.1145/3659102","DOIUrl":"https://doi.org/10.1145/3659102","url":null,"abstract":"<p>Large volumes of on-chip and off-chip memory are required by contemporary applications. Emerging non-volatile memory technologies including STT-RAM, PCM, and ReRAM are becoming popular for on-chip and off-chip memories as a result of their desirable properties. Compared to traditional memory technologies like SRAM and DRAM, they have minimal leakage current and high packing density. Non Volatile Memories (NVM), however, have a low write endurance, a high write latency, and high write energy. Non-volatile Single Level Cell (SLC) memories can store a single bit of data in each memory cell, whereas Multi Level Cells (MLC) can store two or more bits in each memory cell. Although MLC NVMs have substantially higher packing density than SLCs, their lifetime and access speed are key concerns. For a given cache size, MLC caches consume 1.84x less space and 2.62x less leakage power than SLC caches. We propose Trace buffer Assisted Non-volatile Memory Cache (TANC), an approach that increases the lifespan and performance of MLC-based last-level caches using the underutilised Embedded Trace Buffers (ETB). TANC improves the lifetime of MLC LLCs up to 4.36x, and decreases average memory access time by 4% compared to SLC NVM LLCs and by 6.41x and 11%, respectively, compared to baseline MLC LLCs.</p>","PeriodicalId":50944,"journal":{"name":"ACM Transactions on Design Automation of Electronic Systems","volume":null,"pages":null},"PeriodicalIF":1.4,"publicationDate":"2024-04-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140573734","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Modeling Retention Errors of 3D NAND Flash for Optimizing Data Placement 模拟 3D NAND 闪存的保留误差以优化数据放置
IF 1.4 4区 计算机科学
ACM Transactions on Design Automation of Electronic Systems Pub Date : 2024-04-16 DOI: 10.1145/3659101
Huanhuan Tian, Jiewen Tang, Jun Li, Zhibing Sha, Fan Yang, Zhigang Cai, Jianwei Liao
{"title":"Modeling Retention Errors of 3D NAND Flash for Optimizing Data Placement","authors":"Huanhuan Tian, Jiewen Tang, Jun Li, Zhibing Sha, Fan Yang, Zhigang Cai, Jianwei Liao","doi":"10.1145/3659101","DOIUrl":"https://doi.org/10.1145/3659101","url":null,"abstract":"<p>Considering 3D NAND flash has a new property of <i>process variation (PV)</i>, which causes different raw bit error rates (RBER) among different layers of the flash block. This paper builds a mathematical model for estimating the retention errors of flash cells, by considering the factor of <i>layer-to-layer PV</i> in 3D NAND flash memory, as well as the factors of program/erase (P/E) cycle and retention time of data. Then, it proposes classifying the layers of flash block in 3D NAND flash memory into profitable and unprofitable categories, according to the error correction overhead. After understanding the retention error variation of different layers in 3D NAND flash, we design a mechanism of data placement, which maps the write data onto a suitable layer of flash block, according to the data hotness and the error correction overhead of layers, to boost read performance of 3D NAND flash. The experimental results demonstrate that our proposed retention error estimation model can yield a <i>R</i><sup>2</sup> value of <monospace>0.966</monospace> on average, verifying the accuracy of the model. Based on the estimated retention error rates of layers, the proposed data placement mechanism can noticeably reduce the read latency by <monospace>29.8</monospace>% on average, compared with state-of-the-art methods against retention errors for 3D NAND flash memory.</p>","PeriodicalId":50944,"journal":{"name":"ACM Transactions on Design Automation of Electronic Systems","volume":null,"pages":null},"PeriodicalIF":1.4,"publicationDate":"2024-04-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140573631","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
WCPNet: Jointly Predicting Wirelength, Congestion and Power for FPGA Using Multi-task Learning WCPNet:利用多任务学习联合预测 FPGA 的线长、拥塞和功率
IF 1.4 4区 计算机科学
ACM Transactions on Design Automation of Electronic Systems Pub Date : 2024-04-08 DOI: 10.1145/3656170
Juming Xian, Yan Xing, Shuting Cai, Weijun Li, Xiaoming Xiong, Zhengfa Hu
{"title":"WCPNet: Jointly Predicting Wirelength, Congestion and Power for FPGA Using Multi-task Learning","authors":"Juming Xian, Yan Xing, Shuting Cai, Weijun Li, Xiaoming Xiong, Zhengfa Hu","doi":"10.1145/3656170","DOIUrl":"https://doi.org/10.1145/3656170","url":null,"abstract":"<p>To speed up the design closure and improve the QoR of FPGA, supervised single-task machine learning techniques have been used to predict individual design metric based on placement results. However, the design objective is to achieve optimal performance while considering multiple conflicting metrics. The single-task approaches predict each metric in isolation and neglect the potential correlations or dependencies among them. To address the limitations, this paper proposes a multi-task learning approach to jointly predict wirelength, congestion and power. By sharing the common feature representations and adopting the joint optimization strategy, the novel WCPNet models (including WCPNet-HS and WCPNet-SS) can not only predict the three metrics of different scales simultaneously, but also outperform the majority of single-task models in terms of both prediction performance and time cost, which are demonstrated by the results of the cross design experiment. By adopting the cross-stitch structure in the encoder, WCPNet-SS outperforms WCPNet-HS in prediction performance, but WCPNet-HS is faster because of the simpler parameters sharing structure. The significance of the feature <i>image</i><sub>pinUtilization</sub> on predicting power and wirelength are demonstrated by the ablation experiment.</p>","PeriodicalId":50944,"journal":{"name":"ACM Transactions on Design Automation of Electronic Systems","volume":null,"pages":null},"PeriodicalIF":1.4,"publicationDate":"2024-04-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140573830","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
ARM-CO-UP: ARM COoperative Utilization of Processors ARM-CO-UP:ARM 处理器的协同利用
IF 1.4 4区 计算机科学
ACM Transactions on Design Automation of Electronic Systems Pub Date : 2024-04-08 DOI: 10.1145/3656472
Ehsan Aghapour, Dolly Sapra, Andy Pimentel, Anuj Pathania
{"title":"ARM-CO-UP: ARM COoperative Utilization of Processors","authors":"Ehsan Aghapour, Dolly Sapra, Andy Pimentel, Anuj Pathania","doi":"10.1145/3656472","DOIUrl":"https://doi.org/10.1145/3656472","url":null,"abstract":"<p>HMPSoCs combine different processors on a single chip. They enable powerful embedded devices, which increasingly perform ML inference tasks at the edge. State-of-the-art HMPSoCs can perform on-chip embedded inference using different processors, such as CPUs, GPUs, and NPUs. HMPSoCs can potentially overcome the limitation of low single-processor CNN inference performance and efficiency by cooperative use of multiple processors. However, standard inference frameworks for edge devices typically utilize only a single processor. </p><p>We present the <i>ARM-CO-UP</i> framework built on the <i>ARM-CL</i> library. The <i>ARM-CO-UP</i> framework supports two modes of operation – Pipeline and Switch. It optimizes inference throughput using pipelined execution of network partitions for consecutive input frames in the Pipeline mode. It improves inference latency through layer-switched inference for a single input frame in the Switch mode. Furthermore, it supports layer-wise CPU/GPU <i>DVFS</i> in both modes for improving power efficiency and energy consumption. <i>ARM-CO-UP</i> is a comprehensive framework for multi-processor CNN inference that automates CNN partitioning and mapping, pipeline synchronization, processor type switching, layer-wise <i>DVFS</i>, and closed-source NPU integration.</p>","PeriodicalId":50944,"journal":{"name":"ACM Transactions on Design Automation of Electronic Systems","volume":null,"pages":null},"PeriodicalIF":1.4,"publicationDate":"2024-04-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140573738","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信