{"title":"Raptor-T: A Fused and Memory-Efficient Sparse Transformer for Long and Variable-Length Sequences","authors":"Hulin Wang;Donglin Yang;Yaqi Xia;Zheng Zhang;Qigang Wang;Jianping Fan;Xiaobo Zhou;Dazhao Cheng","doi":"10.1109/TC.2024.3389507","DOIUrl":"10.1109/TC.2024.3389507","url":null,"abstract":"Transformer-based models have made significant advancements across various domains, largely due to the self-attention mechanism's ability to capture contextual relationships in input sequences. However, processing long sequences remains computationally expensive for Transformer models, primarily due to the \u0000<inline-formula><tex-math>$O(n^{2})$</tex-math></inline-formula>\u0000 complexity associated with self-attention. To address this, sparse attention has been proposed to reduce the quadratic dependency to linear. Nevertheless, deploying the sparse transformer efficiently encounters two major obstacles: 1) Existing system optimizations are less effective for the sparse transformer due to the algorithm's approximation properties leading to fragmented attention, and 2) the variability of input sequences results in computation and memory access inefficiencies. We present Raptor-T, a cutting-edge transformer framework designed for handling long and variable-length sequences. Raptor-T harnesses the power of the sparse transformer to reduce resource requirements for processing long sequences while also implementing system-level optimizations to accelerate inference performance. To address the fragmented attention issue, Raptor-T employs fused and memory-efficient Multi-Head Attention. Additionally, we introduce an asynchronous data processing method to mitigate GPU-blocking operations caused by sparse attention. Furthermore, Raptor-T minimizes padding for variable-length inputs, effectively reducing the overhead associated with padding and achieving balanced computation on GPUs. In evaluation, we compare Raptor-T's performance against state-of-the-art frameworks on an NVIDIA A100 GPU. The experimental results demonstrate that Raptor-T outperforms FlashAttention-2 and FasterTransformer, achieving an impressive average end-to-end performance improvement of 3.41X and 3.71X, respectively.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"73 7","pages":"1852-1865"},"PeriodicalIF":3.7,"publicationDate":"2024-04-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140617273","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Ara2: Exploring Single- and Multi-Core Vector Processing With an Efficient RVV 1.0 Compliant Open-Source Processor","authors":"Matteo Perotti;Matheus Cavalcante;Renzo Andri;Lukas Cavigelli;Luca Benini","doi":"10.1109/TC.2024.3388896","DOIUrl":"10.1109/TC.2024.3388896","url":null,"abstract":"Vector processing is highly effective in boosting processor performance and efficiency for data-parallel workloads. In this paper, we present Ara2, the first fully open-source vector processor to support the RISC-V V 1.0 frozen ISA. We evaluate Ara2's performance on a diverse set of data-parallel kernels for various problem sizes and vector-unit configurations, achieving an average functional-unit utilization of 95% on the most computationally intensive kernels. We pinpoint performance boosters and bottlenecks, including the scalar core, memories, and vector architecture, providing insights into the main vector architecture's performance drivers. Leveraging the openness of the design, we implement Ara2 in a 22nm technology, characterize its PPA metrics on various configurations (2-16 lanes), and analyze its microarchitecture and implementation bottlenecks. Ara2 achieves a state-of-the-art energy efficiency of 37.8 DP-GFLOPS/W (0.8V) and 1.35GHz of clock frequency (critical path: \u0000<inline-formula><tex-math>$sim$</tex-math></inline-formula>\u000040 FO4 gates). Finally, we explore the performance and energy-efficiency trade-offs of multi-core vector processors: we find that multiple vector cores help overcome the scalar core issue-rate bound that limits short-vector performance. For example, a cluster of eight 2-lane Ara2 (16 FPUs) achieves more than 3x better performance than a 16-lane single-core Ara2 (16 FPUs) when executing a 32x32x32 matrix multiplication, with 1.5x improved energy efficiency.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"73 7","pages":"1822-1836"},"PeriodicalIF":3.7,"publicationDate":"2024-04-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140575962","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Construction of Reed-Solomon Erasure Codes With Four Parities Based on Systematic Vandermonde Matrices","authors":"Leilei Yu;Yunghsiang S. Han","doi":"10.1109/TC.2024.3387069","DOIUrl":"10.1109/TC.2024.3387069","url":null,"abstract":"In 2021, Tang et al. proposed an improved construction of Reed-Solomon (RS) erasure codes with four parity symbols to accelerate the computation of Reed-Muller (RM) transform-based RS algorithm. The idea is to change the original Vandermonde parity-check matrix into a systematic Vandermonde parity-check matrix. However, the construction relies on a computer search and requires that the size of the information vector of RS codes does not exceed \u0000<inline-formula><tex-math>$52$</tex-math></inline-formula>\u0000. This paper improves its idea and proposes a purely algebraic construction. The proposed method has a more explicit construction, a wider range of codeword lengths, and competitive encoding/erasure decoding computational complexity.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"73 7","pages":"1875-1882"},"PeriodicalIF":3.7,"publicationDate":"2024-04-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140575807","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Multi-Objective Hardware-Mapping Co-Optimisation for Multi-DNN Workloads on Chiplet-Based Accelerators","authors":"Abhijit Das;Enrico Russo;Maurizio Palesi","doi":"10.1109/TC.2024.3386067","DOIUrl":"10.1109/TC.2024.3386067","url":null,"abstract":"The need to efficiently execute different Deep Neural Networks (DNNs) on the same computing platform, coupled with the requirement for easy scalability, makes Multi-Chip Module (MCM)-based accelerators a preferred design choice. Such an accelerator brings together heterogeneous sub-accelerators in the form of chiplets, interconnected by a Network-on-Package (NoP). This paper addresses the challenge of selecting the most suitable sub-accelerators, configuring them, determining their optimal placement in the NoP, and mapping the layers of a predetermined set of DNNs spatially and temporally. The objective is to minimise execution time and energy consumption during parallel execution while also minimising the overall cost, specifically the silicon area, of the accelerator. This paper presents MOHaM, a framework for multi-objective hardware-mapping co-optimisation for multi-DNN workloads on chiplet-based accelerators. MOHaM exploits a multi-objective evolutionary algorithm that has been specialised for the given problem by incorporating several customised genetic operators. MOHaM is evaluated against state-of-the-art Design Space Exploration (DSE) frameworks on different multi-DNN workload scenarios. The solutions discovered by MOHaM are Pareto optimal compared to those by the state-of-the-art. Specifically, MOHaM-generated accelerator designs can reduce latency by up to \u0000<inline-formula><tex-math>$96%$</tex-math></inline-formula>\u0000 and energy by up to \u0000<inline-formula><tex-math>$96.12%$</tex-math></inline-formula>\u0000.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"73 8","pages":"1883-1898"},"PeriodicalIF":3.6,"publicationDate":"2024-04-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140575444","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Qinyun Cai;Guanghua Tan;Wangdong Yang;Xianhao He;Yuwei Yan;Keqin Li;Kenli Li
{"title":"COALA: A Compiler-Assisted Adaptive Library Routines Allocation Framework for Heterogeneous Systems","authors":"Qinyun Cai;Guanghua Tan;Wangdong Yang;Xianhao He;Yuwei Yan;Keqin Li;Kenli Li","doi":"10.1109/TC.2024.3385269","DOIUrl":"10.1109/TC.2024.3385269","url":null,"abstract":"Experienced developers often leverage well-tuned libraries and allocate their routines for computing tasks to enhance performance when building modern scientific and engineering applications. However, such well-tuned libraries are meticulously customized for specific target architectures or environments. Additionally, the performance of their routines is significantly impacted by the actual input data of computing tasks, which often remains uncertain until runtime. Accordingly, statically allocating these library routines may hinder the adaptability of applications and compromise performance, particularly in the context of heterogeneous systems. To address this issue, we propose the Compiler-Assisted Adaptive Library Routines Allocation (COALA) framework for heterogeneous systems. COALA is a fully automated mechanism that employs compiler assistance for dynamic allocation of the most suitable routine to each computing task on heterogeneous systems. It allows the deployment of varying allocation policies tailored to specific optimization targets. During the application compilation process, COALA reconstructs computing tasks and inserts a probe for each of these tasks. Probes serve the purpose of conveying vital information about the requirements of each task, including its computing objective, data size, and computing flops, to a user-level allocation component at runtime. Subsequently, the allocation component utilizes the probe information along with the allocation policy to assign the most optimal library routine for executing the computing tasks. In our prototype, we further introduce and deploy a performance-oriented allocation policy founded on a machine learning-based performance evaluation method for library routines. Experimental verification and evaluation on two heterogeneous systems reveal that COALA can significantly improve application performance, with gains of up to 4.3x for numerical simulation software and 4.2x for machine learning applications, and enhance system utilization by up to 27.8%.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"73 7","pages":"1724-1737"},"PeriodicalIF":3.7,"publicationDate":"2024-04-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140575527","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Incendio: Priority-Based Scheduling for Alleviating Cold Start in Serverless Computing","authors":"Xinquan Cai;Qianlong Sang;Chuang Hu;Yili Gong;Kun Suo;Xiaobo Zhou;Dazhao Cheng","doi":"10.1109/TC.2024.3386063","DOIUrl":"10.1109/TC.2024.3386063","url":null,"abstract":"In serverless computing, cold start results in long response latency. Existing approaches strive to alleviate the issue by reducing the number of cold starts. However, our measurement based on real-world production traces shows that the minimum number of cold starts does not equate to the minimum response latency, and solely focusing on optimizing the number of cold starts will lead to sub-optimal performance. The root cause is that functions have different priorities in terms of latency benefits by transferring a cold start to a warm start. In this paper, we propose \u0000<i>Incendio</i>\u0000, a serverless computing framework exploiting priority-based scheduling to minimize the overall response latency from the perspective of cloud providers. We reveal the priority of a function is correlated to multiple factors and design a priority model based on Spearman's rank correlation coefficient. We integrate a hybrid Prophet-LightGBM prediction model to dynamically manage runtime pools, which enables the system to prewarm containers in advance and terminate containers at the appropriate time. Furthermore, to satisfy the low-cost and high-accuracy requirements in serverless computing, we propose a Clustered Reinforcement Learning-based function scheduling strategy. The evaluations show that Incendio speeds up the native system by 1.4\u0000<inline-formula><tex-math>$times$</tex-math></inline-formula>\u0000, and achieves 23% and 14.8% latency reductions compared to two state-of-the-art approaches.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"73 7","pages":"1780-1794"},"PeriodicalIF":3.7,"publicationDate":"2024-04-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140575541","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Marco Angioli;Marcello Barbirotta;Abdallah Cheikh;Antonio Mastrandrea;Francesco Menichelli;Saeid Jamili;Mauro Olivieri
{"title":"Design, Implementation and Evaluation of a New Variable Latency Integer Division Scheme","authors":"Marco Angioli;Marcello Barbirotta;Abdallah Cheikh;Antonio Mastrandrea;Francesco Menichelli;Saeid Jamili;Mauro Olivieri","doi":"10.1109/TC.2024.3386060","DOIUrl":"10.1109/TC.2024.3386060","url":null,"abstract":"Integer division is key for various applications and often represents the performance bottleneck due to its inherent mathematical properties that limit its parallelization. This paper presents a new data-dependent variable latency division algorithm derived from the classic non-performing restoring method. The proposed technique exploits the relationship between the number of leading zeros in the divisor and in the partial remainder to dynamically detect and skip those iterations that result in a simple left shift. While a similar principle has been exploited in previous works, the proposed approach outperforms existing variable latency divider schemes in average latency and power consumption. We detail the algorithm and its implementation in four variants, offering versatility for the specific application requirements. For each variant, we report the average latency evaluated with different benchmarks, and we analyze the synthesis results for both FPGA and ASIC deployment, reporting clock speed, average execution time, hardware resources, and energy consumption, compared with existing fixed and variable latency dividers.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"73 7","pages":"1767-1779"},"PeriodicalIF":3.7,"publicationDate":"2024-04-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10494681","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140575542","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Error-Detection Schemes for Analog Content-Addressable Memories","authors":"Ron M. Roth","doi":"10.1109/TC.2024.3386065","DOIUrl":"10.1109/TC.2024.3386065","url":null,"abstract":"Analog content-addressable memories (in short, a-CAMs) have been recently introduced as accelerators for machine-learning tasks, such as tree-based inference or implementation of nonlinear activation functions. The cells in these memories contain nanoscale memristive devices, which may be susceptible to various types of errors, such as manufacturing defects, inaccurate programming of the cells, or drifts in their contents over time. The objective of this work is to develop techniques for overcoming the reliability issues that are caused by such error events. To this end, several coding schemes are presented for the detection of errors in a-CAMs. These schemes consist of an encoding stage, a detection cycle (which is performed periodically), and some minor additions to the hardware. During encoding, redundancy symbols are programmed into a portion of the a-CAM (or, alternatively, are written into an external memory). During each detection cycle, a certain set of input vectors is applied to the a-CAM. The schemes differ in several ways, e.g., in the range of alphabet sizes that they are most suitable for, in the tradeoff that each provides between redundancy and hardware additions, or in the type of errors that they handle (Hamming metric versus \u0000<inline-formula><tex-math>$L_{1}$</tex-math></inline-formula>\u0000 metric).","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"73 7","pages":"1795-1808"},"PeriodicalIF":3.7,"publicationDate":"2024-04-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140602791","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Multi-Grained Trace Collection, Analysis, and Management of Diverse Container Images","authors":"Zhuo Huang;Qi Zhang;Hao Fan;Song Wu;Chen Yu;Hai Jin;Jun Deng;Jing Gu;Zhimin Tang","doi":"10.1109/TC.2024.3383966","DOIUrl":"10.1109/TC.2024.3383966","url":null,"abstract":"Container technology is getting popular in cloud environments due to its lightweight feature and convenient deployment. The container registry plays a critical role in container-based clouds, as many container startups involve downloading layer-structured container images from a container registry. However, the container registry is struggling to efficiently manage images (i.e., transfer and store) with the emergence of diverse services and new image formats. The reason is that the container registry manages images uniformly at layer granularity. On the one hand, such uniform layer-level management probably cannot fit the various requirements of different kinds of containerized services well. On the other hand, new image formats organizing data in blocks or files cannot benefit from such uniform layer-level image management. In this paper, we perform the first analysis of image traces at multiple granularities (i.e., image-, layer-, and file-level) for various services and provide an in-depth comparison of different image formats. The traces are collected from a production-level container registry, amounting to 24 million requests and involving more than 184 TB of transferred data. We provide a number of valuable insights, including request patterns of services, file-level access patterns, and bottlenecks associated with different image formats. Based on these insights, we also propose two optimizations to improve image transfer and application deployment.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"73 7","pages":"1698-1710"},"PeriodicalIF":3.7,"publicationDate":"2024-04-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10494783","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140575933","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Analysis and Mitigation of Shared Resource Contention on Heterogeneous Multicore: An Industrial Case Study","authors":"Michael Bechtel;Heechul Yun","doi":"10.1109/TC.2024.3386059","DOIUrl":"10.1109/TC.2024.3386059","url":null,"abstract":"In this paper, we present a solution to the industrial challenge put forth by ARM in 2022. We systematically analyze the effect of shared resource contention to an augmented reality head-up display (AR-HUD) case-study application of the industrial challenge on a heterogeneous multicore platform, NVIDIA Jetson Nano. We configure the AR-HUD application such that it can process incoming image frames in real-time at 20Hz on the platform. We use Microarchitectural Denial-of-Service (DoS) attacks as aggressor workloads of the challenge and show that they can dramatically impact the latency and accuracy of the AR-HUD application. This results in significant deviations of the estimated trajectories from known ground truths, despite our best effort to mitigate their influence by using cache partitioning and real-time scheduling of the AR-HUD application. To address the challenge, we propose RT-Gang++, a partitioned real-time gang scheduling framework with last-level cache (LLC) and integrated GPU bandwidth throttling capabilities. By applying RT-Gang++, we are able to achieve desired level of performance of the AR-HUD application even in the presence of fully loaded aggressor tasks.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"73 7","pages":"1753-1766"},"PeriodicalIF":3.7,"publicationDate":"2024-04-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140575457","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}