Zhihua Fan, Wenming Li, Zhen Wang, Yu Yang, Xiaochun Ye, Dongrui Fan, Ninghui Sun, Xuejun An
{"title":"Improving Utilization of Dataflow Unit for Multi-Batch Processing.","authors":"Zhihua Fan, Wenming Li, Zhen Wang, Yu Yang, Xiaochun Ye, Dongrui Fan, Ninghui Sun, Xuejun An","doi":"10.1145/3637906","DOIUrl":"https://doi.org/10.1145/3637906","url":null,"abstract":"<p>Dataflow architectures can achieve much better performance and higher efficiency than general-purpose core, approaching the performance of a specialized design while retaining programmability. However, advanced application scenarios place higher demands on the hardware in terms of cross-domain and multi-batch processing. In this paper, we propose a unified scale-vector architecture that can work in multiple modes and adapt to diverse algorithms and requirements efficiently. First, a novel reconfigurable interconnection structure is proposed, which can organize execution units into different cluster typologies as a way to accommodate different data-level parallelism. Second, we decouple threads within each DFG node into consecutive pipeline stages and provide architectural support. By time-multiplexing during these stages, dataflow hardware can achieve much higher utilization and performance. In addition, the task-based program model can also exploit multi-level parallelism and deploy applications efficiently. Evaluated in a wide range of benchmarks, including digital signal processing algorithms, CNNs, and scientific computing algorithms, our design attains up to 11.95 × energy efficiency (performance-per-watt) improvement over GPU (V100), and 2.01 × energy efficiency improvement over state-of-the-art dataflow architectures.</p>","PeriodicalId":50920,"journal":{"name":"ACM Transactions on Architecture and Code Optimization","volume":"4 1","pages":""},"PeriodicalIF":1.6,"publicationDate":"2023-12-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138716786","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Linbo Long, Shuiyong He, Jingcheng Shen, Renping Liu, Zhenhua Tan, Congming Gao, Duo Liu, Kan Zhong, Yi Jiang
{"title":"WA-Zone: Wear-Aware Zone Management Optimization for LSM-Tree on ZNS SSDs","authors":"Linbo Long, Shuiyong He, Jingcheng Shen, Renping Liu, Zhenhua Tan, Congming Gao, Duo Liu, Kan Zhong, Yi Jiang","doi":"10.1145/3637488","DOIUrl":"https://doi.org/10.1145/3637488","url":null,"abstract":"<p>ZNS SSDs divide the storage space into sequential-write zones, reducing costs of DRAM utilization, garbage collection, and over-provisioning. The sequential-write feature of zones is well-suited for LSM-based databases, where random writes are organized into sequential writes to improve performance. However, the current compaction mechanism of LSM-tree results in widely varying access frequencies (i.e., hotness) of data and thus incurs an extreme imbalance in the distribution of erasure counts across zones. The imbalance significantly limits the lifetime of SSDs. Moreover, the current zone-reset method involves a large number of unnecessary erase operations on unused blocks, further shortening the SSD lifetime. </p><p>Considering the access pattern of LSM-tree, this paper proposes a wear-aware zone-management technique, termed <i>WA-Zone</i>, to effectively balance inter- and intra-zone wear in ZNS SSDs. In WA-Zone, a wear-aware zone allocator is first proposed to dynamically allocate data with different hotness to zones with corresponding lifetimes, enabling an even distribution of the erasure counts across zones. Then, a partial-erase-based zone-reset method is presented to avoid unnecessary erase operations. Furthermore, because the novel zone-reset method might lead to an unbalanced distribution of erasure counts across blocks in a zone, a wear-aware block allocator is proposed. Experimental results based on the <i>FEMU</i> emulator demonstrate the proposed WA-Zone enhances the ZNS-SSD lifetime by 5.23 ×, compared with the baseline scheme.</p>","PeriodicalId":50920,"journal":{"name":"ACM Transactions on Architecture and Code Optimization","volume":"104 1","pages":""},"PeriodicalIF":1.6,"publicationDate":"2023-12-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138631935","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Zhonghua Wang, Chen Ding, Fengguang Song, Kai Lu, Jiguang Wan, Zhihu Tan, Changsheng Xie, Guokuan Li
{"title":"WIPE: a Write-Optimized Learned Index for Persistent Memory","authors":"Zhonghua Wang, Chen Ding, Fengguang Song, Kai Lu, Jiguang Wan, Zhihu Tan, Changsheng Xie, Guokuan Li","doi":"10.1145/3634915","DOIUrl":"https://doi.org/10.1145/3634915","url":null,"abstract":"<p>Learned Index, which utilizes effective machine learning models to accelerate locating sorted data positions, has gained increasing attention in many big data scenarios. Using efficient learned models, the learned indexes build large nodes and flat structures, thereby greatly improving the performance. However, most of the state-of-the-art learned indexes are designed for DRAM, and there is hence an urgent need to enable high-performance learned indexes for emerging Non-Volatile Memory (NVM). In this paper, we first evaluate and analyze the performance of the existing learned indexes on NVM. We discover that these learned indexes encounter severe write amplification and write performance degradation due to the requirements of maintaining large sorted/semi-sorted data nodes. To tackle the problems, we propose a novel three-tiered architecture of write-optimized persistent learned index, which is named <i>WIPE</i>, by adopting unsorted fine-granularity data nodes to achieve high write performance on NVM. Thereinto, we devise a new root node construction algorithm to accelerate searching numerous small data nodes. The algorithm ensures stable flat structure and high read performance in large-size datasets by introducing an intermediate layer (i.e., index nodes) and achieving accurate prediction of index node positions from the root node. Our extensive experiments on Intel DCPMM show that WIPE can improve write throughput and read throughput by up to 3.9 × and 7 ×, respectively, compared to the state-of-the-art learned indexes. Also, WIPE can recover from a system crash in ∼ 18<i>ms</i>. WIPE is free as an open-source software package<sup>1</sup>.</p>","PeriodicalId":50920,"journal":{"name":"ACM Transactions on Architecture and Code Optimization","volume":"77 1 1","pages":""},"PeriodicalIF":1.6,"publicationDate":"2023-11-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138529935","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Chandra Sekhar Mummidi, Victor C. Ferreira, Sudarshan Srinivasan, Sandip Kundu
{"title":"Highly Efficient Self-Checking Matrix Multiplication on Tiled AMX Accelerators","authors":"Chandra Sekhar Mummidi, Victor C. Ferreira, Sudarshan Srinivasan, Sandip Kundu","doi":"10.1145/3633332","DOIUrl":"https://doi.org/10.1145/3633332","url":null,"abstract":"<p>General Matrix Multiplication (GEMM) is a computationally expensive operation that is used in many applications such as machine-learning. Hardware accelerators are increasingly popular for speeding up GEMM computation, with Tiled Matrix Multiplication (TMUL) in recent Intel processors being an example. Unfortunately, the TMUL hardware is susceptible to errors necessitating online error detection. Algorithm-based Error Detection techniques (ABED) is a powerful technique to detect errors in matrix multiplications. In this paper, we consider implementation of ABED that integrates seamlessly with the TMUL hardware to minimize performance overhead. Unfortunately, rounding errors introduced by floating-point operations do not allow a straightforward implementation of ABED in TMUL. Previously an error bound was considered for addressing rounding errors in ABED. If the error detection threshold is set too low, it will trigger false alarm while a loose bound will allow errors to escape detection. In this paper, we propose an adaptive error threshold that takes into account the TMUL input values to address the problem of false triggers and error escapes, and provide a taxonomy of various error classes. This threshold is obtained from theoretical error analysis but is not easy to implement in hardware. Consequently, we relax the threshold such that it can be easily computed in hardware. While ABED ensures error free computation it does not guarantee full coverage of all hardware faults. To address this problem, we propose an algorithmic pattern-generation technique to ensure full coverage for all hardware faults. To evaluate the benefits of our proposed solution, we conducted fault injection experiments and show that our approach does not produce any false alarms or detection escapes for observable errors. We conducted additional fault injection experiments on a Deep Neural Network (DNN) model and find that if a fault is not detected, it does not cause any misclassification.</p>","PeriodicalId":50920,"journal":{"name":"ACM Transactions on Architecture and Code Optimization","volume":"14 2 1","pages":""},"PeriodicalIF":1.6,"publicationDate":"2023-11-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138529824","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Abakus: Accelerating k-mer Counting With Storage Technology","authors":"Lingxi Wu, Minxuan Zhou, Weihong Xu, Ashish Venkat, Tajana Rosing, Kevin Skadron","doi":"10.1145/3632952","DOIUrl":"https://doi.org/10.1145/3632952","url":null,"abstract":"<p>This work seeks to leverage Processing-with-storage-technology (PWST) to accelerate a key bioinformatics kernel called <i>k</i>-mer counting, which involves processing large files of sequence data on the disk to build a histogram of fixed-size genome sequence substrings and thereby entails prohibitively high I/O overhead. In particular, this work proposes a set of accelerator designs called Abakus that offer varying degrees of tradeoffs in terms of performance, efficiency, and hardware implementation complexity. The key to these designs is a set of domain-specific hardware extensions to accelerate the key operations for <i>k</i>-mer counting at various levels of the SSD hierarchy, with the goal of enhancing the limited computing capabilities of conventional SSDs, while exploiting the parallelism of the multi-channel, multi-way SSDs. Our evaluation suggests that Abakus can achieve 8.42 ×, 6.91 ×, and 2.32 × speedup over the CPU-, GPU-, and near-data processing solutions.</p>","PeriodicalId":50920,"journal":{"name":"ACM Transactions on Architecture and Code Optimization","volume":"55 1","pages":""},"PeriodicalIF":1.6,"publicationDate":"2023-11-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138529837","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Gino A. Chacon, Charles Williams, Johann Knechtel, Ozgur Sinanoglu, Paul V. Gratz, Vassos Soteriou
{"title":"Coherence Attacks and Countermeasures in Interposer-Based Chiplet Systems","authors":"Gino A. Chacon, Charles Williams, Johann Knechtel, Ozgur Sinanoglu, Paul V. Gratz, Vassos Soteriou","doi":"10.1145/3633461","DOIUrl":"https://doi.org/10.1145/3633461","url":null,"abstract":"<p>Industry is moving towards large-scale hardware systems which bundle processor cores, memories, accelerators, etc. via 2.5D integration. These components are fabricated separately as chiplets and then integrated using an interposer as an interconnect carrier. This new design style is beneficial in terms of yield and economies of scale, as chiplets may come from various vendors and are relatively easy to integrate into one larger sophisticated system. However, the benefits of this approach come at the cost of new security challenges, especially when integrating chiplets that come from untrusted or not fully trusted, third- party vendors. </p><p>In this work, we explore these challenges for modern interposer-based systems of cache-coherent, multi-core chiplets. First, we present basic coherence-oriented hardware Trojan attacks that pose a significant threat to chiplet-based designs and demonstrate how these basic attacks can be orchestrated to pose a significant threat to interposer-based systems. Second, we propose a novel scheme using an active interposer as a generic, secure-by-construction platform that forms a physical root of trust for modern 2.5D systems. The implementation of our scheme is confined to the interposer, resulting in little cost and leaving the chiplets and coherence system untouched. We show that our scheme prevents a range of coherence attacks with low overheads on system performance, ∼ 4%. Further, we demonstrate that our scheme scales efficiently as system size and memory capacities increase, resulting in reduced performance overheads.</p>","PeriodicalId":50920,"journal":{"name":"ACM Transactions on Architecture and Code Optimization","volume":"75 1","pages":""},"PeriodicalIF":1.6,"publicationDate":"2023-11-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138529850","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"COWS for High Performance: Cost Aware Work Stealing for Irregular Parallel Loops: ACM Transactions on Architecture and Code Optimization: Vol 0, No ja","authors":"Prasoon Mishra, V. Krishna Nandivada","doi":"10.1145/3633331","DOIUrl":"https://doi.org/10.1145/3633331","url":null,"abstract":"<p>Parallel libraries such as OpenMP distribute the iterations of parallel-for-loops among the threads, using a programmer-specified scheduling policy. While the existing scheduling policies perform reasonably well in the context of balanced workloads, in computations that involve highly imbalanced workloads it is extremely non-trivial to obtain an efficient distribution of work (even using non-static scheduling methods like dynamic and guided). In this paper, we present a scheme called COst aware Work Stealing (COWS) to efficiently extend the idea of work-stealing to OpenMP. </p><p>In contrast to the traditional work-stealing schedulers, COWS takes into consideration that (i) not all iterations of a parallel-for-loops may take the same amount of time. (ii) identifying a suitable victim for stealing is important for load-balancing, and (iii) queues lead to significant overheads in traditional work-stealing and should be avoided. We present two variations of COWS: WSRI (a naive work-stealing scheme based on the number of remaining iterations) and WSRW (work-stealing scheme based on the amount of remaining workload). Since in irregular loops like those found in graph analytics, it is not possible to statically compute the cost of the iterations of the parallel-for-loops, we use a combined compile-time + runtime approach, where the remaining workload of a loop is computed efficiently at runtime by utilizing the code generated by our compile-time component. We have performed an evaluation over seven different benchmark programs, using five different input datasets, on two different hardware across a varying number of threads; leading to a total of 275 number of configurations. We show that in 225 out of 275 configurations, compared to the best OpenMP scheduling scheme for that configuration, our approach achieves clear performance gains.</p>","PeriodicalId":50920,"journal":{"name":"ACM Transactions on Architecture and Code Optimization","volume":"59 1","pages":""},"PeriodicalIF":1.6,"publicationDate":"2023-11-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138529848","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Xueying Wang, Guangli Li, Zhen Jia, Xiaobing Feng, Yida Wang
{"title":"Fast Convolution Meets Low Precision: Exploring Efficient Quantized Winograd Convolution on Modern CPUs","authors":"Xueying Wang, Guangli Li, Zhen Jia, Xiaobing Feng, Yida Wang","doi":"10.1145/3632956","DOIUrl":"https://doi.org/10.1145/3632956","url":null,"abstract":"<p>Low-precision computation has emerged as one of the most effective techniques for accelerating convolutional neural networks and has garnered widespread support on modern hardware. Despite its effectiveness in accelerating convolutional neural networks, low-precision computation has not been commonly applied to fast convolutions, such as the Winograd algorithm, due to numerical issues. In this paper, we propose an effective quantized Winograd convolution, named LoWino, which employs an in-side quantization method in the Winograd domain to reduce the precision loss caused by transformations. Meanwhile, we present an efficient implementation that integrates well-designed optimization techniques, allowing us to fully exploit the capabilities of low-precision computation on modern CPUs. We evaluate LoWino on two Intel Xeon Scalable Processor platforms with representative convolutional layers and neural network models. The experimental results demonstrate that our approach can achieve an average of 1.84 × and 1.91 × operator speedups over state-of-the-art implementations in the vendor library while preserving accuracy loss at a reasonable level.</p>","PeriodicalId":50920,"journal":{"name":"ACM Transactions on Architecture and Code Optimization","volume":"9 1","pages":""},"PeriodicalIF":1.6,"publicationDate":"2023-11-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138529933","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Hao Fan, Yiliang Ye, Shadi Ibrahim, Zhuo Huang, Xingru Li, WeiBin Xue, Song Wu, Chen Yu, Xuanhua Shi, Hai Jin
{"title":"QoS-pro: A QoS-enhanced Transaction Processing Framework for Shared SSDs","authors":"Hao Fan, Yiliang Ye, Shadi Ibrahim, Zhuo Huang, Xingru Li, WeiBin Xue, Song Wu, Chen Yu, Xuanhua Shi, Hai Jin","doi":"10.1145/3632955","DOIUrl":"https://doi.org/10.1145/3632955","url":null,"abstract":"Solid State Drives (SSDs) are widely used in data-intensive scenarios due to their high performance and decreasing cost. However, in shared environments, concurrent workloads can interfere with each other, leading to a violation of Quality of Service (QoS). While QoS mechanisms like fairness guarantees and latency constraints have been integrated into SSDs, existing transaction processing frameworks offer limited QoS guarantees and can significantly degrade overall performance in a shared environment. The reason is that the internal components of an SSD, originally designed to exploit parallelism, struggle to coordinate effectively when QoS mechanisms are applied to them. This paper proposes a novel QoS -enhanced transaction pro cessing framework, called QoS-pro, which enhances QoS guarantees for concurrent workloads while maintaining high parallelism for SSDs. QoS-pro achieves this by redesigning transaction processing procedures to fully exploit the parallelism of shared SSDs and enhancing QoS-oriented transaction translation and scheduling with parallelism features in mind. In terms of fairness guarantees, QoS-pro outperforms state-of-the-art methods by achieving 96% fairness improvement and 64% maximum latency reduction. QoS-pro also shows almost no loss in throughput when compared with parallelism-oriented methods. Additionally, QoS-pro triggers the fewest Garbage Collection (GC) operations and minimally affects concurrently running workloads during GC operations.","PeriodicalId":50920,"journal":{"name":"ACM Transactions on Architecture and Code Optimization","volume":"47 4","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-11-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134901890","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Fine-Grain Quantitative Analysis of Demand Paging in Unified Virtual Memory","authors":"Tyler Allen, Bennett Cooper, Rong Ge","doi":"10.1145/3632953","DOIUrl":"https://doi.org/10.1145/3632953","url":null,"abstract":"The abstraction of a shared memory space over separate CPU and GPU memory domains has eased the burden of portability for many HPC codebases. However, users pay for ease of use provided by system-managed memory with a moderate-to-high performance overhead. NVIDIA Unified Virtual Memory (UVM) is currently the primary real-world implementation of such abstraction and offers a functionally equivalent testbed for in-depth performance study for both UVM and future Linux Heterogeneous Memory Management (HMM) compatible systems. The continued advocacy for UVM and HMM motivates improvement of the underlying system. We focus on UVM-based systems and investigate root causes of UVM overhead, a non-trivial task due to complex interactions of multiple hardware and software constituents and the desired cost granularity. In our prior work, we delved deeply into UVM system architecture and showed internal behaviors of page fault servicing in batches. We provided quantitative evaluation of batch handling for various applications under different scenarios, including prefetching and oversubscription. We revealed the driver workload depends on the interactions among application access patterns, GPU hardware constraints, and host OS components. Host OS components have significant overhead present across implementations, warranting close attention. This extension furthers our prior study in three aspects: fine-grain cost analysis and breakdown, extension to multiple GPUs, and investigation of platforms with different GPU-GPU interconnects. We take a top-down approach to quantitative batch analysis and uncover how constituent component costs accumulate and overlap governed by synchronous and asynchronous operations. Our multi-GPU analysis shows reduced cost of GPU-GPU batch workloads compared to CPU-GPU workloads. We further demonstrate that while specialized interconnects, NVLink, can improve batch cost, their benefits are limited by host OS software overhead and GPU oversubscription. This study serves as a proxy for future shared memory systems, such as those that interface with HMM, and the development of interconnects.","PeriodicalId":50920,"journal":{"name":"ACM Transactions on Architecture and Code Optimization","volume":"11 4","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-11-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134991126","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}