ACM Transactions on Architecture and Code Optimization最新文献

筛选
英文 中文
Tyche: An Efficient and General Prefetcher for Indirect Memory Accesses Tyche:用于间接内存访问的高效通用预取器
IF 1.6 3区 计算机科学
ACM Transactions on Architecture and Code Optimization Pub Date : 2024-01-22 DOI: 10.1145/3641853
Feng Xue, Chenji Han, Xinyu Li, Junliang Wu, Tingting Zhang, Tianyi Liu, Yifan Hao, Zidong Du, Qi Guo, Fuxin Zhang
{"title":"Tyche: An Efficient and General Prefetcher for Indirect Memory Accesses","authors":"Feng Xue, Chenji Han, Xinyu Li, Junliang Wu, Tingting Zhang, Tianyi Liu, Yifan Hao, Zidong Du, Qi Guo, Fuxin Zhang","doi":"10.1145/3641853","DOIUrl":"https://doi.org/10.1145/3641853","url":null,"abstract":"<p>Indirect memory accesses (IMAs, i.e., <i>A</i>[<i>f</i>(<i>B</i>[<i>i</i>])]) are typical memory access patterns in applications such as graph analysis, machine learning, and database. IMAs are composed of producer-consumer pairs, where the consumers’ memory addresses are derived from the producers’ memory data. Due to the built-in value-dependent feature, IMAs exhibit poor locality, making prefetching ineffective. Hindered by the challenges of recording the potentially complex graphs of instruction dependencies among IMA producers and consumers, current state-of-the-art hardware prefetchers either <b>(a)</b> exhibit inadequate IMA identification abilities or <b>(b)</b> rely on the run-ahead mechanism to prefetch IMAs intermittently and insufficiently. </p><p>To solve this problem, we propose Tyche<sup>1</sup>, an efficient and general hardware prefetcher to enhance IMA performance. Tyche adopts a bilateral propagation mechanism to precisely excavate the instruction dependencies in simple chains with moderate length (rather than complex graphs). Based on the exact instruction dependencies, Tyche can accurately identify various IMA patterns, including nonlinear ones, and generate accurate prefetching requests continuously. Evaluated on broad benchmarks, Tyche achieves an average performance speedup of 16.2% over the state-of-the-art spatial prefetcher Berti. More importantly, Tyche outperforms the state-of-the-art IMA prefetchers IMP, Gretch, and Vector Runahead, by 15.9%, 12.8%, and 10.7%, respectively, with a lower storage overhead of only 0.57KB.</p>","PeriodicalId":50920,"journal":{"name":"ACM Transactions on Architecture and Code Optimization","volume":"7 1","pages":""},"PeriodicalIF":1.6,"publicationDate":"2024-01-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139517190","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Rcmp: Reconstructing RDMA-Based Memory Disaggregation via CXL Rcmp:通过 CXL 重构基于 RDMA 的内存分解
IF 1.6 3区 计算机科学
ACM Transactions on Architecture and Code Optimization Pub Date : 2024-01-19 DOI: 10.1145/3634916
Zhonghua Wang, Yixing Guo, Kai Lu, Jiguang Wan, Daohui Wang, Ting Yao, Huatao Wu
{"title":"Rcmp: Reconstructing RDMA-Based Memory Disaggregation via CXL","authors":"Zhonghua Wang, Yixing Guo, Kai Lu, Jiguang Wan, Daohui Wang, Ting Yao, Huatao Wu","doi":"10.1145/3634916","DOIUrl":"https://doi.org/10.1145/3634916","url":null,"abstract":"<p>Memory disaggregation is a promising architecture for modern datacenters that separates compute and memory resources into independent pools connected by ultra-fast networks, which can improve memory utilization, reduce cost, and enable elastic scaling of compute and memory resources. However, existing memory disaggregation solutions based on remote direct memory access (RDMA) suffer from high latency and additional overheads including page faults and code refactoring. Emerging cache-coherent interconnects such as CXL offer opportunities to reconstruct high-performance memory disaggregation. However, existing CXL-based approaches have physical distance limitation and cannot be deployed across racks.</p><p>In this article, we propose Rcmp, a novel low-latency and highly scalable memory disaggregation system based on RDMA and CXL. The significant feature is that Rcmp improves the performance of RDMA-based systems via CXL, and leverages RDMA to overcome CXL’s distance limitation. To address the challenges of the mismatch between RDMA and CXL in terms of granularity, communication, and performance, Rcmp (1) provides a global page-based memory space management and enables fine-grained data access, (2) designs an efficient communication mechanism to avoid communication blocking issues, (3) proposes a hot-page identification and swapping strategy to reduce RDMA communications, and (4) designs an RDMA-optimized RPC framework to accelerate RDMA transfers. We implement a prototype of Rcmp and evaluate its performance by using micro-benchmarks and running a key-value store with YCSB benchmarks. The results show that Rcmp can achieve 5.2× lower latency and 3.8× higher throughput than RDMA-based systems. We also demonstrate that Rcmp can scale well with the increasing number of nodes without compromising performance.</p>","PeriodicalId":50920,"journal":{"name":"ACM Transactions on Architecture and Code Optimization","volume":"23 1","pages":""},"PeriodicalIF":1.6,"publicationDate":"2024-01-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139499290","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Dedicated Hardware Accelerators for Processing of Sparse Matrices and Vectors: A Survey 处理稀疏矩阵和向量的专用硬件加速器:调查
IF 1.6 3区 计算机科学
ACM Transactions on Architecture and Code Optimization Pub Date : 2024-01-17 DOI: 10.1145/3640542
Valentin Isaac–Chassande, Adrian Evans, Yves Durand, Frédéric Rousseau
{"title":"Dedicated Hardware Accelerators for Processing of Sparse Matrices and Vectors: A Survey","authors":"Valentin Isaac–Chassande, Adrian Evans, Yves Durand, Frédéric Rousseau","doi":"10.1145/3640542","DOIUrl":"https://doi.org/10.1145/3640542","url":null,"abstract":"<p>Performance in scientific and engineering applications such as computational physics, algebraic graph problems or Convolutional Neural Networks (CNN), is dominated by the manipulation of large sparse matrices – matrices with a large number of zero elements. Specialized software using data formats for sparse matrices has been optimized for the main kernels of interest: SpMV and SpMSpM matrix multiplications, but due to the indirect memory accesses, the performance is still limited by the memory hierarchy of conventional computers. Recent work shows that specific hardware accelerators can reduce memory traffic and improve the execution time of sparse matrix multiplication, compared to the best software implementations. The performance of these sparse hardware accelerators depends on the choice of the sparse format, <i>COO</i>, <i>CSR</i>, etc, the algorithm, <i>inner-product</i>, <i>outer-product</i>, <i>Gustavson</i>, and many hardware design choices. In this article, we propose a systematic survey which identifies the design choices of state-of-the-art accelerators for sparse matrix multiplication kernels. We introduce the necessary concepts and then present, compare and classify the main sparse accelerators in the literature, using consistent notations. Finally, we propose a taxonomy for these accelerators to help future designers make the best choices depending of their objectives.</p>","PeriodicalId":50920,"journal":{"name":"ACM Transactions on Architecture and Code Optimization","volume":"53 1","pages":""},"PeriodicalIF":1.6,"publicationDate":"2024-01-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139483484","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Cost-aware service placement and scheduling in the Edge-Cloud Continuum 边缘-云连续体中的成本感知服务安置和调度
IF 1.6 3区 计算机科学
ACM Transactions on Architecture and Code Optimization Pub Date : 2024-01-16 DOI: 10.1145/3640823
Samuel Rac, Mats Brorsson
{"title":"Cost-aware service placement and scheduling in the Edge-Cloud Continuum","authors":"Samuel Rac, Mats Brorsson","doi":"10.1145/3640823","DOIUrl":"https://doi.org/10.1145/3640823","url":null,"abstract":"<p>The edge to data center computing continuum is the aggregation of computing resources located anywhere between the network edge (e.g., close to 5G antennas), and servers in traditional data centers. Kubernetes is the de facto standard for the orchestration of services in data center environments, where it is very efficient. It, however, fails to give the same performance when including edge resources. At the edge, resources are more limited, and networking conditions are changing over time. In this paper, we present a methodology that lowers the costs of running applications in the edge-to-cloud computing continuum. This methodology can adapt to changing environments, e.g., moving end-users. We are also monitoring some Key Performance Indicators of the applications to ensure that cost optimizations do not negatively impact their Quality of Service. In addition, to ensure that performances are optimal even when users are moving, we introduce a background process that periodically checks if a better location is available for the service and, if so, moves the service. To demonstrate the performance of our scheduling approach, we evaluate it using a vehicle cooperative perception use case, a representative 5G application. With this use case, we can demonstrate that our scheduling approach can robustly lower the cost in different scenarios, while other approaches that are already available fail in either being adaptive to changing environments or will have poor cost-effectiveness in some scenarios.</p>","PeriodicalId":50920,"journal":{"name":"ACM Transactions on Architecture and Code Optimization","volume":"4 1","pages":""},"PeriodicalIF":1.6,"publicationDate":"2024-01-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139483853","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
An Instruction Inflation Analyzing Framework for Dynamic Binary Translators 动态二进制翻译器的指令膨胀分析框架
IF 1.6 3区 计算机科学
ACM Transactions on Architecture and Code Optimization Pub Date : 2024-01-15 DOI: 10.1145/3640813
Benyi Xie, Yue Yan, Chenghao Yan, Sicheng Tao, Zhuangzhuang Zhang, Xinyu Li, Yanzhi Lan, Xiang Wu, Tianyi Liu, Tingting Zhang, Fuxin Zhang
{"title":"An Instruction Inflation Analyzing Framework for Dynamic Binary Translators","authors":"Benyi Xie, Yue Yan, Chenghao Yan, Sicheng Tao, Zhuangzhuang Zhang, Xinyu Li, Yanzhi Lan, Xiang Wu, Tianyi Liu, Tingting Zhang, Fuxin Zhang","doi":"10.1145/3640813","DOIUrl":"https://doi.org/10.1145/3640813","url":null,"abstract":"<p>Dynamic binary translators (DBTs) are widely used to migrate applications between different instruction set architectures (ISAs). Despite extensive research to improve DBT performance, noticeable overhead remains, preventing near-native performance, especially when translating from complex instruction set computer (CISC) to reduced instruction set computer (RISC). For computational workloads, the main overhead stems from translated code quality. Experimental data show state-of-the-art DBT products have dynamic code inflation of at least 1.46. This indicates on average over 1.46 host instructions are needed to emulate one guest instruction. Worse, inflation closely correlates with translated code quality. However, the detailed sources of instruction inflation remain unclear. </p><p>To understand the sources of inflation, we present Deflater, an instruction inflation analysis framework comprising a mathematical model, a collection of black-box unit tests called BenchMIAOes, and a trace-based simulator called InflatSim. The mathematical model calculates overall inflation based on the inflation of individual instructions and translation block (TB) optimizations. BenchMIAOes extract model parameters from DBTs without accessing DBT source code. InflatSim implements the model and uses the extracted parameters from BenchMIAOes to simulate a given DBT’s behavior. Deflater is a valuable tool to guide DBT analysis and improvement. Using Deflater, we simulated inflation for three state-of-the-art CISC-to-RISC DBTs: ExaGear, Rosetta2, and LATX, with inflation errors of 5.63%, 5.15%, and 3.44% respectively for SPEC CPU 2017, gaining insights into these commercial DBTs. Deflater also efficiently models inflation for the open-source DBT QEMU and suggests optimizations that can substantially reduce inflation. Implementing the suggested optimizations confirms Deflater’s effective guidance, with 4.65% inflation error, and gains 5.47x performance improvement.</p>","PeriodicalId":50920,"journal":{"name":"ACM Transactions on Architecture and Code Optimization","volume":"22 1","pages":""},"PeriodicalIF":1.6,"publicationDate":"2024-01-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139498826","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Assessing the Impact of Compiler Optimizations on GPUs Reliability 评估编译器优化对 GPU 可靠性的影响
IF 1.6 3区 计算机科学
ACM Transactions on Architecture and Code Optimization Pub Date : 2024-01-12 DOI: 10.1145/3638249
Fernando Fernandes dos Santos, Luigi Carro, Flavio Vella, Paolo Rech
{"title":"Assessing the Impact of Compiler Optimizations on GPUs Reliability","authors":"Fernando Fernandes dos Santos, Luigi Carro, Flavio Vella, Paolo Rech","doi":"10.1145/3638249","DOIUrl":"https://doi.org/10.1145/3638249","url":null,"abstract":"<p>Graphics Processing Units (GPUs) compilers have evolved in order to support general-purpose programming languages for multiple architectures. NVIDIA CUDA Compiler (NVCC) has many compilation levels before generating the machine code and applies complex optimizations to improve performance. These optimizations modify how the software is mapped in the underlying hardware; thus, as we show in this paper, they can also affect GPU reliability. We evaluate the effects on the GPU error rate of the optimization flags applied at the NVCC Parallel Thread Execution (PTX) compiling phase by analyzing two NVIDIA GPU architectures (Kepler and Volta) and two compiler versions (NVCC 10.2 and 11.3). We compare and combine fault propagation analysis based on software fault injection, hardware utilization distribution obtained with application-level profiling, and machine instructions radiation-induced error rate measured with beam experiments. We consider eight different workloads and 144 combinations of compilation flags, and we show that optimizations can impact the GPUs’ error rate of up to an order of magnitude. Additionally, through accelerated neutron beam experiments on a NVIDIA Kepler GPU, we show that the error rate of the unoptimized GEMM (-O0 flag) is lower than the optimized GEMM’s (-O3 flag) error rate. When the performance is evaluated together with the error rate, we show that the most optimized versions (-O1 and -O3) always produce a higher amount of correct data than the unoptimized code (-O0).</p>","PeriodicalId":50920,"journal":{"name":"ACM Transactions on Architecture and Code Optimization","volume":"42 1","pages":""},"PeriodicalIF":1.6,"publicationDate":"2024-01-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139461050","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
An Efficient Hybrid Deep Learning Accelerator for Compact and Heterogeneous CNNs 适用于紧凑型异构 CNN 的高效混合深度学习加速器
IF 1.6 3区 计算机科学
ACM Transactions on Architecture and Code Optimization Pub Date : 2024-01-08 DOI: 10.1145/3639823
Fareed Qararyah, Muhammad Waqar Azhar, Pedro Trancoso
{"title":"An Efficient Hybrid Deep Learning Accelerator for Compact and Heterogeneous CNNs","authors":"Fareed Qararyah, Muhammad Waqar Azhar, Pedro Trancoso","doi":"10.1145/3639823","DOIUrl":"https://doi.org/10.1145/3639823","url":null,"abstract":"<p>Resource-efficient Convolutional Neural Networks (CNNs) are gaining more attention. These CNNs have relatively low computational and memory requirements. A common denominator among such CNNs is having more heterogeneity than traditional CNNs. This heterogeneity is present at two levels: intra-layer-type and inter-layer-type. Generic accelerators do not capture these levels of heterogeneity, which harms their efficiency. Consequently, researchers have proposed model-specific accelerators with dedicated engines. When designing an accelerator with dedicated engines, one option is to dedicate an engine per CNN layer. We refer to accelerators designed with this approach as single-engine single-layer (SESL). This approach enables optimizing each engine for its specific layer. However, such accelerators are resource-demanding and unscalable. Another option is to design a minimal number of dedicated engines such that each engine handles all layers of one type. We refer to these accelerators as single-engine multiple-layer (SEML). single-engine multiple-layer accelerators capture the inter-layer-type, but not the intra-layer-type heterogeneity. </p><p>We propose FiBHA (Fixed Budget Hybrid CNN Accelerator), a hybrid accelerator composed of a single-engine single-layer Layer part and a single-engine multiple-layer part, each processing a subset of CNN layers. FiBHA captures more heterogeneity than single-engine multiple-layer while being more resource-aware and scalable than single-engine single-layer. Moreover, we propose a novel module, Fused Inverted Residual Bottleneck (FIRB), a fine-grained and memory-light single-engine single-layer architecture building block. The proposed architecture is implemented and evaluated using high-level synthesis (HLS) on different FPGAs representing various resource budgets. Our evaluation shows that FiBHA improves the throughput by up to 4<i>x</i> and 2.5<i>x</i> compared to state-of-the-art single-engine single-layer and single-engine multiple-layer accelerators, respectively. Moreover, FiBHA reduces memory and energy consumption compared to a single-engine multiple-layer accelerator. The evaluation also shows that FIRB reduces the required memory by up to (54% ), and energy requirements by up to (35% ) compared to traditional pipelining.</p>","PeriodicalId":50920,"journal":{"name":"ACM Transactions on Architecture and Code Optimization","volume":"27 1","pages":""},"PeriodicalIF":1.6,"publicationDate":"2024-01-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139414715","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
ApHMM: Accelerating Profile Hidden Markov Models for Fast and Energy-Efficient Genome Analysis ApHMM:加速轮廓隐马尔可夫模型,实现快速节能的基因组分析
IF 1.6 3区 计算机科学
ACM Transactions on Architecture and Code Optimization Pub Date : 2023-12-28 DOI: 10.1145/3632950
Can Firtina, Kamlesh Pillai, Gurpreet S. Kalsi, Bharathwaj Suresh, Damla Senol Cali, Jeremie S. Kim, Taha Shahroodi, Meryem Banu Cavlak, Joël Lindegger, Mohammed Alser, Juan Gómez Luna, Sreenivas Subramoney, Onur Mutlu
{"title":"ApHMM: Accelerating Profile Hidden Markov Models for Fast and Energy-Efficient Genome Analysis","authors":"Can Firtina, Kamlesh Pillai, Gurpreet S. Kalsi, Bharathwaj Suresh, Damla Senol Cali, Jeremie S. Kim, Taha Shahroodi, Meryem Banu Cavlak, Joël Lindegger, Mohammed Alser, Juan Gómez Luna, Sreenivas Subramoney, Onur Mutlu","doi":"10.1145/3632950","DOIUrl":"https://doi.org/10.1145/3632950","url":null,"abstract":"<p>Profile hidden Markov models (pHMMs) are widely employed in various bioinformatics applications to identify similarities between biological sequences, such as DNA or protein sequences. In pHMMs, sequences are represented as graph structures, where states and edges capture modifications (i.e., insertions, deletions, and substitutions) by assigning probabilities to them. These probabilities are subsequently used to compute the similarity score between a sequence and a pHMM graph. The Baum-Welch algorithm, a prevalent and highly accurate method, utilizes these probabilities to optimize and compute similarity scores. Accurate computation of these probabilities is essential for the correct identification of sequence similarities. However, the Baum-Welch algorithm is computationally intensive, and existing solutions offer either software-only or hardware-only approaches with fixed pHMM designs. When we analyze state-of-the-art works, we identify an urgent need for a flexible, high-performance, and energy-efficient hardware-software co-design to address the major inefficiencies in the Baum-Welch algorithm for pHMMs. </p><p>We introduce <i>ApHMM</i>, the <i>first</i> flexible acceleration framework designed to significantly reduce both computational and energy overheads associated with the Baum-Welch algorithm for pHMMs. ApHMM employs hardware-software co-design to tackle the major inefficiencies in the Baum-Welch algorithm by 1) designing flexible hardware to accommodate various pHMM designs, 2) exploiting predictable data dependency patterns through on-chip memory with memoization techniques, 3) rapidly filtering out unnecessary computations using a hardware-based filter, and 4) minimizing redundant computations. </p><p>ApHMM achieves substantial speedups of 15.55 × - 260.03 ×, 1.83 × - 5.34 ×, and 27.97 × when compared to CPU, GPU, and FPGA implementations of the Baum-Welch algorithm, respectively. ApHMM outperforms state-of-the-art CPU implementations in three key bioinformatics applications: 1) error correction, 2) protein family search, and 3) multiple sequence alignment, by 1.29 × - 59.94 ×, 1.03 × - 1.75 ×, and 1.03 × - 1.95 ×, respectively, while improving their energy efficiency by 64.24 × - 115.46 ×, 1.75 ×, 1.96 ×.</p>","PeriodicalId":50920,"journal":{"name":"ACM Transactions on Architecture and Code Optimization","volume":"9 1","pages":""},"PeriodicalIF":1.6,"publicationDate":"2023-12-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139067804","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Exploring Data Layout for Sparse Tensor Times Dense Matrix on GPUs 探索 GPU 上稀疏张量倍密集矩阵的数据布局
IF 1.6 3区 计算机科学
ACM Transactions on Architecture and Code Optimization Pub Date : 2023-12-28 DOI: 10.1145/3633462
Khalid Ahmad, Cris Cecka, Michael Garland, Mary Hall
{"title":"Exploring Data Layout for Sparse Tensor Times Dense Matrix on GPUs","authors":"Khalid Ahmad, Cris Cecka, Michael Garland, Mary Hall","doi":"10.1145/3633462","DOIUrl":"https://doi.org/10.1145/3633462","url":null,"abstract":"<p>An important sparse tensor computation is sparse-tensor-dense-matrix multiplication (SpTM), which is used in tensor decomposition and applications. SpTM is a multi-dimensional analog to sparse-matrix-dense-matrix multiplication (SpMM). In this paper, we employ a hierarchical tensor data layout that can unfold a multidimensional tensor to derive a 2D matrix, making it possible to compute SpTM using SpMM kernel implementations for GPUs. We compare two SpMM implementations to the state-of-the-art PASTA sparse tensor contraction implementation using: (1) SpMM with hierarchical tensor data layout; and, (2) unfolding followed by an invocation of cuSPARSE’s SpMM. Results show that SpMM can outperform PASTA 70.9% of the time, but none of the three approaches is best overall. Therefore, we use a decision tree classifier to identify the best performing sparse tensor contraction kernel based on precomputed properties of the sparse tensor.</p>","PeriodicalId":50920,"journal":{"name":"ACM Transactions on Architecture and Code Optimization","volume":"33 1","pages":""},"PeriodicalIF":1.6,"publicationDate":"2023-12-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139068139","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A Concise Concurrent B+-Tree for Persistent Memory 用于持久内存的简明并发 B+ 树
IF 1.6 3区 计算机科学
ACM Transactions on Architecture and Code Optimization Pub Date : 2023-12-25 DOI: 10.1145/3638717
Yan Wei, Zhang Xingjun
{"title":"A Concise Concurrent B+-Tree for Persistent Memory","authors":"Yan Wei, Zhang Xingjun","doi":"10.1145/3638717","DOIUrl":"https://doi.org/10.1145/3638717","url":null,"abstract":"<p>Persistent memory (PM) presents a unique opportunity for designing data management systems that offer improved performance, scalability, and instant restart capability. As a widely used data structure for managing data in such systems, B<sup>+</sup>-Tree must address the challenges presented by PM in both data consistency and device performance. However, existing studies suffer from significant performance degradation when maintaining data consistency on PM. To settle this problem, we propose a new concurrent B<sup>+</sup>-Tree, CC-Tree, optimized for PM. CC-Tree ensures data consistency while providing high concurrent performance, thanks to several technologies, including partitioned metadata, log-free split, and lock-free read. We conducted experiments using state-of-the-art indices, and the results demonstrate significant performance improvements, including approximately 1.2-1.6x search, 1.5-1.7x insertion, 1.5-2.8x update, 1.9-4x deletion, 0.9-10x range scan, and up to 1.55-1.82x in hybrid workloads.</p>","PeriodicalId":50920,"journal":{"name":"ACM Transactions on Architecture and Code Optimization","volume":"89 1","pages":""},"PeriodicalIF":1.6,"publicationDate":"2023-12-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139036626","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信