arXiv - CS - Performance最新文献

筛选
英文 中文
Lock-Free Computation of PageRank in Dynamic Graphs 动态图中 PageRank 的无锁计算
arXiv - CS - Performance Pub Date : 2024-07-28 DOI: arxiv-2407.19562
Subhajit Sahu
{"title":"Lock-Free Computation of PageRank in Dynamic Graphs","authors":"Subhajit Sahu","doi":"arxiv-2407.19562","DOIUrl":"https://doi.org/arxiv-2407.19562","url":null,"abstract":"PageRank is a metric that assigns importance to the vertices of a graph based\u0000on its neighbors and their scores. Recently, there has been increasing interest\u0000in computing PageRank on dynamic graphs, where the graph structure evolves due\u0000to edge insertions and deletions. However, traditional barrier-based approaches\u0000for updating PageRanks encounter significant wait times on certain graph\u0000structures, leading to high overall runtimes. Additionally, the growing trend\u0000of multicore architectures with increased core counts has raised concerns about\u0000random thread delays and failures. In this study, we propose a lock-free\u0000algorithm for updating PageRank scores on dynamic graphs. First, we introduce\u0000our Dynamic Frontier (DF) approach, which identifies and processes vertices\u0000likely to change PageRanks with minimal overhead. Subsequently, we integrate DF\u0000with our lock-free and fault-tolerant PageRank ($DF_{LF}$), incorporating a\u0000helping mechanism among threads between its two phases. Experimental results\u0000demonstrate that $DF_{LF}$ not only eliminates waiting times at iteration\u0000barriers but also withstands random thread delays and crashes. On average, it\u0000is 4.6x faster than lock-free Naive-dynamic PageRank ($ND_{LF}$).","PeriodicalId":501291,"journal":{"name":"arXiv - CS - Performance","volume":"13 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141870320","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Binary Bleed: Fast Distributed and Parallel Method for Automatic Model Selection 二进制漂白:自动模型选择的快速分布式并行方法
arXiv - CS - Performance Pub Date : 2024-07-26 DOI: arxiv-2407.19125
Ryan BarronTheoretical Division, Los Alamos National Laboratory, Los Alamos, USADepartment of Computer Science and Electrical Engineering, University of Maryland, Baltimore County, Maryland, USA, Maksim E. ErenAdvanced Research in Cyber Systems, Los Alamos National Laboratory, Los Alamos, USADepartment of Computer Science and Electrical Engineering, University of Maryland, Baltimore County, Maryland, USA, Manish BhattaraiTheoretical Division, Los Alamos National Laboratory, Los Alamos, USA, Ismael BoureimaTheoretical Division, Los Alamos National Laboratory, Los Alamos, USA, Cynthia MatuszekAdvanced Research in Cyber Systems, Los Alamos National Laboratory, Los Alamos, USADepartment of Computer Science and Electrical Engineering, University of Maryland, Baltimore County, Maryland, USA, Boian S. AlexandrovTheoretical Division, Los Alamos National Laboratory, Los Alamos, USA
{"title":"Binary Bleed: Fast Distributed and Parallel Method for Automatic Model Selection","authors":"Ryan BarronTheoretical Division, Los Alamos National Laboratory, Los Alamos, USADepartment of Computer Science and Electrical Engineering, University of Maryland, Baltimore County, Maryland, USA, Maksim E. ErenAdvanced Research in Cyber Systems, Los Alamos National Laboratory, Los Alamos, USADepartment of Computer Science and Electrical Engineering, University of Maryland, Baltimore County, Maryland, USA, Manish BhattaraiTheoretical Division, Los Alamos National Laboratory, Los Alamos, USA, Ismael BoureimaTheoretical Division, Los Alamos National Laboratory, Los Alamos, USA, Cynthia MatuszekAdvanced Research in Cyber Systems, Los Alamos National Laboratory, Los Alamos, USADepartment of Computer Science and Electrical Engineering, University of Maryland, Baltimore County, Maryland, USA, Boian S. AlexandrovTheoretical Division, Los Alamos National Laboratory, Los Alamos, USA","doi":"arxiv-2407.19125","DOIUrl":"https://doi.org/arxiv-2407.19125","url":null,"abstract":"In several Machine Learning (ML) clustering and dimensionality reduction\u0000approaches, such as non-negative matrix factorization (NMF), RESCAL, and\u0000K-Means clustering, users must select a hyper-parameter k to define the number\u0000of clusters or components that yield an ideal separation of samples or clean\u0000clusters. This selection, while difficult, is crucial to avoid overfitting or\u0000underfitting the data. Several ML applications use scoring methods (e.g.,\u0000Silhouette and Davies Boulding scores) to evaluate the cluster pattern\u0000stability for a specific k. The score is calculated for different trials over a\u0000range of k, and the ideal k is heuristically selected as the value before the\u0000model starts overfitting, indicated by a drop or increase in the score\u0000resembling an elbow curve plot. While the grid-search method can be used to\u0000accurately find a good k value, visiting a range of k can become time-consuming\u0000and computationally resource-intensive. In this paper, we introduce the Binary\u0000Bleed method based on binary search, which significantly reduces the k search\u0000space for these grid-search ML algorithms by truncating the target k values\u0000from the search space using a heuristic with thresholding over the scores.\u0000Binary Bleed is designed to work with single-node serial, single-node\u0000multi-processing, and distributed computing resources. In our experiments, we\u0000demonstrate the reduced search space gain over a naive sequential search of the\u0000ideal k and the accuracy of the Binary Bleed in identifying the correct k for\u0000NMFk, K-Means pyDNMFk, and pyDRESCALk with Silhouette and Davies Boulding\u0000scores. We make our implementation of Binary Bleed for the NMF algorithm\u0000available on GitHub.","PeriodicalId":501291,"journal":{"name":"arXiv - CS - Performance","volume":"48 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141870323","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
SCALE: Self-regulated Clustered federAted LEarning in a Homogeneous Environment SCALE:在同质环境中进行自我调节的聚类联合学习
arXiv - CS - Performance Pub Date : 2024-07-25 DOI: arxiv-2407.18387
Sai Puppala, Ismail Hossain, Md Jahangir Alam, Sajedul Talukder, Zahidur Talukder, Syed Bahauddin
{"title":"SCALE: Self-regulated Clustered federAted LEarning in a Homogeneous Environment","authors":"Sai Puppala, Ismail Hossain, Md Jahangir Alam, Sajedul Talukder, Zahidur Talukder, Syed Bahauddin","doi":"arxiv-2407.18387","DOIUrl":"https://doi.org/arxiv-2407.18387","url":null,"abstract":"Federated Learning (FL) has emerged as a transformative approach for enabling\u0000distributed machine learning while preserving user privacy, yet it faces\u0000challenges like communication inefficiencies and reliance on centralized\u0000infrastructures, leading to increased latency and costs. This paper presents a\u0000novel FL methodology that overcomes these limitations by eliminating the\u0000dependency on edge servers, employing a server-assisted Proximity Evaluation\u0000for dynamic cluster formation based on data similarity, performance indices,\u0000and geographical proximity. Our integrated approach enhances operational\u0000efficiency and scalability through a Hybrid Decentralized Aggregation Protocol,\u0000which merges local model training with peer-to-peer weight exchange and a\u0000centralized final aggregation managed by a dynamically elected driver node,\u0000significantly curtailing global communication overhead. Additionally, the\u0000methodology includes Decentralized Driver Selection, Check-pointing to reduce\u0000network traffic, and a Health Status Verification Mechanism for system\u0000robustness. Validated using the breast cancer dataset, our architecture not\u0000only demonstrates a nearly tenfold reduction in communication overhead but also\u0000shows remarkable improvements in reducing training latency and energy\u0000consumption while maintaining high learning performance, offering a scalable,\u0000efficient, and privacy-preserving solution for the future of federated learning\u0000ecosystems.","PeriodicalId":501291,"journal":{"name":"arXiv - CS - Performance","volume":"42 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141873317","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
SAfEPaTh: A System-Level Approach for Efficient Power and Thermal Estimation of Convolutional Neural Network Accelerator SAfEPaTh:高效卷积神经网络加速器功率和热估计的系统级方法
arXiv - CS - Performance Pub Date : 2024-07-24 DOI: arxiv-2407.17623
Yukai Chen, Simei Yang, Debjyoti Bhattacharjee, Francky Catthoor, Arindam Mallik
{"title":"SAfEPaTh: A System-Level Approach for Efficient Power and Thermal Estimation of Convolutional Neural Network Accelerator","authors":"Yukai Chen, Simei Yang, Debjyoti Bhattacharjee, Francky Catthoor, Arindam Mallik","doi":"arxiv-2407.17623","DOIUrl":"https://doi.org/arxiv-2407.17623","url":null,"abstract":"The design of energy-efficient, high-performance, and reliable Convolutional\u0000Neural Network (CNN) accelerators involves significant challenges due to\u0000complex power and thermal management issues. This paper introduces SAfEPaTh, a\u0000novel system-level approach for accurately estimating power and temperature in\u0000tile-based CNN accelerators. By addressing both steady-state and\u0000transient-state scenarios, SAfEPaTh effectively captures the dynamic effects of\u0000pipeline bubbles in interlayer pipelines, utilizing real CNN workloads for\u0000comprehensive evaluation. Unlike traditional methods, it eliminates the need\u0000for circuit-level simulations or on-chip measurements. Our methodology\u0000leverages TANIA, a cutting-edge hybrid digital-analog tile-based accelerator\u0000featuring analog-in-memory computing cores alongside digital cores. Through\u0000rigorous simulation results using the ResNet18 model, we demonstrate SAfEPaTh's\u0000capability to accurately estimate power and temperature within 500 seconds,\u0000encompassing CNN model accelerator mapping exploration and detailed power and\u0000thermal estimations. This efficiency and accuracy make SAfEPaTh an invaluable\u0000tool for designers, enabling them to optimize performance while adhering to\u0000stringent power and thermal constraints. Furthermore, SAfEPaTh's adaptability\u0000extends its utility across various CNN models and accelerator architectures,\u0000underscoring its broad applicability in the field. This study contributes\u0000significantly to the advancement of energy-efficient and reliable CNN\u0000accelerator designs, addressing critical challenges in dynamic power and\u0000thermal management.","PeriodicalId":501291,"journal":{"name":"arXiv - CS - Performance","volume":"142 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141785657","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
KWT-Tiny: RISC-V Accelerated, Embedded Keyword Spotting Transformer KWT-Tiny:RISC-V 加速的嵌入式关键词查找转换器
arXiv - CS - Performance Pub Date : 2024-07-22 DOI: arxiv-2407.16026
Aness Al-Qawlaq, Ajay Kumar M, Deepu John
{"title":"KWT-Tiny: RISC-V Accelerated, Embedded Keyword Spotting Transformer","authors":"Aness Al-Qawlaq, Ajay Kumar M, Deepu John","doi":"arxiv-2407.16026","DOIUrl":"https://doi.org/arxiv-2407.16026","url":null,"abstract":"This paper explores the adaptation of Transformerbased models for edge\u0000devices through the quantisation and hardware acceleration of the ARM Keyword\u0000Transformer (KWT) model on a RISC-V platform. The model was targeted to run on\u000064kB RAM in bare-metal C using a custom-developed edge AI library. KWT-1 was\u0000retrained to be 369 times smaller, with only a 10% loss in accuracy through\u0000reducing output classes from 35 to 2. The retraining and quantisation reduced\u0000model size from 2.42 MB to 1.65 kB. The integration of custom RISC-V\u0000instructions that accelerated GELU and SoftMax operations enabled a 5x speedup\u0000and thus ~5x power reduction in inference, with inference clock cycle counts\u0000decreasing from 26 million to 5.5 million clock cycles while incurring a small\u0000area overhead of approximately 29%. The results demonstrate a viable method for\u0000porting and accelerating Transformer-based models in low-power IoT devices.","PeriodicalId":501291,"journal":{"name":"arXiv - CS - Performance","volume":"356 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141778972","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Simopt -- Simulation pass for Speculative Optimisation of FPGA-CAD flow Simopt -- 用于 FPGA-CAD 流程推测性优化的仿真通行证
arXiv - CS - Performance Pub Date : 2024-07-22 DOI: arxiv-2408.12676
Eashan Wadhwa, Shanker Shreejith
{"title":"Simopt -- Simulation pass for Speculative Optimisation of FPGA-CAD flow","authors":"Eashan Wadhwa, Shanker Shreejith","doi":"arxiv-2408.12676","DOIUrl":"https://doi.org/arxiv-2408.12676","url":null,"abstract":"Behavioural simulation is deployed in CAD flow to verify the functional\u0000correctness of a Register Transfer Level (RTL) design. Metadata extracted from\u0000behavioural simulation could be used to optimise and/or speed up subsequent\u0000steps in the hardware design flow. In this paper, we propose Simopt, a tool\u0000flow that extracts simulation metadata to improve the timing performance of the\u0000design by introducing latency awareness during the placement phase and\u0000subsequently improving the routing time of the post-placed netlist using vendor\u0000tools. For our experiments, we adapt the open-source Yosys flow to perform\u0000Simopt-aware placement. Our results show that using the Simopt-pass in the\u0000design implementation flow results in up to 38.2% reduction in timing\u0000performance (latency) of the design.","PeriodicalId":501291,"journal":{"name":"arXiv - CS - Performance","volume":"5 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142195486","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
The Bicameral Cache: a split cache for vector architectures 双核高速缓存:矢量架构的分离式高速缓存
arXiv - CS - Performance Pub Date : 2024-07-22 DOI: arxiv-2407.15440
Susana Rebolledo Ruiz, Borja Perez, Jose Luis Bosque, Peter Hsu
{"title":"The Bicameral Cache: a split cache for vector architectures","authors":"Susana Rebolledo Ruiz, Borja Perez, Jose Luis Bosque, Peter Hsu","doi":"arxiv-2407.15440","DOIUrl":"https://doi.org/arxiv-2407.15440","url":null,"abstract":"The Bicameral Cache is a cache organization proposal for a vector\u0000architecture that segregates data according to their access type,\u0000distinguishing scalar from vector references. Its aim is to avoid both types of\u0000references from interfering in each other's data locality, with a special focus\u0000on prioritizing the performance on vector references. The proposed system\u0000incorporates an additional, non-polluting prefetching mechanism to help\u0000populate the long vector cache lines in advance to increase the hit rate by\u0000further exploiting the spatial locality on vector data. Its evaluation was\u0000conducted on the Cavatools simulator, comparing the performance to a standard\u0000conventional cache, over different typical vector benchmarks for several vector\u0000lengths. The results proved the proposed cache speeds up performance on\u0000stride-1 vector benchmarks, while hardly impacting non-stride-1's. In addition,\u0000the prefetching feature consistently provided an additional value.","PeriodicalId":501291,"journal":{"name":"arXiv - CS - Performance","volume":"57 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141778973","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Missile: Fine-Grained, Hardware-Level GPU Resource Isolation for Multi-Tenant DNN Inference 导弹多租户 DNN 推理的细粒度硬件级 GPU 资源隔离
arXiv - CS - Performance Pub Date : 2024-07-19 DOI: arxiv-2407.13996
Yongkang Zhang, Haoxuan Yu, Chenxia Han, Cheng Wang, Baotong Lu, Yang Li, Xiaowen Chu, Huaicheng Li
{"title":"Missile: Fine-Grained, Hardware-Level GPU Resource Isolation for Multi-Tenant DNN Inference","authors":"Yongkang Zhang, Haoxuan Yu, Chenxia Han, Cheng Wang, Baotong Lu, Yang Li, Xiaowen Chu, Huaicheng Li","doi":"arxiv-2407.13996","DOIUrl":"https://doi.org/arxiv-2407.13996","url":null,"abstract":"Colocating high-priority, latency-sensitive (LS) and low-priority,\u0000best-effort (BE) DNN inference services reduces the total cost of ownership\u0000(TCO) of GPU clusters. Limited by bottlenecks such as VRAM channel conflicts\u0000and PCIe bus contentions, existing GPU sharing solutions are unable to avoid\u0000resource conflicts among concurrently executing tasks, failing to achieve both\u0000low latency for LS tasks and high throughput for BE tasks. To bridge this gap,\u0000this paper presents Missile, a general GPU sharing solution for multi-tenant\u0000DNN inference on NVIDIA GPUs. Missile approximates fine-grained GPU hardware\u0000resource isolation between multiple LS and BE DNN tasks at software level.\u0000Through comprehensive reverse engineering, Missile first reveals a general VRAM\u0000channel hash mapping architecture of NVIDIA GPUs and eliminates VRAM channel\u0000conflicts using software-level cache coloring. It also isolates the PCIe bus\u0000and fairly allocates PCIe bandwidth using completely fair scheduler. We\u0000evaluate 12 mainstream DNNs with synthetic and real-world workloads on four\u0000GPUs. The results show that compared to the state-of-the-art GPU sharing\u0000solutions, Missile reduces tail latency for LS services by up to ~50%, achieves\u0000up to 6.1x BE job throughput, and allocates PCIe bus bandwidth to tenants\u0000on-demand for optimal performance.","PeriodicalId":501291,"journal":{"name":"arXiv - CS - Performance","volume":"2013 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141737356","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Mixture of Experts with Mixture of Precisions for Tuning Quality of Service 专家与精确度混合物用于调整服务质量
arXiv - CS - Performance Pub Date : 2024-07-19 DOI: arxiv-2407.14417
HamidReza Imani, Abdolah Amirany, Tarek El-Ghazawi
{"title":"Mixture of Experts with Mixture of Precisions for Tuning Quality of Service","authors":"HamidReza Imani, Abdolah Amirany, Tarek El-Ghazawi","doi":"arxiv-2407.14417","DOIUrl":"https://doi.org/arxiv-2407.14417","url":null,"abstract":"The increasing demand for deploying large Mixture-of-Experts (MoE) models in\u0000resource-constrained environments necessitates efficient approaches to address\u0000their high memory and computational requirements challenges. Moreover, given\u0000that tasks come in different user-defined constraints and the available\u0000resources change over time in multi-tenant environments, it is necessary to\u0000design an approach which provides a flexible configuration space. This paper\u0000presents an adaptive serving approach for the efficient deployment of MoE\u0000models, capitalizing on partial quantization of the experts. By dynamically\u0000determining the number of quantized experts and their distribution across CPU\u0000and GPU, our approach explores the Pareto frontier and offers a fine-grained\u0000range of configurations for tuning throughput and model quality. Our evaluation\u0000on an NVIDIA A100 GPU using a Mixtral 8x7B MoE model for three language\u0000modelling benchmarks demonstrates that the throughput of token generation can\u0000be adjusted from 0.63 to 13.00 token per second. This enhancement comes with a\u0000marginal perplexity increase of 2.62 to 2.80, 6.48 to 7.24, and 3.24 to 3.53\u0000for WikiText2, PTB, and C4 datasets respectively under maximum quantization.\u0000These results highlight the practical applicability of our approach in dynamic\u0000and accuracy-sensitive applications where both memory usage and output quality\u0000are important.","PeriodicalId":501291,"journal":{"name":"arXiv - CS - Performance","volume":"69 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141737358","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Data-driven Forecasting of Deep Learning Performance on GPUs GPU 上深度学习性能的数据驱动预测
arXiv - CS - Performance Pub Date : 2024-07-18 DOI: arxiv-2407.13853
Seonho Lee, Amar Phanishayee, Divya Mahajan
{"title":"Data-driven Forecasting of Deep Learning Performance on GPUs","authors":"Seonho Lee, Amar Phanishayee, Divya Mahajan","doi":"arxiv-2407.13853","DOIUrl":"https://doi.org/arxiv-2407.13853","url":null,"abstract":"Deep learning kernels exhibit predictable memory accesses and compute\u0000patterns, making GPUs' parallel architecture well-suited for their execution.\u0000Software and runtime systems for GPUs are optimized to better utilize the\u0000stream multiprocessors, on-chip cache, and off-chip high-bandwidth memory. As\u0000deep learning models and GPUs evolve, access to newer GPUs is often limited,\u0000raising questions about the performance of new model architectures on existing\u0000GPUs, existing models on new GPUs, and new model architectures on new GPUs. To\u0000address these questions, we introduce NeuSight, a framework to predict the\u0000performance of various deep learning models, for both training and inference,\u0000on unseen GPUs without requiring actual execution. The framework leverages both\u0000GPU hardware behavior and software library optimizations to estimate end-to-end\u0000performance. Previous work uses regression models that capture linear trends or\u0000multilayer perceptrons to predict the overall latency of deep learning kernels\u0000on GPUs. These approaches suffer from higher error percentages when forecasting\u0000performance on unseen models and new GPUs. Instead, NeuSight decomposes the\u0000prediction problem into smaller problems, bounding the prediction through\u0000fundamental performance laws. NeuSight decomposes a single deep learning kernel\u0000prediction into smaller working sets called tiles, which are executed\u0000independently on the GPU. Tile-granularity predictions are determined using a\u0000machine learning approach and aggregated to estimate end-to-end latency.\u0000NeuSight outperforms prior work across various deep learning workloads and the\u0000latest GPUs. It reduces the percentage error from 198% and 19.7% to 3.8% in\u0000predicting the latency of GPT3 model for training and inference on H100,\u0000compared to state-of-the-art prior works, where both GPT3 and H100 were not\u0000used to train the framework.","PeriodicalId":501291,"journal":{"name":"arXiv - CS - Performance","volume":"23 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141737355","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信