多卡GPU机器学习作业的分析与优化

IF 1.5 4区 计算机科学 Q3 COMPUTER SCIENCE, SOFTWARE ENGINEERING
Marcin Lawenda, Kyrylo Khloponin, Krzesimir Samborski, Łukasz Szustak
{"title":"多卡GPU机器学习作业的分析与优化","authors":"Marcin Lawenda,&nbsp;Kyrylo Khloponin,&nbsp;Krzesimir Samborski,&nbsp;Łukasz Szustak","doi":"10.1002/cpe.70196","DOIUrl":null,"url":null,"abstract":"<div>\n \n <p>The article discusses various model optimization techniques, providing a comprehensive analysis of key performance indicators. Several parallelization strategies for image recognition are analyzed, adapted to different hardware and software configurations, including distributed data parallelism and distributed hardware processing. Changing the tensor layout in PyTorch DataLoader from NCHW to NHWC and enabling <i>pin</i>_<i>memory</i> has proven to be very beneficial and easy to implement. Furthermore, the impact of different performance techniques (DPO, LoRA, QLoRA, and QAT) on the tuning process of LLMs was investigated. LoRA allows for faster tuning, while requiring less VRAM compared to DPO. On the other hand, QAT is the most resource-intensive method, with the slowest processing times. A significant portion of LLM tuning time is attributed to initializing new kernels and synchronizing multiple threads when memory operations are not dominant.</p>\n </div>","PeriodicalId":55214,"journal":{"name":"Concurrency and Computation-Practice & Experience","volume":"37 18-20","pages":""},"PeriodicalIF":1.5000,"publicationDate":"2025-07-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Profiling and Optimization of Multicard GPU Machine Learning Jobs\",\"authors\":\"Marcin Lawenda,&nbsp;Kyrylo Khloponin,&nbsp;Krzesimir Samborski,&nbsp;Łukasz Szustak\",\"doi\":\"10.1002/cpe.70196\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div>\\n \\n <p>The article discusses various model optimization techniques, providing a comprehensive analysis of key performance indicators. Several parallelization strategies for image recognition are analyzed, adapted to different hardware and software configurations, including distributed data parallelism and distributed hardware processing. Changing the tensor layout in PyTorch DataLoader from NCHW to NHWC and enabling <i>pin</i>_<i>memory</i> has proven to be very beneficial and easy to implement. Furthermore, the impact of different performance techniques (DPO, LoRA, QLoRA, and QAT) on the tuning process of LLMs was investigated. LoRA allows for faster tuning, while requiring less VRAM compared to DPO. On the other hand, QAT is the most resource-intensive method, with the slowest processing times. A significant portion of LLM tuning time is attributed to initializing new kernels and synchronizing multiple threads when memory operations are not dominant.</p>\\n </div>\",\"PeriodicalId\":55214,\"journal\":{\"name\":\"Concurrency and Computation-Practice & Experience\",\"volume\":\"37 18-20\",\"pages\":\"\"},\"PeriodicalIF\":1.5000,\"publicationDate\":\"2025-07-22\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Concurrency and Computation-Practice & Experience\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://onlinelibrary.wiley.com/doi/10.1002/cpe.70196\",\"RegionNum\":4,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q3\",\"JCRName\":\"COMPUTER SCIENCE, SOFTWARE ENGINEERING\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Concurrency and Computation-Practice & Experience","FirstCategoryId":"94","ListUrlMain":"https://onlinelibrary.wiley.com/doi/10.1002/cpe.70196","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, SOFTWARE ENGINEERING","Score":null,"Total":0}
引用次数: 0

摘要

本文讨论了各种模型优化技术,并对关键性能指标进行了全面分析。分析了几种适用于不同硬件和软件配置的图像识别并行化策略,包括分布式数据并行化和分布式硬件处理。将PyTorch DataLoader中的张量布局从NCHW更改为NHWC并启用pin_memory已被证明是非常有益且易于实现的。此外,还研究了不同性能技术(DPO、LoRA、QLoRA和QAT)对llm调优过程的影响。与DPO相比,LoRA允许更快的调优,同时需要更少的VRAM。另一方面,QAT是资源最密集的方法,处理时间最慢。LLM调优时间的很大一部分归因于初始化新内核和在内存操作不占主导地位时同步多个线程。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Profiling and Optimization of Multicard GPU Machine Learning Jobs

The article discusses various model optimization techniques, providing a comprehensive analysis of key performance indicators. Several parallelization strategies for image recognition are analyzed, adapted to different hardware and software configurations, including distributed data parallelism and distributed hardware processing. Changing the tensor layout in PyTorch DataLoader from NCHW to NHWC and enabling pin_memory has proven to be very beneficial and easy to implement. Furthermore, the impact of different performance techniques (DPO, LoRA, QLoRA, and QAT) on the tuning process of LLMs was investigated. LoRA allows for faster tuning, while requiring less VRAM compared to DPO. On the other hand, QAT is the most resource-intensive method, with the slowest processing times. A significant portion of LLM tuning time is attributed to initializing new kernels and synchronizing multiple threads when memory operations are not dominant.

求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
Concurrency and Computation-Practice & Experience
Concurrency and Computation-Practice & Experience 工程技术-计算机:理论方法
CiteScore
5.00
自引率
10.00%
发文量
664
审稿时长
9.6 months
期刊介绍: Concurrency and Computation: Practice and Experience (CCPE) publishes high-quality, original research papers, and authoritative research review papers, in the overlapping fields of: Parallel and distributed computing; High-performance computing; Computational and data science; Artificial intelligence and machine learning; Big data applications, algorithms, and systems; Network science; Ontologies and semantics; Security and privacy; Cloud/edge/fog computing; Green computing; and Quantum computing.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信