多卡GPU机器学习作业的分析与优化

IF 1.5 4区计算机科学 Q3 COMPUTER SCIENCE, SOFTWARE ENGINEERING

Concurrency and Computation-Practice & Experience Pub Date : 2025-07-22 DOI:10.1002/cpe.70196

Marcin Lawenda, Kyrylo Khloponin, Krzesimir Samborski, Łukasz Szustak

{"title":"多卡GPU机器学习作业的分析与优化","authors":"Marcin Lawenda, Kyrylo Khloponin, Krzesimir Samborski, Łukasz Szustak","doi":"10.1002/cpe.70196","DOIUrl":null,"url":null,"abstract":"<div>\n \n <p>The article discusses various model optimization techniques, providing a comprehensive analysis of key performance indicators. Several parallelization strategies for image recognition are analyzed, adapted to different hardware and software configurations, including distributed data parallelism and distributed hardware processing. Changing the tensor layout in PyTorch DataLoader from NCHW to NHWC and enabling <i>pin</i>_<i>memory</i> has proven to be very beneficial and easy to implement. Furthermore, the impact of different performance techniques (DPO, LoRA, QLoRA, and QAT) on the tuning process of LLMs was investigated. LoRA allows for faster tuning, while requiring less VRAM compared to DPO. On the other hand, QAT is the most resource-intensive method, with the slowest processing times. A significant portion of LLM tuning time is attributed to initializing new kernels and synchronizing multiple threads when memory operations are not dominant.</p>\n </div>","PeriodicalId":55214,"journal":{"name":"Concurrency and Computation-Practice & Experience","volume":"37 18-20","pages":""},"PeriodicalIF":1.5000,"publicationDate":"2025-07-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Profiling and Optimization of Multicard GPU Machine Learning Jobs\",\"authors\":\"Marcin Lawenda, Kyrylo Khloponin, Krzesimir Samborski, Łukasz Szustak\",\"doi\":\"10.1002/cpe.70196\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div>\\n \\n <p>The article discusses various model optimization techniques, providing a comprehensive analysis of key performance indicators. Several parallelization strategies for image recognition are analyzed, adapted to different hardware and software configurations, including distributed data parallelism and distributed hardware processing. Changing the tensor layout in PyTorch DataLoader from NCHW to NHWC and enabling <i>pin</i>_<i>memory</i> has proven to be very beneficial and easy to implement. Furthermore, the impact of different performance techniques (DPO, LoRA, QLoRA, and QAT) on the tuning process of LLMs was investigated. LoRA allows for faster tuning, while requiring less VRAM compared to DPO. On the other hand, QAT is the most resource-intensive method, with the slowest processing times. A significant portion of LLM tuning time is attributed to initializing new kernels and synchronizing multiple threads when memory operations are not dominant.</p>\\n </div>\",\"PeriodicalId\":55214,\"journal\":{\"name\":\"Concurrency and Computation-Practice & Experience\",\"volume\":\"37 18-20\",\"pages\":\"\"},\"PeriodicalIF\":1.5000,\"publicationDate\":\"2025-07-22\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Concurrency and Computation-Practice & Experience\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://onlinelibrary.wiley.com/doi/10.1002/cpe.70196\",\"RegionNum\":4,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q3\",\"JCRName\":\"COMPUTER SCIENCE, SOFTWARE ENGINEERING\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Concurrency and Computation-Practice & Experience","FirstCategoryId":"94","ListUrlMain":"https://onlinelibrary.wiley.com/doi/10.1002/cpe.70196","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, SOFTWARE ENGINEERING","Score":null,"Total":0}

引用次数: 0

摘要

本文讨论了各种模型优化技术，并对关键性能指标进行了全面分析。分析了几种适用于不同硬件和软件配置的图像识别并行化策略，包括分布式数据并行化和分布式硬件处理。将PyTorch DataLoader中的张量布局从NCHW更改为NHWC并启用pin_memory已被证明是非常有益且易于实现的。此外，还研究了不同性能技术（DPO、LoRA、QLoRA和QAT）对llm调优过程的影响。与DPO相比，LoRA允许更快的调优，同时需要更少的VRAM。另一方面，QAT是资源最密集的方法，处理时间最慢。LLM调优时间的很大一部分归因于初始化新内核和在内存操作不占主导地位时同步多个线程。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Profiling and Optimization of Multicard GPU Machine Learning Jobs

The article discusses various model optimization techniques, providing a comprehensive analysis of key performance indicators. Several parallelization strategies for image recognition are analyzed, adapted to different hardware and software configurations, including distributed data parallelism and distributed hardware processing. Changing the tensor layout in PyTorch DataLoader from NCHW to NHWC and enabling pin_memory has proven to be very beneficial and easy to implement. Furthermore, the impact of different performance techniques (DPO, LoRA, QLoRA, and QAT) on the tuning process of LLMs was investigated. LoRA allows for faster tuning, while requiring less VRAM compared to DPO. On the other hand, QAT is the most resource-intensive method, with the slowest processing times. A significant portion of LLM tuning time is attributed to initializing new kernels and synchronizing multiple threads when memory operations are not dominant.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Concurrency and Computation-Practice & Experience 工程技术-计算机：理论方法

CiteScore

5.00

自引率

10.00%

发文量

664

审稿时长

9.6 months

期刊介绍： Concurrency and Computation: Practice and Experience (CCPE) publishes high-quality, original research papers, and authoritative research review papers, in the overlapping fields of: Parallel and distributed computing; High-performance computing; Computational and data science; Artificial intelligence and machine learning; Big data applications, algorithms, and systems; Network science; Ontologies and semantics; Security and privacy; Cloud/edge/fog computing; Green computing; and Quantum computing.