Marcin Lawenda, Kyrylo Khloponin, Krzesimir Samborski, Łukasz Szustak
{"title":"多卡GPU机器学习作业的分析与优化","authors":"Marcin Lawenda, Kyrylo Khloponin, Krzesimir Samborski, Łukasz Szustak","doi":"10.1002/cpe.70196","DOIUrl":null,"url":null,"abstract":"<div>\n \n <p>The article discusses various model optimization techniques, providing a comprehensive analysis of key performance indicators. Several parallelization strategies for image recognition are analyzed, adapted to different hardware and software configurations, including distributed data parallelism and distributed hardware processing. Changing the tensor layout in PyTorch DataLoader from NCHW to NHWC and enabling <i>pin</i>_<i>memory</i> has proven to be very beneficial and easy to implement. Furthermore, the impact of different performance techniques (DPO, LoRA, QLoRA, and QAT) on the tuning process of LLMs was investigated. LoRA allows for faster tuning, while requiring less VRAM compared to DPO. On the other hand, QAT is the most resource-intensive method, with the slowest processing times. A significant portion of LLM tuning time is attributed to initializing new kernels and synchronizing multiple threads when memory operations are not dominant.</p>\n </div>","PeriodicalId":55214,"journal":{"name":"Concurrency and Computation-Practice & Experience","volume":"37 18-20","pages":""},"PeriodicalIF":1.5000,"publicationDate":"2025-07-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Profiling and Optimization of Multicard GPU Machine Learning Jobs\",\"authors\":\"Marcin Lawenda, Kyrylo Khloponin, Krzesimir Samborski, Łukasz Szustak\",\"doi\":\"10.1002/cpe.70196\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div>\\n \\n <p>The article discusses various model optimization techniques, providing a comprehensive analysis of key performance indicators. Several parallelization strategies for image recognition are analyzed, adapted to different hardware and software configurations, including distributed data parallelism and distributed hardware processing. Changing the tensor layout in PyTorch DataLoader from NCHW to NHWC and enabling <i>pin</i>_<i>memory</i> has proven to be very beneficial and easy to implement. Furthermore, the impact of different performance techniques (DPO, LoRA, QLoRA, and QAT) on the tuning process of LLMs was investigated. LoRA allows for faster tuning, while requiring less VRAM compared to DPO. On the other hand, QAT is the most resource-intensive method, with the slowest processing times. A significant portion of LLM tuning time is attributed to initializing new kernels and synchronizing multiple threads when memory operations are not dominant.</p>\\n </div>\",\"PeriodicalId\":55214,\"journal\":{\"name\":\"Concurrency and Computation-Practice & Experience\",\"volume\":\"37 18-20\",\"pages\":\"\"},\"PeriodicalIF\":1.5000,\"publicationDate\":\"2025-07-22\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Concurrency and Computation-Practice & Experience\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://onlinelibrary.wiley.com/doi/10.1002/cpe.70196\",\"RegionNum\":4,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q3\",\"JCRName\":\"COMPUTER SCIENCE, SOFTWARE ENGINEERING\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Concurrency and Computation-Practice & Experience","FirstCategoryId":"94","ListUrlMain":"https://onlinelibrary.wiley.com/doi/10.1002/cpe.70196","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, SOFTWARE ENGINEERING","Score":null,"Total":0}
Profiling and Optimization of Multicard GPU Machine Learning Jobs
The article discusses various model optimization techniques, providing a comprehensive analysis of key performance indicators. Several parallelization strategies for image recognition are analyzed, adapted to different hardware and software configurations, including distributed data parallelism and distributed hardware processing. Changing the tensor layout in PyTorch DataLoader from NCHW to NHWC and enabling pin_memory has proven to be very beneficial and easy to implement. Furthermore, the impact of different performance techniques (DPO, LoRA, QLoRA, and QAT) on the tuning process of LLMs was investigated. LoRA allows for faster tuning, while requiring less VRAM compared to DPO. On the other hand, QAT is the most resource-intensive method, with the slowest processing times. A significant portion of LLM tuning time is attributed to initializing new kernels and synchronizing multiple threads when memory operations are not dominant.
期刊介绍:
Concurrency and Computation: Practice and Experience (CCPE) publishes high-quality, original research papers, and authoritative research review papers, in the overlapping fields of:
Parallel and distributed computing;
High-performance computing;
Computational and data science;
Artificial intelligence and machine learning;
Big data applications, algorithms, and systems;
Network science;
Ontologies and semantics;
Security and privacy;
Cloud/edge/fog computing;
Green computing; and
Quantum computing.