Insights into resource utilization of code small language models serving with runtime engines and execution providers

IF 4.1 2区计算机科学 Q1 COMPUTER SCIENCE, SOFTWARE ENGINEERING

Journal of Systems and Software Pub Date : 2025-07-28 DOI:10.1016/j.jss.2025.112574

Francisco Durán , Matias Martinez , Patricia Lago , Silverio Martínez-Fernández

{"title":"Insights into resource utilization of code small language models serving with runtime engines and execution providers","authors":"Francisco Durán , Matias Martinez , Patricia Lago , Silverio Martínez-Fernández","doi":"10.1016/j.jss.2025.112574","DOIUrl":null,"url":null,"abstract":"<div><div>The rapid growth of language models, particularly in code generation, requires substantial computational resources, raising concerns about energy consumption and environmental impact. Optimizing language models inference resource utilization is crucial, and Small Language Models (SLMs) offer a promising solution to reduce resource demands. Our goal is to analyze the impact of deep learning serving configurations, defined as combinations of runtime engines and execution providers, on resource utilization, in terms of energy consumption, execution time, and computing-resource utilization from the point of view of software engineers conducting inference in the context of code generation SLMs. We conducted a technology-oriented, multi-stage experimental pipeline using twelve code generation SLMs to investigate energy consumption, execution time, and computing-resource utilization across the configurations. Significant differences emerged across configurations. CUDA execution provider configurations outperformed CPU execution provider configurations in both energy consumption and execution time. Among the configurations, TORCH paired with CUDA demonstrated the greatest energy efficiency, achieving energy savings from 37.99% up to 89.16% compared to other serving configurations. Similarly, optimized runtime engines like ONNX with the CPU execution provider achieved from 8.98% up to 72.04% energy savings within CPU-based configurations. Also, TORCH paired with CUDA exhibited efficient computing-resource utilization. Serving configuration choice significantly impacts resource utilization. While further research is needed, we recommend the above configurations best suited to software engineers’ requirements for enhancing serving resource utilization efficiency.</div><div><em>Editor’s note: Open Science material was validated by the Journal of Systems and Software Open Science Board</em>.</div></div>","PeriodicalId":51099,"journal":{"name":"Journal of Systems and Software","volume":"230 ","pages":"Article 112574"},"PeriodicalIF":4.1000,"publicationDate":"2025-07-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Systems and Software","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0164121225002432","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, SOFTWARE ENGINEERING","Score":null,"Total":0}

引用次数: 0

Abstract

The rapid growth of language models, particularly in code generation, requires substantial computational resources, raising concerns about energy consumption and environmental impact. Optimizing language models inference resource utilization is crucial, and Small Language Models (SLMs) offer a promising solution to reduce resource demands. Our goal is to analyze the impact of deep learning serving configurations, defined as combinations of runtime engines and execution providers, on resource utilization, in terms of energy consumption, execution time, and computing-resource utilization from the point of view of software engineers conducting inference in the context of code generation SLMs. We conducted a technology-oriented, multi-stage experimental pipeline using twelve code generation SLMs to investigate energy consumption, execution time, and computing-resource utilization across the configurations. Significant differences emerged across configurations. CUDA execution provider configurations outperformed CPU execution provider configurations in both energy consumption and execution time. Among the configurations, TORCH paired with CUDA demonstrated the greatest energy efficiency, achieving energy savings from 37.99% up to 89.16% compared to other serving configurations. Similarly, optimized runtime engines like ONNX with the CPU execution provider achieved from 8.98% up to 72.04% energy savings within CPU-based configurations. Also, TORCH paired with CUDA exhibited efficient computing-resource utilization. Serving configuration choice significantly impacts resource utilization. While further research is needed, we recommend the above configurations best suited to software engineers’ requirements for enhancing serving resource utilization efficiency.

Editor’s note: Open Science material was validated by the Journal of Systems and Software Open Science Board.

查看原文本刊更多论文

洞察与运行时引擎和执行提供程序一起服务的代码小语言模型的资源利用

语言模型的快速增长，特别是在代码生成方面，需要大量的计算资源，引起了对能源消耗和环境影响的关注。优化语言模型推理资源利用率是至关重要的，而小语言模型（Small language models, slm）为减少资源需求提供了一个有希望的解决方案。我们的目标是分析深度学习服务配置（定义为运行时引擎和执行提供程序的组合）对资源利用的影响，从在代码生成slm上下文中进行推理的软件工程师的角度来看，包括能耗、执行时间和计算资源利用率。我们使用12个代码生成slm进行了一个面向技术的多阶段实验管道，以调查能源消耗、执行时间和跨配置的计算资源利用率。不同配置之间出现了显著差异。CUDA执行提供程序配置在能耗和执行时间上都优于CPU执行提供程序配置。在这些配置中，TORCH与CUDA配合使用的节能效果最好，与其他配置相比，节能效果从37.99%提高到89.16%。类似地，带有CPU执行提供程序的ONNX等优化的运行时引擎在基于CPU的配置中实现了从8.98%到72.04%的节能。此外，TORCH与CUDA配对显示出高效的计算资源利用率。服务配置选择显著影响资源利用率。虽然还需要进一步的研究，但我们推荐上述配置最适合软件工程师的需求，以提高服务资源的利用效率。编者注：开放科学材料由系统与软件开放科学委员会杂志验证。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Journal of Systems and Software 工程技术-计算机：理论方法

CiteScore

8.60

自引率

5.70%

发文量

193

审稿时长

16 weeks

期刊介绍： The Journal of Systems and Software publishes papers covering all aspects of software engineering and related hardware-software-systems issues. All articles should include a validation of the idea presented, e.g. through case studies, experiments, or systematic comparisons with other approaches already in practice. Topics of interest include, but are not limited to: •Methods and tools for, and empirical studies on, software requirements, design, architecture, verification and validation, maintenance and evolution •Agile, model-driven, service-oriented, open source and global software development •Approaches for mobile, multiprocessing, real-time, distributed, cloud-based, dependable and virtualized systems •Human factors and management concerns of software development •Data management and big data issues of software systems •Metrics and evaluation, data mining of software development resources •Business and economic aspects of software development processes The journal welcomes state-of-the-art surveys and reports of practical experience for all of these topics.