Usage patterns of software product metrics in assessing developers’ output: A comprehensive study

IF 4.3 2区计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS

Information and Software Technology Pub Date : 2025-10-17 DOI:10.1016/j.infsof.2025.107935

Wentao Chen , Huiqun Yu , Guisheng Fan , Zijie Huang , Yuguo Liang

{"title":"Usage patterns of software product metrics in assessing developers’ output: A comprehensive study","authors":"Wentao Chen , Huiqun Yu , Guisheng Fan , Zijie Huang , Yuguo Liang","doi":"10.1016/j.infsof.2025.107935","DOIUrl":null,"url":null,"abstract":"<div><h3>Context:</h3><div>Accurate assessment of developers’ output is crucial for both software engineering research and industrial practice. This assessment often relies on software product metrics such as lines of code (LOC) and quality metrics from static analysis tools. However, existing research lacks a comprehensive understanding of the usage patterns of product metrics, and a single metric is increasingly vulnerable to manipulation, particularly with the emergence of large language models (LLMs).</div></div><div><h3>Objectives:</h3><div>This study aims to investigate (1) how developers can intentionally manipulate commonly used metrics like LOC by using LLMs, (2) whether complex efficiency metrics provide consistent advantages over simpler metrics, and (3) the reliability and cost-effectiveness of quality metrics derived from tools such as SonarQube.</div></div><div><h3>Methods:</h3><div>We conduct empirical analyses involving three LLMs to achieve metric manipulation and evaluate product metric reliability across nine open-source projects. We further validate our findings through a collaboration with a large financial institution facing fairness concerns in developers’ output due to inappropriate metric usage.</div></div><div><h3>Results:</h3><div>We observe that developers can inflate LOC by an average of 60.86% using LLMs, leading to anomalous assessments. Complex efficiency metrics do not yield consistent performance improvements relative to their computational costs. Furthermore, quality metrics from SonarQube and PMD often fail to capture real quality changes and are expensive to compute. The software metric migration plan based on our findings effectively reduces evaluation anomalies in the industry and standardizes developers’ commits, confirming our conclusions’ practical validity.</div></div><div><h3>Conclusion:</h3><div>Our findings highlight critical limitations in current metric practices and demonstrate how thoughtful usage patterns of product metrics can improve fairness in developer evaluation. This work bridges the gap between academic insights and industrial needs, offering practical guidance for more reliable developers’ output assessment.</div></div>","PeriodicalId":54983,"journal":{"name":"Information and Software Technology","volume":"189 ","pages":"Article 107935"},"PeriodicalIF":4.3000,"publicationDate":"2025-10-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Information and Software Technology","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0950584925002745","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 0

Abstract

Context:

Accurate assessment of developers’ output is crucial for both software engineering research and industrial practice. This assessment often relies on software product metrics such as lines of code (LOC) and quality metrics from static analysis tools. However, existing research lacks a comprehensive understanding of the usage patterns of product metrics, and a single metric is increasingly vulnerable to manipulation, particularly with the emergence of large language models (LLMs).

Objectives:

This study aims to investigate (1) how developers can intentionally manipulate commonly used metrics like LOC by using LLMs, (2) whether complex efficiency metrics provide consistent advantages over simpler metrics, and (3) the reliability and cost-effectiveness of quality metrics derived from tools such as SonarQube.

Methods:

We conduct empirical analyses involving three LLMs to achieve metric manipulation and evaluate product metric reliability across nine open-source projects. We further validate our findings through a collaboration with a large financial institution facing fairness concerns in developers’ output due to inappropriate metric usage.

Results:

We observe that developers can inflate LOC by an average of 60.86% using LLMs, leading to anomalous assessments. Complex efficiency metrics do not yield consistent performance improvements relative to their computational costs. Furthermore, quality metrics from SonarQube and PMD often fail to capture real quality changes and are expensive to compute. The software metric migration plan based on our findings effectively reduces evaluation anomalies in the industry and standardizes developers’ commits, confirming our conclusions’ practical validity.

Conclusion:

Our findings highlight critical limitations in current metric practices and demonstrate how thoughtful usage patterns of product metrics can improve fairness in developer evaluation. This work bridges the gap between academic insights and industrial needs, offering practical guidance for more reliable developers’ output assessment.

查看原文本刊更多论文

评估开发人员产出的软件产品度量的使用模式：一项综合研究

背景：准确评估开发人员的产出对于软件工程研究和工业实践都是至关重要的。这种评估通常依赖于软件产品度量，例如代码行（LOC）和来自静态分析工具的质量度量。然而，现有的研究缺乏对产品度量的使用模式的全面理解，并且单个度量越来越容易受到操纵，特别是随着大型语言模型（llm）的出现。目的：本研究旨在调查(1)开发人员如何通过使用llm有意地操纵常用的度量，如LOC；(2)复杂的效率度量是否比简单的度量提供一致的优势；(3)来自SonarQube等工具的质量度量的可靠性和成本效益。方法：我们进行了涉及三个llm的实证分析，以实现度量操作并评估九个开源项目的产品度量可靠性。我们通过与一家大型金融机构的合作进一步验证了我们的发现，该机构面临着由于不适当的度量使用而导致的开发者产出公平性问题。结果：我们观察到，开发人员可以使用llm将LOC平均夸大60.86%，从而导致异常评估。相对于它们的计算成本，复杂的效率度量不能产生一致的性能改进。此外，来自SonarQube和PMD的质量度量常常不能捕获真正的质量变化，并且计算成本很高。基于我们的发现的软件度量迁移计划有效地减少了行业中的评估异常，并标准化了开发人员的提交，确认了我们的结论的实际有效性。结论：我们的发现突出了当前指标实践的关键局限性，并展示了深思熟虑的产品指标使用模式如何提高开发人员评估的公平性。这项工作弥合了学术见解和行业需求之间的差距，为更可靠的开发人员产出评估提供了实践指导。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Information and Software Technology 工程技术-计算机：软件工程

CiteScore

9.10

自引率

7.70%

发文量

164

审稿时长

9.6 weeks

期刊介绍： Information and Software Technology is the international archival journal focusing on research and experience that contributes to the improvement of software development practices. The journal''s scope includes methods and techniques to better engineer software and manage its development. Articles submitted for review should have a clear component of software engineering or address ways to improve the engineering and management of software development. Areas covered by the journal include: • Software management, quality and metrics, • Software processes, • Software architecture, modelling, specification, design and programming • Functional and non-functional software requirements • Software testing and verification & validation • Empirical studies of all aspects of engineering and managing software development Short Communications is a new section dedicated to short papers addressing new ideas, controversial opinions, "Negative" results and much more. Read the Guide for authors for more information. The journal encourages and welcomes submissions of systematic literature studies (reviews and maps) within the scope of the journal. Information and Software Technology is the premiere outlet for systematic literature studies in software engineering.