Measuring the quality of generative AI systems: Mapping metrics to quality characteristics — Snowballing literature review

IF 4.3 2区计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS

Information and Software Technology Pub Date : 2025-06-18 DOI:10.1016/j.infsof.2025.107802

Liang Yu , Emil Alégroth , Panagiota Chatzipetrou , Tony Gorschek

{"title":"Measuring the quality of generative AI systems: Mapping metrics to quality characteristics — Snowballing literature review","authors":"Liang Yu , Emil Alégroth , Panagiota Chatzipetrou , Tony Gorschek","doi":"10.1016/j.infsof.2025.107802","DOIUrl":null,"url":null,"abstract":"<div><h3>Context</h3><div>Generative Artificial Intelligence (GenAI) and the use of Large Language Models (LLMs) have revolutionized tasks that previously required significant human effort, which has attracted considerable interest from industry stakeholders. This growing interest has accelerated the integration of AI models into various industrial applications. However, the model integration introduces challenges to product quality, as conventional quality measuring methods may fail to assess GenAI systems. Consequently, evaluation techniques for GenAI systems need to be adapted and refined. Examining the current state and applicability of evaluation techniques for the GenAI system outputs is essential.</div></div><div><h3>Objective</h3><div>This study aims to explore the current metrics, methods, and processes for assessing the outputs of GenAI systems and the potential of risky outputs.</div></div><div><h3>Method</h3><div>We performed a snowballing literature review to identify metrics, evaluation methods, and evaluation processes from 43 selected papers.</div></div><div><h3>Results</h3><div>We identified 28 metrics and mapped these metrics to four quality characteristics defined by the ISO/IEC 25023 standard for software systems. Additionally, we discovered three types of evaluation methods to measure the quality of system outputs and a three-step process to assess faulty system outputs. Based on these insights, we suggested a five-step framework for measuring system quality while utilizing GenAI models.</div></div><div><h3>Conclusion</h3><div>Our findings present a mapping that visualizes candidate metrics to be selected for measuring quality characteristics of GenAI systems, accompanied by step-by-step processes to assist practitioners in conducting quality assessments.</div></div>","PeriodicalId":54983,"journal":{"name":"Information and Software Technology","volume":"186 ","pages":"Article 107802"},"PeriodicalIF":4.3000,"publicationDate":"2025-06-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Information and Software Technology","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0950584925001417","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 0

Abstract

Context

Generative Artificial Intelligence (GenAI) and the use of Large Language Models (LLMs) have revolutionized tasks that previously required significant human effort, which has attracted considerable interest from industry stakeholders. This growing interest has accelerated the integration of AI models into various industrial applications. However, the model integration introduces challenges to product quality, as conventional quality measuring methods may fail to assess GenAI systems. Consequently, evaluation techniques for GenAI systems need to be adapted and refined. Examining the current state and applicability of evaluation techniques for the GenAI system outputs is essential.

Objective

This study aims to explore the current metrics, methods, and processes for assessing the outputs of GenAI systems and the potential of risky outputs.

Method

We performed a snowballing literature review to identify metrics, evaluation methods, and evaluation processes from 43 selected papers.

Results

We identified 28 metrics and mapped these metrics to four quality characteristics defined by the ISO/IEC 25023 standard for software systems. Additionally, we discovered three types of evaluation methods to measure the quality of system outputs and a three-step process to assess faulty system outputs. Based on these insights, we suggested a five-step framework for measuring system quality while utilizing GenAI models.

Conclusion

Our findings present a mapping that visualizes candidate metrics to be selected for measuring quality characteristics of GenAI systems, accompanied by step-by-step processes to assist practitioners in conducting quality assessments.

Abstract Image

查看原文本刊更多论文

测量生成人工智能系统的质量：将度量映射到质量特征——滚雪球式文献综述

生成人工智能（GenAI）和大型语言模型（llm）的使用彻底改变了以前需要大量人力的任务，这引起了行业利益相关者的极大兴趣。这种日益增长的兴趣加速了人工智能模型与各种工业应用的整合。然而，模型集成给产品质量带来了挑战，因为传统的质量测量方法可能无法评估GenAI系统。因此，需要调整和改进GenAI系统的评估技术。审查GenAI系统产出评价技术的现状和适用性是必不可少的。目的：本研究旨在探讨当前评估GenAI系统输出和潜在风险输出的指标、方法和流程。方法采用滚雪球式文献综述方法，从43篇入选论文中确定指标、评价方法和评价过程。结果我们确定了28个度量标准，并将这些度量标准映射到ISO/IEC 25023软件系统标准定义的四个质量特征。此外，我们发现了三种类型的评估方法来衡量系统输出的质量和评估故障系统输出的三步过程。基于这些见解，我们提出了一个利用GenAI模型测量系统质量的五步框架。结论：我们的研究结果呈现了一种可视化的候选度量，用于测量GenAI系统的质量特征，伴随着逐步的过程，以帮助从业者进行质量评估。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Information and Software Technology 工程技术-计算机：软件工程

CiteScore

9.10

自引率

7.70%

发文量

164

审稿时长

9.6 weeks

期刊介绍： Information and Software Technology is the international archival journal focusing on research and experience that contributes to the improvement of software development practices. The journal''s scope includes methods and techniques to better engineer software and manage its development. Articles submitted for review should have a clear component of software engineering or address ways to improve the engineering and management of software development. Areas covered by the journal include: • Software management, quality and metrics, • Software processes, • Software architecture, modelling, specification, design and programming • Functional and non-functional software requirements • Software testing and verification & validation • Empirical studies of all aspects of engineering and managing software development Short Communications is a new section dedicated to short papers addressing new ideas, controversial opinions, "Negative" results and much more. Read the Guide for authors for more information. The journal encourages and welcomes submissions of systematic literature studies (reviews and maps) within the scope of the journal. Information and Software Technology is the premiere outlet for systematic literature studies in software engineering.