Evaluation and incident prevention in an enterprise AI assistant

IF 3.2 4区计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Ai Magazine Pub Date : 2025-09-08 DOI:10.1002/aaai.70028

Akash V. Maharaj, David Arbour, Daniel Lee, Uttaran Bhattacharya, Anup Rao, Austin Zane, Avi Feller, Kun Qian, Sajjadur Rahman, Yunyao Li

{"title":"Evaluation and incident prevention in an enterprise AI assistant","authors":"Akash V. Maharaj, David Arbour, Daniel Lee, Uttaran Bhattacharya, Anup Rao, Austin Zane, Avi Feller, Kun Qian, Sajjadur Rahman, Yunyao Li","doi":"10.1002/aaai.70028","DOIUrl":null,"url":null,"abstract":"<p>Enterprise AI Assistants are increasingly deployed in domains where accuracy is paramount, making each erroneous output a potentially significant incident. This paper presents a comprehensive framework for monitoring, benchmarking, and continuously improving such complex, multi-component systems under active development by multiple teams. Our approach encompasses three key elements: (1) a hierarchical “severity” framework for incident detection that identifies and categorizes errors while attributing component-specific error rates, facilitating targeted improvements; (2) a scalable and principled methodology for benchmark construction, evaluation, and deployment, designed to accommodate multiple development teams, mitigate overfitting risks, and assess the downstream impact of system modifications; and (3) a continual improvement strategy leveraging multidimensional evaluation, enabling the identification and implementation of diverse enhancement opportunities. By adopting this holistic framework, organizations can systematically enhance the reliability and performance of their AI Assistants, ensuring their efficacy in critical enterprise environments. We conclude by discussing how this multifaceted approach opens avenues for various classes of enhancements, including human-AI collaborative evaluation, paving the way for more robust and trustworthy AI systems. </p>","PeriodicalId":7854,"journal":{"name":"Ai Magazine","volume":"46 3","pages":""},"PeriodicalIF":3.2000,"publicationDate":"2025-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1002/aaai.70028","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Ai Magazine","FirstCategoryId":"94","ListUrlMain":"https://onlinelibrary.wiley.com/doi/10.1002/aaai.70028","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

Enterprise AI Assistants are increasingly deployed in domains where accuracy is paramount, making each erroneous output a potentially significant incident. This paper presents a comprehensive framework for monitoring, benchmarking, and continuously improving such complex, multi-component systems under active development by multiple teams. Our approach encompasses three key elements: (1) a hierarchical “severity” framework for incident detection that identifies and categorizes errors while attributing component-specific error rates, facilitating targeted improvements; (2) a scalable and principled methodology for benchmark construction, evaluation, and deployment, designed to accommodate multiple development teams, mitigate overfitting risks, and assess the downstream impact of system modifications; and (3) a continual improvement strategy leveraging multidimensional evaluation, enabling the identification and implementation of diverse enhancement opportunities. By adopting this holistic framework, organizations can systematically enhance the reliability and performance of their AI Assistants, ensuring their efficacy in critical enterprise environments. We conclude by discussing how this multifaceted approach opens avenues for various classes of enhancements, including human-AI collaborative evaluation, paving the way for more robust and trustworthy AI systems.

Abstract Image

查看原文本刊更多论文

企业AI助手的评估与事件预防

企业人工智能助手越来越多地部署在准确性至关重要的领域，使得每个错误的输出都可能成为重大事件。本文提出了一个全面的框架，用于监控、基准测试和持续改进由多个团队积极开发的这种复杂的多组件系统。我们的方法包含三个关键要素：(1)用于事件检测的分层“严重性”框架，该框架可以识别和分类错误，同时归因于特定组件的错误率，促进有针对性的改进；(2)用于基准构建、评估和部署的可扩展和原则性方法，旨在适应多个开发团队，减轻过度拟合风险，并评估系统修改的下游影响；(3)利用多维评价的持续改进策略，使识别和实施各种改进机会成为可能。通过采用这一整体框架，组织可以系统地提高其人工智能助手的可靠性和性能，确保其在关键企业环境中的有效性。最后，我们讨论了这种多方面的方法如何为各种类型的增强开辟道路，包括人类-人工智能协作评估，为更强大和值得信赖的人工智能系统铺平道路。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Ai Magazine 工程技术-计算机：人工智能

CiteScore

3.90

自引率

11.10%

发文量

审稿时长

>12 weeks

期刊介绍： AI Magazine publishes original articles that are reasonably self-contained and aimed at a broad spectrum of the AI community. Technical content should be kept to a minimum. In general, the magazine does not publish articles that have been published elsewhere in whole or in part. The magazine welcomes the contribution of articles on the theory and practice of AI as well as general survey articles, tutorial articles on timely topics, conference or symposia or workshop reports, and timely columns on topics of interest to AI scientists.