Evaluation and incident prevention in an enterprise AI assistant

IF 3.2 4区 计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE
Ai Magazine Pub Date : 2025-09-08 DOI:10.1002/aaai.70028
Akash V. Maharaj, David Arbour, Daniel Lee, Uttaran Bhattacharya, Anup Rao, Austin Zane, Avi Feller, Kun Qian, Sajjadur Rahman, Yunyao Li
{"title":"Evaluation and incident prevention in an enterprise AI assistant","authors":"Akash V. Maharaj,&nbsp;David Arbour,&nbsp;Daniel Lee,&nbsp;Uttaran Bhattacharya,&nbsp;Anup Rao,&nbsp;Austin Zane,&nbsp;Avi Feller,&nbsp;Kun Qian,&nbsp;Sajjadur Rahman,&nbsp;Yunyao Li","doi":"10.1002/aaai.70028","DOIUrl":null,"url":null,"abstract":"<p>Enterprise AI Assistants are increasingly deployed in domains where accuracy is paramount, making each erroneous output a potentially significant incident. This paper presents a comprehensive framework for monitoring, benchmarking, and continuously improving such complex, multi-component systems under active development by multiple teams. Our approach encompasses three key elements: (1) a hierarchical “severity” framework for incident detection that identifies and categorizes errors while attributing component-specific error rates, facilitating targeted improvements; (2) a scalable and principled methodology for benchmark construction, evaluation, and deployment, designed to accommodate multiple development teams, mitigate overfitting risks, and assess the downstream impact of system modifications; and (3) a continual improvement strategy leveraging multidimensional evaluation, enabling the identification and implementation of diverse enhancement opportunities. By adopting this holistic framework, organizations can systematically enhance the reliability and performance of their AI Assistants, ensuring their efficacy in critical enterprise environments. We conclude by discussing how this multifaceted approach opens avenues for various classes of enhancements, including human-AI collaborative evaluation, paving the way for more robust and trustworthy AI systems. </p>","PeriodicalId":7854,"journal":{"name":"Ai Magazine","volume":"46 3","pages":""},"PeriodicalIF":3.2000,"publicationDate":"2025-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1002/aaai.70028","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Ai Magazine","FirstCategoryId":"94","ListUrlMain":"https://onlinelibrary.wiley.com/doi/10.1002/aaai.70028","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0

Abstract

Enterprise AI Assistants are increasingly deployed in domains where accuracy is paramount, making each erroneous output a potentially significant incident. This paper presents a comprehensive framework for monitoring, benchmarking, and continuously improving such complex, multi-component systems under active development by multiple teams. Our approach encompasses three key elements: (1) a hierarchical “severity” framework for incident detection that identifies and categorizes errors while attributing component-specific error rates, facilitating targeted improvements; (2) a scalable and principled methodology for benchmark construction, evaluation, and deployment, designed to accommodate multiple development teams, mitigate overfitting risks, and assess the downstream impact of system modifications; and (3) a continual improvement strategy leveraging multidimensional evaluation, enabling the identification and implementation of diverse enhancement opportunities. By adopting this holistic framework, organizations can systematically enhance the reliability and performance of their AI Assistants, ensuring their efficacy in critical enterprise environments. We conclude by discussing how this multifaceted approach opens avenues for various classes of enhancements, including human-AI collaborative evaluation, paving the way for more robust and trustworthy AI systems.

Abstract Image

企业AI助手的评估与事件预防
企业人工智能助手越来越多地部署在准确性至关重要的领域,使得每个错误的输出都可能成为重大事件。本文提出了一个全面的框架,用于监控、基准测试和持续改进由多个团队积极开发的这种复杂的多组件系统。我们的方法包含三个关键要素:(1)用于事件检测的分层“严重性”框架,该框架可以识别和分类错误,同时归因于特定组件的错误率,促进有针对性的改进;(2)用于基准构建、评估和部署的可扩展和原则性方法,旨在适应多个开发团队,减轻过度拟合风险,并评估系统修改的下游影响;(3)利用多维评价的持续改进策略,使识别和实施各种改进机会成为可能。通过采用这一整体框架,组织可以系统地提高其人工智能助手的可靠性和性能,确保其在关键企业环境中的有效性。最后,我们讨论了这种多方面的方法如何为各种类型的增强开辟道路,包括人类-人工智能协作评估,为更强大和值得信赖的人工智能系统铺平道路。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
Ai Magazine
Ai Magazine 工程技术-计算机:人工智能
CiteScore
3.90
自引率
11.10%
发文量
61
审稿时长
>12 weeks
期刊介绍: AI Magazine publishes original articles that are reasonably self-contained and aimed at a broad spectrum of the AI community. Technical content should be kept to a minimum. In general, the magazine does not publish articles that have been published elsewhere in whole or in part. The magazine welcomes the contribution of articles on the theory and practice of AI as well as general survey articles, tutorial articles on timely topics, conference or symposia or workshop reports, and timely columns on topics of interest to AI scientists.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信