InCA: Rethinking In-Car Conversational System Assessment Leveraging Large Language Models

Friedl, Ken E., Khan, Abbas Goher, Sahoo, Soumya Ranjan, Rony, Md Rashad Al Hasan, Germies, Jana, Süß, Christian
{"title":"InCA: Rethinking In-Car Conversational System Assessment Leveraging\n Large Language Models","authors":"Friedl, Ken E., Khan, Abbas Goher, Sahoo, Soumya Ranjan, Rony, Md Rashad Al Hasan, Germies, Jana, Süß, Christian","doi":"10.48550/arxiv.2311.07469","DOIUrl":null,"url":null,"abstract":"The assessment of advanced generative large language models (LLMs) poses a significant challenge, given their heightened complexity in recent developments. Furthermore, evaluating the performance of LLM-based applications in various industries, as indicated by Key Performance Indicators (KPIs), is a complex undertaking. This task necessitates a profound understanding of industry use cases and the anticipated system behavior. Within the context of the automotive industry, existing evaluation metrics prove inadequate for assessing in-car conversational question answering (ConvQA) systems. The unique demands of these systems, where answers may relate to driver or car safety and are confined within the car domain, highlight the limitations of current metrics. To address these challenges, this paper introduces a set of KPIs tailored for evaluating the performance of in-car ConvQA systems, along with datasets specifically designed for these KPIs. A preliminary and comprehensive empirical evaluation substantiates the efficacy of our proposed approach. Furthermore, we investigate the impact of employing varied personas in prompts and found that it enhances the model's capacity to simulate diverse viewpoints in assessments, mirroring how individuals with different backgrounds perceive a topic.","PeriodicalId":496270,"journal":{"name":"arXiv (Cornell University)","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2023-11-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv (Cornell University)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.48550/arxiv.2311.07469","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

The assessment of advanced generative large language models (LLMs) poses a significant challenge, given their heightened complexity in recent developments. Furthermore, evaluating the performance of LLM-based applications in various industries, as indicated by Key Performance Indicators (KPIs), is a complex undertaking. This task necessitates a profound understanding of industry use cases and the anticipated system behavior. Within the context of the automotive industry, existing evaluation metrics prove inadequate for assessing in-car conversational question answering (ConvQA) systems. The unique demands of these systems, where answers may relate to driver or car safety and are confined within the car domain, highlight the limitations of current metrics. To address these challenges, this paper introduces a set of KPIs tailored for evaluating the performance of in-car ConvQA systems, along with datasets specifically designed for these KPIs. A preliminary and comprehensive empirical evaluation substantiates the efficacy of our proposed approach. Furthermore, we investigate the impact of employing varied personas in prompts and found that it enhances the model's capacity to simulate diverse viewpoints in assessments, mirroring how individuals with different backgrounds perceive a topic.
InCA:利用大型语言模型重新思考车载会话系统评估
高级生成大型语言模型(llm)的评估提出了一个重大挑战,因为它们在最近的发展中具有高度的复杂性。此外,根据关键绩效指标(kpi)来评估各行业中基于法学硕士的应用程序的性能是一项复杂的工作。这项任务需要对行业用例和预期的系统行为有深刻的理解。在汽车行业的背景下,现有的评估指标被证明不足以评估车载会话问答(ConvQA)系统。这些系统的独特需求,其答案可能与驾驶员或汽车安全有关,并且仅限于汽车领域,突出了当前指标的局限性。为了应对这些挑战,本文介绍了一组专门用于评估车载ConvQA系统性能的kpi,以及为这些kpi专门设计的数据集。初步和全面的实证评估证实了我们提出的方法的有效性。此外,我们研究了在提示中使用不同角色的影响,发现它增强了模型在评估中模拟不同观点的能力,反映了不同背景的个体如何看待主题。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信