在线与离线：社交聊天机器人的第一方和第三方评价比较研究

arXiv - CS - Human-Computer Interaction Pub Date : 2024-09-12 DOI:arxiv-2409.07823

Ekaterina Svikhnushina, Pearl Pu

{"title":"在线与离线：社交聊天机器人的第一方和第三方评价比较研究","authors":"Ekaterina Svikhnushina, Pearl Pu","doi":"arxiv-2409.07823","DOIUrl":null,"url":null,"abstract":"This paper explores the efficacy of online versus offline evaluation methods\nin assessing conversational chatbots, specifically comparing first-party direct\ninteractions with third-party observational assessments. By extending a\nbenchmarking dataset of user dialogs with empathetic chatbots with offline\nthird-party evaluations, we present a systematic comparison between the\nfeedback from online interactions and the more detached offline third-party\nevaluations. Our results reveal that offline human evaluations fail to capture\nthe subtleties of human-chatbot interactions as effectively as online\nassessments. In comparison, automated third-party evaluations using a GPT-4\nmodel offer a better approximation of first-party human judgments given\ndetailed instructions. This study highlights the limitations of third-party\nevaluations in grasping the complexities of user experiences and advocates for\nthe integration of direct interaction feedback in conversational AI evaluation\nto enhance system development and user satisfaction.","PeriodicalId":501541,"journal":{"name":"arXiv - CS - Human-Computer Interaction","volume":"23 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Online vs Offline: A Comparative Study of First-Party and Third-Party Evaluations of Social Chatbots\",\"authors\":\"Ekaterina Svikhnushina, Pearl Pu\",\"doi\":\"arxiv-2409.07823\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"This paper explores the efficacy of online versus offline evaluation methods\\nin assessing conversational chatbots, specifically comparing first-party direct\\ninteractions with third-party observational assessments. By extending a\\nbenchmarking dataset of user dialogs with empathetic chatbots with offline\\nthird-party evaluations, we present a systematic comparison between the\\nfeedback from online interactions and the more detached offline third-party\\nevaluations. Our results reveal that offline human evaluations fail to capture\\nthe subtleties of human-chatbot interactions as effectively as online\\nassessments. In comparison, automated third-party evaluations using a GPT-4\\nmodel offer a better approximation of first-party human judgments given\\ndetailed instructions. This study highlights the limitations of third-party\\nevaluations in grasping the complexities of user experiences and advocates for\\nthe integration of direct interaction feedback in conversational AI evaluation\\nto enhance system development and user satisfaction.\",\"PeriodicalId\":501541,\"journal\":{\"name\":\"arXiv - CS - Human-Computer Interaction\",\"volume\":\"23 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-09-12\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - CS - Human-Computer Interaction\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2409.07823\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Human-Computer Interaction","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.07823","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

本文探讨了在线与离线评估方法在评估会话式聊天机器人方面的功效，特别是比较了第一方直接交互与第三方观察评估。通过将用户与移情聊天机器人对话的基准数据集与离线第三方评估进行扩展，我们对来自在线交互的反馈与更加独立的离线第三方评估进行了系统比较。我们的结果表明，离线人工评估无法像在线评估那样有效捕捉人与聊天机器人交互的微妙之处。相比之下，使用GPT-4模型的自动第三方评估能更好地接近第一方人类给出详细说明后做出的判断。本研究强调了第三方评估在把握用户体验复杂性方面的局限性，并主张在对话式人工智能评估中整合直接交互反馈，以提高系统开发水平和用户满意度。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Online vs Offline: A Comparative Study of First-Party and Third-Party Evaluations of Social Chatbots

This paper explores the efficacy of online versus offline evaluation methods in assessing conversational chatbots, specifically comparing first-party direct interactions with third-party observational assessments. By extending a benchmarking dataset of user dialogs with empathetic chatbots with offline third-party evaluations, we present a systematic comparison between the feedback from online interactions and the more detached offline third-party evaluations. Our results reveal that offline human evaluations fail to capture the subtleties of human-chatbot interactions as effectively as online assessments. In comparison, automated third-party evaluations using a GPT-4 model offer a better approximation of first-party human judgments given detailed instructions. This study highlights the limitations of third-party evaluations in grasping the complexities of user experiences and advocates for the integration of direct interaction feedback in conversational AI evaluation to enhance system development and user satisfaction.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

arXiv - CS - Human-Computer Interaction

自引率

0.00%

发文量