Online vs Offline: A Comparative Study of First-Party and Third-Party Evaluations of Social Chatbots

arXiv - CS - Human-Computer Interaction Pub Date : 2024-09-12 DOI:arxiv-2409.07823

Ekaterina Svikhnushina, Pearl Pu

引用次数: 0

Abstract

This paper explores the efficacy of online versus offline evaluation methods in assessing conversational chatbots, specifically comparing first-party direct interactions with third-party observational assessments. By extending a benchmarking dataset of user dialogs with empathetic chatbots with offline third-party evaluations, we present a systematic comparison between the feedback from online interactions and the more detached offline third-party evaluations. Our results reveal that offline human evaluations fail to capture the subtleties of human-chatbot interactions as effectively as online assessments. In comparison, automated third-party evaluations using a GPT-4 model offer a better approximation of first-party human judgments given detailed instructions. This study highlights the limitations of third-party evaluations in grasping the complexities of user experiences and advocates for the integration of direct interaction feedback in conversational AI evaluation to enhance system development and user satisfaction.

查看原文本刊更多论文

在线与离线：社交聊天机器人的第一方和第三方评价比较研究

本文探讨了在线与离线评估方法在评估会话式聊天机器人方面的功效，特别是比较了第一方直接交互与第三方观察评估。通过将用户与移情聊天机器人对话的基准数据集与离线第三方评估进行扩展，我们对来自在线交互的反馈与更加独立的离线第三方评估进行了系统比较。我们的结果表明，离线人工评估无法像在线评估那样有效捕捉人与聊天机器人交互的微妙之处。相比之下，使用GPT-4模型的自动第三方评估能更好地接近第一方人类给出详细说明后做出的判断。本研究强调了第三方评估在把握用户体验复杂性方面的局限性，并主张在对话式人工智能评估中整合直接交互反馈，以提高系统开发水平和用户满意度。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

arXiv - CS - Human-Computer Interaction

自引率

0.00%

发文量