{"title":"Online vs Offline: A Comparative Study of First-Party and Third-Party Evaluations of Social Chatbots","authors":"Ekaterina Svikhnushina, Pearl Pu","doi":"arxiv-2409.07823","DOIUrl":null,"url":null,"abstract":"This paper explores the efficacy of online versus offline evaluation methods\nin assessing conversational chatbots, specifically comparing first-party direct\ninteractions with third-party observational assessments. By extending a\nbenchmarking dataset of user dialogs with empathetic chatbots with offline\nthird-party evaluations, we present a systematic comparison between the\nfeedback from online interactions and the more detached offline third-party\nevaluations. Our results reveal that offline human evaluations fail to capture\nthe subtleties of human-chatbot interactions as effectively as online\nassessments. In comparison, automated third-party evaluations using a GPT-4\nmodel offer a better approximation of first-party human judgments given\ndetailed instructions. This study highlights the limitations of third-party\nevaluations in grasping the complexities of user experiences and advocates for\nthe integration of direct interaction feedback in conversational AI evaluation\nto enhance system development and user satisfaction.","PeriodicalId":501541,"journal":{"name":"arXiv - CS - Human-Computer Interaction","volume":"23 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Human-Computer Interaction","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.07823","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
This paper explores the efficacy of online versus offline evaluation methods
in assessing conversational chatbots, specifically comparing first-party direct
interactions with third-party observational assessments. By extending a
benchmarking dataset of user dialogs with empathetic chatbots with offline
third-party evaluations, we present a systematic comparison between the
feedback from online interactions and the more detached offline third-party
evaluations. Our results reveal that offline human evaluations fail to capture
the subtleties of human-chatbot interactions as effectively as online
assessments. In comparison, automated third-party evaluations using a GPT-4
model offer a better approximation of first-party human judgments given
detailed instructions. This study highlights the limitations of third-party
evaluations in grasping the complexities of user experiences and advocates for
the integration of direct interaction feedback in conversational AI evaluation
to enhance system development and user satisfaction.