{"title":"在线与离线:社交聊天机器人的第一方和第三方评价比较研究","authors":"Ekaterina Svikhnushina, Pearl Pu","doi":"arxiv-2409.07823","DOIUrl":null,"url":null,"abstract":"This paper explores the efficacy of online versus offline evaluation methods\nin assessing conversational chatbots, specifically comparing first-party direct\ninteractions with third-party observational assessments. By extending a\nbenchmarking dataset of user dialogs with empathetic chatbots with offline\nthird-party evaluations, we present a systematic comparison between the\nfeedback from online interactions and the more detached offline third-party\nevaluations. Our results reveal that offline human evaluations fail to capture\nthe subtleties of human-chatbot interactions as effectively as online\nassessments. In comparison, automated third-party evaluations using a GPT-4\nmodel offer a better approximation of first-party human judgments given\ndetailed instructions. This study highlights the limitations of third-party\nevaluations in grasping the complexities of user experiences and advocates for\nthe integration of direct interaction feedback in conversational AI evaluation\nto enhance system development and user satisfaction.","PeriodicalId":501541,"journal":{"name":"arXiv - CS - Human-Computer Interaction","volume":"23 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Online vs Offline: A Comparative Study of First-Party and Third-Party Evaluations of Social Chatbots\",\"authors\":\"Ekaterina Svikhnushina, Pearl Pu\",\"doi\":\"arxiv-2409.07823\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"This paper explores the efficacy of online versus offline evaluation methods\\nin assessing conversational chatbots, specifically comparing first-party direct\\ninteractions with third-party observational assessments. By extending a\\nbenchmarking dataset of user dialogs with empathetic chatbots with offline\\nthird-party evaluations, we present a systematic comparison between the\\nfeedback from online interactions and the more detached offline third-party\\nevaluations. Our results reveal that offline human evaluations fail to capture\\nthe subtleties of human-chatbot interactions as effectively as online\\nassessments. In comparison, automated third-party evaluations using a GPT-4\\nmodel offer a better approximation of first-party human judgments given\\ndetailed instructions. This study highlights the limitations of third-party\\nevaluations in grasping the complexities of user experiences and advocates for\\nthe integration of direct interaction feedback in conversational AI evaluation\\nto enhance system development and user satisfaction.\",\"PeriodicalId\":501541,\"journal\":{\"name\":\"arXiv - CS - Human-Computer Interaction\",\"volume\":\"23 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-09-12\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - CS - Human-Computer Interaction\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2409.07823\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Human-Computer Interaction","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.07823","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Online vs Offline: A Comparative Study of First-Party and Third-Party Evaluations of Social Chatbots
This paper explores the efficacy of online versus offline evaluation methods
in assessing conversational chatbots, specifically comparing first-party direct
interactions with third-party observational assessments. By extending a
benchmarking dataset of user dialogs with empathetic chatbots with offline
third-party evaluations, we present a systematic comparison between the
feedback from online interactions and the more detached offline third-party
evaluations. Our results reveal that offline human evaluations fail to capture
the subtleties of human-chatbot interactions as effectively as online
assessments. In comparison, automated third-party evaluations using a GPT-4
model offer a better approximation of first-party human judgments given
detailed instructions. This study highlights the limitations of third-party
evaluations in grasping the complexities of user experiences and advocates for
the integration of direct interaction feedback in conversational AI evaluation
to enhance system development and user satisfaction.