会话搜索评价指标的元评价

Zeyang Liu, K. Zhou, Max L. Wilson
{"title":"会话搜索评价指标的元评价","authors":"Zeyang Liu, K. Zhou, Max L. Wilson","doi":"10.1145/3445029","DOIUrl":null,"url":null,"abstract":"Conversational search systems, such as Google assistant and Microsoft Cortana, enable users to interact with search systems in multiple rounds through natural language dialogues. Evaluating such systems is very challenging, given that any natural language responses could be generated, and users commonly interact for multiple semantically coherent rounds to accomplish a search task. Although prior studies proposed many evaluation metrics, the extent of how those measures effectively capture user preference remain to be investigated. In this article, we systematically meta-evaluate a variety of conversational search metrics. We specifically study three perspectives on those metrics: (1) reliability: the ability to detect “actual” performance differences as opposed to those observed by chance; (2) fidelity: the ability to agree with ultimate user preference; and (3) intuitiveness: the ability to capture any property deemed important: adequacy, informativeness, and fluency in the context of conversational search. By conducting experiments on two test collections, we find that the performance of different metrics vary significantly across different scenarios, whereas consistent with prior studies, existing metrics only achieve weak correlation with ultimate user preference and satisfaction. METEOR is, comparatively speaking, the best existing single-turn metric considering all three perspectives. We also demonstrate that adapted session-based evaluation metrics can be used to measure multi-turn conversational search, achieving moderate concordance with user satisfaction. To our knowledge, our work establishes the most comprehensive meta-evaluation for conversational search to date.","PeriodicalId":6934,"journal":{"name":"ACM Transactions on Information Systems (TOIS)","volume":"1 1","pages":"1 - 42"},"PeriodicalIF":0.0000,"publicationDate":"2021-04-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"8","resultStr":"{\"title\":\"Meta-evaluation of Conversational Search Evaluation Metrics\",\"authors\":\"Zeyang Liu, K. Zhou, Max L. Wilson\",\"doi\":\"10.1145/3445029\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Conversational search systems, such as Google assistant and Microsoft Cortana, enable users to interact with search systems in multiple rounds through natural language dialogues. Evaluating such systems is very challenging, given that any natural language responses could be generated, and users commonly interact for multiple semantically coherent rounds to accomplish a search task. Although prior studies proposed many evaluation metrics, the extent of how those measures effectively capture user preference remain to be investigated. In this article, we systematically meta-evaluate a variety of conversational search metrics. We specifically study three perspectives on those metrics: (1) reliability: the ability to detect “actual” performance differences as opposed to those observed by chance; (2) fidelity: the ability to agree with ultimate user preference; and (3) intuitiveness: the ability to capture any property deemed important: adequacy, informativeness, and fluency in the context of conversational search. By conducting experiments on two test collections, we find that the performance of different metrics vary significantly across different scenarios, whereas consistent with prior studies, existing metrics only achieve weak correlation with ultimate user preference and satisfaction. METEOR is, comparatively speaking, the best existing single-turn metric considering all three perspectives. We also demonstrate that adapted session-based evaluation metrics can be used to measure multi-turn conversational search, achieving moderate concordance with user satisfaction. To our knowledge, our work establishes the most comprehensive meta-evaluation for conversational search to date.\",\"PeriodicalId\":6934,\"journal\":{\"name\":\"ACM Transactions on Information Systems (TOIS)\",\"volume\":\"1 1\",\"pages\":\"1 - 42\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2021-04-27\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"8\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"ACM Transactions on Information Systems (TOIS)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3445029\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACM Transactions on Information Systems (TOIS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3445029","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 8

摘要

会话式搜索系统,如Google assistant和Microsoft Cortana,使用户能够通过自然语言对话与搜索系统进行多轮交互。评估这样的系统是非常具有挑战性的,因为可以生成任何自然语言响应,并且用户通常会交互多个语义上连贯的回合来完成搜索任务。虽然先前的研究提出了许多评价指标,但这些措施如何有效地捕获用户偏好的程度仍有待调查。在本文中,我们系统地对各种会话搜索指标进行元评估。我们具体研究了这些指标的三个角度:(1)可靠性:检测“实际”性能差异的能力,而不是偶然观察到的差异;(2)保真度:符合最终用户偏好的能力;(3)直观性:捕捉任何被认为重要的属性的能力:在会话搜索的背景下,充分性、信息性和流畅性。通过对两个测试集的实验,我们发现不同指标在不同场景下的表现差异显著,而与先前的研究一致,现有指标与最终用户偏好和满意度仅实现弱相关性。流星是,相对而言,最好的单回合指标考虑到所有三个角度。我们还证明了适应的基于会话的评估指标可以用于测量多回合会话搜索,实现与用户满意度的适度一致性。据我们所知,我们的工作为会话搜索建立了迄今为止最全面的元评估。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Meta-evaluation of Conversational Search Evaluation Metrics
Conversational search systems, such as Google assistant and Microsoft Cortana, enable users to interact with search systems in multiple rounds through natural language dialogues. Evaluating such systems is very challenging, given that any natural language responses could be generated, and users commonly interact for multiple semantically coherent rounds to accomplish a search task. Although prior studies proposed many evaluation metrics, the extent of how those measures effectively capture user preference remain to be investigated. In this article, we systematically meta-evaluate a variety of conversational search metrics. We specifically study three perspectives on those metrics: (1) reliability: the ability to detect “actual” performance differences as opposed to those observed by chance; (2) fidelity: the ability to agree with ultimate user preference; and (3) intuitiveness: the ability to capture any property deemed important: adequacy, informativeness, and fluency in the context of conversational search. By conducting experiments on two test collections, we find that the performance of different metrics vary significantly across different scenarios, whereas consistent with prior studies, existing metrics only achieve weak correlation with ultimate user preference and satisfaction. METEOR is, comparatively speaking, the best existing single-turn metric considering all three perspectives. We also demonstrate that adapted session-based evaluation metrics can be used to measure multi-turn conversational search, achieving moderate concordance with user satisfaction. To our knowledge, our work establishes the most comprehensive meta-evaluation for conversational search to date.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信