Assessor Differences and User Preferences in Tweet Timeline Generation

Yulu Wang, G. Sherman, Jimmy J. Lin, Miles Efron
{"title":"Assessor Differences and User Preferences in Tweet Timeline Generation","authors":"Yulu Wang, G. Sherman, Jimmy J. Lin, Miles Efron","doi":"10.1145/2766462.2767699","DOIUrl":null,"url":null,"abstract":"In information retrieval evaluation, when presented with an effectiveness difference between two systems, there are three relevant questions one might ask. First, are the differences statistically significant? Second, is the comparison stable with respect to assessor differences? Finally, is the difference actually meaningful to a user? This paper tackles the last two questions about assessor differences and user preferences in the context of the newly-introduced tweet timeline generation task in the TREC 2014 Microblog track, where the system's goal is to construct an informative summary of non-redundant tweets that addresses the user's information need. Central to the evaluation methodology is human-generated semantic clusters of tweets that contain substantively similar information. We show that the evaluation is stable with respect to assessor differences in clustering and that user preferences generally correlate with effectiveness metrics even though users are not explicitly aware of the semantic clustering being performed by the systems. Although our analyses are limited to this particular task, we believe that lessons learned could generalize to other evaluations based on establishing semantic equivalence between information units, such as nugget-based evaluations in question answering and temporal summarization.","PeriodicalId":297035,"journal":{"name":"Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2015-08-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"30","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2766462.2767699","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 30

Abstract

In information retrieval evaluation, when presented with an effectiveness difference between two systems, there are three relevant questions one might ask. First, are the differences statistically significant? Second, is the comparison stable with respect to assessor differences? Finally, is the difference actually meaningful to a user? This paper tackles the last two questions about assessor differences and user preferences in the context of the newly-introduced tweet timeline generation task in the TREC 2014 Microblog track, where the system's goal is to construct an informative summary of non-redundant tweets that addresses the user's information need. Central to the evaluation methodology is human-generated semantic clusters of tweets that contain substantively similar information. We show that the evaluation is stable with respect to assessor differences in clustering and that user preferences generally correlate with effectiveness metrics even though users are not explicitly aware of the semantic clustering being performed by the systems. Although our analyses are limited to this particular task, we believe that lessons learned could generalize to other evaluations based on establishing semantic equivalence between information units, such as nugget-based evaluations in question answering and temporal summarization.
评估者差异和用户偏好在推文时间轴生成
在信息检索评价中,当两种系统的有效性存在差异时,人们可能会提出三个相关问题。首先,这些差异在统计上是否显著?第二,相对于评估者的差异,比较是否稳定?最后,这种差异对用户来说真的有意义吗?本文在TREC 2014微博轨道中新引入的推文时间线生成任务的背景下解决了关于评估者差异和用户偏好的最后两个问题,其中系统的目标是构建非冗余推文的信息摘要,以满足用户的信息需求。评估方法的核心是人类生成的包含实质相似信息的推文语义集群。我们表明,就聚类的评估者差异而言,评估是稳定的,即使用户没有明确地意识到系统正在执行的语义聚类,用户偏好通常也与有效性指标相关。尽管我们的分析仅限于这一特定任务,但我们相信,通过建立信息单元之间的语义等价,可以将经验教训推广到其他评估中,例如基于金块的问题回答评估和时间摘要评估。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信