{"title":"Assessor Differences and User Preferences in Tweet Timeline Generation","authors":"Yulu Wang, G. Sherman, Jimmy J. Lin, Miles Efron","doi":"10.1145/2766462.2767699","DOIUrl":null,"url":null,"abstract":"In information retrieval evaluation, when presented with an effectiveness difference between two systems, there are three relevant questions one might ask. First, are the differences statistically significant? Second, is the comparison stable with respect to assessor differences? Finally, is the difference actually meaningful to a user? This paper tackles the last two questions about assessor differences and user preferences in the context of the newly-introduced tweet timeline generation task in the TREC 2014 Microblog track, where the system's goal is to construct an informative summary of non-redundant tweets that addresses the user's information need. Central to the evaluation methodology is human-generated semantic clusters of tweets that contain substantively similar information. We show that the evaluation is stable with respect to assessor differences in clustering and that user preferences generally correlate with effectiveness metrics even though users are not explicitly aware of the semantic clustering being performed by the systems. Although our analyses are limited to this particular task, we believe that lessons learned could generalize to other evaluations based on establishing semantic equivalence between information units, such as nugget-based evaluations in question answering and temporal summarization.","PeriodicalId":297035,"journal":{"name":"Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2015-08-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"30","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2766462.2767699","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 30
Abstract
In information retrieval evaluation, when presented with an effectiveness difference between two systems, there are three relevant questions one might ask. First, are the differences statistically significant? Second, is the comparison stable with respect to assessor differences? Finally, is the difference actually meaningful to a user? This paper tackles the last two questions about assessor differences and user preferences in the context of the newly-introduced tweet timeline generation task in the TREC 2014 Microblog track, where the system's goal is to construct an informative summary of non-redundant tweets that addresses the user's information need. Central to the evaluation methodology is human-generated semantic clusters of tweets that contain substantively similar information. We show that the evaluation is stable with respect to assessor differences in clustering and that user preferences generally correlate with effectiveness metrics even though users are not explicitly aware of the semantic clustering being performed by the systems. Although our analyses are limited to this particular task, we believe that lessons learned could generalize to other evaluations based on establishing semantic equivalence between information units, such as nugget-based evaluations in question answering and temporal summarization.