Aldo Lipani, G. Zuccon, M. Lupu, B. Koopman, A. Hanbury
{"title":"The Impact of Fixed-Cost Pooling Strategies on Test Collection Bias","authors":"Aldo Lipani, G. Zuccon, M. Lupu, B. Koopman, A. Hanbury","doi":"10.1145/2970398.2970429","DOIUrl":"https://doi.org/10.1145/2970398.2970429","url":null,"abstract":"In Information Retrieval, test collections are usually built using the pooling method. Many pooling strategies have been developed for the pooling method. Herein, we address the question of identifying the best pooling strategy when evaluating systems using precision-oriented measures in presence of budget constraints on the number of documents to be evaluated. As a quality measurement we use the bias introduced by the pooling strategy, measured both in terms of Mean Absolute Error of the scores and in terms of ranking errors. Based on experiments on 15 test collections, we conclude that, for precision-oriented measures, the best strategies are based on Rank-Biased Precision (RBP). These results can inform collection builders because they suggest that, under fixed assessment budget constraints, RBP-based sampling produces less biased pools than other alternatives.","PeriodicalId":443715,"journal":{"name":"Proceedings of the 2016 ACM International Conference on the Theory of Information Retrieval","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130665163","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Optimization Method for Weighting Explicit and Latent Concepts in Clinical Decision Support Queries","authors":"Saeid Balaneshinkordan, Alexander Kotov","doi":"10.1145/2970398.2970418","DOIUrl":"https://doi.org/10.1145/2970398.2970418","url":null,"abstract":"Accurately answering verbose queries that describe a clinical case and aim at finding articles in a collection of medical literature requires capturing many explicit and latent aspects of complex information needs underlying such queries. Proper representation of these aspects often requires query analysis to identify the most important query concepts as well as query transformation by adding new concepts to a query, which can be extracted from the top retrieved documents or medical knowledge bases. Traditionally, query analysis and expansion have been done separately. In this paper, we propose a method for representing verbose domain-specific queries based on weighted unigram, bigram, and multi-term concepts in the query itself, as well as extracted from the top retrieved documents and external knowledge bases. We also propose a graduated non-convexity optimization framework, which allows to unify query analysis and expansion by jointly determining the importance weights for the query and expansion concepts depending on their type and source. Experiments using a collection of PubMed articles and TREC Clinical Decision Support (CDS) track queries indicate that applying our proposed method results in significant improvement of retrieval accuracy over state-of-the-art methods for ad hoc and medical IR.","PeriodicalId":443715,"journal":{"name":"Proceedings of the 2016 ACM International Conference on the Theory of Information Retrieval","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132169917","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"An Analysis of the Cost and Benefit of Search Interactions","authors":"L. Azzopardi, G. Zuccon","doi":"10.1145/2970398.2970412","DOIUrl":"https://doi.org/10.1145/2970398.2970412","url":null,"abstract":"Interactive Information Retrieval (IR) systems often provide various features and functions, such as query suggestions and relevance feedback, that a user may or may not decide to use. The decision to take such an option has associated costs and may lead to some benefit. Thus, a savvy user would take decisions that maximises their net benefit. In this paper, we formally model the costs and benefits of various decisions that users, implicitly or explicitly, make when searching. We consider and analyse the following scenarios: (i) how long a user's query should be? (ii) should the user pose a specific or vague query? (iii) should the user take a suggestion or re-formulate? (iv) when should a user employ relevance feedback? and (v) when would the \"find similar\" functionality be worthwhile to the user? To this end, we build a series of cost-benefit models exploring a variety of parameters that affect the decisions at play. Through the analyses, we are able to draw a number of insights into different decisions, provide explanations for observed behaviours and generate numerous testable hypotheses. This work not only serves as a basis for future empirical work, but also as a template for developing other cost-benefit models involving human-computer interaction.","PeriodicalId":443715,"journal":{"name":"Proceedings of the 2016 ACM International Conference on the Theory of Information Retrieval","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130014603","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A Utility Maximization Framework for Privacy Preservation of User Generated Content","authors":"Yi Fang, Archana Godavarthy, Haibing Lu","doi":"10.1145/2970398.2970417","DOIUrl":"https://doi.org/10.1145/2970398.2970417","url":null,"abstract":"The prodigious amount of user-generated content continues to grow at an enormous rate. While it greatly facilitates the flow of information and ideas among people and communities, it may pose great threat to our individual privacy. In this paper, we demonstrate that the private traits of individuals can be inferred from user-generated content by using text classification techniques. Specifically, we study three private attributes on Twitter users: religion, political leaning, and marital status. The ground truth labels of the private traits can be readily collected from the Twitter bio field. Based on the tweets posted by the users and their corresponding bios, we show that text classification yields a high accuracy of identification of these personal attributes, which poses a great privacy risk on user-generated content. We further propose a constrained utility maximization framework for preserving user privacy. The goal is to maximize the utility of data when modifying the user-generated content, while degrading the prediction performance of the adversary. The KL divergence is minimized between the prior knowledge about the private attribute and the posterior probability after seeing the user-generated data. Based on this proposed framework, we investigate several specific data sanitization operations for privacy preservation: add, delete, or replace words in the tweets. We derive the exact transformation of the data under each operation. The experiments demonstrate the effectiveness of the proposed framework.","PeriodicalId":443715,"journal":{"name":"Proceedings of the 2016 ACM International Conference on the Theory of Information Retrieval","volume":"45 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134371509","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Utilizing Knowledge Bases in Text-centric Information Retrieval","authors":"Laura Dietz, Alexander Kotov, E. Meij","doi":"10.1145/2970398.2970441","DOIUrl":"https://doi.org/10.1145/2970398.2970441","url":null,"abstract":"General-purpose knowledge bases are increasingly growing in terms of depth (content) and width (coverage). Moreover, algorithms for entity linking and entity retrieval have improved tremendously in the past years. These developments give rise to a new line of research that exploits and combines these developments for the purposes of text-centric information retrieval applications. This tutorial focuses on a) how to retrieve a set of entities for an ad-hoc query, or more broadly, assessing relevance of KB elements for the information need, b) how to annotate text with such elements, and c) how to use this information to assess the relevance of text. We discuss different kinds of information available in a knowledge graph and how to leverage each most effectively. We start the tutorial with a brief overview of different types of knowledge bases, their structure and information contained in popular general-purpose and domain-specific knowledge bases. In particular, we focus on the representation of entity-centric information in the knowledge base through names, terms, relations, and type taxonomies. Next, we will provide a recap on ad-hoc object retrieval from knowledge graphs as well as entity linking and retrieval. This is essential technology, which the remainder of the tutorial builds on. Next we will cover essential components within successful entity linking systems, including the collection of entity name information and techniques for disambiguation with contextual entity mentions. We will present the details of four previously proposed systems that successfully leverage knowledge bases to improve ad-hoc document retrieval. These systems combine the notion of entity retrieval and semantic search on one hand, with text retrieval models and entity linking on the other. Finally, we also touch on entity aspects and links in the knowledge graph as it can help to understand the entities' context. This tutorial is the first to compile, summarize, and disseminate progress in this emerging area and we provide both an overview of state-of-the-art methods and outline open research problems to encourage new contributions.","PeriodicalId":443715,"journal":{"name":"Proceedings of the 2016 ACM International Conference on the Theory of Information Retrieval","volume":"36 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116122292","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Learning to Rank with Labeled Features","authors":"Fernando Diaz","doi":"10.1145/2970398.2970435","DOIUrl":"https://doi.org/10.1145/2970398.2970435","url":null,"abstract":"Classic learning to rank algorithms are trained using a set of labeled documents, pairs of documents, or rankings of documents. Unfortunately, in many situations, gathering such labels requires significant overhead in terms of time and money. We present an algorithm for training a learning to rank model using a set of labeled features elicited from system designers or domain experts. Labeled features incorporate a system designer's belief about the correlation between certain features and relative relevance. We demonstrate the efficacy of our model on a public learning to rank dataset. Our results show that we outperform our baselines even when using as little as a single feature label.","PeriodicalId":443715,"journal":{"name":"Proceedings of the 2016 ACM International Conference on the Theory of Information Retrieval","volume":"90 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121754700","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"EventMiner: Mining Events from Annotated Documents","authors":"Dhruv Gupta, Jannik Strotgen, K. Berberich","doi":"10.1145/2970398.2970411","DOIUrl":"https://doi.org/10.1145/2970398.2970411","url":null,"abstract":"Events are central in human history and thus also in Web queries, in particular if they relate to history or news. However, ambiguity issues arise as queries may refer to ambiguous events differing in time, geography, or participating entities. Thus, users would greatly benefit if search results were presented along different events. In this paper, we present EventMiner, an algorithm that mines events from top-k pseudo-relevant documents for a given query. It is a probabilistic framework that leverages semantic annotations in the form of temporal expressions, geographic locations, and named entities to analyze natural language text and determine important events. Using a large news corpus, we show that using semantic annotations, EventMiner detects important events and presents documents covering the identified events in the order of their importance.","PeriodicalId":443715,"journal":{"name":"Proceedings of the 2016 ACM International Conference on the Theory of Information Retrieval","volume":"94 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127068090","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Joint Estimation of Topics and Hashtag Relevance in Cross-Lingual Tweets","authors":"Procheta Sen, Debasis Ganguly, G. Jones","doi":"10.1145/2970398.2970425","DOIUrl":"https://doi.org/10.1145/2970398.2970425","url":null,"abstract":"Twitter is a widely used platform for sharing news articles. An emerging trend in multi-lingual communities is to share non-English news articles using English tweets in order to spread the news to a wider audience. In general, the choice of relevant hashtags for such tweets depends on the topic of the non-English news article. In this paper, we address the problem of automatically detecting the relevance of the hashtags of such tweets. More specifically, we propose a generative model to jointly model the topics within an English tweet and those within the non-English news article shared from it to predict the relevance of the hashtags of the tweet. For conducting experiments, we compiled a collection of English tweets that share news articles in Bengali (a South Asian language). Our experiments on this dataset demonstrate that this joint estimation based approach using the topics from both the non-English news articles and the tweets proves to be more effective for relevance estimation than that of only using the topics of a tweet itself.","PeriodicalId":443715,"journal":{"name":"Proceedings of the 2016 ACM International Conference on the Theory of Information Retrieval","volume":"36 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124865063","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"From \"More Like This\" to \"Better Than This\"","authors":"Haggai Roitman, D. Cohen, S. Hummel","doi":"10.1145/2970398.2970421","DOIUrl":"https://doi.org/10.1145/2970398.2970421","url":null,"abstract":"In this paper we address a novel retrieval problem we term the \"Better Than This\" problem. For a given pair of a user query to be answered by some search engine and a single example answer provided by the user that may or may not be a correct answer to the query, we determine whether or not there exists some better answer within the search engine. The approach we take is to test whether the user's provided answer can be used for relevance feedback in order to improve the ability of the search engine to better answer the user's query. If this is indeed the case, then we determine that the original answer provided by the user is good enough and there is no need to consider a better alternative. Otherwise, we decide that the best alternative that the search engine can provide should be considered as a better answer. Using a simulation based evaluation, we demonstrate that, our approach provides a better decision making solution to this problem, compared to several other alternatives.","PeriodicalId":443715,"journal":{"name":"Proceedings of the 2016 ACM International Conference on the Theory of Information Retrieval","volume":"58 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126821202","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Rank-at-a-Time Query Processing","authors":"Ahmed Elbagoury, Matt Crane, Jimmy J. Lin","doi":"10.1145/2970398.2970434","DOIUrl":"https://doi.org/10.1145/2970398.2970434","url":null,"abstract":"Query processing strategies for ranked retrieval have been studied for decades. In this paper we propose a new strategy, which we call rank-at-a-time query processing, that evaluates documents in descending order of quantized scores and is able to directly compute the final document ranking via a sequence of boolean intersections. We show that such a strategy is equivalent to a second-order restricted composition of per-term scores. Rank-at-a-time query processing has the advantage that it is anytime score-safe, which means that the retrieval algorithm can self-adapt to produce an exact ranking given an arbitrary latency constraint. Due to the combinatorial nature of compositions, however, a naive implementation is too slow to be of practical use. To address this issue, we introduce a hybrid variant that is able to reduce query latency to a point that is on par with state-of-the-art retrieval engines.","PeriodicalId":443715,"journal":{"name":"Proceedings of the 2016 ACM International Conference on the Theory of Information Retrieval","volume":"33 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126846169","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}