{"title":"Neural Models for Full Text Search","authors":"Nick Craswell","doi":"10.1145/3018661.3042065","DOIUrl":"https://doi.org/10.1145/3018661.3042065","url":null,"abstract":"A fundamental concern in search engines is to determine which documents have the best content for satisfying the user, based on analysis of the user's query and the text of documents. For this query-content match, many learning to rank systems make use of IR features developed in the 1990s in the TREC framework. Such features are still important in a variety of search tasks, and particularly in the long tail where clicks, links and social media signals become sparse. I will present our current progress, in particular three different neural models, with the goal of surpassing the 1990s models in full text search. This will include evidence, using proprietary Bing datasets, that large-scale training data can be useful. I will also argue that for the field to make progress on query-content relevance modeling, it may be valuable to set up a shared blind evaluation similar to 1990s TREC, possibly with large-scale training data.","PeriodicalId":344017,"journal":{"name":"Proceedings of the Tenth ACM International Conference on Web Search and Data Mining","volume":"491 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-02-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133084543","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Fuzhen Zhuang, Dan Luo, Nicholas Jing Yuan, Xing Xie, Qing He
{"title":"Representation Learning with Pair-wise Constraints for Collaborative Ranking","authors":"Fuzhen Zhuang, Dan Luo, Nicholas Jing Yuan, Xing Xie, Qing He","doi":"10.1145/3018661.3018720","DOIUrl":"https://doi.org/10.1145/3018661.3018720","url":null,"abstract":"Last decades have witnessed a vast amount of interest and research in recommendation systems. Collaborative filtering, which uses the known preferences of a group of users to make recommendations or predictions of the unknown preferences for other users, is one of the most successful approaches to build recommendation systems. Most previous collaborative filtering approaches employ the matrix factorization techniques to learn latent user feature profiles and item feature profiles. Also many subsequent works are proposed to incorporate users' social network information and items' attributions to further improve recommendation performance under the matrix factorization framework. However, the matrix factorization based methods may not make full use of the rating information, leading to unsatisfying performance. Recently deep learning has been approved to be able to find good representations in natural language processing, image classification, and so on. Along this line, we propose a collaborative ranking framework via representation learning with pair-wise constraints (REAP for short), in which autoencoder is used to simultaneously learn the latent factors of both users and items and pair-wise ranked loss defined by (user, item) pairs is considered. Extensive experiments are conducted on five data sets to demonstrate the effectiveness of the proposed framework.","PeriodicalId":344017,"journal":{"name":"Proceedings of the Tenth ACM International Conference on Web Search and Data Mining","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-02-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122058009","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Michael Bendersky, Xuanhui Wang, Donald Metzler, Marc Najork
{"title":"Learning from User Interactions in Personal Search via Attribute Parameterization","authors":"Michael Bendersky, Xuanhui Wang, Donald Metzler, Marc Najork","doi":"10.1145/3018661.3018712","DOIUrl":"https://doi.org/10.1145/3018661.3018712","url":null,"abstract":"User interaction data (e.g., click data) has proven to be a powerful signal for learning-to-rank models in web search. However, such models require observing multiple interactions across many users for the same query-document pair to achieve statistically meaningful gains. Therefore, utilizing user interaction data for improving search over personal, rather than public, content is a challenging problem. First, the documents (e.g., emails or private files) are not shared across users. Second, user search queries are of personal nature (e.g., \"alice's address\") and may not generalize well across users. In this paper, we propose a solution to these challenges, by projecting user queries and documents into a multi-dimensional space of fine-grained and semantically coherent attributes. We then introduce a novel parameterization technique to overcome sparsity in the multi-dimensional attribute space. Attribute parameterization enables effective usage of cross-user interactions for improving personal search quality -- which is a first such published result, to the best of our knowledge. Experiments with a dataset derived from interactions of users of one of the world's largest personal search engines demonstrate the effectiveness of the proposed attribute parameterization technique.","PeriodicalId":344017,"journal":{"name":"Proceedings of the Tenth ACM International Conference on Web Search and Data Mining","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-02-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122368132","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Concept Embedded Convolutional Semantic Model for Question Retrieval","authors":"P. Wang, Yong Zhang, Lei Ji, Jun Yan, Lianwen Jin","doi":"10.1145/3018661.3018687","DOIUrl":"https://doi.org/10.1145/3018661.3018687","url":null,"abstract":"The question retrieval, which aims to find similar questions of a given question, is playing pivotal role in various question answering (QA) systems. This task is quite challenging mainly on three aspects: lexical gap, polysemy and word order. In this paper, we propose a unified framework to simultaneously handle these three problems. We use word combined with corresponding concept information to handle the polysemous problem. The concept embedding and word embedding are learned at the same time from both context-dependent and context-independent view. The lexical gap problem is handled since the semantic information has been encoded into the embedding. Then, we propose to use a high-level feature embedded convolutional semantic model to learn the question embedding by inputting the concept embedding and word embedding without manually labeling training data. The proposed framework nicely represent the hierarchical structures of word information and concept information in sentences with their layer-by-layer composition and pooling. Finally, the framework is trained in a weakly-supervised manner on question answer pairs, which can be directly obtained without manually labeling. Experiments on two real question answering datasets show that the proposed framework can significantly outperform the state-of-the-art solutions.","PeriodicalId":344017,"journal":{"name":"Proceedings of the Tenth ACM International Conference on Web Search and Data Mining","volume":"22 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-02-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117027305","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
M. Saleem, Rohit Kumar, T. Calders, Xike Xie, T. Pedersen
{"title":"Location Influence in Location-based Social Networks","authors":"M. Saleem, Rohit Kumar, T. Calders, Xike Xie, T. Pedersen","doi":"10.1145/3018661.3018705","DOIUrl":"https://doi.org/10.1145/3018661.3018705","url":null,"abstract":"Location-based social networks (LBSN) are social networks complemented with location data such as geo-tagged activity data of its users. In this paper, we study how users of a LBSN are navigating between locations and based on this information we select the most influential locations. In contrast to existing works on influence maximization, we are not per se interested in selecting the users with the largest set of friends or the set of locations visited by the most users; instead, we introduce a notion of location influence that captures the ability of a set of locations to reach out geographically. We provide an exact on-line algorithm and a more memory-efficient but approximate variant based on the HyperLogLog sketch to maintain a data structure called Influence Oracle that allows to efficiently find a top-k set of influential locations. Experiments show that our algorithms are efficient and scalable and that our new location influence notion favors diverse sets of locations with a large geographical spread.","PeriodicalId":344017,"journal":{"name":"Proceedings of the Tenth ACM International Conference on Web Search and Data Mining","volume":"36 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-02-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124342659","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Haitao Yu, A. Jatowt, Roi Blanco, Hideo Joho, J. Jose, Long Chen, Fajie Yuan
{"title":"A Concise Integer Linear Programming Formulation for Implicit Search Result Diversification","authors":"Haitao Yu, A. Jatowt, Roi Blanco, Hideo Joho, J. Jose, Long Chen, Fajie Yuan","doi":"10.1145/3018661.3018710","DOIUrl":"https://doi.org/10.1145/3018661.3018710","url":null,"abstract":"To cope with ambiguous and/or underspecified queries, search result diversification (SRD) is a key technique that has attracted a lot of attention. This paper focuses on implicit SRD, where the possible subtopics underlying a query are unknown beforehand. We formulate implicit SRD as a process of selecting and ranking k exemplar documents that utilizes integer linear programming (ILP). Unlike the common practice of relying on approximate methods, this formulation enables us to obtain the optimal solution of the objective function. Based on four benchmark collections, our extensive empirical experiments reveal that: (1) The factors, such as different initial runs, the number of input documents, query types and the ways of computing document similarity significantly affect the performance of diversification models. Careful examinations of these factors are highly recommended in the development of implicit SRD methods. (2) The proposed method can achieve substantially improved performance over the state-of-the-art unsupervised methods for implicit SRD.","PeriodicalId":344017,"journal":{"name":"Proceedings of the Tenth ACM International Conference on Web Search and Data Mining","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-02-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126644105","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Gong Cheng, Cheng Jin, Wentao Ding, Danyun Xu, Yuzhong Qu
{"title":"Generating Illustrative Snippets for Open Data on the Web","authors":"Gong Cheng, Cheng Jin, Wentao Ding, Danyun Xu, Yuzhong Qu","doi":"10.1145/3018661.3018670","DOIUrl":"https://doi.org/10.1145/3018661.3018670","url":null,"abstract":"To embrace the open data movement, increasingly many datasets have been published on the Web to be reused. Users, when assessing the usefulness of an unfamiliar dataset, need means to quickly inspect its contents. To satisfy the needs, we propose to automatically extract an optimal small portion from a dataset, called a snippet, to concisely illustrate the contents of the dataset. We consider the quality of a snippet from three aspects: coverage, familiarity, and cohesion, which are jointly formulated in a new combinatorial optimization problem called the maximum-weight-and-coverage connected graph problem (MwcCG). We give a constant-factor approximation algorithm for this NP-hard problem, and experiment with our solution on real-world datasets. Our quantitative analysis and user study show that our approach outperforms a baseline approach.","PeriodicalId":344017,"journal":{"name":"Proceedings of the Tenth ACM International Conference on Web Search and Data Mining","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-02-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126533004","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Machine Learning at Amazon","authors":"R. Herbrich","doi":"10.1145/3018661.3022764","DOIUrl":"https://doi.org/10.1145/3018661.3022764","url":null,"abstract":"In this talk I will give an introduction into the field of machine learning and discuss why it is a crucial technology for Amazon. Machine learning is the science of automatically extracting patterns from data in order to make automated predictions of future data. One way to differentiate machine learning tasks is by the following two factors: (1) How much noise is contained in the data? and (2) How far into the future is the prediction task? The former presents a limit to the learnability of task --- regardless which learning algorithm is used --- whereas the latter has a crucial implication on the representation of the predictions: while most tasks in search and advertising typically only forecast minutes into the future, tasks in e-commerce can require predictions up to a year into the future. The further the forecast horizon, the more important it is to take account of uncertainty in both the learning algorithm and the representation of the predictions. I will discuss which learning frameworks are best suited for the various scenarios, that is, short-term predictions with little noise vs. long-term predictions with lots of noise, and present some ideas to combine representation learning with probabilistic methods. In the second half of the talk, I will give an overview of the applications of machine learning at Amazon ranging from demand forecasting, machine translation to automation of computer vision tasks and robotics. I will also discuss the importance of tools for data scientist and share learnings on bringing machine learning algorithms into products.","PeriodicalId":344017,"journal":{"name":"Proceedings of the Tenth ACM International Conference on Web Search and Data Mining","volume":"33 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-02-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121576011","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Neural Text Embeddings for Information Retrieval","authors":"Bhaskar Mitra, Nick Craswell","doi":"10.1145/3018661.3022755","DOIUrl":"https://doi.org/10.1145/3018661.3022755","url":null,"abstract":"In the last few years, neural representation learning approaches have achieved very good performance on many natural language processing tasks, such as language modelling and machine translation. This suggests that neural models will also achieve good performance on information retrieval (IR) tasks, such as relevance ranking, addressing the query-document vocabulary mismatch problem by using a semantic rather than lexical matching. Although initial iterations of neural models do not outperform traditional lexical-matching baselines, the level of interest and effort in this area is increasing, potentially leading to a breakthrough. The popularity of the recent SIGIR 2016 workshop on Neural Information Retrieval provides evidence to the growing interest in neural models for IR. While recent tutorials have covered some aspects of deep learning for retrieval tasks, there is a significant scope for organizing a tutorial that focuses on the fundamentals of representation learning for text retrieval. The goal of this tutorial will be to introduce state-of-the-art neural embedding models and bridge the gap between these neural models with early representation learning approaches in IR (e.g., LSA). We will discuss some of the key challenges and insights in making these models work in practice, and demonstrate one of the toolsets available to researchers interested in this area.","PeriodicalId":344017,"journal":{"name":"Proceedings of the Tenth ACM International Conference on Web Search and Data Mining","volume":"40 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-02-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123218046","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Click Through Rate Prediction for Local Search Results","authors":"Fidel Cacheda, Nicola Barbieri, Roi Blanco","doi":"10.1145/3018661.3018683","DOIUrl":"https://doi.org/10.1145/3018661.3018683","url":null,"abstract":"With the ubiquity of internet access and location services provided by smartphone devices, the volume of queries issued by users to find products and services that are located near them is rapidly increasing. Local search engines help users in this task by matching queries with a predefined geographical connotation (\"local queries\") against a database of local business listings. Local search differs from traditional web-search because to correctly capture users' click behavior, the estimation of relevance between query and candidate results must be integrated with geographical signals, such as distance. The intuition is that users prefer businesses that are physically closer to them. However, this notion of closeness is likely to depend upon other factors, like the category of the business, the quality of the service provided, the density of businesses in the area of interest, etc. In this paper we perform an extensive analysis of online users' behavior and investigate the problem of estimating the click-through rate on local search (LCTR) by exploiting the combination of standard retrieval methods with a rich collection of geo and business-dependent features. We validate our approach on a large log collected from a real-world local search service. Our evaluation shows that the non-linear combination of business information, geo-local and textual relevance features leads to a significant improvements over state of the art alternative approaches based on a combination of relevance, distance and business reputation.","PeriodicalId":344017,"journal":{"name":"Proceedings of the Tenth ACM International Conference on Web Search and Data Mining","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-02-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124926352","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}