Linkai Weng, Zhiwei Li, Rui Cai, Yaoxue Zhang, Yuezhi Zhou, L. Yang, Lei Zhang
{"title":"Query by document via a decomposition-based two-level retrieval approach","authors":"Linkai Weng, Zhiwei Li, Rui Cai, Yaoxue Zhang, Yuezhi Zhou, L. Yang, Lei Zhang","doi":"10.1145/2009916.2009985","DOIUrl":"https://doi.org/10.1145/2009916.2009985","url":null,"abstract":"Retrieving similar documents from a large-scale text corpus according to a given document is a fundamental technique for many applications. However, most of existing indexing techniques have difficulties to address this problem due to special properties of a document query, e.g. high dimensionality, sparse representation and semantic concern. Towards addressing this problem, we propose a two-level retrieval solution based on a document decomposition idea. A document is decomposed to a compact vector and a few document specific keywords by a dimension reduction approach. The compact vector embodies the major semantics of a document, and the document specific keywords complement the discriminative power lost in dimension reduction process. We adopt locality sensitive hashing (LSH) to index the compact vectors, which guarantees to quickly find a set of related documents according to the vector of a query document. Then we re-rank documents in this set by their document specific keywords. In experiments, we obtained promising results on various datasets in terms of both accuracy and performance. We demonstrated that this solution is able to index large-scale corpus for efficient similarity-based document retrieval.","PeriodicalId":356580,"journal":{"name":"Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval","volume":"95 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-07-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124178374","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A unified framework for recommendations based on quaternary semantic analysis","authors":"Wei Chen, W. Hsu, M. Lee","doi":"10.1145/2009916.2010052","DOIUrl":"https://doi.org/10.1145/2009916.2010052","url":null,"abstract":"Social network systems such as FaceBook and YouTube have played a significant role in capturing both explicit and implicit user preferences for different items in the form of ratings and tags. This forms a quaternary relationship among users, items, tags and ratings. Existing systems have utilized only ternary relationships such as users-items-ratings, or users-items-tags to derive their recommendations. In this paper, we show that ternary relationships are insufficient to provide accurate recommendations. Instead, we model the quaternary relationship among users, items, tags and ratings as a 4-order tensor and cast the recommendation problem as a multi-way latent semantic analysis problem. A unified framework for user recommendation, item recommendation, tag recommendation and item rating prediction is proposed. The results of extensive experiments performed on a real world dataset demonstrate that our unified framework outperforms the state-of-the-art techniques in all the four recommendation tasks.","PeriodicalId":356580,"journal":{"name":"Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval","volume":"262 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-07-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127963999","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Shuaiqiang Wang, Byron J. Gao, Ke Wang, Hady W. Lauw
{"title":"Parallel learning to rank for information retrieval","authors":"Shuaiqiang Wang, Byron J. Gao, Ke Wang, Hady W. Lauw","doi":"10.1145/2009916.2010060","DOIUrl":"https://doi.org/10.1145/2009916.2010060","url":null,"abstract":"Learning to rank represents a category of effective ranking methods for information retrieval. While the primary concern of existing research has been accuracy, learning efficiency is becoming an important issue due to the unprecedented availability of large-scale training data and the need for continuous update of ranking functions. In this paper, we investigate parallel learning to rank, targeting simultaneous improvement in accuracy and efficiency.","PeriodicalId":356580,"journal":{"name":"Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-07-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115985872","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Graph-cut based tag enrichment","authors":"Xueming Qian, Xiansheng Hua","doi":"10.1145/2009916.2010074","DOIUrl":"https://doi.org/10.1145/2009916.2010074","url":null,"abstract":"In this paper, a graph cut based tag enrichment approach is proposed. We build a graph for each image with its initial tags. The graph is with two terminals. Nodes of the graph are full connected with each other. Min-cut/max-flow algorithm is utilized to find the relevant tags for the image. Experiments on Flickr dataset demonstrate the effectiveness of the proposed graph-cut based tag enrichment approach.","PeriodicalId":356580,"journal":{"name":"Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval","volume":"20 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-07-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116799554","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Sample selection for dictionary-based corpus compression","authors":"C. Hoobin, S. Puglisi, J. Zobel","doi":"10.1145/2009916.2010087","DOIUrl":"https://doi.org/10.1145/2009916.2010087","url":null,"abstract":"Compression of large text corpora has the potential to drastically reduce both storage requirements and per-document access costs. Adaptive methods used for general-purpose compression are ineffective for this application, and historically the most successful methods have been based on word-based dictionaries, which allow use of global properties of the text. However, these are dependent on the text complying with assumptions about content and lead to dictionaries of unpredictable size. In recent work we have described an LZ-like approach in which sampled blocks of a corpus are used as a dictionary against which the complete corpus is compressed, giving compression twice as effective than that of zlib. Here we explore how pre-processing can be used to eliminate redundancy in our sampled dictionary. Our experiments show that dictionary size can be reduced by 50% or more (less than 0.1% of the collection size) with no significant effect on compression or access speed.","PeriodicalId":356580,"journal":{"name":"Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-07-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115401425","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Identifying similar people in professional social networks with discriminative probabilistic models","authors":"Suleyman Cetintas, Monica Rogati, Luo Si, Yi Fang","doi":"10.1145/2009916.2010123","DOIUrl":"https://doi.org/10.1145/2009916.2010123","url":null,"abstract":"Identifying similar professionals is an important task for many core services in professional social networks. Information about users can be obtained from heterogeneous information sources, and different sources provide different insights on user similarity. This paper proposes a discriminative probabilistic model that identifies latent content and graph classes for people with similar profile content and social graph similarity patterns, and learns a specialized similarity model for each latent class. To the best of our knowledge, this is the first work on identifying similar professionals in professional social networks, and the first work that identifies latent classes to learn a separate similarity model for each latent class. Experiments on a real-world dataset demonstrate the effectiveness of the proposed discriminative learning model.","PeriodicalId":356580,"journal":{"name":"Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-07-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114616971","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ming Ji, Jun Yan, Siyu Gu, Jiawei Han, Xiaofei He, Wei Vivian Zhang, Zheng Chen
{"title":"Learning search tasks in queries and web pages via graph regularization","authors":"Ming Ji, Jun Yan, Siyu Gu, Jiawei Han, Xiaofei He, Wei Vivian Zhang, Zheng Chen","doi":"10.1145/2009916.2009928","DOIUrl":"https://doi.org/10.1145/2009916.2009928","url":null,"abstract":"As the Internet grows explosively, search engines play a more and more important role for users in effectively accessing online information. Recently, it has been recognized that a query is often triggered by a search task that the user wants to accomplish. Similarly, many web pages are specifically designed to help accomplish a certain task. Therefore, learning hidden tasks behind queries and web pages can help search engines return the most useful web pages to users by task matching. For instance, the search task that triggers query \"thinkpad T410 broken\" is to maintain a computer, and it is desirable for a search engine to return the Lenovo troubleshooting page on the top of the list. However, existing search engine technologies mainly focus on topic detection or relevance ranking, which are not able to predict the task that triggers a query and the task a web page can accomplish. In this paper, we propose to simultaneously classify queries and web pages into the popular search tasks by exploiting their content together with click-through logs. Specifically, we construct a taskoriented heterogeneous graph among queries and web pages. Each pair of objects in the graph are linked together as long as they potentially share similar search tasks. A novel graph-based regularization algorithm is designed for search task prediction by leveraging the graph. Extensive experiments in real search log data demonstrate the effectiveness of our method over state-of-the-art classifiers, and the search performance can be significantly improved by using the task prediction results as additional information.","PeriodicalId":356580,"journal":{"name":"Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-07-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125350678","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Qiaoling Liu, Eugene Agichtein, G. Dror, E. Gabrilovich, Y. Maarek, D. Pelleg, Idan Szpektor
{"title":"Predicting web searcher satisfaction with existing community-based answers","authors":"Qiaoling Liu, Eugene Agichtein, G. Dror, E. Gabrilovich, Y. Maarek, D. Pelleg, Idan Szpektor","doi":"10.1145/2009916.2009974","DOIUrl":"https://doi.org/10.1145/2009916.2009974","url":null,"abstract":"Community-based Question Answering (CQA) sites, such as Yahoo! Answers, Baidu Knows, Naver, and Quora, have been rapidly growing in popularity. The resulting archives of posted answers to questions, in Yahoo! Answers alone, already exceed in size 1 billion, and are aggressively indexed by web search engines. In fact, a large number of search engine users benefit from these archives, by finding existing answers that address their own queries. This scenario poses new challenges and opportunities for both search engines and CQA sites. To this end, we formulate a new problem of predicting the satisfaction of web searchers with CQA answers. We analyze a large number of web searches that result in a visit to a popular CQA site, and identify unique characteristics of searcher satisfaction in this setting, namely, the effects of query clarity, query-to-question match, and answer quality. We then propose and evaluate several approaches to predicting searcher satisfaction that exploit these characteristics. To the best of our knowledge, this is the first attempt to predict and validate the usefulness of CQA archives for external searchers, rather than for the original askers. Our results suggest promising directions for improving and exploiting community question answering services in pursuit of satisfying even more Web search queries.","PeriodicalId":356580,"journal":{"name":"Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-07-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128034239","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Recommending ephemeral items at web scale","authors":"Ye Chen, J. Canny","doi":"10.1145/2009916.2010051","DOIUrl":"https://doi.org/10.1145/2009916.2010051","url":null,"abstract":"We describe an innovative and scalable recommendation system successfully deployed at eBay. To build recommenders for long-tail marketplaces requires projection of volatile items into a persistent space of latent products. We first present a generative clustering model for collections of unstructured, heterogeneous, and ephemeral item data, under the assumption that items are generated from latent products. An item is represented as a vector of independently and distinctly distributed variables, while a latent product is characterized as a vector of probability distributions, respectively. The probability distributions are chosen as natural stochastic models for different types of data. The learning objective is to maximize the total intra-cluster coherence measured by the sum of log likelihoods of items under such a generative process. In the space of latent products, robust recommendations can then be derived using naive Bayes for ranking, from historical transactional data. Item-based recommendations are achieved by inferring latent products from unseen items. In particular, we develop a probabilistic scoring function of recommended items, which takes into account item-product membership, product purchase probability, and the important auction-end-time factor. With the holistic probabilistic measure of a prospective item purchase, one can further maximize the expected revenue and the more subjective user satisfaction as well. We evaluated the latent product clustering and recommendation ranking models using real-world e-commerce data from eBay, in both forms of offline simulation and online A/B testing. In the recent production launch, our system yielded 3-5 folds improvement over the existing production system in click-through, purchase-through and gross merchandising value; thus now driving 100% related recommendation traffic with billions of items at eBay. We believe that this work provides a practical yet principled framework for recommendation in the domains with affluent user self-input data.","PeriodicalId":356580,"journal":{"name":"Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval","volume":"51 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-07-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126704301","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Enver Kayaaslan, B. B. Cambazoglu, Roi Blanco, F. Junqueira, C. Aykanat
{"title":"Energy-price-driven query processing in multi-center web search engines","authors":"Enver Kayaaslan, B. B. Cambazoglu, Roi Blanco, F. Junqueira, C. Aykanat","doi":"10.1145/2009916.2010047","DOIUrl":"https://doi.org/10.1145/2009916.2010047","url":null,"abstract":"Concurrently processing thousands of web queries, each with a response time under a fraction of a second, necessitates maintaining and operating massive data centers. For large-scale web search engines, this translates into high energy consumption and a huge electric bill. This work takes the challenge to reduce the electric bill of commercial web search engines operating on data centers that are geographically far apart. Based on the observation that energy prices and query workloads show high spatio-temporal variation, we propose a technique that dynamically shifts the query workload of a search engine between its data centers to reduce the electric bill. Experiments on real-life query workloads obtained from a commercial search engine show that significant financial savings can be achieved by this technique.","PeriodicalId":356580,"journal":{"name":"Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval","volume":"83 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-07-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127493356","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}