{"title":"Resilient Retrieval Models for Large Collection","authors":"Dipannita Podder","doi":"10.1145/3539618.3591793","DOIUrl":"https://doi.org/10.1145/3539618.3591793","url":null,"abstract":"Modern search engines employ multi-stage ranking pipeline to balance retrieval efficiency and effectiveness for large collections. These pipelines retrieve an initial set of candidate documents from the large repository by some cost-effective retrieval model (such as BM25, LM), then re-rank these candidate documents by neural retrieval models. These pipelines perform well if the first-stage ranker achieves high recall [2]. To achieve this, the first-stage ranker should address the problems in milliseconds. One of the major problems of the search engine is the presence of extraneous terms in the query. Since the query document term matching is the fundamental block of any retrieval model, the retrieval effectiveness drops when the documents are getting matched with these extraneous query terms. The existing models [4, 5] address this issue by estimating weights of the terms either by using supervised approaches or by utilizing the information of a set of initial top-ranked documents and incorporating it into the final ranking function. Although the later category of methods is unsupervised, they are inefficient as ranking the large collection to get the initial top-ranked documents is computationally expensive. Besides, in the real-world collection, some terms may appear multiple times in the documents for several reasons, such as a term may appear for different contexts, the author bursts this term, or it is an outlier. Thus, the existing retrieval models overestimate the relevance score of the irrelevant documents if they contain some query term with extremely high frequency. Paik et al. [3] propose a probabilistic model based on truncated distributions that reduce the contribution of such high-frequency occurrences of the terms in relevance score. But, the truncation point selection does not leverage term-specific distribution information. It treats all the relevant documents as a bag for a set of queries which is not a good way to capture the distribution of terms. Furthermore, this model does not capture the term burstiness; it only reduces the effect of the outliers. Cummins et al. [1] propose a language model based on Dirichlet compound multinomial distribution that can capture the term burstiness. But this model is explicitly specific to the language model. Considering the above research gaps, we focus on the following research questions in this doctoral work. Research Question 1: How can we identify the central query terms from the verbose query without relying on an initial ranked list or relevance judgment and modify the ranking function so that it can focus on the derived central query terms? To address RQ1, we generate the contextual vector of the entire query and individual query terms using the pre-trained BERT (Bidi-rectional Encoder Representations from Transformers) model and subsequently analyze their correlation to estimate the term centrality score so that the ranking function may focus on the central terms while term matching","PeriodicalId":425056,"journal":{"name":"Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval","volume":"151 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-07-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121145947","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Online Conversion Rate Prediction via Neural Satellite Networks in Delayed Feedback Advertising","authors":"Qiming Liu, Haoming Li, Xiang Ao, Yuyao Guo, Zhi Dong, Ruobing Zhang, Qiong Chen, Jianfeng Tong, Qing He","doi":"10.1145/3539618.3591747","DOIUrl":"https://doi.org/10.1145/3539618.3591747","url":null,"abstract":"The delayed feedback is becoming one of the main obstacles in online advertising due to the pervasive deployment of the cost-per-conversion display strategy requesting a real-time conversion rate (CVR) prediction. It makes the observed data contain a large number of fake negatives that temporarily have no feedback but will convert later. Training on such biased data distribution would severely harm the performance of models. Prevailing approaches wait for a set period of time to see if samples convert before training on them, but solutions to guaranteeing data freshness remain under-explored by current research. In this work, we propose Delayed Feed-back modeling via neural Satellite Networks (DFSN for short) for online CVR prediction. It tackles the issue of data freshness to permit adaptive waiting windows. We first assign a long waiting window for our main model to cover most of conversions and greatly reduce fake negatives. Meanwhile, two kinds of satellite models are devised to learn from the latest data, and online transfer learning techniques are utilized to sufficiently exploit their knowledge. With information from satellites, our main model can deal with the issue of data freshness, achieving better performance than previous methods. Extensive experiments on two real-world advertising datasets demonstrate the superiority of our model.","PeriodicalId":425056,"journal":{"name":"Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval","volume":"205 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-07-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121630772","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Dense Passage Retrieval: Architectures and Augmentation Methods","authors":"Thilina C. Rajapakse","doi":"10.1145/3539618.3591796","DOIUrl":"https://doi.org/10.1145/3539618.3591796","url":null,"abstract":"The dual-encoder model is a dense retrieval architecture, consisting of two encoder models, that has surpassed traditional sparse retrieval methods for open-domain retrieval [1]. But, room exists for improvement, particularly when dense retrievers are exposed to unseen passages or queries. Considering out-of-domain queries, i.e., queries originating from domains other than the one the model was trained on, the loss in accuracy may be significant. A main factor for this is the mismatch in the information available to the context encoder and the query encoder during training. Common retireval training datasets contain an overwhelming majority of passages with one query from a passage. I hypothesize that this could lead the dual-encoder model, particularly the passage encoder, to overfit to a single potential query from a given passage to the detriment of out-of-domain performance. Based on this, I seek to answer the following research question: (RQ1.1) Does training a DPR model on data containing multiple queries per passage improve the generalizability of the model? To answer RQ1.1, I build generated datasets that have multiple queries for most passages, and compare dense passage retriever models trained on these datasets against models trained on (mostly) single query per passage datasets. I show that training on passages with multiple queries leads to models that generalize better to out-of-distribution and out-of-domain test datasets [2]. Language can be considered another domain in the context of a dense retrieval. Training a dense retrieval model is especially challenging in languages other than English due to the scarcity of training data. I propose a novel training technique, clustered training, aimed at improving the retrieval quality of dense retrievers, especially in out-of-distribution and zero-shot settings. I address the following research questions: (RQ2.1)Does clustered training improve the effectiveness of multilingual DPR models on in-distribution data? (RQ2.2) Does clustered training improve the effectiveness of multilingual DPR models on out-of-distribution data from languages that it is trained on? (RQ2.2 Does clustered training improve the effectiveness of multilingual DPR models on out-of-distribution data from languages that it is trained on? (RQ2.3) Does clustered training help multilingual DPR models to generalize to new languages (zero-shot)? I show that clustered training improves the out-of-distribution and zero-shot performance of a DPR model without a clear loss in in-distribution performance using the Mr. TyDi [3] dataset. Finally, I propose a modified dual-encoder architecture that can perform both retrieval and reranking with the same model in a single forward pass. While dual encoder models can surpass traditional sparse retrieval methods, they lag behind two stage retrieval pipelines in retrieval quality. I propose a modification to the dual encoder model where a second representation is used to rerank the passag","PeriodicalId":425056,"journal":{"name":"Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval","volume":"38 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-07-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126352041","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Eugene Yang, Dawn J Lawrie, J. Mayfield, Suraj Nair, Douglas W. Oard
{"title":"Neural Methods for Cross-Language Information Retrieval","authors":"Eugene Yang, Dawn J Lawrie, J. Mayfield, Suraj Nair, Douglas W. Oard","doi":"10.1145/3539618.3594244","DOIUrl":"https://doi.org/10.1145/3539618.3594244","url":null,"abstract":"This half day tutorial introduces the participant to the basic concepts underlying neural Cross-Language Information Retrieval (CLIR). It discusses the most common algorithmic approaches to CLIR, focusing on modern neural methods; the history of CLIR; where to find and how to use CLIR training collections, test collections and baseline systems; how CLIR training and test collections are constructed; and open research questions in CLIR.","PeriodicalId":425056,"journal":{"name":"Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval","volume":"44 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-07-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125630587","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Daniel Vera Nieto, Saikishore Kalloori, Fabio Zund, Clara Fernandez Labrador, Marc Willhaus, Severin Klingler, M. Gross
{"title":"A Retrieval System for Images and Videos based on Aesthetic Assessment of Visuals","authors":"Daniel Vera Nieto, Saikishore Kalloori, Fabio Zund, Clara Fernandez Labrador, Marc Willhaus, Severin Klingler, M. Gross","doi":"10.1145/3539618.3591817","DOIUrl":"https://doi.org/10.1145/3539618.3591817","url":null,"abstract":"Attractive images or videos are the visual backbones of journalism and social media to gain the user's attention. From trailers to teaser images to image galleries, appealing visuals have only grown in importance over the years. However, selecting eye-catching shots from a video or the perfect image from large image collections is a challenging and time-consuming task. We present our tool that can assess image and video content from an aesthetic standpoint. We discovered that it is possible to perform such an assessment by combining expert knowledge with data-driven information. We combine the relevant aesthetic features and machine learning algorithms into an aesthetics retrieval system, which enables users to sort uploaded visuals based on an aesthetic score and interact with additional photographic, cinematic, and person-specific features.","PeriodicalId":425056,"journal":{"name":"Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval","volume":"60 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-07-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121948920","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yang Zhang, Yue Shen, Dong Wang, Jinjie Gu, Guannan Zhang
{"title":"Connecting Unseen Domains: Cross-Domain Invariant Learning in Recommendation","authors":"Yang Zhang, Yue Shen, Dong Wang, Jinjie Gu, Guannan Zhang","doi":"10.1145/3539618.3591965","DOIUrl":"https://doi.org/10.1145/3539618.3591965","url":null,"abstract":"As web applications continue to expand and diversify their services, user interactions exist in different scenarios. To leverage this wealth of information, cross-domain recommendation (CDR) has gained significant attention in recent years. However, existing CDR approaches mostly focus on information transfer between observed domains, with little attention paid to generalizing to unseen domains. Although recent research on invariant learning can help for the purpose of generalization, relying only on invariant preference may be overly conservative and result in mediocre performance when the unseen domain shifts slightly. In this paper, we present a novel framework that considers both CDR and domain generalization through a united causal invariant view. We assume that user interactions are determined by domain-invariant preference and domain-specific preference. The proposed approach differentiates the invariant preference and the specific preference from observational behaviors in a way of adversarial learning. Additionally, a novel domain routing module is designed to connect unseen domains to observed domains. Extensive experiments on public and industry datasets have proved the effectiveness of the proposed approach under both CDR and domain generalization settings.","PeriodicalId":425056,"journal":{"name":"Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval","volume":"30 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-07-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127926778","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A Generic Learning Framework for Sequential Recommendation with Distribution Shifts","authors":"Zhengyi Yang, Xiangnan He, Jizhi Zhang, Jiancan Wu, Xin Xin, Jiawei Chen, Xiang Wang","doi":"10.1145/3539618.3591624","DOIUrl":"https://doi.org/10.1145/3539618.3591624","url":null,"abstract":"Leading sequential recommendation (SeqRec) models adopt empirical risk minimization (ERM) as the learning framework, which inherently assumes that the training data (historical interaction sequences) and the testing data (future interactions) are drawn from the same distribution. However, such i.i.d. assumption hardly holds in practice, due to the online serving and dynamic nature of recommender system.For example, with the streaming of new data, the item popularity distribution would change, and the user preference would evolve after consuming some items. Such distribution shifts could undermine the ERM framework, hurting the model's generalization ability for future online serving. In this work, we aim to develop a generic learning framework to enhance the generalization of recommenders in the dynamic environment. Specifically, on top of ERM, we devise a Distributionally Robust Optimization mechanism for SeqRec (DROS). At its core is our carefully-designed distribution adaption paradigm, which considers the dynamics of data distribution and explores possible distribution shifts between training and testing. Through this way, we can endow the backbone recommenders with better generalization ability.It is worth mentioning that DROS is an effective model-agnostic learning framework, which is applicable to general recommendation scenarios.Theoretical analyses show that DROS enables the backbone recommenders to achieve robust performance in future testing data.Empirical studies verify the effectiveness against dynamic distribution shifts of DROS. Codes are anonymously open-sourced at https://github.com/YangZhengyi98/DROS.","PeriodicalId":425056,"journal":{"name":"Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-07-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130477236","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Maciej Rybiński, Stephen Wan, Sarvnaz Karimi, Cécile Paris, Brian Jin, N. Huth, Peter Thorburn, D. Holzworth
{"title":"SciHarvester: Searching Scientific Documents for Numerical Values","authors":"Maciej Rybiński, Stephen Wan, Sarvnaz Karimi, Cécile Paris, Brian Jin, N. Huth, Peter Thorburn, D. Holzworth","doi":"10.1145/3539618.3591808","DOIUrl":"https://doi.org/10.1145/3539618.3591808","url":null,"abstract":"A challenge for search technologies is to support scientific literature surveys that present overviews of the reported numerical values documented for specific physical properties. We present SciHarvester, a system tailored to address this problem for agronomic science. It provides an interface to search PubAg documents, allowing complex queries involving restrictions on numerical values. SciHarvester identifies relevant documents and generates overview of reported parameter values. The system allows interrogation of the results to explain the system's performance. Our evaluations demonstrate the promise of incorporating information extraction techniques with the use of neural scoring mechanisms.","PeriodicalId":425056,"journal":{"name":"Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-07-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134434388","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Xiangyu Niu, Yu-Wei Wu, Xiao Lu, G. Nagpal, Philip Pronin, Kecheng Hao, Zhen Liao, Guangdeng Liao
{"title":"Facebook Content Search: Efficient and Effective Adapting Search on A Large Scale","authors":"Xiangyu Niu, Yu-Wei Wu, Xiao Lu, G. Nagpal, Philip Pronin, Kecheng Hao, Zhen Liao, Guangdeng Liao","doi":"10.1145/3539618.3591840","DOIUrl":"https://doi.org/10.1145/3539618.3591840","url":null,"abstract":"Facebook content search is a critical channel that enables people to discover the best content to deepen their engagement with friends and family, creators, and communities. Building a highly personalized search engine to serve billions of daily active users to find the best results from a large scale of candidates is a challenging task. The search engine must take multiple dimensions into consideration, including different content types, different query intents, and user social graph, etc. In this paper, we discuss the challenges of Facebook content search in depth, and then describe our novel approach to efficiently handling a massive number of documents with advanced query understanding, retrieval, and machine learning techniques. The proposed system has been fully verified and applied to the production system of Facebook Search, which serves billions of users.","PeriodicalId":425056,"journal":{"name":"Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval","volume":"349 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-07-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131160659","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"HeteroCS: A Heterogeneous Community Search System With Semantic Explanation","authors":"Weibin Cai, Fanwei Zhu, Zemin Liu, Ming-hui Wu","doi":"10.1145/3539618.3591812","DOIUrl":"https://doi.org/10.1145/3539618.3591812","url":null,"abstract":"Community search, which looks for query-dependent communities in a graph, is an important task in graph analysis. Existing community search studies address the problem by finding a densely-connected subgraph containing the query. However, many real-world networks are heterogeneous with rich semantics. Queries in heterogeneous networks generally involve in multiple communities with different semantic connections, while returning a single community with mixed semantics has limited applications. In this paper, we revisit the community search problem on heterogeneous networks and introduce a novel paradigm of heterogeneous community search and ranking. We propose to automatically discover the query semantics to enable the search of different semantic communities and develop a comprehensive community evaluation model to support the ranking of results. We build HeteroCS, a heterogeneous community search system with semantic explanation, upon our semantic community model, and deploy it on two real-world graphs. We present a demonstration case to illustrate the novelty and effectiveness of the system.","PeriodicalId":425056,"journal":{"name":"Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval","volume":"73 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-07-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115556578","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}