Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval最新文献_第3页

Resilient Retrieval Models for Large Collection 大数据集的弹性检索模型

Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval Pub Date : 2023-07-18 DOI: 10.1145/3539618.3591793

Dipannita Podder

{"title":"Resilient Retrieval Models for Large Collection","authors":"Dipannita Podder","doi":"10.1145/3539618.3591793","DOIUrl":"https://doi.org/10.1145/3539618.3591793","url":null,"abstract":"Modern search engines employ multi-stage ranking pipeline to balance retrieval efficiency and effectiveness for large collections. These pipelines retrieve an initial set of candidate documents from the large repository by some cost-effective retrieval model (such as BM25, LM), then re-rank these candidate documents by neural retrieval models. These pipelines perform well if the first-stage ranker achieves high recall [2]. To achieve this, the first-stage ranker should address the problems in milliseconds. One of the major problems of the search engine is the presence of extraneous terms in the query. Since the query document term matching is the fundamental block of any retrieval model, the retrieval effectiveness drops when the documents are getting matched with these extraneous query terms. The existing models [4, 5] address this issue by estimating weights of the terms either by using supervised approaches or by utilizing the information of a set of initial top-ranked documents and incorporating it into the final ranking function. Although the later category of methods is unsupervised, they are inefficient as ranking the large collection to get the initial top-ranked documents is computationally expensive. Besides, in the real-world collection, some terms may appear multiple times in the documents for several reasons, such as a term may appear for different contexts, the author bursts this term, or it is an outlier. Thus, the existing retrieval models overestimate the relevance score of the irrelevant documents if they contain some query term with extremely high frequency. Paik et al. [3] propose a probabilistic model based on truncated distributions that reduce the contribution of such high-frequency occurrences of the terms in relevance score. But, the truncation point selection does not leverage term-specific distribution information. It treats all the relevant documents as a bag for a set of queries which is not a good way to capture the distribution of terms. Furthermore, this model does not capture the term burstiness; it only reduces the effect of the outliers. Cummins et al. [1] propose a language model based on Dirichlet compound multinomial distribution that can capture the term burstiness. But this model is explicitly specific to the language model. Considering the above research gaps, we focus on the following research questions in this doctoral work. Research Question 1: How can we identify the central query terms from the verbose query without relying on an initial ranked list or relevance judgment and modify the ranking function so that it can focus on the derived central query terms? To address RQ1, we generate the contextual vector of the entire query and individual query terms using the pre-trained BERT (Bidi-rectional Encoder Representations from Transformers) model and subsequently analyze their correlation to estimate the term centrality score so that the ranking function may focus on the central terms while term matching","PeriodicalId":425056,"journal":{"name":"Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval","volume":"151 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-07-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121145947","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Online Conversion Rate Prediction via Neural Satellite Networks in Delayed Feedback Advertising 基于神经卫星网络的延迟反馈广告在线转化率预测

Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval Pub Date : 2023-07-18 DOI: 10.1145/3539618.3591747

Qiming Liu, Haoming Li, Xiang Ao, Yuyao Guo, Zhi Dong, Ruobing Zhang, Qiong Chen, Jianfeng Tong, Qing He

{"title":"Online Conversion Rate Prediction via Neural Satellite Networks in Delayed Feedback Advertising","authors":"Qiming Liu, Haoming Li, Xiang Ao, Yuyao Guo, Zhi Dong, Ruobing Zhang, Qiong Chen, Jianfeng Tong, Qing He","doi":"10.1145/3539618.3591747","DOIUrl":"https://doi.org/10.1145/3539618.3591747","url":null,"abstract":"The delayed feedback is becoming one of the main obstacles in online advertising due to the pervasive deployment of the cost-per-conversion display strategy requesting a real-time conversion rate (CVR) prediction. It makes the observed data contain a large number of fake negatives that temporarily have no feedback but will convert later. Training on such biased data distribution would severely harm the performance of models. Prevailing approaches wait for a set period of time to see if samples convert before training on them, but solutions to guaranteeing data freshness remain under-explored by current research. In this work, we propose Delayed Feed-back modeling via neural Satellite Networks (DFSN for short) for online CVR prediction. It tackles the issue of data freshness to permit adaptive waiting windows. We first assign a long waiting window for our main model to cover most of conversions and greatly reduce fake negatives. Meanwhile, two kinds of satellite models are devised to learn from the latest data, and online transfer learning techniques are utilized to sufficiently exploit their knowledge. With information from satellites, our main model can deal with the issue of data freshness, achieving better performance than previous methods. Extensive experiments on two real-world advertising datasets demonstrate the superiority of our model.","PeriodicalId":425056,"journal":{"name":"Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval","volume":"205 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-07-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121630772","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Dense Passage Retrieval: Architectures and Augmentation Methods 密集通道检索:体系结构和增强方法

Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval Pub Date : 2023-07-18 DOI: 10.1145/3539618.3591796

Thilina C. Rajapakse

{"title":"Dense Passage Retrieval: Architectures and Augmentation Methods","authors":"Thilina C. Rajapakse","doi":"10.1145/3539618.3591796","DOIUrl":"https://doi.org/10.1145/3539618.3591796","url":null,"abstract":"The dual-encoder model is a dense retrieval architecture, consisting of two encoder models, that has surpassed traditional sparse retrieval methods for open-domain retrieval [1]. But, room exists for improvement, particularly when dense retrievers are exposed to unseen passages or queries. Considering out-of-domain queries, i.e., queries originating from domains other than the one the model was trained on, the loss in accuracy may be significant. A main factor for this is the mismatch in the information available to the context encoder and the query encoder during training. Common retireval training datasets contain an overwhelming majority of passages with one query from a passage. I hypothesize that this could lead the dual-encoder model, particularly the passage encoder, to overfit to a single potential query from a given passage to the detriment of out-of-domain performance. Based on this, I seek to answer the following research question: (RQ1.1) Does training a DPR model on data containing multiple queries per passage improve the generalizability of the model? To answer RQ1.1, I build generated datasets that have multiple queries for most passages, and compare dense passage retriever models trained on these datasets against models trained on (mostly) single query per passage datasets. I show that training on passages with multiple queries leads to models that generalize better to out-of-distribution and out-of-domain test datasets [2]. Language can be considered another domain in the context of a dense retrieval. Training a dense retrieval model is especially challenging in languages other than English due to the scarcity of training data. I propose a novel training technique, clustered training, aimed at improving the retrieval quality of dense retrievers, especially in out-of-distribution and zero-shot settings. I address the following research questions: (RQ2.1)Does clustered training improve the effectiveness of multilingual DPR models on in-distribution data? (RQ2.2) Does clustered training improve the effectiveness of multilingual DPR models on out-of-distribution data from languages that it is trained on? (RQ2.2 Does clustered training improve the effectiveness of multilingual DPR models on out-of-distribution data from languages that it is trained on? (RQ2.3) Does clustered training help multilingual DPR models to generalize to new languages (zero-shot)? I show that clustered training improves the out-of-distribution and zero-shot performance of a DPR model without a clear loss in in-distribution performance using the Mr. TyDi [3] dataset. Finally, I propose a modified dual-encoder architecture that can perform both retrieval and reranking with the same model in a single forward pass. While dual encoder models can surpass traditional sparse retrieval methods, they lag behind two stage retrieval pipelines in retrieval quality. I propose a modification to the dual encoder model where a second representation is used to rerank the passag","PeriodicalId":425056,"journal":{"name":"Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval","volume":"38 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-07-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126352041","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Neural Methods for Cross-Language Information Retrieval 跨语言信息检索的神经方法

Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval Pub Date : 2023-07-18 DOI: 10.1145/3539618.3594244

Eugene Yang, Dawn J Lawrie, J. Mayfield, Suraj Nair, Douglas W. Oard

引用次数: 0

A Retrieval System for Images and Videos based on Aesthetic Assessment of Visuals 基于视觉审美评价的图像视频检索系统

Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval Pub Date : 2023-07-18 DOI: 10.1145/3539618.3591817

Daniel Vera Nieto, Saikishore Kalloori, Fabio Zund, Clara Fernandez Labrador, Marc Willhaus, Severin Klingler, M. Gross

引用次数: 0

Connecting Unseen Domains: Cross-Domain Invariant Learning in Recommendation 连接看不见的领域:推荐中的跨领域不变学习

Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval Pub Date : 2023-07-18 DOI: 10.1145/3539618.3591965

Yang Zhang, Yue Shen, Dong Wang, Jinjie Gu, Guannan Zhang

{"title":"Connecting Unseen Domains: Cross-Domain Invariant Learning in Recommendation","authors":"Yang Zhang, Yue Shen, Dong Wang, Jinjie Gu, Guannan Zhang","doi":"10.1145/3539618.3591965","DOIUrl":"https://doi.org/10.1145/3539618.3591965","url":null,"abstract":"As web applications continue to expand and diversify their services, user interactions exist in different scenarios. To leverage this wealth of information, cross-domain recommendation (CDR) has gained significant attention in recent years. However, existing CDR approaches mostly focus on information transfer between observed domains, with little attention paid to generalizing to unseen domains. Although recent research on invariant learning can help for the purpose of generalization, relying only on invariant preference may be overly conservative and result in mediocre performance when the unseen domain shifts slightly. In this paper, we present a novel framework that considers both CDR and domain generalization through a united causal invariant view. We assume that user interactions are determined by domain-invariant preference and domain-specific preference. The proposed approach differentiates the invariant preference and the specific preference from observational behaviors in a way of adversarial learning. Additionally, a novel domain routing module is designed to connect unseen domains to observed domains. Extensive experiments on public and industry datasets have proved the effectiveness of the proposed approach under both CDR and domain generalization settings.","PeriodicalId":425056,"journal":{"name":"Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval","volume":"30 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-07-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127926778","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

A Generic Learning Framework for Sequential Recommendation with Distribution Shifts 具有分布移位的顺序推荐的通用学习框架

Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval Pub Date : 2023-07-18 DOI: 10.1145/3539618.3591624

Zhengyi Yang, Xiangnan He, Jizhi Zhang, Jiancan Wu, Xin Xin, Jiawei Chen, Xiang Wang

{"title":"A Generic Learning Framework for Sequential Recommendation with Distribution Shifts","authors":"Zhengyi Yang, Xiangnan He, Jizhi Zhang, Jiancan Wu, Xin Xin, Jiawei Chen, Xiang Wang","doi":"10.1145/3539618.3591624","DOIUrl":"https://doi.org/10.1145/3539618.3591624","url":null,"abstract":"Leading sequential recommendation (SeqRec) models adopt empirical risk minimization (ERM) as the learning framework, which inherently assumes that the training data (historical interaction sequences) and the testing data (future interactions) are drawn from the same distribution. However, such i.i.d. assumption hardly holds in practice, due to the online serving and dynamic nature of recommender system.For example, with the streaming of new data, the item popularity distribution would change, and the user preference would evolve after consuming some items. Such distribution shifts could undermine the ERM framework, hurting the model's generalization ability for future online serving. In this work, we aim to develop a generic learning framework to enhance the generalization of recommenders in the dynamic environment. Specifically, on top of ERM, we devise a Distributionally Robust Optimization mechanism for SeqRec (DROS). At its core is our carefully-designed distribution adaption paradigm, which considers the dynamics of data distribution and explores possible distribution shifts between training and testing. Through this way, we can endow the backbone recommenders with better generalization ability.It is worth mentioning that DROS is an effective model-agnostic learning framework, which is applicable to general recommendation scenarios.Theoretical analyses show that DROS enables the backbone recommenders to achieve robust performance in future testing data.Empirical studies verify the effectiveness against dynamic distribution shifts of DROS. Codes are anonymously open-sourced at https://github.com/YangZhengyi98/DROS.","PeriodicalId":425056,"journal":{"name":"Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-07-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130477236","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 6

SciHarvester: Searching Scientific Documents for Numerical Values SciHarvester:搜索科学文献的数值

Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval Pub Date : 2023-07-18 DOI: 10.1145/3539618.3591808

Maciej Rybiński, Stephen Wan, Sarvnaz Karimi, Cécile Paris, Brian Jin, N. Huth, Peter Thorburn, D. Holzworth

引用次数: 0

Facebook Content Search: Efficient and Effective Adapting Search on A Large Scale Facebook内容搜索:高效和有效的大规模适应搜索

Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval Pub Date : 2023-07-18 DOI: 10.1145/3539618.3591840

Xiangyu Niu, Yu-Wei Wu, Xiao Lu, G. Nagpal, Philip Pronin, Kecheng Hao, Zhen Liao, Guangdeng Liao

引用次数: 0

HeteroCS: A Heterogeneous Community Search System With Semantic Explanation 异构社区搜索系统:一个具有语义解释的异构社区搜索系统

Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval Pub Date : 2023-07-18 DOI: 10.1145/3539618.3591812

Weibin Cai, Fanwei Zhu, Zemin Liu, Ming-hui Wu

{"title":"HeteroCS: A Heterogeneous Community Search System With Semantic Explanation","authors":"Weibin Cai, Fanwei Zhu, Zemin Liu, Ming-hui Wu","doi":"10.1145/3539618.3591812","DOIUrl":"https://doi.org/10.1145/3539618.3591812","url":null,"abstract":"Community search, which looks for query-dependent communities in a graph, is an important task in graph analysis. Existing community search studies address the problem by finding a densely-connected subgraph containing the query. However, many real-world networks are heterogeneous with rich semantics. Queries in heterogeneous networks generally involve in multiple communities with different semantic connections, while returning a single community with mixed semantics has limited applications. In this paper, we revisit the community search problem on heterogeneous networks and introduce a novel paradigm of heterogeneous community search and ranking. We propose to automatically discover the query semantics to enable the search of different semantic communities and develop a comprehensive community evaluation model to support the ranking of results. We build HeteroCS, a heterogeneous community search system with semantic explanation, upon our semantic community model, and deploy it on two real-world graphs. We present a demonstration case to illustrate the novelty and effectiveness of the system.","PeriodicalId":425056,"journal":{"name":"Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval","volume":"73 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-07-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115556578","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0