arXiv - CS - Information Retrieval最新文献_第3页

Perceptions of Edinburgh: Capturing Neighbourhood Characteristics by Clustering Geoparsed Local News 感知爱丁堡：通过聚类地方新闻捕捉邻里特征

arXiv - CS - Information Retrieval Pub Date : 2024-09-17 DOI: arxiv-2409.11505

Andreas Grivas, Claire Grover, Richard Tobin, Clare Llewellyn, Eleojo Oluwaseun Abubakar, Chunyu Zheng, Chris Dibben, Alan Marshall, Jamie Pearce, Beatrice Alex

{"title":"Perceptions of Edinburgh: Capturing Neighbourhood Characteristics by Clustering Geoparsed Local News","authors":"Andreas Grivas, Claire Grover, Richard Tobin, Clare Llewellyn, Eleojo Oluwaseun Abubakar, Chunyu Zheng, Chris Dibben, Alan Marshall, Jamie Pearce, Beatrice Alex","doi":"arxiv-2409.11505","DOIUrl":"https://doi.org/arxiv-2409.11505","url":null,"abstract":"The communities that we live in affect our health in ways that are complex\u0000and hard to define. Moreover, our understanding of the place-based processes\u0000affecting health and inequalities is limited. This undermines the development\u0000of robust policy interventions to improve local health and well-being. News\u0000media provides social and community information that may be useful in health\u0000studies. Here we propose a methodology for characterising neighbourhoods by\u0000using local news articles. More specifically, we show how we can use Natural\u0000Language Processing (NLP) to unlock further information about neighbourhoods by\u0000analysing, geoparsing and clustering news articles. Our work is novel because\u0000we combine street-level geoparsing tailored to the locality with clustering of\u0000full news articles, enabling a more detailed examination of neighbourhood\u0000characteristics. We evaluate our outputs and show via a confluence of evidence,\u0000both from a qualitative and a quantitative perspective, that the themes we\u0000extract from news articles are sensible and reflect many characteristics of the\u0000real world. This is significant because it allows us to better understand the\u0000effects of neighbourhoods on health. Our findings on neighbourhood\u0000characterisation using news data will support a new generation of place-based\u0000research which examines a wider set of spatial processes and how they affect\u0000health, enabling new epidemiological research.","PeriodicalId":501281,"journal":{"name":"arXiv - CS - Information Retrieval","volume":"31 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142255193","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

TISIS : Trajectory Indexing for SImilarity Search TISIS：用于相似性搜索的轨迹索引法

arXiv - CS - Information Retrieval Pub Date : 2024-09-17 DOI: arxiv-2409.11301

Sara Jarrad, Hubert Naacke, Stephane Gancarski

{"title":"TISIS : Trajectory Indexing for SImilarity Search","authors":"Sara Jarrad, Hubert Naacke, Stephane Gancarski","doi":"arxiv-2409.11301","DOIUrl":"https://doi.org/arxiv-2409.11301","url":null,"abstract":"Social media platforms enable users to share diverse types of information,\u0000including geolocation data that captures their movement patterns. Such\u0000geolocation data can be leveraged to reconstruct the trajectory of a user's\u0000visited Points of Interest (POIs). A key requirement in numerous applications\u0000is the ability to measure the similarity between such trajectories, as this\u0000facilitates the retrieval of trajectories that are similar to a given reference\u0000trajectory. This is the main focus of our work. Existing methods predominantly\u0000rely on applying a similarity function to each candidate trajectory to identify\u0000those that are sufficiently similar. However, this approach becomes\u0000computationally expensive when dealing with large-scale datasets. To mitigate\u0000this challenge, we propose TISIS, an efficient method that uses trajectory\u0000indexing to quickly find similar trajectories that share common POIs in the\u0000same order. Furthermore, to account for scenarios where POIs in trajectories\u0000may not exactly match but are contextually similar, we introduce TISIS*, a\u0000variant of TISIS that incorporates POI embeddings. This extension allows for\u0000more comprehensive retrieval of similar trajectories by considering semantic\u0000similarities between POIs, beyond mere exact matches. Extensive experimental\u0000evaluations demonstrate that the proposed approach significantly outperforms a\u0000baseline method based on the well-known Longest Common SubSequence (LCSS)\u0000algorithm, yielding substantial performance improvements across various\u0000real-world datasets.","PeriodicalId":501281,"journal":{"name":"arXiv - CS - Information Retrieval","volume":"18 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142255201","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Trustworthiness in Retrieval-Augmented Generation Systems: A Survey 检索增强生成系统中的可信度：调查

arXiv - CS - Information Retrieval Pub Date : 2024-09-16 DOI: arxiv-2409.10102

Yujia Zhou, Yan Liu, Xiaoxi Li, Jiajie Jin, Hongjin Qian, Zheng Liu, Chaozhuo Li, Zhicheng Dou, Tsung-Yi Ho, Philip S. Yu

{"title":"Trustworthiness in Retrieval-Augmented Generation Systems: A Survey","authors":"Yujia Zhou, Yan Liu, Xiaoxi Li, Jiajie Jin, Hongjin Qian, Zheng Liu, Chaozhuo Li, Zhicheng Dou, Tsung-Yi Ho, Philip S. Yu","doi":"arxiv-2409.10102","DOIUrl":"https://doi.org/arxiv-2409.10102","url":null,"abstract":"Retrieval-Augmented Generation (RAG) has quickly grown into a pivotal\u0000paradigm in the development of Large Language Models (LLMs). While much of the\u0000current research in this field focuses on performance optimization,\u0000particularly in terms of accuracy and efficiency, the trustworthiness of RAG\u0000systems remains an area still under exploration. From a positive perspective,\u0000RAG systems are promising to enhance LLMs by providing them with useful and\u0000up-to-date knowledge from vast external databases, thereby mitigating the\u0000long-standing problem of hallucination. While from a negative perspective, RAG\u0000systems are at the risk of generating undesirable contents if the retrieved\u0000information is either inappropriate or poorly utilized. To address these\u0000concerns, we propose a unified framework that assesses the trustworthiness of\u0000RAG systems across six key dimensions: factuality, robustness, fairness,\u0000transparency, accountability, and privacy. Within this framework, we thoroughly\u0000review the existing literature on each dimension. Additionally, we create the\u0000evaluation benchmark regarding the six dimensions and conduct comprehensive\u0000evaluations for a variety of proprietary and open-source models. Finally, we\u0000identify the potential challenges for future research based on our\u0000investigation results. Through this work, we aim to lay a structured foundation\u0000for future investigations and provide practical insights for enhancing the\u0000trustworthiness of RAG systems in real-world applications.","PeriodicalId":501281,"journal":{"name":"arXiv - CS - Information Retrieval","volume":"23 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142255099","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

jina-embeddings-v3: Multilingual Embeddings With Task LoRA jina-embeddings-v3：带任务 LoRA 的多语言嵌入法

arXiv - CS - Information Retrieval Pub Date : 2024-09-16 DOI: arxiv-2409.10173

Saba Sturua, Isabelle Mohr, Mohammad Kalim Akram, Michael Günther, Bo Wang, Markus Krimmel, Feng Wang, Georgios Mastrapas, Andreas Koukounas, Andreas Koukounas, Nan Wang, Han Xiao

引用次数: 0

beeFormer: Bridging the Gap Between Semantic and Interaction Similarity in Recommender Systems beeFormer：缩小推荐系统中语义相似性与交互相似性之间的差距

arXiv - CS - Information Retrieval Pub Date : 2024-09-16 DOI: arxiv-2409.10309

Vojtěch Vančura, Pavel Kordík, Milan Straka

{"title":"beeFormer: Bridging the Gap Between Semantic and Interaction Similarity in Recommender Systems","authors":"Vojtěch Vančura, Pavel Kordík, Milan Straka","doi":"arxiv-2409.10309","DOIUrl":"https://doi.org/arxiv-2409.10309","url":null,"abstract":"Recommender systems often use text-side information to improve their\u0000predictions, especially in cold-start or zero-shot recommendation scenarios,\u0000where traditional collaborative filtering approaches cannot be used. Many\u0000approaches to text-mining side information for recommender systems have been\u0000proposed over recent years, with sentence Transformers being the most prominent\u0000one. However, these models are trained to predict semantic similarity without\u0000utilizing interaction data with hidden patterns specific to recommender\u0000systems. In this paper, we propose beeFormer, a framework for training sentence\u0000Transformer models with interaction data. We demonstrate that our models\u0000trained with beeFormer can transfer knowledge between datasets while\u0000outperforming not only semantic similarity sentence Transformers but also\u0000traditional collaborative filtering methods. We also show that training on\u0000multiple datasets from different domains accumulates knowledge in a single\u0000model, unlocking the possibility of training universal, domain-agnostic\u0000sentence Transformer models to mine text representations for recommender\u0000systems. We release the source code, trained models, and additional details\u0000allowing replication of our experiments at\u0000https://github.com/recombee/beeformer.","PeriodicalId":501281,"journal":{"name":"arXiv - CS - Information Retrieval","volume":"16 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142255419","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Large Language Model Enhanced Hard Sample Identification for Denoising Recommendation 用于去噪推荐的大语言模型增强型硬样本识别

arXiv - CS - Information Retrieval Pub Date : 2024-09-16 DOI: arxiv-2409.10343

Tianrui Song, Wenshuo Chao, Hao Liu

{"title":"Large Language Model Enhanced Hard Sample Identification for Denoising Recommendation","authors":"Tianrui Song, Wenshuo Chao, Hao Liu","doi":"arxiv-2409.10343","DOIUrl":"https://doi.org/arxiv-2409.10343","url":null,"abstract":"Implicit feedback, often used to build recommender systems, unavoidably\u0000confronts noise due to factors such as misclicks and position bias. Previous\u0000studies have attempted to alleviate this by identifying noisy samples based on\u0000their diverged patterns, such as higher loss values, and mitigating the noise\u0000through sample dropping or reweighting. Despite the progress, we observe\u0000existing approaches struggle to distinguish hard samples and noise samples, as\u0000they often exhibit similar patterns, thereby limiting their effectiveness in\u0000denoising recommendations. To address this challenge, we propose a Large\u0000Language Model Enhanced Hard Sample Denoising (LLMHD) framework. Specifically,\u0000we construct an LLM-based scorer to evaluate the semantic consistency of items\u0000with the user preference, which is quantified based on summarized historical\u0000user interactions. The resulting scores are used to assess the hardness of\u0000samples for the pointwise or pairwise training objectives. To ensure\u0000efficiency, we introduce a variance-based sample pruning strategy to filter\u0000potential hard samples before scoring. Besides, we propose an iterative\u0000preference update module designed to continuously refine summarized user\u0000preference, which may be biased due to false-positive user-item interactions.\u0000Extensive experiments on three real-world datasets and four backbone\u0000recommenders demonstrate the effectiveness of our approach.","PeriodicalId":501281,"journal":{"name":"arXiv - CS - Information Retrieval","volume":"69 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142255418","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Practical and Asymptotically Optimal Quantization of High-Dimensional Vectors in Euclidean Space for Approximate Nearest Neighbor Search 欧几里得空间中高维向量的实用和渐近最优量化，用于近似近邻搜索

arXiv - CS - Information Retrieval Pub Date : 2024-09-16 DOI: arxiv-2409.09913

Jianyang Gao, Yutong Gou, Yuexuan Xu, Yongyi Yang, Cheng Long, Raymond Chi-Wing Wong

{"title":"Practical and Asymptotically Optimal Quantization of High-Dimensional Vectors in Euclidean Space for Approximate Nearest Neighbor Search","authors":"Jianyang Gao, Yutong Gou, Yuexuan Xu, Yongyi Yang, Cheng Long, Raymond Chi-Wing Wong","doi":"arxiv-2409.09913","DOIUrl":"https://doi.org/arxiv-2409.09913","url":null,"abstract":"Approximate nearest neighbor (ANN) query in high-dimensional Euclidean space\u0000is a key operator in database systems. For this query, quantization is a\u0000popular family of methods developed for compressing vectors and reducing memory\u0000consumption. Recently, a method called RaBitQ achieves the state-of-the-art\u0000performance among these methods. It produces better empirical performance in\u0000both accuracy and efficiency when using the same compression rate and provides\u0000rigorous theoretical guarantees. However, the method is only designed for\u0000compressing vectors at high compression rates (32x) and lacks support for\u0000achieving higher accuracy by using more space. In this paper, we introduce a\u0000new quantization method to address this limitation by extending RaBitQ. The new\u0000method inherits the theoretical guarantees of RaBitQ and achieves the\u0000asymptotic optimality in terms of the trade-off between space and error bounds\u0000as to be proven in this study. Additionally, we present efficient\u0000implementations of the method, enabling its application to ANN queries to\u0000reduce both space and time consumption. Extensive experiments on real-world\u0000datasets confirm that our method consistently outperforms the state-of-the-art\u0000baselines in both accuracy and efficiency when using the same amount of memory.","PeriodicalId":501281,"journal":{"name":"arXiv - CS - Information Retrieval","volume":"191 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142255104","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Incorporating Classifier-Free Guidance in Diffusion Model-Based Recommendation 在基于扩散模型的推荐中纳入无分类指导

arXiv - CS - Information Retrieval Pub Date : 2024-09-16 DOI: arxiv-2409.10494

Noah Buchanan, Susan Gauch, Quan Mai

{"title":"Incorporating Classifier-Free Guidance in Diffusion Model-Based Recommendation","authors":"Noah Buchanan, Susan Gauch, Quan Mai","doi":"arxiv-2409.10494","DOIUrl":"https://doi.org/arxiv-2409.10494","url":null,"abstract":"This paper presents a diffusion-based recommender system that incorporates\u0000classifier-free guidance. Most current recommender systems provide\u0000recommendations using conventional methods such as collaborative or\u0000content-based filtering. Diffusion is a new approach to generative AI that\u0000improves on previous generative AI approaches such as Variational Autoencoders\u0000(VAEs) and Generative Adversarial Networks (GANs). We incorporate diffusion in\u0000a recommender system that mirrors the sequence users take when browsing and\u0000rating items. Although a few current recommender systems incorporate diffusion,\u0000they do not incorporate classifier-free guidance, a new innovation in diffusion\u0000models as a whole. In this paper, we present a diffusion recommender system\u0000that augments the underlying recommender system model for improved performance\u0000and also incorporates classifier-free guidance. Our findings show improvements\u0000over state-of-the-art recommender systems for most metrics for several\u0000recommendation tasks on a variety of datasets. In particular, our approach\u0000demonstrates the potential to provide better recommendations when data is\u0000sparse.","PeriodicalId":501281,"journal":{"name":"arXiv - CS - Information Retrieval","volume":"2 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142255417","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Online Learning via Memory: Retrieval-Augmented Detector Adaptation 通过记忆进行在线学习：检索-增强探测器适应性

arXiv - CS - Information Retrieval Pub Date : 2024-09-16 DOI: arxiv-2409.10716

Yanan Jian, Fuxun Yu, Qi Zhang, William Levine, Brandon Dubbs, Nikolaos Karianakis

引用次数: 0

Enhancing Personalized Recipe Recommendation Through Multi-Class Classification 通过多类分类加强个性化食谱推荐

arXiv - CS - Information Retrieval Pub Date : 2024-09-16 DOI: arxiv-2409.10267

Harish Neelam, Koushik Sai Veerella

引用次数: 0