Andreas Grivas, Claire Grover, Richard Tobin, Clare Llewellyn, Eleojo Oluwaseun Abubakar, Chunyu Zheng, Chris Dibben, Alan Marshall, Jamie Pearce, Beatrice Alex
{"title":"Perceptions of Edinburgh: Capturing Neighbourhood Characteristics by Clustering Geoparsed Local News","authors":"Andreas Grivas, Claire Grover, Richard Tobin, Clare Llewellyn, Eleojo Oluwaseun Abubakar, Chunyu Zheng, Chris Dibben, Alan Marshall, Jamie Pearce, Beatrice Alex","doi":"arxiv-2409.11505","DOIUrl":"https://doi.org/arxiv-2409.11505","url":null,"abstract":"The communities that we live in affect our health in ways that are complex\u0000and hard to define. Moreover, our understanding of the place-based processes\u0000affecting health and inequalities is limited. This undermines the development\u0000of robust policy interventions to improve local health and well-being. News\u0000media provides social and community information that may be useful in health\u0000studies. Here we propose a methodology for characterising neighbourhoods by\u0000using local news articles. More specifically, we show how we can use Natural\u0000Language Processing (NLP) to unlock further information about neighbourhoods by\u0000analysing, geoparsing and clustering news articles. Our work is novel because\u0000we combine street-level geoparsing tailored to the locality with clustering of\u0000full news articles, enabling a more detailed examination of neighbourhood\u0000characteristics. We evaluate our outputs and show via a confluence of evidence,\u0000both from a qualitative and a quantitative perspective, that the themes we\u0000extract from news articles are sensible and reflect many characteristics of the\u0000real world. This is significant because it allows us to better understand the\u0000effects of neighbourhoods on health. Our findings on neighbourhood\u0000characterisation using news data will support a new generation of place-based\u0000research which examines a wider set of spatial processes and how they affect\u0000health, enabling new epidemiological research.","PeriodicalId":501281,"journal":{"name":"arXiv - CS - Information Retrieval","volume":"31 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142255193","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"TISIS : Trajectory Indexing for SImilarity Search","authors":"Sara Jarrad, Hubert Naacke, Stephane Gancarski","doi":"arxiv-2409.11301","DOIUrl":"https://doi.org/arxiv-2409.11301","url":null,"abstract":"Social media platforms enable users to share diverse types of information,\u0000including geolocation data that captures their movement patterns. Such\u0000geolocation data can be leveraged to reconstruct the trajectory of a user's\u0000visited Points of Interest (POIs). A key requirement in numerous applications\u0000is the ability to measure the similarity between such trajectories, as this\u0000facilitates the retrieval of trajectories that are similar to a given reference\u0000trajectory. This is the main focus of our work. Existing methods predominantly\u0000rely on applying a similarity function to each candidate trajectory to identify\u0000those that are sufficiently similar. However, this approach becomes\u0000computationally expensive when dealing with large-scale datasets. To mitigate\u0000this challenge, we propose TISIS, an efficient method that uses trajectory\u0000indexing to quickly find similar trajectories that share common POIs in the\u0000same order. Furthermore, to account for scenarios where POIs in trajectories\u0000may not exactly match but are contextually similar, we introduce TISIS*, a\u0000variant of TISIS that incorporates POI embeddings. This extension allows for\u0000more comprehensive retrieval of similar trajectories by considering semantic\u0000similarities between POIs, beyond mere exact matches. Extensive experimental\u0000evaluations demonstrate that the proposed approach significantly outperforms a\u0000baseline method based on the well-known Longest Common SubSequence (LCSS)\u0000algorithm, yielding substantial performance improvements across various\u0000real-world datasets.","PeriodicalId":501281,"journal":{"name":"arXiv - CS - Information Retrieval","volume":"18 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142255201","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yujia Zhou, Yan Liu, Xiaoxi Li, Jiajie Jin, Hongjin Qian, Zheng Liu, Chaozhuo Li, Zhicheng Dou, Tsung-Yi Ho, Philip S. Yu
{"title":"Trustworthiness in Retrieval-Augmented Generation Systems: A Survey","authors":"Yujia Zhou, Yan Liu, Xiaoxi Li, Jiajie Jin, Hongjin Qian, Zheng Liu, Chaozhuo Li, Zhicheng Dou, Tsung-Yi Ho, Philip S. Yu","doi":"arxiv-2409.10102","DOIUrl":"https://doi.org/arxiv-2409.10102","url":null,"abstract":"Retrieval-Augmented Generation (RAG) has quickly grown into a pivotal\u0000paradigm in the development of Large Language Models (LLMs). While much of the\u0000current research in this field focuses on performance optimization,\u0000particularly in terms of accuracy and efficiency, the trustworthiness of RAG\u0000systems remains an area still under exploration. From a positive perspective,\u0000RAG systems are promising to enhance LLMs by providing them with useful and\u0000up-to-date knowledge from vast external databases, thereby mitigating the\u0000long-standing problem of hallucination. While from a negative perspective, RAG\u0000systems are at the risk of generating undesirable contents if the retrieved\u0000information is either inappropriate or poorly utilized. To address these\u0000concerns, we propose a unified framework that assesses the trustworthiness of\u0000RAG systems across six key dimensions: factuality, robustness, fairness,\u0000transparency, accountability, and privacy. Within this framework, we thoroughly\u0000review the existing literature on each dimension. Additionally, we create the\u0000evaluation benchmark regarding the six dimensions and conduct comprehensive\u0000evaluations for a variety of proprietary and open-source models. Finally, we\u0000identify the potential challenges for future research based on our\u0000investigation results. Through this work, we aim to lay a structured foundation\u0000for future investigations and provide practical insights for enhancing the\u0000trustworthiness of RAG systems in real-world applications.","PeriodicalId":501281,"journal":{"name":"arXiv - CS - Information Retrieval","volume":"23 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142255099","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Saba Sturua, Isabelle Mohr, Mohammad Kalim Akram, Michael Günther, Bo Wang, Markus Krimmel, Feng Wang, Georgios Mastrapas, Andreas Koukounas, Andreas Koukounas, Nan Wang, Han Xiao
{"title":"jina-embeddings-v3: Multilingual Embeddings With Task LoRA","authors":"Saba Sturua, Isabelle Mohr, Mohammad Kalim Akram, Michael Günther, Bo Wang, Markus Krimmel, Feng Wang, Georgios Mastrapas, Andreas Koukounas, Andreas Koukounas, Nan Wang, Han Xiao","doi":"arxiv-2409.10173","DOIUrl":"https://doi.org/arxiv-2409.10173","url":null,"abstract":"We introduce jina-embeddings-v3, a novel text embedding model with 570\u0000million parameters, achieves state-of-the-art performance on multilingual data\u0000and long-context retrieval tasks, supporting context lengths of up to 8192\u0000tokens. The model includes a set of task-specific Low-Rank Adaptation (LoRA)\u0000adapters to generate high-quality embeddings for query-document retrieval,\u0000clustering, classification, and text matching. Additionally, Matryoshka\u0000Representation Learning is integrated into the training process, allowing\u0000flexible truncation of embedding dimensions without compromising performance.\u0000Evaluation on the MTEB benchmark shows that jina-embeddings-v3 outperforms the\u0000latest proprietary embeddings from OpenAI and Cohere on English tasks, while\u0000achieving superior performance compared to multilingual-e5-large-instruct\u0000across all multilingual tasks.","PeriodicalId":501281,"journal":{"name":"arXiv - CS - Information Retrieval","volume":"48 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142255102","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"beeFormer: Bridging the Gap Between Semantic and Interaction Similarity in Recommender Systems","authors":"Vojtěch Vančura, Pavel Kordík, Milan Straka","doi":"arxiv-2409.10309","DOIUrl":"https://doi.org/arxiv-2409.10309","url":null,"abstract":"Recommender systems often use text-side information to improve their\u0000predictions, especially in cold-start or zero-shot recommendation scenarios,\u0000where traditional collaborative filtering approaches cannot be used. Many\u0000approaches to text-mining side information for recommender systems have been\u0000proposed over recent years, with sentence Transformers being the most prominent\u0000one. However, these models are trained to predict semantic similarity without\u0000utilizing interaction data with hidden patterns specific to recommender\u0000systems. In this paper, we propose beeFormer, a framework for training sentence\u0000Transformer models with interaction data. We demonstrate that our models\u0000trained with beeFormer can transfer knowledge between datasets while\u0000outperforming not only semantic similarity sentence Transformers but also\u0000traditional collaborative filtering methods. We also show that training on\u0000multiple datasets from different domains accumulates knowledge in a single\u0000model, unlocking the possibility of training universal, domain-agnostic\u0000sentence Transformer models to mine text representations for recommender\u0000systems. We release the source code, trained models, and additional details\u0000allowing replication of our experiments at\u0000https://github.com/recombee/beeformer.","PeriodicalId":501281,"journal":{"name":"arXiv - CS - Information Retrieval","volume":"16 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142255419","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Large Language Model Enhanced Hard Sample Identification for Denoising Recommendation","authors":"Tianrui Song, Wenshuo Chao, Hao Liu","doi":"arxiv-2409.10343","DOIUrl":"https://doi.org/arxiv-2409.10343","url":null,"abstract":"Implicit feedback, often used to build recommender systems, unavoidably\u0000confronts noise due to factors such as misclicks and position bias. Previous\u0000studies have attempted to alleviate this by identifying noisy samples based on\u0000their diverged patterns, such as higher loss values, and mitigating the noise\u0000through sample dropping or reweighting. Despite the progress, we observe\u0000existing approaches struggle to distinguish hard samples and noise samples, as\u0000they often exhibit similar patterns, thereby limiting their effectiveness in\u0000denoising recommendations. To address this challenge, we propose a Large\u0000Language Model Enhanced Hard Sample Denoising (LLMHD) framework. Specifically,\u0000we construct an LLM-based scorer to evaluate the semantic consistency of items\u0000with the user preference, which is quantified based on summarized historical\u0000user interactions. The resulting scores are used to assess the hardness of\u0000samples for the pointwise or pairwise training objectives. To ensure\u0000efficiency, we introduce a variance-based sample pruning strategy to filter\u0000potential hard samples before scoring. Besides, we propose an iterative\u0000preference update module designed to continuously refine summarized user\u0000preference, which may be biased due to false-positive user-item interactions.\u0000Extensive experiments on three real-world datasets and four backbone\u0000recommenders demonstrate the effectiveness of our approach.","PeriodicalId":501281,"journal":{"name":"arXiv - CS - Information Retrieval","volume":"69 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142255418","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jianyang Gao, Yutong Gou, Yuexuan Xu, Yongyi Yang, Cheng Long, Raymond Chi-Wing Wong
{"title":"Practical and Asymptotically Optimal Quantization of High-Dimensional Vectors in Euclidean Space for Approximate Nearest Neighbor Search","authors":"Jianyang Gao, Yutong Gou, Yuexuan Xu, Yongyi Yang, Cheng Long, Raymond Chi-Wing Wong","doi":"arxiv-2409.09913","DOIUrl":"https://doi.org/arxiv-2409.09913","url":null,"abstract":"Approximate nearest neighbor (ANN) query in high-dimensional Euclidean space\u0000is a key operator in database systems. For this query, quantization is a\u0000popular family of methods developed for compressing vectors and reducing memory\u0000consumption. Recently, a method called RaBitQ achieves the state-of-the-art\u0000performance among these methods. It produces better empirical performance in\u0000both accuracy and efficiency when using the same compression rate and provides\u0000rigorous theoretical guarantees. However, the method is only designed for\u0000compressing vectors at high compression rates (32x) and lacks support for\u0000achieving higher accuracy by using more space. In this paper, we introduce a\u0000new quantization method to address this limitation by extending RaBitQ. The new\u0000method inherits the theoretical guarantees of RaBitQ and achieves the\u0000asymptotic optimality in terms of the trade-off between space and error bounds\u0000as to be proven in this study. Additionally, we present efficient\u0000implementations of the method, enabling its application to ANN queries to\u0000reduce both space and time consumption. Extensive experiments on real-world\u0000datasets confirm that our method consistently outperforms the state-of-the-art\u0000baselines in both accuracy and efficiency when using the same amount of memory.","PeriodicalId":501281,"journal":{"name":"arXiv - CS - Information Retrieval","volume":"191 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142255104","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Incorporating Classifier-Free Guidance in Diffusion Model-Based Recommendation","authors":"Noah Buchanan, Susan Gauch, Quan Mai","doi":"arxiv-2409.10494","DOIUrl":"https://doi.org/arxiv-2409.10494","url":null,"abstract":"This paper presents a diffusion-based recommender system that incorporates\u0000classifier-free guidance. Most current recommender systems provide\u0000recommendations using conventional methods such as collaborative or\u0000content-based filtering. Diffusion is a new approach to generative AI that\u0000improves on previous generative AI approaches such as Variational Autoencoders\u0000(VAEs) and Generative Adversarial Networks (GANs). We incorporate diffusion in\u0000a recommender system that mirrors the sequence users take when browsing and\u0000rating items. Although a few current recommender systems incorporate diffusion,\u0000they do not incorporate classifier-free guidance, a new innovation in diffusion\u0000models as a whole. In this paper, we present a diffusion recommender system\u0000that augments the underlying recommender system model for improved performance\u0000and also incorporates classifier-free guidance. Our findings show improvements\u0000over state-of-the-art recommender systems for most metrics for several\u0000recommendation tasks on a variety of datasets. In particular, our approach\u0000demonstrates the potential to provide better recommendations when data is\u0000sparse.","PeriodicalId":501281,"journal":{"name":"arXiv - CS - Information Retrieval","volume":"2 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142255417","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Online Learning via Memory: Retrieval-Augmented Detector Adaptation","authors":"Yanan Jian, Fuxun Yu, Qi Zhang, William Levine, Brandon Dubbs, Nikolaos Karianakis","doi":"arxiv-2409.10716","DOIUrl":"https://doi.org/arxiv-2409.10716","url":null,"abstract":"This paper presents a novel way of online adapting any off-the-shelf object\u0000detection model to a novel domain without retraining the detector model.\u0000Inspired by how humans quickly learn knowledge of a new subject (e.g.,\u0000memorization), we allow the detector to look up similar object concepts from\u0000memory during test time. This is achieved through a retrieval augmented\u0000classification (RAC) module together with a memory bank that can be flexibly\u0000updated with new domain knowledge. We experimented with various off-the-shelf\u0000open-set detector and close-set detectors. With only a tiny memory bank (e.g.,\u000010 images per category) and being training-free, our online learning method\u0000could significantly outperform baselines in adapting a detector to novel\u0000domains.","PeriodicalId":501281,"journal":{"name":"arXiv - CS - Information Retrieval","volume":"18 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142255415","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Enhancing Personalized Recipe Recommendation Through Multi-Class Classification","authors":"Harish Neelam, Koushik Sai Veerella","doi":"arxiv-2409.10267","DOIUrl":"https://doi.org/arxiv-2409.10267","url":null,"abstract":"This paper intends to address the challenge of personalized recipe\u0000recommendation in the realm of diverse culinary preferences. The problem domain\u0000involves recipe recommendations, utilizing techniques such as association\u0000analysis and classification. Association analysis explores the relationships\u0000and connections between different ingredients to enhance the user experience.\u0000Meanwhile, the classification aspect involves categorizing recipes based on\u0000user-defined ingredients and preferences. A unique aspect of the paper is the\u0000consideration of recipes and ingredients belonging to multiple classes,\u0000recognizing the complexity of culinary combinations. This necessitates a\u0000sophisticated approach to classification and recommendation, ensuring the\u0000system accommodates the nature of recipe categorization. The paper seeks not\u0000only to recommend recipes but also to explore the process involved in achieving\u0000accurate and personalized recommendations.","PeriodicalId":501281,"journal":{"name":"arXiv - CS - Information Retrieval","volume":"11 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142255421","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}