{"title":"Reasoning Graph Enhanced Exemplars Retrieval for In-Context Learning","authors":"Yukang Lin, Bingchen Zhong, Shuoran Jiang, Joanna Siebert, Qingcai Chen","doi":"arxiv-2409.11147","DOIUrl":"https://doi.org/arxiv-2409.11147","url":null,"abstract":"Large language models(LLMs) have exhibited remarkable few-shot learning\u0000capabilities and unified the paradigm of NLP tasks through the in-context\u0000learning(ICL) technique. Despite the success of ICL, the quality of the\u0000exemplar demonstrations can significantly influence the LLM's performance.\u0000Existing exemplar selection methods mainly focus on the semantic similarity\u0000between queries and candidate exemplars. On the other hand, the logical\u0000connections between reasoning steps can be beneficial to depict the\u0000problem-solving process as well. In this paper, we proposes a novel method\u0000named Reasoning Graph-enhanced Exemplar Retrieval(RGER). RGER first quires LLM\u0000to generate an initial response, then expresses intermediate problem-solving\u0000steps to a graph structure. After that, it employs graph kernel to select\u0000exemplars with semantic and structural similarity. Extensive experiments\u0000demonstrate the structural relationship is helpful to the alignment of queries\u0000and candidate exemplars. The efficacy of RGER on math and logit reasoning tasks\u0000showcases its superiority over state-of-the-art retrieval-based approaches. Our\u0000code is released at https://github.com/Yukang-Lin/RGER.","PeriodicalId":501030,"journal":{"name":"arXiv - CS - Computation and Language","volume":"30 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142262359","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Guijin Son, Hyunwoo Ko, Hoyoung Lee, Yewon Kim, Seunghyeok Hong
{"title":"LLM-as-a-Judge & Reward Model: What They Can and Cannot Do","authors":"Guijin Son, Hyunwoo Ko, Hoyoung Lee, Yewon Kim, Seunghyeok Hong","doi":"arxiv-2409.11239","DOIUrl":"https://doi.org/arxiv-2409.11239","url":null,"abstract":"LLM-as-a-Judge and reward models are widely used alternatives of\u0000multiple-choice questions or human annotators for large language model (LLM)\u0000evaluation. Their efficacy shines in evaluating long-form responses, serving a\u0000critical role as evaluators of leaderboards and as proxies to align LLMs via\u0000reinforcement learning. However, despite their popularity, their effectiveness\u0000outside of English remains largely unexplored. In this paper, we conduct a\u0000comprehensive analysis on automated evaluators, reporting key findings on their\u0000behavior in a non-English environment. First, we discover that English\u0000evaluation capabilities significantly influence language-specific capabilities,\u0000often more than the language proficiency itself, enabling evaluators trained in\u0000English to easily transfer their skills to other languages. Second, we identify\u0000critical shortcomings, where LLMs fail to detect and penalize errors, such as\u0000factual inaccuracies, cultural misrepresentations, and the presence of unwanted\u0000language. Finally, we release Kudge, the first non-English meta-evaluation\u0000dataset containing 5,012 human annotations in Korean.","PeriodicalId":501030,"journal":{"name":"arXiv - CS - Computation and Language","volume":"50 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142262353","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Egalitarian Language Representation in Language Models: It All Begins with Tokenizers","authors":"Menan Velayuthan, Kengatharaiyer Sarveswaran","doi":"arxiv-2409.11501","DOIUrl":"https://doi.org/arxiv-2409.11501","url":null,"abstract":"Tokenizers act as a bridge between human language and the latent space of\u0000language models, influencing how language is represented in these models. Due\u0000to the immense popularity of English-Centric Large Language Models (LLMs),\u0000efforts are being made to adapt them for other languages. However, we\u0000demonstrate that, from a tokenization standpoint, not all tokenizers offer fair\u0000representation for complex script languages such as Tamil, Sinhala, and Hindi,\u0000primarily due to the choice of pre-tokenization methods. We go further to show\u0000that pre-tokenization plays a more critical role than the tokenization\u0000algorithm itself in achieving an egalitarian representation of these complex\u0000script languages. To address this, we introduce an improvement to the Byte Pair\u0000Encoding (BPE) algorithm by incorporating graphemes, which we term Grapheme\u0000Pair Encoding (GPE). Our experiments show that grapheme-based character\u0000extraction outperforms byte-level tokenizers for complex scripts. We validate\u0000this approach through experiments on Tamil, Sinhala, and Hindi.","PeriodicalId":501030,"journal":{"name":"arXiv - CS - Computation and Language","volume":"50 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142262570","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Mengfei Liang, Archish Arun, Zekun Wu, Cristian Munoz, Jonathan Lutch, Emre Kazim, Adriano Koshiyama, Philip Treleaven
{"title":"THaMES: An End-to-End Tool for Hallucination Mitigation and Evaluation in Large Language Models","authors":"Mengfei Liang, Archish Arun, Zekun Wu, Cristian Munoz, Jonathan Lutch, Emre Kazim, Adriano Koshiyama, Philip Treleaven","doi":"arxiv-2409.11353","DOIUrl":"https://doi.org/arxiv-2409.11353","url":null,"abstract":"Hallucination, the generation of factually incorrect content, is a growing\u0000challenge in Large Language Models (LLMs). Existing detection and mitigation\u0000methods are often isolated and insufficient for domain-specific needs, lacking\u0000a standardized pipeline. This paper introduces THaMES (Tool for Hallucination\u0000Mitigations and EvaluationS), an integrated framework and library addressing\u0000this gap. THaMES offers an end-to-end solution for evaluating and mitigating\u0000hallucinations in LLMs, featuring automated test set generation, multifaceted\u0000benchmarking, and adaptable mitigation strategies. It automates test set\u0000creation from any corpus, ensuring high data quality, diversity, and\u0000cost-efficiency through techniques like batch processing, weighted sampling,\u0000and counterfactual validation. THaMES assesses a model's ability to detect and\u0000reduce hallucinations across various tasks, including text generation and\u0000binary classification, applying optimal mitigation strategies like In-Context\u0000Learning (ICL), Retrieval Augmented Generation (RAG), and Parameter-Efficient\u0000Fine-tuning (PEFT). Evaluations of state-of-the-art LLMs using a knowledge base\u0000of academic papers, political news, and Wikipedia reveal that commercial models\u0000like GPT-4o benefit more from RAG than ICL, while open-weight models like\u0000Llama-3.1-8B-Instruct and Mistral-Nemo gain more from ICL. Additionally, PEFT\u0000significantly enhances the performance of Llama-3.1-8B-Instruct in both\u0000evaluation tasks.","PeriodicalId":501030,"journal":{"name":"arXiv - CS - Computation and Language","volume":"27 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142262579","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Samee Arif, Taimoor Arif, Aamina Jamal Khan, Muhammad Saad Haroon, Agha Ali Raza, Awais Athar
{"title":"The Art of Storytelling: Multi-Agent Generative AI for Dynamic Multimodal Narratives","authors":"Samee Arif, Taimoor Arif, Aamina Jamal Khan, Muhammad Saad Haroon, Agha Ali Raza, Awais Athar","doi":"arxiv-2409.11261","DOIUrl":"https://doi.org/arxiv-2409.11261","url":null,"abstract":"This paper introduces the concept of an education tool that utilizes\u0000Generative Artificial Intelligence (GenAI) to enhance storytelling for\u0000children. The system combines GenAI-driven narrative co-creation,\u0000text-to-speech conversion, and text-to-video generation to produce an engaging\u0000experience for learners. We describe the co-creation process, the adaptation of\u0000narratives into spoken words using text-to-speech models, and the\u0000transformation of these narratives into contextually relevant visuals through\u0000text-to-video technology. Our evaluation covers the linguistics of the\u0000generated stories, the text-to-speech conversion quality, and the accuracy of\u0000the generated visuals.","PeriodicalId":501030,"journal":{"name":"arXiv - CS - Computation and Language","volume":"188 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142262349","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Multi-Document Grounded Multi-Turn Synthetic Dialog Generation","authors":"Young-Suk Lee, Chulaka Gunasekara, Danish Contractor, Ramón Fernandez Astudillo, Radu Florian","doi":"arxiv-2409.11500","DOIUrl":"https://doi.org/arxiv-2409.11500","url":null,"abstract":"We introduce a technique for multi-document grounded multi-turn synthetic\u0000dialog generation that incorporates three main ideas. First, we control the\u0000overall dialog flow using taxonomy-driven user queries that are generated with\u0000Chain-of-Thought (CoT) prompting. Second, we support the generation of\u0000multi-document grounded dialogs by mimicking real-world use of retrievers to\u0000update the grounding documents after every user-turn in the dialog. Third, we\u0000apply LLM-as-a-Judge to filter out queries with incorrect answers. Human\u0000evaluation of the synthetic dialog data suggests that the data is diverse,\u0000coherent, and includes mostly correct answers. Both human and automatic\u0000evaluations of answerable queries indicate that models fine-tuned on synthetic\u0000dialogs consistently out-perform those fine-tuned on existing human generated\u0000training data across four publicly available multi-turn document grounded\u0000benchmark test sets.","PeriodicalId":501030,"journal":{"name":"arXiv - CS - Computation and Language","volume":"20 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142262572","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Linear Recency Bias During Training Improves Transformers' Fit to Reading Times","authors":"Christian Clark, Byung-Doh Oh, William Schuler","doi":"arxiv-2409.11250","DOIUrl":"https://doi.org/arxiv-2409.11250","url":null,"abstract":"Recent psycholinguistic research has compared human reading times to\u0000surprisal estimates from language models to study the factors shaping human\u0000sentence processing difficulty. Previous studies have shown a strong fit\u0000between surprisal values from Transformers and reading times. However, standard\u0000Transformers work with a lossless representation of the entire previous\u0000linguistic context, unlike models of human language processing that include\u0000memory decay. To bridge this gap, this paper evaluates a modification of the\u0000Transformer model that uses ALiBi (Press et al., 2022), a recency bias added to\u0000attention scores. Surprisal estimates with ALiBi show an improved fit to human\u0000reading times compared to a standard Transformer baseline. A subsequent\u0000analysis of attention heads suggests that ALiBi's mixture of slopes -- which\u0000determine the rate of memory decay in each attention head -- may play a role in\u0000the improvement by helping models with ALiBi to track different kinds of\u0000linguistic dependencies.","PeriodicalId":501030,"journal":{"name":"arXiv - CS - Computation and Language","volume":"1243 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142262351","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Theo King, Zekun Wu, Adriano Koshiyama, Emre Kazim, Philip Treleaven
{"title":"HEARTS: A Holistic Framework for Explainable, Sustainable and Robust Text Stereotype Detection","authors":"Theo King, Zekun Wu, Adriano Koshiyama, Emre Kazim, Philip Treleaven","doi":"arxiv-2409.11579","DOIUrl":"https://doi.org/arxiv-2409.11579","url":null,"abstract":"Stereotypes are generalised assumptions about societal groups, and even\u0000state-of-the-art LLMs using in-context learning struggle to identify them\u0000accurately. Due to the subjective nature of stereotypes, where what constitutes\u0000a stereotype can vary widely depending on cultural, social, and individual\u0000perspectives, robust explainability is crucial. Explainable models ensure that\u0000these nuanced judgments can be understood and validated by human users,\u0000promoting trust and accountability. We address these challenges by introducing\u0000HEARTS (Holistic Framework for Explainable, Sustainable, and Robust Text\u0000Stereotype Detection), a framework that enhances model performance, minimises\u0000carbon footprint, and provides transparent, interpretable explanations. We\u0000establish the Expanded Multi-Grain Stereotype Dataset (EMGSD), comprising\u000057,201 labeled texts across six groups, including under-represented\u0000demographics like LGBTQ+ and regional stereotypes. Ablation studies confirm\u0000that BERT models fine-tuned on EMGSD outperform those trained on individual\u0000components. We then analyse a fine-tuned, carbon-efficient ALBERT-V2 model\u0000using SHAP to generate token-level importance values, ensuring alignment with\u0000human understanding, and calculate explainability confidence scores by\u0000comparing SHAP and LIME outputs. Finally, HEARTS is applied to assess\u0000stereotypical bias in 12 LLM outputs, revealing a gradual reduction in bias\u0000over time within model families.","PeriodicalId":501030,"journal":{"name":"arXiv - CS - Computation and Language","volume":"15 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142262440","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Khaled AlNuaimi, Gautier Marti, Mathieu Ravaut, Abdulla AlKetbi, Andreas Henschel, Raed Jaradat
{"title":"Enriching Datasets with Demographics through Large Language Models: What's in a Name?","authors":"Khaled AlNuaimi, Gautier Marti, Mathieu Ravaut, Abdulla AlKetbi, Andreas Henschel, Raed Jaradat","doi":"arxiv-2409.11491","DOIUrl":"https://doi.org/arxiv-2409.11491","url":null,"abstract":"Enriching datasets with demographic information, such as gender, race, and\u0000age from names, is a critical task in fields like healthcare, public policy,\u0000and social sciences. Such demographic insights allow for more precise and\u0000effective engagement with target populations. Despite previous efforts\u0000employing hidden Markov models and recurrent neural networks to predict\u0000demographics from names, significant limitations persist: the lack of\u0000large-scale, well-curated, unbiased, publicly available datasets, and the lack\u0000of an approach robust across datasets. This scarcity has hindered the\u0000development of traditional supervised learning approaches. In this paper, we\u0000demonstrate that the zero-shot capabilities of Large Language Models (LLMs) can\u0000perform as well as, if not better than, bespoke models trained on specialized\u0000data. We apply these LLMs to a variety of datasets, including a real-life,\u0000unlabelled dataset of licensed financial professionals in Hong Kong, and\u0000critically assess the inherent demographic biases in these models. Our work not\u0000only advances the state-of-the-art in demographic enrichment but also opens\u0000avenues for future research in mitigating biases in LLMs.","PeriodicalId":501030,"journal":{"name":"arXiv - CS - Computation and Language","volume":"30 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142262573","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jianing Wang, Yang Zhou, Xiaocheng Zhang, Mengjiao Bao, Peng Yan
{"title":"Self-Evolutionary Large Language Models through Uncertainty-Enhanced Preference Optimization","authors":"Jianing Wang, Yang Zhou, Xiaocheng Zhang, Mengjiao Bao, Peng Yan","doi":"arxiv-2409.11212","DOIUrl":"https://doi.org/arxiv-2409.11212","url":null,"abstract":"Iterative preference optimization has recently become one of the de-facto\u0000training paradigms for large language models (LLMs), but the performance is\u0000still underwhelming due to too much noisy preference data yielded in the loop.\u0000To combat this issue, we present an textbf{U}ncertainty-enhanced\u0000textbf{P}reference textbf{O}ptimization (UPO) framework to make the LLM\u0000self-evolve with reliable feedback. The key idea is mitigating the noisy\u0000preference data derived from the current policy and reward models by performing\u0000pair-wise uncertainty estimation and judiciously reliable feedback sampling. To\u0000reach this goal, we thus introduce an estimator model, which incorporates Monte\u0000Carlo (MC) dropout in Bayesian neural network (BNN) to perform uncertainty\u0000estimation for the preference data derived from the LLM policy. Compared to the\u0000existing methods that directly filter generated responses based on the reward\u0000score, the estimator focuses on the model uncertainty in a pair-wise manner and\u0000effectively bypasses the confirmation bias problem of the reward model.\u0000Additionally, we also propose an uncertainty-enhanced self-evolution algorithm\u0000to improve the robustness of preference optimization and encourage the LLM to\u0000generate responses with both high reward and certainty. Extensive experiments\u0000over multiple benchmarks demonstrate that our framework substantially\u0000alleviates the noisy problem and improves the performance of iterative\u0000preference optimization.","PeriodicalId":501030,"journal":{"name":"arXiv - CS - Computation and Language","volume":"91 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142262356","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}