arXiv - CS - Computation and Language最新文献_第4页

Reasoning Graph Enhanced Exemplars Retrieval for In-Context Learning 推理图增强范例检索，促进情境学习

arXiv - CS - Computation and Language Pub Date : 2024-09-17 DOI: arxiv-2409.11147

Yukang Lin, Bingchen Zhong, Shuoran Jiang, Joanna Siebert, Qingcai Chen

{"title":"Reasoning Graph Enhanced Exemplars Retrieval for In-Context Learning","authors":"Yukang Lin, Bingchen Zhong, Shuoran Jiang, Joanna Siebert, Qingcai Chen","doi":"arxiv-2409.11147","DOIUrl":"https://doi.org/arxiv-2409.11147","url":null,"abstract":"Large language models(LLMs) have exhibited remarkable few-shot learning\u0000capabilities and unified the paradigm of NLP tasks through the in-context\u0000learning(ICL) technique. Despite the success of ICL, the quality of the\u0000exemplar demonstrations can significantly influence the LLM's performance.\u0000Existing exemplar selection methods mainly focus on the semantic similarity\u0000between queries and candidate exemplars. On the other hand, the logical\u0000connections between reasoning steps can be beneficial to depict the\u0000problem-solving process as well. In this paper, we proposes a novel method\u0000named Reasoning Graph-enhanced Exemplar Retrieval(RGER). RGER first quires LLM\u0000to generate an initial response, then expresses intermediate problem-solving\u0000steps to a graph structure. After that, it employs graph kernel to select\u0000exemplars with semantic and structural similarity. Extensive experiments\u0000demonstrate the structural relationship is helpful to the alignment of queries\u0000and candidate exemplars. The efficacy of RGER on math and logit reasoning tasks\u0000showcases its superiority over state-of-the-art retrieval-based approaches. Our\u0000code is released at https://github.com/Yukang-Lin/RGER.","PeriodicalId":501030,"journal":{"name":"arXiv - CS - Computation and Language","volume":"30 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142262359","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

LLM-as-a-Judge & Reward Model: What They Can and Cannot Do 法学硕士担任法官与奖励模式：他们能做什么，不能做什么

arXiv - CS - Computation and Language Pub Date : 2024-09-17 DOI: arxiv-2409.11239

Guijin Son, Hyunwoo Ko, Hoyoung Lee, Yewon Kim, Seunghyeok Hong

{"title":"LLM-as-a-Judge & Reward Model: What They Can and Cannot Do","authors":"Guijin Son, Hyunwoo Ko, Hoyoung Lee, Yewon Kim, Seunghyeok Hong","doi":"arxiv-2409.11239","DOIUrl":"https://doi.org/arxiv-2409.11239","url":null,"abstract":"LLM-as-a-Judge and reward models are widely used alternatives of\u0000multiple-choice questions or human annotators for large language model (LLM)\u0000evaluation. Their efficacy shines in evaluating long-form responses, serving a\u0000critical role as evaluators of leaderboards and as proxies to align LLMs via\u0000reinforcement learning. However, despite their popularity, their effectiveness\u0000outside of English remains largely unexplored. In this paper, we conduct a\u0000comprehensive analysis on automated evaluators, reporting key findings on their\u0000behavior in a non-English environment. First, we discover that English\u0000evaluation capabilities significantly influence language-specific capabilities,\u0000often more than the language proficiency itself, enabling evaluators trained in\u0000English to easily transfer their skills to other languages. Second, we identify\u0000critical shortcomings, where LLMs fail to detect and penalize errors, such as\u0000factual inaccuracies, cultural misrepresentations, and the presence of unwanted\u0000language. Finally, we release Kudge, the first non-English meta-evaluation\u0000dataset containing 5,012 human annotations in Korean.","PeriodicalId":501030,"journal":{"name":"arXiv - CS - Computation and Language","volume":"50 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142262353","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Egalitarian Language Representation in Language Models: It All Begins with Tokenizers 语言模型中的平等语言表达：一切从分词器开始

arXiv - CS - Computation and Language Pub Date : 2024-09-17 DOI: arxiv-2409.11501

Menan Velayuthan, Kengatharaiyer Sarveswaran

引用次数: 0

THaMES: An End-to-End Tool for Hallucination Mitigation and Evaluation in Large Language Models THaMES：大型语言模型中减少和评估幻觉的端到端工具

arXiv - CS - Computation and Language Pub Date : 2024-09-17 DOI: arxiv-2409.11353

Mengfei Liang, Archish Arun, Zekun Wu, Cristian Munoz, Jonathan Lutch, Emre Kazim, Adriano Koshiyama, Philip Treleaven

{"title":"THaMES: An End-to-End Tool for Hallucination Mitigation and Evaluation in Large Language Models","authors":"Mengfei Liang, Archish Arun, Zekun Wu, Cristian Munoz, Jonathan Lutch, Emre Kazim, Adriano Koshiyama, Philip Treleaven","doi":"arxiv-2409.11353","DOIUrl":"https://doi.org/arxiv-2409.11353","url":null,"abstract":"Hallucination, the generation of factually incorrect content, is a growing\u0000challenge in Large Language Models (LLMs). Existing detection and mitigation\u0000methods are often isolated and insufficient for domain-specific needs, lacking\u0000a standardized pipeline. This paper introduces THaMES (Tool for Hallucination\u0000Mitigations and EvaluationS), an integrated framework and library addressing\u0000this gap. THaMES offers an end-to-end solution for evaluating and mitigating\u0000hallucinations in LLMs, featuring automated test set generation, multifaceted\u0000benchmarking, and adaptable mitigation strategies. It automates test set\u0000creation from any corpus, ensuring high data quality, diversity, and\u0000cost-efficiency through techniques like batch processing, weighted sampling,\u0000and counterfactual validation. THaMES assesses a model's ability to detect and\u0000reduce hallucinations across various tasks, including text generation and\u0000binary classification, applying optimal mitigation strategies like In-Context\u0000Learning (ICL), Retrieval Augmented Generation (RAG), and Parameter-Efficient\u0000Fine-tuning (PEFT). Evaluations of state-of-the-art LLMs using a knowledge base\u0000of academic papers, political news, and Wikipedia reveal that commercial models\u0000like GPT-4o benefit more from RAG than ICL, while open-weight models like\u0000Llama-3.1-8B-Instruct and Mistral-Nemo gain more from ICL. Additionally, PEFT\u0000significantly enhances the performance of Llama-3.1-8B-Instruct in both\u0000evaluation tasks.","PeriodicalId":501030,"journal":{"name":"arXiv - CS - Computation and Language","volume":"27 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142262579","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

The Art of Storytelling: Multi-Agent Generative AI for Dynamic Multimodal Narratives 讲故事的艺术：用于动态多模态叙事的多代理生成式人工智能

arXiv - CS - Computation and Language Pub Date : 2024-09-17 DOI: arxiv-2409.11261

Samee Arif, Taimoor Arif, Aamina Jamal Khan, Muhammad Saad Haroon, Agha Ali Raza, Awais Athar

引用次数: 0

Multi-Document Grounded Multi-Turn Synthetic Dialog Generation 多文档接地多轮合成对话生成

arXiv - CS - Computation and Language Pub Date : 2024-09-17 DOI: arxiv-2409.11500

Young-Suk Lee, Chulaka Gunasekara, Danish Contractor, Ramón Fernandez Astudillo, Radu Florian

引用次数: 0

Linear Recency Bias During Training Improves Transformers' Fit to Reading Times 训练过程中的线性回忆偏差可提高变压器与阅读时间的匹配度

arXiv - CS - Computation and Language Pub Date : 2024-09-17 DOI: arxiv-2409.11250

Christian Clark, Byung-Doh Oh, William Schuler

引用次数: 0

HEARTS: A Holistic Framework for Explainable, Sustainable and Robust Text Stereotype Detection HEARTS：可解释、可持续和稳健的文本刻板印象检测整体框架

arXiv - CS - Computation and Language Pub Date : 2024-09-17 DOI: arxiv-2409.11579

Theo King, Zekun Wu, Adriano Koshiyama, Emre Kazim, Philip Treleaven

{"title":"HEARTS: A Holistic Framework for Explainable, Sustainable and Robust Text Stereotype Detection","authors":"Theo King, Zekun Wu, Adriano Koshiyama, Emre Kazim, Philip Treleaven","doi":"arxiv-2409.11579","DOIUrl":"https://doi.org/arxiv-2409.11579","url":null,"abstract":"Stereotypes are generalised assumptions about societal groups, and even\u0000state-of-the-art LLMs using in-context learning struggle to identify them\u0000accurately. Due to the subjective nature of stereotypes, where what constitutes\u0000a stereotype can vary widely depending on cultural, social, and individual\u0000perspectives, robust explainability is crucial. Explainable models ensure that\u0000these nuanced judgments can be understood and validated by human users,\u0000promoting trust and accountability. We address these challenges by introducing\u0000HEARTS (Holistic Framework for Explainable, Sustainable, and Robust Text\u0000Stereotype Detection), a framework that enhances model performance, minimises\u0000carbon footprint, and provides transparent, interpretable explanations. We\u0000establish the Expanded Multi-Grain Stereotype Dataset (EMGSD), comprising\u000057,201 labeled texts across six groups, including under-represented\u0000demographics like LGBTQ+ and regional stereotypes. Ablation studies confirm\u0000that BERT models fine-tuned on EMGSD outperform those trained on individual\u0000components. We then analyse a fine-tuned, carbon-efficient ALBERT-V2 model\u0000using SHAP to generate token-level importance values, ensuring alignment with\u0000human understanding, and calculate explainability confidence scores by\u0000comparing SHAP and LIME outputs. Finally, HEARTS is applied to assess\u0000stereotypical bias in 12 LLM outputs, revealing a gradual reduction in bias\u0000over time within model families.","PeriodicalId":501030,"journal":{"name":"arXiv - CS - Computation and Language","volume":"15 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142262440","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Enriching Datasets with Demographics through Large Language Models: What's in a Name? 通过大型语言模型用人口统计学丰富数据集：名字里有什么？

arXiv - CS - Computation and Language Pub Date : 2024-09-17 DOI: arxiv-2409.11491

Khaled AlNuaimi, Gautier Marti, Mathieu Ravaut, Abdulla AlKetbi, Andreas Henschel, Raed Jaradat

{"title":"Enriching Datasets with Demographics through Large Language Models: What's in a Name?","authors":"Khaled AlNuaimi, Gautier Marti, Mathieu Ravaut, Abdulla AlKetbi, Andreas Henschel, Raed Jaradat","doi":"arxiv-2409.11491","DOIUrl":"https://doi.org/arxiv-2409.11491","url":null,"abstract":"Enriching datasets with demographic information, such as gender, race, and\u0000age from names, is a critical task in fields like healthcare, public policy,\u0000and social sciences. Such demographic insights allow for more precise and\u0000effective engagement with target populations. Despite previous efforts\u0000employing hidden Markov models and recurrent neural networks to predict\u0000demographics from names, significant limitations persist: the lack of\u0000large-scale, well-curated, unbiased, publicly available datasets, and the lack\u0000of an approach robust across datasets. This scarcity has hindered the\u0000development of traditional supervised learning approaches. In this paper, we\u0000demonstrate that the zero-shot capabilities of Large Language Models (LLMs) can\u0000perform as well as, if not better than, bespoke models trained on specialized\u0000data. We apply these LLMs to a variety of datasets, including a real-life,\u0000unlabelled dataset of licensed financial professionals in Hong Kong, and\u0000critically assess the inherent demographic biases in these models. Our work not\u0000only advances the state-of-the-art in demographic enrichment but also opens\u0000avenues for future research in mitigating biases in LLMs.","PeriodicalId":501030,"journal":{"name":"arXiv - CS - Computation and Language","volume":"30 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142262573","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Self-Evolutionary Large Language Models through Uncertainty-Enhanced Preference Optimization 通过不确定性增强偏好优化实现大型语言模型的自我进化

arXiv - CS - Computation and Language Pub Date : 2024-09-17 DOI: arxiv-2409.11212

Jianing Wang, Yang Zhou, Xiaocheng Zhang, Mengjiao Bao, Peng Yan

{"title":"Self-Evolutionary Large Language Models through Uncertainty-Enhanced Preference Optimization","authors":"Jianing Wang, Yang Zhou, Xiaocheng Zhang, Mengjiao Bao, Peng Yan","doi":"arxiv-2409.11212","DOIUrl":"https://doi.org/arxiv-2409.11212","url":null,"abstract":"Iterative preference optimization has recently become one of the de-facto\u0000training paradigms for large language models (LLMs), but the performance is\u0000still underwhelming due to too much noisy preference data yielded in the loop.\u0000To combat this issue, we present an textbf{U}ncertainty-enhanced\u0000textbf{P}reference textbf{O}ptimization (UPO) framework to make the LLM\u0000self-evolve with reliable feedback. The key idea is mitigating the noisy\u0000preference data derived from the current policy and reward models by performing\u0000pair-wise uncertainty estimation and judiciously reliable feedback sampling. To\u0000reach this goal, we thus introduce an estimator model, which incorporates Monte\u0000Carlo (MC) dropout in Bayesian neural network (BNN) to perform uncertainty\u0000estimation for the preference data derived from the LLM policy. Compared to the\u0000existing methods that directly filter generated responses based on the reward\u0000score, the estimator focuses on the model uncertainty in a pair-wise manner and\u0000effectively bypasses the confirmation bias problem of the reward model.\u0000Additionally, we also propose an uncertainty-enhanced self-evolution algorithm\u0000to improve the robustness of preference optimization and encourage the LLM to\u0000generate responses with both high reward and certainty. Extensive experiments\u0000over multiple benchmarks demonstrate that our framework substantially\u0000alleviates the noisy problem and improves the performance of iterative\u0000preference optimization.","PeriodicalId":501030,"journal":{"name":"arXiv - CS - Computation and Language","volume":"91 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142262356","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0