Mengfei Liang, Archish Arun, Zekun Wu, Cristian Munoz, Jonathan Lutch, Emre Kazim, Adriano Koshiyama, Philip Treleaven
{"title":"THaMES:大型语言模型中减少和评估幻觉的端到端工具","authors":"Mengfei Liang, Archish Arun, Zekun Wu, Cristian Munoz, Jonathan Lutch, Emre Kazim, Adriano Koshiyama, Philip Treleaven","doi":"arxiv-2409.11353","DOIUrl":null,"url":null,"abstract":"Hallucination, the generation of factually incorrect content, is a growing\nchallenge in Large Language Models (LLMs). Existing detection and mitigation\nmethods are often isolated and insufficient for domain-specific needs, lacking\na standardized pipeline. This paper introduces THaMES (Tool for Hallucination\nMitigations and EvaluationS), an integrated framework and library addressing\nthis gap. THaMES offers an end-to-end solution for evaluating and mitigating\nhallucinations in LLMs, featuring automated test set generation, multifaceted\nbenchmarking, and adaptable mitigation strategies. It automates test set\ncreation from any corpus, ensuring high data quality, diversity, and\ncost-efficiency through techniques like batch processing, weighted sampling,\nand counterfactual validation. THaMES assesses a model's ability to detect and\nreduce hallucinations across various tasks, including text generation and\nbinary classification, applying optimal mitigation strategies like In-Context\nLearning (ICL), Retrieval Augmented Generation (RAG), and Parameter-Efficient\nFine-tuning (PEFT). Evaluations of state-of-the-art LLMs using a knowledge base\nof academic papers, political news, and Wikipedia reveal that commercial models\nlike GPT-4o benefit more from RAG than ICL, while open-weight models like\nLlama-3.1-8B-Instruct and Mistral-Nemo gain more from ICL. Additionally, PEFT\nsignificantly enhances the performance of Llama-3.1-8B-Instruct in both\nevaluation tasks.","PeriodicalId":501030,"journal":{"name":"arXiv - CS - Computation and Language","volume":"27 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"THaMES: An End-to-End Tool for Hallucination Mitigation and Evaluation in Large Language Models\",\"authors\":\"Mengfei Liang, Archish Arun, Zekun Wu, Cristian Munoz, Jonathan Lutch, Emre Kazim, Adriano Koshiyama, Philip Treleaven\",\"doi\":\"arxiv-2409.11353\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Hallucination, the generation of factually incorrect content, is a growing\\nchallenge in Large Language Models (LLMs). Existing detection and mitigation\\nmethods are often isolated and insufficient for domain-specific needs, lacking\\na standardized pipeline. This paper introduces THaMES (Tool for Hallucination\\nMitigations and EvaluationS), an integrated framework and library addressing\\nthis gap. THaMES offers an end-to-end solution for evaluating and mitigating\\nhallucinations in LLMs, featuring automated test set generation, multifaceted\\nbenchmarking, and adaptable mitigation strategies. It automates test set\\ncreation from any corpus, ensuring high data quality, diversity, and\\ncost-efficiency through techniques like batch processing, weighted sampling,\\nand counterfactual validation. THaMES assesses a model's ability to detect and\\nreduce hallucinations across various tasks, including text generation and\\nbinary classification, applying optimal mitigation strategies like In-Context\\nLearning (ICL), Retrieval Augmented Generation (RAG), and Parameter-Efficient\\nFine-tuning (PEFT). Evaluations of state-of-the-art LLMs using a knowledge base\\nof academic papers, political news, and Wikipedia reveal that commercial models\\nlike GPT-4o benefit more from RAG than ICL, while open-weight models like\\nLlama-3.1-8B-Instruct and Mistral-Nemo gain more from ICL. Additionally, PEFT\\nsignificantly enhances the performance of Llama-3.1-8B-Instruct in both\\nevaluation tasks.\",\"PeriodicalId\":501030,\"journal\":{\"name\":\"arXiv - CS - Computation and Language\",\"volume\":\"27 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-09-17\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - CS - Computation and Language\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2409.11353\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Computation and Language","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.11353","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
THaMES: An End-to-End Tool for Hallucination Mitigation and Evaluation in Large Language Models
Hallucination, the generation of factually incorrect content, is a growing
challenge in Large Language Models (LLMs). Existing detection and mitigation
methods are often isolated and insufficient for domain-specific needs, lacking
a standardized pipeline. This paper introduces THaMES (Tool for Hallucination
Mitigations and EvaluationS), an integrated framework and library addressing
this gap. THaMES offers an end-to-end solution for evaluating and mitigating
hallucinations in LLMs, featuring automated test set generation, multifaceted
benchmarking, and adaptable mitigation strategies. It automates test set
creation from any corpus, ensuring high data quality, diversity, and
cost-efficiency through techniques like batch processing, weighted sampling,
and counterfactual validation. THaMES assesses a model's ability to detect and
reduce hallucinations across various tasks, including text generation and
binary classification, applying optimal mitigation strategies like In-Context
Learning (ICL), Retrieval Augmented Generation (RAG), and Parameter-Efficient
Fine-tuning (PEFT). Evaluations of state-of-the-art LLMs using a knowledge base
of academic papers, political news, and Wikipedia reveal that commercial models
like GPT-4o benefit more from RAG than ICL, while open-weight models like
Llama-3.1-8B-Instruct and Mistral-Nemo gain more from ICL. Additionally, PEFT
significantly enhances the performance of Llama-3.1-8B-Instruct in both
evaluation tasks.