{"title":"Think Together and Work Better: Combining Humans' and LLMs' Think-Aloud Outcomes for Effective Text Evaluation","authors":"SeongYeub Chu, JongWoo Kim, MunYong Yi","doi":"arxiv-2409.07355","DOIUrl":null,"url":null,"abstract":"This study introduces \\textbf{InteractEval}, a framework that integrates\nhuman expertise and Large Language Models (LLMs) using the Think-Aloud (TA)\nmethod to generate attributes for checklist-based text evaluation. By combining\nhuman flexibility and reasoning with LLM consistency, InteractEval outperforms\ntraditional non-LLM-based and LLM-based baselines across four distinct\ndimensions, consisting of Coherence, Fluency, Consistency, and Relevance. The\nexperiment also investigates the effectiveness of the TA method, showing that\nit promotes divergent thinking in both humans and LLMs, leading to the\ngeneration of a wider range of relevant attributes and enhance text evaluation\nperformance. Comparative analysis reveals that humans excel at identifying\nattributes related to internal quality (Coherence and Fluency), but LLMs\nperform better at those attributes related to external alignment (Consistency\nand Relevance). Consequently, leveraging both humans and LLMs together produces\nthe best evaluation outcomes. In other words, this study emphasizes the\nnecessity of effectively combining humans and LLMs in an automated\nchecklist-based text evaluation framework. The code is available at\n\\textbf{\\url{https://github.com/BBeeChu/InteractEval.git}}.","PeriodicalId":501030,"journal":{"name":"arXiv - CS - Computation and Language","volume":"1 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Computation and Language","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.07355","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
This study introduces \textbf{InteractEval}, a framework that integrates
human expertise and Large Language Models (LLMs) using the Think-Aloud (TA)
method to generate attributes for checklist-based text evaluation. By combining
human flexibility and reasoning with LLM consistency, InteractEval outperforms
traditional non-LLM-based and LLM-based baselines across four distinct
dimensions, consisting of Coherence, Fluency, Consistency, and Relevance. The
experiment also investigates the effectiveness of the TA method, showing that
it promotes divergent thinking in both humans and LLMs, leading to the
generation of a wider range of relevant attributes and enhance text evaluation
performance. Comparative analysis reveals that humans excel at identifying
attributes related to internal quality (Coherence and Fluency), but LLMs
perform better at those attributes related to external alignment (Consistency
and Relevance). Consequently, leveraging both humans and LLMs together produces
the best evaluation outcomes. In other words, this study emphasizes the
necessity of effectively combining humans and LLMs in an automated
checklist-based text evaluation framework. The code is available at
\textbf{\url{https://github.com/BBeeChu/InteractEval.git}}.