LLM-as-a-Judge：使用大型语言模型自动评估搜索查询解析。

IF 2.4 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS

Frontiers in Big Data Pub Date : 2025-07-21 eCollection Date: 2025-01-01 DOI:10.3389/fdata.2025.1611389

Mehmet Selman Baysan, Serkan Uysal, İrem İşlek, Çağla Çığ Karaman, Tunga Güngör

{"title":"LLM-as-a-Judge：使用大型语言模型自动评估搜索查询解析。","authors":"Mehmet Selman Baysan, Serkan Uysal, İrem İşlek, Çağla Çığ Karaman, Tunga Güngör","doi":"10.3389/fdata.2025.1611389","DOIUrl":null,"url":null,"abstract":"Introduction: The adoption of Large Language Models (LLMs) in search systems necessitates new evaluation methodologies beyond traditional rule-based or manual approaches.Methods: We propose a general framework for evaluating structured outputs using LLMs, focusing on search query parsing within an online classified platform. Our approach leverages LLMs' contextual reasoning capabilities through three evaluation methodologies: Pointwise, Pairwise, and Pass/Fail assessments. Additionally, we introduce a Contextual Evaluation Prompt Routing strategy to improve reliability and reduce hallucinations.Results: Experiments conducted on both small- and large-scale datasets demonstrate that LLM-based evaluation achieves approximately 90% agreement with human judgments.Discussion: These results validate LLM-driven evaluation as a scalable, interpretable, and effective alternative to traditional evaluation methods, providing robust query parsing for real-world search systems.","PeriodicalId":52859,"journal":{"name":"Frontiers in Big Data","volume":"8 ","pages":"1611389"},"PeriodicalIF":2.4000,"publicationDate":"2025-07-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12319771/pdf/","citationCount":"0","resultStr":"{\"title\":\"LLM-as-a-Judge: automated evaluation of search query parsing using large language models.\",\"authors\":\"Mehmet Selman Baysan, Serkan Uysal, İrem İşlek, Çağla Çığ Karaman, Tunga Güngör\",\"doi\":\"10.3389/fdata.2025.1611389\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Introduction: The adoption of Large Language Models (LLMs) in search systems necessitates new evaluation methodologies beyond traditional rule-based or manual approaches.Methods: We propose a general framework for evaluating structured outputs using LLMs, focusing on search query parsing within an online classified platform. Our approach leverages LLMs' contextual reasoning capabilities through three evaluation methodologies: Pointwise, Pairwise, and Pass/Fail assessments. Additionally, we introduce a Contextual Evaluation Prompt Routing strategy to improve reliability and reduce hallucinations.Results: Experiments conducted on both small- and large-scale datasets demonstrate that LLM-based evaluation achieves approximately 90% agreement with human judgments.Discussion: These results validate LLM-driven evaluation as a scalable, interpretable, and effective alternative to traditional evaluation methods, providing robust query parsing for real-world search systems.\",\"PeriodicalId\":52859,\"journal\":{\"name\":\"Frontiers in Big Data\",\"volume\":\"8 \",\"pages\":\"1611389\"},\"PeriodicalIF\":2.4000,\"publicationDate\":\"2025-07-21\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12319771/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Frontiers in Big Data\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.3389/fdata.2025.1611389\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"2025/1/1 0:00:00\",\"PubModel\":\"eCollection\",\"JCR\":\"Q3\",\"JCRName\":\"COMPUTER SCIENCE, INFORMATION SYSTEMS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Frontiers in Big Data","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.3389/fdata.2025.1611389","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/1/1 0:00:00","PubModel":"eCollection","JCR":"Q3","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 0

摘要

引言：在搜索系统中采用大型语言模型（llm）需要新的评估方法，而不是传统的基于规则或手动方法。方法：我们提出了一个使用llm评估结构化输出的通用框架，重点关注在线分类平台内的搜索查询解析。我们的方法通过三种评估方法利用法学硕士的上下文推理能力：点对评估、两两评估和通过/不通过评估。此外，我们引入了上下文评估提示路由策略，以提高可靠性和减少幻觉。结果：在小型和大型数据集上进行的实验表明，基于llm的评估与人类判断的一致性约为90%。讨论：这些结果验证了llm驱动的评估是传统评估方法的可伸缩、可解释和有效的替代方法，为现实世界的搜索系统提供了健壮的查询解析。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

LLM-as-a-Judge: automated evaluation of search query parsing using large language models.

查看原文本刊更多论文

LLM-as-a-Judge: automated evaluation of search query parsing using large language models.

Introduction: The adoption of Large Language Models (LLMs) in search systems necessitates new evaluation methodologies beyond traditional rule-based or manual approaches.

Methods: We propose a general framework for evaluating structured outputs using LLMs, focusing on search query parsing within an online classified platform. Our approach leverages LLMs' contextual reasoning capabilities through three evaluation methodologies: Pointwise, Pairwise, and Pass/Fail assessments. Additionally, we introduce a Contextual Evaluation Prompt Routing strategy to improve reliability and reduce hallucinations.

Results: Experiments conducted on both small- and large-scale datasets demonstrate that LLM-based evaluation achieves approximately 90% agreement with human judgments.

Discussion: These results validate LLM-driven evaluation as a scalable, interpretable, and effective alternative to traditional evaluation methods, providing robust query parsing for real-world search systems.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Frontiers in Big Data Multiple-

CiteScore

5.20

自引率

3.20%

发文量

122

审稿时长

13 weeks