LLM-as-a-Judge: automated evaluation of search query parsing using large language models.

IF 2.4 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS

Frontiers in Big Data Pub Date : 2025-07-21 eCollection Date: 2025-01-01 DOI:10.3389/fdata.2025.1611389

Mehmet Selman Baysan, Serkan Uysal, İrem İşlek, Çağla Çığ Karaman, Tunga Güngör

{"title":"LLM-as-a-Judge: automated evaluation of search query parsing using large language models.","authors":"Mehmet Selman Baysan, Serkan Uysal, İrem İşlek, Çağla Çığ Karaman, Tunga Güngör","doi":"10.3389/fdata.2025.1611389","DOIUrl":null,"url":null,"abstract":"Introduction: The adoption of Large Language Models (LLMs) in search systems necessitates new evaluation methodologies beyond traditional rule-based or manual approaches.Methods: We propose a general framework for evaluating structured outputs using LLMs, focusing on search query parsing within an online classified platform. Our approach leverages LLMs' contextual reasoning capabilities through three evaluation methodologies: Pointwise, Pairwise, and Pass/Fail assessments. Additionally, we introduce a Contextual Evaluation Prompt Routing strategy to improve reliability and reduce hallucinations.Results: Experiments conducted on both small- and large-scale datasets demonstrate that LLM-based evaluation achieves approximately 90% agreement with human judgments.Discussion: These results validate LLM-driven evaluation as a scalable, interpretable, and effective alternative to traditional evaluation methods, providing robust query parsing for real-world search systems.","PeriodicalId":52859,"journal":{"name":"Frontiers in Big Data","volume":"8 ","pages":"1611389"},"PeriodicalIF":2.4000,"publicationDate":"2025-07-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12319771/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Frontiers in Big Data","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.3389/fdata.2025.1611389","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/1/1 0:00:00","PubModel":"eCollection","JCR":"Q3","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 0

Abstract

Introduction: The adoption of Large Language Models (LLMs) in search systems necessitates new evaluation methodologies beyond traditional rule-based or manual approaches.

Methods: We propose a general framework for evaluating structured outputs using LLMs, focusing on search query parsing within an online classified platform. Our approach leverages LLMs' contextual reasoning capabilities through three evaluation methodologies: Pointwise, Pairwise, and Pass/Fail assessments. Additionally, we introduce a Contextual Evaluation Prompt Routing strategy to improve reliability and reduce hallucinations.

Results: Experiments conducted on both small- and large-scale datasets demonstrate that LLM-based evaluation achieves approximately 90% agreement with human judgments.

Discussion: These results validate LLM-driven evaluation as a scalable, interpretable, and effective alternative to traditional evaluation methods, providing robust query parsing for real-world search systems.

Abstract Image

查看原文本刊更多论文

LLM-as-a-Judge：使用大型语言模型自动评估搜索查询解析。

引言：在搜索系统中采用大型语言模型（llm）需要新的评估方法，而不是传统的基于规则或手动方法。方法：我们提出了一个使用llm评估结构化输出的通用框架，重点关注在线分类平台内的搜索查询解析。我们的方法通过三种评估方法利用法学硕士的上下文推理能力：点对评估、两两评估和通过/不通过评估。此外，我们引入了上下文评估提示路由策略，以提高可靠性和减少幻觉。结果：在小型和大型数据集上进行的实验表明，基于llm的评估与人类判断的一致性约为90%。讨论：这些结果验证了llm驱动的评估是传统评估方法的可伸缩、可解释和有效的替代方法，为现实世界的搜索系统提供了健壮的查询解析。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊