APPLS：评估简单语言总结的评估指标。

Proceedings of the Conference on Empirical Methods in Natural Language Processing. Conference on Empirical Methods in Natural Language Processing Pub Date : 2024-11-01 DOI:10.18653/v1/2024.emnlp-main.519

Yue Guo, Tal August, Gondy Leroy, Trevor Cohen, Lucy Lu Wang

{"title":"APPLS：评估简单语言总结的评估指标。","authors":"Yue Guo, Tal August, Gondy Leroy, Trevor Cohen, Lucy Lu Wang","doi":"10.18653/v1/2024.emnlp-main.519","DOIUrl":null,"url":null,"abstract":"While there has been significant development of models for Plain Language Summarization (PLS), evaluation remains a challenge. PLS lacks a dedicated assessment metric, and the suitability of text generation evaluation metrics is unclear due to the unique transformations involved (e.g., adding background explanations, removing jargon). To address these questions, our study introduces a granular meta-evaluation testbed, APPLS, designed to evaluate metrics for PLS. We identify four PLS criteria from previous work-informativeness, simplification, coherence, and faithfulness-and define a set of perturbations corresponding to these criteria that sensitive metrics should be able to detect. We apply these perturbations to the texts of two PLS datasets to create our testbed. Using APPLS, we assess performance of 14 metrics, including automated scores, lexical features, and LLM prompt-based evaluations. Our analysis reveals that while some current metrics show sensitivity to specific criteria, no single method captures all four criteria simultaneously. We therefore recommend a suite of automated metrics be used to capture PLS quality along all relevant criteria. This work contributes the first meta-evaluation testbed for PLS and a comprehensive evaluation of existing metrics.","PeriodicalId":74540,"journal":{"name":"Proceedings of the Conference on Empirical Methods in Natural Language Processing. Conference on Empirical Methods in Natural Language Processing","volume":"2024 ","pages":"9194-9211"},"PeriodicalIF":0.0000,"publicationDate":"2024-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11938995/pdf/","citationCount":"0","resultStr":"{\"title\":\"APPLS: Evaluating Evaluation Metrics for Plain Language Summarization.\",\"authors\":\"Yue Guo, Tal August, Gondy Leroy, Trevor Cohen, Lucy Lu Wang\",\"doi\":\"10.18653/v1/2024.emnlp-main.519\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"While there has been significant development of models for Plain Language Summarization (PLS), evaluation remains a challenge. PLS lacks a dedicated assessment metric, and the suitability of text generation evaluation metrics is unclear due to the unique transformations involved (e.g., adding background explanations, removing jargon). To address these questions, our study introduces a granular meta-evaluation testbed, APPLS, designed to evaluate metrics for PLS. We identify four PLS criteria from previous work-informativeness, simplification, coherence, and faithfulness-and define a set of perturbations corresponding to these criteria that sensitive metrics should be able to detect. We apply these perturbations to the texts of two PLS datasets to create our testbed. Using APPLS, we assess performance of 14 metrics, including automated scores, lexical features, and LLM prompt-based evaluations. Our analysis reveals that while some current metrics show sensitivity to specific criteria, no single method captures all four criteria simultaneously. We therefore recommend a suite of automated metrics be used to capture PLS quality along all relevant criteria. This work contributes the first meta-evaluation testbed for PLS and a comprehensive evaluation of existing metrics.\",\"PeriodicalId\":74540,\"journal\":{\"name\":\"Proceedings of the Conference on Empirical Methods in Natural Language Processing. Conference on Empirical Methods in Natural Language Processing\",\"volume\":\"2024 \",\"pages\":\"9194-9211\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-11-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11938995/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the Conference on Empirical Methods in Natural Language Processing. Conference on Empirical Methods in Natural Language Processing\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.18653/v1/2024.emnlp-main.519\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the Conference on Empirical Methods in Natural Language Processing. Conference on Empirical Methods in Natural Language Processing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.18653/v1/2024.emnlp-main.519","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

虽然对于简单语言总结（PLS）的模型已经有了很大的发展，但是评估仍然是一个挑战。PLS缺乏专门的评估指标，并且由于所涉及的独特转换（例如，添加背景解释，删除术语），文本生成评估指标的适用性不明确。为了解决这些问题，我们的研究引入了一个颗粒元评估测试平台，APPLS，旨在评估PLS的指标。我们从以前的工作中确定了四个PLS标准——信息性、简化性、一致性和忠实性——并定义了一组与这些标准相对应的扰动，敏感指标应该能够检测到这些标准。我们将这些扰动应用于两个PLS数据集的文本来创建我们的测试平台。使用APPLS，我们评估了14个指标的性能，包括自动评分、词汇特征和基于LLM提示的评估。我们的分析表明，虽然当前的一些指标显示出对特定标准的敏感性，但没有一种方法可以同时捕获所有四个标准。因此，我们建议使用一套自动化的度量标准来捕获所有相关标准的PLS质量。这项工作为PLS提供了第一个元评估测试平台，并对现有指标进行了全面评估。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

APPLS: Evaluating Evaluation Metrics for Plain Language Summarization.

While there has been significant development of models for Plain Language Summarization (PLS), evaluation remains a challenge. PLS lacks a dedicated assessment metric, and the suitability of text generation evaluation metrics is unclear due to the unique transformations involved (e.g., adding background explanations, removing jargon). To address these questions, our study introduces a granular meta-evaluation testbed, APPLS, designed to evaluate metrics for PLS. We identify four PLS criteria from previous work-informativeness, simplification, coherence, and faithfulness-and define a set of perturbations corresponding to these criteria that sensitive metrics should be able to detect. We apply these perturbations to the texts of two PLS datasets to create our testbed. Using APPLS, we assess performance of 14 metrics, including automated scores, lexical features, and LLM prompt-based evaluations. Our analysis reveals that while some current metrics show sensitivity to specific criteria, no single method captures all four criteria simultaneously. We therefore recommend a suite of automated metrics be used to capture PLS quality along all relevant criteria. This work contributes the first meta-evaluation testbed for PLS and a comprehensive evaluation of existing metrics.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Proceedings of the Conference on Empirical Methods in Natural Language Processing. Conference on Empirical Methods in Natural Language Processing

自引率

0.00%

发文量