细节决定成败:论事件抽取评估的陷阱

Annual Meeting of the Association for Computational Linguistics Pub Date : 2023-06-12 DOI:10.48550/arXiv.2306.06918

Hao Peng, Xiaozhi Wang, Feng Yao, Kaisheng Zeng, Lei Hou, Juanzi Li, Zhiyuan Liu, Weixing Shen

{"title":"细节决定成败:论事件抽取评估的陷阱","authors":"Hao Peng, Xiaozhi Wang, Feng Yao, Kaisheng Zeng, Lei Hou, Juanzi Li, Zhiyuan Liu, Weixing Shen","doi":"10.48550/arXiv.2306.06918","DOIUrl":null,"url":null,"abstract":"Event extraction (EE) is a crucial task aiming at extracting events from texts, which includes two subtasks: event detection (ED) and event argument extraction (EAE). In this paper, we check the reliability of EE evaluations and identify three major pitfalls: (1) The data preprocessing discrepancy makes the evaluation results on the same dataset not directly comparable, but the data preprocessing details are not widely noted and specified in papers. (2) The output space discrepancy of different model paradigms makes different-paradigm EE models lack grounds for comparison and also leads to unclear mapping issues between predictions and annotations. (3) The absence of pipeline evaluation of many EAE-only works makes them hard to be directly compared with EE works and may not well reflect the model performance in real-world pipeline scenarios. We demonstrate the significant influence of these pitfalls through comprehensive meta-analyses of recent papers and empirical experiments. To avoid these pitfalls, we suggest a series of remedies, including specifying data preprocessing, standardizing outputs, and providing pipeline evaluation results. To help implement these remedies, we develop a consistent evaluation framework OMNIEVENT, which can be obtained from https://github.com/THU-KEG/OmniEvent.","PeriodicalId":352845,"journal":{"name":"Annual Meeting of the Association for Computational Linguistics","volume":"45 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-06-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"6","resultStr":"{\"title\":\"The Devil is in the Details: On the Pitfalls of Event Extraction Evaluation\",\"authors\":\"Hao Peng, Xiaozhi Wang, Feng Yao, Kaisheng Zeng, Lei Hou, Juanzi Li, Zhiyuan Liu, Weixing Shen\",\"doi\":\"10.48550/arXiv.2306.06918\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Event extraction (EE) is a crucial task aiming at extracting events from texts, which includes two subtasks: event detection (ED) and event argument extraction (EAE). In this paper, we check the reliability of EE evaluations and identify three major pitfalls: (1) The data preprocessing discrepancy makes the evaluation results on the same dataset not directly comparable, but the data preprocessing details are not widely noted and specified in papers. (2) The output space discrepancy of different model paradigms makes different-paradigm EE models lack grounds for comparison and also leads to unclear mapping issues between predictions and annotations. (3) The absence of pipeline evaluation of many EAE-only works makes them hard to be directly compared with EE works and may not well reflect the model performance in real-world pipeline scenarios. We demonstrate the significant influence of these pitfalls through comprehensive meta-analyses of recent papers and empirical experiments. To avoid these pitfalls, we suggest a series of remedies, including specifying data preprocessing, standardizing outputs, and providing pipeline evaluation results. To help implement these remedies, we develop a consistent evaluation framework OMNIEVENT, which can be obtained from https://github.com/THU-KEG/OmniEvent.\",\"PeriodicalId\":352845,\"journal\":{\"name\":\"Annual Meeting of the Association for Computational Linguistics\",\"volume\":\"45 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2023-06-12\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"6\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Annual Meeting of the Association for Computational Linguistics\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.48550/arXiv.2306.06918\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Annual Meeting of the Association for Computational Linguistics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.48550/arXiv.2306.06918","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 6

摘要

事件提取(EE)是一项旨在从文本中提取事件的关键任务，它包括两个子任务:事件检测(ED)和事件参数提取(EAE)。本文对评价的可靠性进行了检验，发现了三个主要缺陷:(1)数据预处理差异使得同一数据集上的评价结果不具有直接可比性，但数据预处理细节在论文中没有得到广泛的关注和说明。(2)不同模型范式的输出空间差异使得不同范式的EE模型缺乏比较依据，也导致预测与注释之间的映射问题不明确。(3)许多纯eae作品缺乏管道评价，难以与EE作品进行直接比较，不能很好地反映模型在真实管道场景中的性能。我们通过对近期论文和实证实验的综合元分析，证明了这些陷阱的重大影响。为了避免这些陷阱，我们提出了一系列补救措施，包括指定数据预处理、标准化输出和提供管道评估结果。为了帮助实现这些补救措施，我们开发了一个一致的评估框架OMNIEVENT，可以从https://github.com/THU-KEG/OmniEvent获得。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

The Devil is in the Details: On the Pitfalls of Event Extraction Evaluation

Event extraction (EE) is a crucial task aiming at extracting events from texts, which includes two subtasks: event detection (ED) and event argument extraction (EAE). In this paper, we check the reliability of EE evaluations and identify three major pitfalls: (1) The data preprocessing discrepancy makes the evaluation results on the same dataset not directly comparable, but the data preprocessing details are not widely noted and specified in papers. (2) The output space discrepancy of different model paradigms makes different-paradigm EE models lack grounds for comparison and also leads to unclear mapping issues between predictions and annotations. (3) The absence of pipeline evaluation of many EAE-only works makes them hard to be directly compared with EE works and may not well reflect the model performance in real-world pipeline scenarios. We demonstrate the significant influence of these pitfalls through comprehensive meta-analyses of recent papers and empirical experiments. To avoid these pitfalls, we suggest a series of remedies, including specifying data preprocessing, standardizing outputs, and providing pipeline evaluation results. To help implement these remedies, we develop a consistent evaluation framework OMNIEVENT, which can be obtained from https://github.com/THU-KEG/OmniEvent.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Annual Meeting of the Association for Computational Linguistics

自引率

0.00%

发文量