Assessing the effectiveness of ROUGE as unbiased metric in Extractive vs. Abstractive summarization techniques

IF 3.7 3区计算机科学 Q2 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS

Journal of Computational Science Pub Date : 2025-03-18 DOI:10.1016/j.jocs.2025.102571

Alessia Auriemma Citarella , Marcello Barbella , Madalina G. Ciobanu , Fabiola De Marco , Luigi Di Biasi , Genoveffa Tortora

{"title":"Assessing the effectiveness of ROUGE as unbiased metric in Extractive vs. Abstractive summarization techniques","authors":"Alessia Auriemma Citarella , Marcello Barbella , Madalina G. Ciobanu , Fabiola De Marco , Luigi Di Biasi , Genoveffa Tortora","doi":"10.1016/j.jocs.2025.102571","DOIUrl":null,"url":null,"abstract":"<div><div>Approaches to Automatic Text Summarization try to extract key information from one or more input texts and generate summaries whilst preserving content meaning. These strategies are separated into two groups, Extractive and Abstractive, which differ in their work. The extractive summarization extracts sentences from the document text directly, whereas the abstractive summarization creates a summary by interpreting the text and rewriting sentences, often with new words. It is important to assess and confirm how similar a summary is to the original text independently of the particular TS algorithm adopted. The literature proposes various metrics and scores for evaluating text summarization results, and ROUGE (Recall-Oriented Understudy of Gisting Evaluation) is the most used. In this study, our main objective is to evaluate how the ROUGE metric performs when applied to both Extractive and Abstractive summarization algorithms. We aim to understand its effectiveness and reliability as an independent and unbiased metric in assessing the quality of summaries generated by these different approaches. We conducted a first experiment to compare the metric efficiency (ROUGE-1, ROUGE-2 and ROUGE-L) for evaluating Abstractive (word2vec, doc2vec, and glove) <span><math><mrow><mi>v</mi><mi>e</mi><mi>r</mi><mi>s</mi><mi>u</mi><mi>s</mi></mrow></math></span> Extractive Text Summarization algorithms (textRank, lsa, luhn, lexRank), and a second one to compare the obtained score for two different summary approaches: a simple execution of a summarization algorithm <span><math><mrow><mi>v</mi><mi>e</mi><mi>r</mi><mi>s</mi><mi>u</mi><mi>s</mi></mrow></math></span> a multiple execution of different algorithms on the same text. Based on our study, evaluating the ROUGE metric for Abstractive and Extractive algorithms revealed that it reaches similar results for the Abstractive and Extractive algorithms. Moreover, our findings indicate that multiple executions, based on the running of two text summarization algorithms sequentially on the same text, generally outperform single executions of a single text summarization algorithm.</div></div>","PeriodicalId":48907,"journal":{"name":"Journal of Computational Science","volume":"87 ","pages":"Article 102571"},"PeriodicalIF":3.7000,"publicationDate":"2025-03-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Computational Science","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1877750325000481","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS","Score":null,"Total":0}

引用次数: 0

Abstract

Approaches to Automatic Text Summarization try to extract key information from one or more input texts and generate summaries whilst preserving content meaning. These strategies are separated into two groups, Extractive and Abstractive, which differ in their work. The extractive summarization extracts sentences from the document text directly, whereas the abstractive summarization creates a summary by interpreting the text and rewriting sentences, often with new words. It is important to assess and confirm how similar a summary is to the original text independently of the particular TS algorithm adopted. The literature proposes various metrics and scores for evaluating text summarization results, and ROUGE (Recall-Oriented Understudy of Gisting Evaluation) is the most used. In this study, our main objective is to evaluate how the ROUGE metric performs when applied to both Extractive and Abstractive summarization algorithms. We aim to understand its effectiveness and reliability as an independent and unbiased metric in assessing the quality of summaries generated by these different approaches. We conducted a first experiment to compare the metric efficiency (ROUGE-1, ROUGE-2 and ROUGE-L) for evaluating Abstractive (word2vec, doc2vec, and glove)

v e r s u s

Extractive Text Summarization algorithms (textRank, lsa, luhn, lexRank), and a second one to compare the obtained score for two different summary approaches: a simple execution of a summarization algorithm

v e r s u s

a multiple execution of different algorithms on the same text. Based on our study, evaluating the ROUGE metric for Abstractive and Extractive algorithms revealed that it reaches similar results for the Abstractive and Extractive algorithms. Moreover, our findings indicate that multiple executions, based on the running of two text summarization algorithms sequentially on the same text, generally outperform single executions of a single text summarization algorithm.

查看原文本刊更多论文

评估ROUGE作为提取与抽象总结技术的无偏度量的有效性

自动文本摘要方法试图从一个或多个输入文本中提取关键信息，并在保留内容含义的同时生成摘要。这些策略被分为两组，抽取和抽象，在他们的工作不同。提取式摘要直接从文档文本中提取句子，而抽象式摘要通过解释文本和重写句子（通常使用新词）来创建摘要。评估和确认摘要与原始文本的相似程度独立于所采用的特定TS算法是很重要的。文献提出了各种评价文本摘要结果的指标和分数，其中ROUGE （Recall-Oriented Understudy of Gisting Evaluation）是最常用的。在这项研究中，我们的主要目标是评估ROUGE度量在应用于提取和抽象摘要算法时的性能。我们的目标是了解其有效性和可靠性，作为一个独立和公正的指标来评估这些不同方法生成的摘要的质量。我们进行了第一个实验，比较了评估Abstractive （word2vec、doc2vec和glove）与提取文本摘要算法（textRank、lsa、luhn、lexRank）的度量效率（ROUGE-1、ROUGE-2和ROUGE-L），并进行了第二个实验，比较了两种不同摘要方法获得的分数：简单执行摘要算法与在同一文本上多次执行不同算法。基于我们的研究，对抽象和提取算法的ROUGE度量进行了评估，结果表明它对抽象和提取算法达到了相似的结果。此外，我们的研究结果表明，基于在同一文本上依次运行两种文本摘要算法的多次执行通常优于单一文本摘要算法的单次执行。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Journal of Computational Science COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS-COMPUTER SCIENCE, THEORY & METHODS

CiteScore

5.50

自引率

3.00%

发文量

227

审稿时长

41 days

期刊介绍： Computational Science is a rapidly growing multi- and interdisciplinary field that uses advanced computing and data analysis to understand and solve complex problems. It has reached a level of predictive capability that now firmly complements the traditional pillars of experimentation and theory. The recent advances in experimental techniques such as detectors, on-line sensor networks and high-resolution imaging techniques, have opened up new windows into physical and biological processes at many levels of detail. The resulting data explosion allows for detailed data driven modeling and simulation. This new discipline in science combines computational thinking, modern computational methods, devices and collateral technologies to address problems far beyond the scope of traditional numerical methods. Computational science typically unifies three distinct elements: • Modeling, Algorithms and Simulations (e.g. numerical and non-numerical, discrete and continuous); • Software developed to solve science (e.g., biological, physical, and social), engineering, medicine, and humanities problems; • Computer and information science that develops and optimizes the advanced system hardware, software, networking, and data management components (e.g. problem solving environments).