Extractive Summarization using Cohesion Network Analysis and Submodular Set Functions

2020 22nd International Symposium on Symbolic and Numeric Algorithms for Scientific Computing (SYNASC) Pub Date : 2020-09-01 DOI:10.1109/SYNASC51798.2020.00035

Valentin Sergiu Cioaca, M. Dascalu, D. McNamara

{"title":"Extractive Summarization using Cohesion Network Analysis and Submodular Set Functions","authors":"Valentin Sergiu Cioaca, M. Dascalu, D. McNamara","doi":"10.1109/SYNASC51798.2020.00035","DOIUrl":null,"url":null,"abstract":"Numerous approaches have been introduced to automate the process of text summarization, but only few can be easily adapted to multiple languages. This paper introduces a multilingual text processing pipeline integrated in the open-source ReaderBench framework, which can be retrofit to cover more than 50 languages. While considering the extensibility of the approach and the problem of missing labeled data for training in various languages besides English, an unsupervised algorithm was preferred to perform extractive summarization (i.e., select the most representative sentences from the original document). Specifically, two different approaches relying on text cohesion were implemented: a) a graph-based text representation derived from Cohesion Network Analysis that extends TextRank, and b) a class of submodular set functions. Evaluations were performed on the DUC dataset and use as baseline the implementation of TextRank from Gensim. Our results using the submodular set functions outperform the baseline. In addition, two use cases on English and Romanian languages are presented, with corresponding graphical representations for the two methods.","PeriodicalId":278104,"journal":{"name":"2020 22nd International Symposium on Symbolic and Numeric Algorithms for Scientific Computing (SYNASC)","volume":"103 5","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2020 22nd International Symposium on Symbolic and Numeric Algorithms for Scientific Computing (SYNASC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SYNASC51798.2020.00035","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 3

Abstract

Numerous approaches have been introduced to automate the process of text summarization, but only few can be easily adapted to multiple languages. This paper introduces a multilingual text processing pipeline integrated in the open-source ReaderBench framework, which can be retrofit to cover more than 50 languages. While considering the extensibility of the approach and the problem of missing labeled data for training in various languages besides English, an unsupervised algorithm was preferred to perform extractive summarization (i.e., select the most representative sentences from the original document). Specifically, two different approaches relying on text cohesion were implemented: a) a graph-based text representation derived from Cohesion Network Analysis that extends TextRank, and b) a class of submodular set functions. Evaluations were performed on the DUC dataset and use as baseline the implementation of TextRank from Gensim. Our results using the submodular set functions outperform the baseline. In addition, two use cases on English and Romanian languages are presented, with corresponding graphical representations for the two methods.

查看原文本刊更多论文

基于内聚网络分析和子模集函数的抽取摘要

已经引入了许多方法来自动化文本摘要过程，但只有少数方法可以轻松地适应多种语言。本文介绍了一个集成在开源ReaderBench框架中的多语言文本处理管道，该管道可以扩展到50多种语言。同时考虑到该方法的可扩展性以及在除英语以外的多种语言中训练缺少标记数据的问题，我们更倾向于使用无监督算法进行抽取摘要(即从原始文档中选择最具代表性的句子)。具体来说，实现了两种依赖于文本内聚的不同方法:a)从扩展TextRank的内聚网络分析派生的基于图的文本表示，b)一类子模块集合函数。对DUC数据集进行评估，并使用Gensim的TextRank作为基线。我们使用子模集合函数的结果优于基线。此外，还介绍了英语和罗马尼亚语的两个用例，并为这两种方法提供了相应的图形表示。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2020 22nd International Symposium on Symbolic and Numeric Algorithms for Scientific Computing (SYNASC)

自引率

0.00%

发文量