Contextual Word Embedding based Clustering for Extractive Summarization

2022 International Conference on Frontiers of Information Technology (FIT) Pub Date : 2022-12-01 DOI:10.1109/FIT57066.2022.00039

Shah Faisal, Atif Khan, S. Yousaf, Muhammad Umair

{"title":"Contextual Word Embedding based Clustering for Extractive Summarization","authors":"Shah Faisal, Atif Khan, S. Yousaf, Muhammad Umair","doi":"10.1109/FIT57066.2022.00039","DOIUrl":null,"url":null,"abstract":"Currently, the amount of content on the internet is expanding tremendously. One reason for the abundance of information is that numerous online resources cover similar themes, posing challenges and opportunities for natural language processing (NLP). People find it challenging to summarize thousands of documents on the same topic manually. Consequently, it is desirable to have multiple documents automatically summed up. This work proposed a contextual word embedding-based clustering technique for extractive summarization. At first, documents are split into sentences, and then each word in all sentences is given an embedding based on its context using the FastText embedding method. The averaged word embeddings are then used to create sentence embeddings/vectors. The Fuzzy C-Mean clustering algorithm is then applied to the collection of sentence embeddings to form clusters of semantically similar sentences. Based on the text features, the sentences inside each cluster are ranked. The final extracted summary comprises representative sentences taken from the highest-ranked sentences within each cluster. The effectiveness of the suggested methodology is tested in the context of the ROUGE evaluation metric and Document Understanding Conference (DUC) 2002 data set. Experimental results demonstrated that the presented technique outperformed the benchmark summarization techniques in terms of ROUGE measures.","PeriodicalId":102958,"journal":{"name":"2022 International Conference on Frontiers of Information Technology (FIT)","volume":"42 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 International Conference on Frontiers of Information Technology (FIT)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/FIT57066.2022.00039","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Currently, the amount of content on the internet is expanding tremendously. One reason for the abundance of information is that numerous online resources cover similar themes, posing challenges and opportunities for natural language processing (NLP). People find it challenging to summarize thousands of documents on the same topic manually. Consequently, it is desirable to have multiple documents automatically summed up. This work proposed a contextual word embedding-based clustering technique for extractive summarization. At first, documents are split into sentences, and then each word in all sentences is given an embedding based on its context using the FastText embedding method. The averaged word embeddings are then used to create sentence embeddings/vectors. The Fuzzy C-Mean clustering algorithm is then applied to the collection of sentence embeddings to form clusters of semantically similar sentences. Based on the text features, the sentences inside each cluster are ranked. The final extracted summary comprises representative sentences taken from the highest-ranked sentences within each cluster. The effectiveness of the suggested methodology is tested in the context of the ROUGE evaluation metric and Document Understanding Conference (DUC) 2002 data set. Experimental results demonstrated that the presented technique outperformed the benchmark summarization techniques in terms of ROUGE measures.

查看原文本刊更多论文

基于上下文词嵌入的聚类提取摘要

目前，互联网上的内容数量正在急剧增长。信息丰富的一个原因是，大量的在线资源涵盖了类似的主题，这给自然语言处理(NLP)带来了挑战和机遇。人们发现手动总结同一主题的数千个文档是一项挑战。因此，需要对多个文档进行自动汇总。本文提出了一种基于上下文词嵌入的聚类抽取摘要技术。首先，将文档分成句子，然后使用FastText嵌入方法根据上下文对所有句子中的每个单词进行嵌入。然后使用平均词嵌入来创建句子嵌入/向量。然后将模糊c均值聚类算法应用于句子嵌入的集合，形成语义相似句子的聚类。根据文本特征，对每个聚类中的句子进行排序。最终提取的摘要包括从每个聚类中排名最高的句子中提取的代表性句子。建议的方法的有效性在ROUGE评估指标和文件理解会议(DUC) 2002数据集的背景下进行了测试。实验结果表明，该方法在ROUGE度量方面优于基准汇总技术。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2022 International Conference on Frontiers of Information Technology (FIT)

自引率

0.00%

发文量