{"title":"Contextual Word Embedding based Clustering for Extractive Summarization","authors":"Shah Faisal, Atif Khan, S. Yousaf, Muhammad Umair","doi":"10.1109/FIT57066.2022.00039","DOIUrl":null,"url":null,"abstract":"Currently, the amount of content on the internet is expanding tremendously. One reason for the abundance of information is that numerous online resources cover similar themes, posing challenges and opportunities for natural language processing (NLP). People find it challenging to summarize thousands of documents on the same topic manually. Consequently, it is desirable to have multiple documents automatically summed up. This work proposed a contextual word embedding-based clustering technique for extractive summarization. At first, documents are split into sentences, and then each word in all sentences is given an embedding based on its context using the FastText embedding method. The averaged word embeddings are then used to create sentence embeddings/vectors. The Fuzzy C-Mean clustering algorithm is then applied to the collection of sentence embeddings to form clusters of semantically similar sentences. Based on the text features, the sentences inside each cluster are ranked. The final extracted summary comprises representative sentences taken from the highest-ranked sentences within each cluster. The effectiveness of the suggested methodology is tested in the context of the ROUGE evaluation metric and Document Understanding Conference (DUC) 2002 data set. Experimental results demonstrated that the presented technique outperformed the benchmark summarization techniques in terms of ROUGE measures.","PeriodicalId":102958,"journal":{"name":"2022 International Conference on Frontiers of Information Technology (FIT)","volume":"42 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 International Conference on Frontiers of Information Technology (FIT)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/FIT57066.2022.00039","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Currently, the amount of content on the internet is expanding tremendously. One reason for the abundance of information is that numerous online resources cover similar themes, posing challenges and opportunities for natural language processing (NLP). People find it challenging to summarize thousands of documents on the same topic manually. Consequently, it is desirable to have multiple documents automatically summed up. This work proposed a contextual word embedding-based clustering technique for extractive summarization. At first, documents are split into sentences, and then each word in all sentences is given an embedding based on its context using the FastText embedding method. The averaged word embeddings are then used to create sentence embeddings/vectors. The Fuzzy C-Mean clustering algorithm is then applied to the collection of sentence embeddings to form clusters of semantically similar sentences. Based on the text features, the sentences inside each cluster are ranked. The final extracted summary comprises representative sentences taken from the highest-ranked sentences within each cluster. The effectiveness of the suggested methodology is tested in the context of the ROUGE evaluation metric and Document Understanding Conference (DUC) 2002 data set. Experimental results demonstrated that the presented technique outperformed the benchmark summarization techniques in terms of ROUGE measures.