Automatic Extractive Text Summarization using Multiple Linguistic Features

IF 1.8 4区 计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE
Pooja Gupta, Swati Nigam, Rajiv Singh
{"title":"Automatic Extractive Text Summarization using Multiple Linguistic Features","authors":"Pooja Gupta, Swati Nigam, Rajiv Singh","doi":"10.1145/3656471","DOIUrl":null,"url":null,"abstract":"<p>Automatic text summarization (ATS) provides a summary of distinct categories of information using natural language processing (NLP). Low-resource languages like Hindi have restricted applications of these techniques. This study proposes a method for automatically generating summaries of Hindi documents using extractive technique. The approach retrieves pertinent sentences from the source documents by employing multiple linguistic features and machine learning (ML) using maximum likelihood estimation (MLE) and maximum entropy (ME). We conducted pre-processing on the input documents, such as eliminating Hindi stop words and stemming. We have obtained 15 linguistic feature scores from each document to identify the phrases with high scores for summary generation. We have performed experiments over BBC News articles, CNN News, DUC 2004, Hindi Text Short Summarization Corpus, Indian Language News Text Summarization Corpus, and Wikipedia Articles for the proposed text summarizer. The Hindi Text Short Summarization Corpus and Indian Language News Text Summarization Corpus datasets are in Hindi, whereas BBC News articles, CNN News, and the DUC 2004 datasets have been translated into Hindi using Google, Microsoft Bing, and Systran translators for experiments. The summarization results have been calculated and shown for Hindi as well as for English to compare the performance of a low and rich-resource language. Multiple ROUGE metrics, along with precision, recall, and F-measure, have been used for the evaluation, which shows the better performance of the proposed method with multiple ROUGE scores. We compare the proposed method with the supervised and unsupervised machine learning methodologies, including support vector machine (SVM), Naive Bayes (NB), decision tree (DT), latent semantic analysis (LSA), latent Dirichlet allocation (LDA), and K-means clustering, and it was found that the proposed method outperforms these methods.</p>","PeriodicalId":54312,"journal":{"name":"ACM Transactions on Asian and Low-Resource Language Information Processing","volume":"32 1","pages":""},"PeriodicalIF":1.8000,"publicationDate":"2024-04-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACM Transactions on Asian and Low-Resource Language Information Processing","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1145/3656471","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0

Abstract

Automatic text summarization (ATS) provides a summary of distinct categories of information using natural language processing (NLP). Low-resource languages like Hindi have restricted applications of these techniques. This study proposes a method for automatically generating summaries of Hindi documents using extractive technique. The approach retrieves pertinent sentences from the source documents by employing multiple linguistic features and machine learning (ML) using maximum likelihood estimation (MLE) and maximum entropy (ME). We conducted pre-processing on the input documents, such as eliminating Hindi stop words and stemming. We have obtained 15 linguistic feature scores from each document to identify the phrases with high scores for summary generation. We have performed experiments over BBC News articles, CNN News, DUC 2004, Hindi Text Short Summarization Corpus, Indian Language News Text Summarization Corpus, and Wikipedia Articles for the proposed text summarizer. The Hindi Text Short Summarization Corpus and Indian Language News Text Summarization Corpus datasets are in Hindi, whereas BBC News articles, CNN News, and the DUC 2004 datasets have been translated into Hindi using Google, Microsoft Bing, and Systran translators for experiments. The summarization results have been calculated and shown for Hindi as well as for English to compare the performance of a low and rich-resource language. Multiple ROUGE metrics, along with precision, recall, and F-measure, have been used for the evaluation, which shows the better performance of the proposed method with multiple ROUGE scores. We compare the proposed method with the supervised and unsupervised machine learning methodologies, including support vector machine (SVM), Naive Bayes (NB), decision tree (DT), latent semantic analysis (LSA), latent Dirichlet allocation (LDA), and K-means clustering, and it was found that the proposed method outperforms these methods.

利用多种语言特征自动提取文本摘要
自动文本摘要(ATS)利用自然语言处理(NLP)对不同类别的信息进行摘要。印地语等低资源语言限制了这些技术的应用。本研究提出了一种使用提取技术自动生成印地语文档摘要的方法。该方法通过使用多种语言特征以及最大似然估计(MLE)和最大熵(ME)的机器学习(ML),从源文档中检索相关句子。我们对输入文档进行了预处理,如消除印地语停滞词和词干。我们从每篇文档中获取了 15 个语言特征分数,以识别出分数较高的短语,从而生成摘要。我们对 BBC 新闻报道、CNN 新闻、DUC 2004、印地语文本简短摘要语料库、印度语新闻文本摘要语料库和维基百科文章进行了实验,以验证所提议的文本摘要器。印地语文本简短摘要语料库和印度语新闻文本摘要语料库的数据集是印地语的,而 BBC News 文章、CNN News 和 DUC 2004 数据集已使用 Google、Microsoft Bing 和 Systran 翻译器翻译成印地语进行实验。计算并显示了印地语和英语的摘要结果,以比较低资源语言和丰富资源语言的性能。在评估中使用了多个 ROUGE 指标以及精确度、召回率和 F-measure,结果表明使用多个 ROUGE 分数的拟议方法性能更佳。我们将提出的方法与支持向量机 (SVM)、奈夫贝叶斯 (NB)、决策树 (DT)、潜在语义分析 (LSA)、潜在 Dirichlet 分配 (LDA) 和 K-means 聚类等有监督和无监督机器学习方法进行了比较,发现提出的方法优于这些方法。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
CiteScore
3.60
自引率
15.00%
发文量
241
期刊介绍: The ACM Transactions on Asian and Low-Resource Language Information Processing (TALLIP) publishes high quality original archival papers and technical notes in the areas of computation and processing of information in Asian languages, low-resource languages of Africa, Australasia, Oceania and the Americas, as well as related disciplines. The subject areas covered by TALLIP include, but are not limited to: -Computational Linguistics: including computational phonology, computational morphology, computational syntax (e.g. parsing), computational semantics, computational pragmatics, etc. -Linguistic Resources: including computational lexicography, terminology, electronic dictionaries, cross-lingual dictionaries, electronic thesauri, etc. -Hardware and software algorithms and tools for Asian or low-resource language processing, e.g., handwritten character recognition. -Information Understanding: including text understanding, speech understanding, character recognition, discourse processing, dialogue systems, etc. -Machine Translation involving Asian or low-resource languages. -Information Retrieval: including natural language processing (NLP) for concept-based indexing, natural language query interfaces, semantic relevance judgments, etc. -Information Extraction and Filtering: including automatic abstraction, user profiling, etc. -Speech processing: including text-to-speech synthesis and automatic speech recognition. -Multimedia Asian Information Processing: including speech, image, video, image/text translation, etc. -Cross-lingual information processing involving Asian or low-resource languages. -Papers that deal in theory, systems design, evaluation and applications in the aforesaid subjects are appropriate for TALLIP. Emphasis will be placed on the originality and the practical significance of the reported research.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信