运用语言形式进行文本摘要提取

IEEE Open Journal of the Computer Society Pub Date : 2025-08-20 DOI:10.1109/OJCS.2025.3600632

Harsh Mehta;Santosh Kumar Bharti;Nishant Doshi

{"title":"运用语言形式进行文本摘要提取","authors":"Harsh Mehta;Santosh Kumar Bharti;Nishant Doshi","doi":"10.1109/OJCS.2025.3600632","DOIUrl":null,"url":null,"abstract":"Automatic text summarization has been a prominent research topic for over a decade, aiming to distill concise summaries from extensive textual documents. This study introduces a novel approach addressing the intricacies of morphologically rich Indo-Iranian languages. We propose a unique method that leverages linguistic formality to guide summary generation. Building on an existing formality formula designed for English, we adapt and extend it for the structural characteristics of Indo-Iranian languages, which follow the Subject-Object-Verb (SOV) order. Our refined formula demonstrates a 7.28% improvement in formality scores compared to informal texts, validated through statistical significance testing. To assess sentence formality, we use our custom formula alongside additional features such as Shannon entropy scores and numeric token presence, combining these into a comprehensive sentence evaluation metric. Using this framework, we generate extractive summaries of Gujarati texts. Comparative evaluations at 20% and 30% compression ratios reveal that our method outperforms existing baselines, with ROUGE-1 score improvements of 14.63% at 30% and 28.60% at 20% compression. For reproducibility and further exploration, all experimental data and source code are made publicly available.","PeriodicalId":13205,"journal":{"name":"IEEE Open Journal of the Computer Society","volume":"6 ","pages":"1414-1425"},"PeriodicalIF":0.0000,"publicationDate":"2025-08-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=11130639","citationCount":"0","resultStr":"{\"title\":\"Extractive Text Summarization Using Formality of Language\",\"authors\":\"Harsh Mehta;Santosh Kumar Bharti;Nishant Doshi\",\"doi\":\"10.1109/OJCS.2025.3600632\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Automatic text summarization has been a prominent research topic for over a decade, aiming to distill concise summaries from extensive textual documents. This study introduces a novel approach addressing the intricacies of morphologically rich Indo-Iranian languages. We propose a unique method that leverages linguistic formality to guide summary generation. Building on an existing formality formula designed for English, we adapt and extend it for the structural characteristics of Indo-Iranian languages, which follow the Subject-Object-Verb (SOV) order. Our refined formula demonstrates a 7.28% improvement in formality scores compared to informal texts, validated through statistical significance testing. To assess sentence formality, we use our custom formula alongside additional features such as Shannon entropy scores and numeric token presence, combining these into a comprehensive sentence evaluation metric. Using this framework, we generate extractive summaries of Gujarati texts. Comparative evaluations at 20% and 30% compression ratios reveal that our method outperforms existing baselines, with ROUGE-1 score improvements of 14.63% at 30% and 28.60% at 20% compression. For reproducibility and further exploration, all experimental data and source code are made publicly available.\",\"PeriodicalId\":13205,\"journal\":{\"name\":\"IEEE Open Journal of the Computer Society\",\"volume\":\"6 \",\"pages\":\"1414-1425\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2025-08-20\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=11130639\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE Open Journal of the Computer Society\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/11130639/\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Open Journal of the Computer Society","FirstCategoryId":"1085","ListUrlMain":"https://ieeexplore.ieee.org/document/11130639/","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

十多年来，自动文本摘要一直是一个突出的研究课题，旨在从大量的文本文档中提取简洁的摘要。本研究介绍了一种新颖的方法来解决形态丰富的印度-伊朗语言的复杂性。我们提出了一种独特的利用语言形式来指导摘要生成的方法。在为英语设计的现有正式公式的基础上，我们根据印度-伊朗语言的结构特征对其进行了调整和扩展，这些语言遵循主语-宾语-动词（SOV）顺序。我们的精炼公式表明，与非正式文本相比，正式性得分提高了7.28%，并通过统计显著性检验进行了验证。为了评估句子的正式性，我们使用自定义公式以及Shannon熵分数和数字标记存在等附加功能，将它们组合成一个综合的句子评估指标。使用这个框架，我们生成古吉拉特语文本的摘录摘要。在20%和30%压缩比下的对比评估表明，我们的方法优于现有的基线，在30%和20%压缩比下ROUGE-1评分分别提高了14.63%和28.60%。为了再现性和进一步探索，所有实验数据和源代码都是公开的。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Extractive Text Summarization Using Formality of Language

Automatic text summarization has been a prominent research topic for over a decade, aiming to distill concise summaries from extensive textual documents. This study introduces a novel approach addressing the intricacies of morphologically rich Indo-Iranian languages. We propose a unique method that leverages linguistic formality to guide summary generation. Building on an existing formality formula designed for English, we adapt and extend it for the structural characteristics of Indo-Iranian languages, which follow the Subject-Object-Verb (SOV) order. Our refined formula demonstrates a 7.28% improvement in formality scores compared to informal texts, validated through statistical significance testing. To assess sentence formality, we use our custom formula alongside additional features such as Shannon entropy scores and numeric token presence, combining these into a comprehensive sentence evaluation metric. Using this framework, we generate extractive summaries of Gujarati texts. Comparative evaluations at 20% and 30% compression ratios reveal that our method outperforms existing baselines, with ROUGE-1 score improvements of 14.63% at 30% and 28.60% at 20% compression. For reproducibility and further exploration, all experimental data and source code are made publicly available.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

IEEE Open Journal of the Computer Society

CiteScore

12.60

自引率

0.00%

发文量