运用语言形式进行文本摘要提取

Harsh Mehta;Santosh Kumar Bharti;Nishant Doshi
{"title":"运用语言形式进行文本摘要提取","authors":"Harsh Mehta;Santosh Kumar Bharti;Nishant Doshi","doi":"10.1109/OJCS.2025.3600632","DOIUrl":null,"url":null,"abstract":"Automatic text summarization has been a prominent research topic for over a decade, aiming to distill concise summaries from extensive textual documents. This study introduces a novel approach addressing the intricacies of morphologically rich Indo-Iranian languages. We propose a unique method that leverages linguistic formality to guide summary generation. Building on an existing formality formula designed for English, we adapt and extend it for the structural characteristics of Indo-Iranian languages, which follow the Subject-Object-Verb (SOV) order. Our refined formula demonstrates a 7.28% improvement in formality scores compared to informal texts, validated through statistical significance testing. To assess sentence formality, we use our custom formula alongside additional features such as Shannon entropy scores and numeric token presence, combining these into a comprehensive sentence evaluation metric. Using this framework, we generate extractive summaries of Gujarati texts. Comparative evaluations at 20% and 30% compression ratios reveal that our method outperforms existing baselines, with ROUGE-1 score improvements of 14.63% at 30% and 28.60% at 20% compression. For reproducibility and further exploration, all experimental data and source code are made publicly available.","PeriodicalId":13205,"journal":{"name":"IEEE Open Journal of the Computer Society","volume":"6 ","pages":"1414-1425"},"PeriodicalIF":0.0000,"publicationDate":"2025-08-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=11130639","citationCount":"0","resultStr":"{\"title\":\"Extractive Text Summarization Using Formality of Language\",\"authors\":\"Harsh Mehta;Santosh Kumar Bharti;Nishant Doshi\",\"doi\":\"10.1109/OJCS.2025.3600632\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Automatic text summarization has been a prominent research topic for over a decade, aiming to distill concise summaries from extensive textual documents. This study introduces a novel approach addressing the intricacies of morphologically rich Indo-Iranian languages. We propose a unique method that leverages linguistic formality to guide summary generation. Building on an existing formality formula designed for English, we adapt and extend it for the structural characteristics of Indo-Iranian languages, which follow the Subject-Object-Verb (SOV) order. Our refined formula demonstrates a 7.28% improvement in formality scores compared to informal texts, validated through statistical significance testing. To assess sentence formality, we use our custom formula alongside additional features such as Shannon entropy scores and numeric token presence, combining these into a comprehensive sentence evaluation metric. Using this framework, we generate extractive summaries of Gujarati texts. Comparative evaluations at 20% and 30% compression ratios reveal that our method outperforms existing baselines, with ROUGE-1 score improvements of 14.63% at 30% and 28.60% at 20% compression. For reproducibility and further exploration, all experimental data and source code are made publicly available.\",\"PeriodicalId\":13205,\"journal\":{\"name\":\"IEEE Open Journal of the Computer Society\",\"volume\":\"6 \",\"pages\":\"1414-1425\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2025-08-20\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=11130639\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE Open Journal of the Computer Society\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/11130639/\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Open Journal of the Computer Society","FirstCategoryId":"1085","ListUrlMain":"https://ieeexplore.ieee.org/document/11130639/","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

摘要

十多年来,自动文本摘要一直是一个突出的研究课题,旨在从大量的文本文档中提取简洁的摘要。本研究介绍了一种新颖的方法来解决形态丰富的印度-伊朗语言的复杂性。我们提出了一种独特的利用语言形式来指导摘要生成的方法。在为英语设计的现有正式公式的基础上,我们根据印度-伊朗语言的结构特征对其进行了调整和扩展,这些语言遵循主语-宾语-动词(SOV)顺序。我们的精炼公式表明,与非正式文本相比,正式性得分提高了7.28%,并通过统计显著性检验进行了验证。为了评估句子的正式性,我们使用自定义公式以及Shannon熵分数和数字标记存在等附加功能,将它们组合成一个综合的句子评估指标。使用这个框架,我们生成古吉拉特语文本的摘录摘要。在20%和30%压缩比下的对比评估表明,我们的方法优于现有的基线,在30%和20%压缩比下ROUGE-1评分分别提高了14.63%和28.60%。为了再现性和进一步探索,所有实验数据和源代码都是公开的。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Extractive Text Summarization Using Formality of Language
Automatic text summarization has been a prominent research topic for over a decade, aiming to distill concise summaries from extensive textual documents. This study introduces a novel approach addressing the intricacies of morphologically rich Indo-Iranian languages. We propose a unique method that leverages linguistic formality to guide summary generation. Building on an existing formality formula designed for English, we adapt and extend it for the structural characteristics of Indo-Iranian languages, which follow the Subject-Object-Verb (SOV) order. Our refined formula demonstrates a 7.28% improvement in formality scores compared to informal texts, validated through statistical significance testing. To assess sentence formality, we use our custom formula alongside additional features such as Shannon entropy scores and numeric token presence, combining these into a comprehensive sentence evaluation metric. Using this framework, we generate extractive summaries of Gujarati texts. Comparative evaluations at 20% and 30% compression ratios reveal that our method outperforms existing baselines, with ROUGE-1 score improvements of 14.63% at 30% and 28.60% at 20% compression. For reproducibility and further exploration, all experimental data and source code are made publicly available.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
CiteScore
12.60
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信