{"title":"Extractive Text Summarization Using Formality of Language","authors":"Harsh Mehta;Santosh Kumar Bharti;Nishant Doshi","doi":"10.1109/OJCS.2025.3600632","DOIUrl":null,"url":null,"abstract":"Automatic text summarization has been a prominent research topic for over a decade, aiming to distill concise summaries from extensive textual documents. This study introduces a novel approach addressing the intricacies of morphologically rich Indo-Iranian languages. We propose a unique method that leverages linguistic formality to guide summary generation. Building on an existing formality formula designed for English, we adapt and extend it for the structural characteristics of Indo-Iranian languages, which follow the Subject-Object-Verb (SOV) order. Our refined formula demonstrates a 7.28% improvement in formality scores compared to informal texts, validated through statistical significance testing. To assess sentence formality, we use our custom formula alongside additional features such as Shannon entropy scores and numeric token presence, combining these into a comprehensive sentence evaluation metric. Using this framework, we generate extractive summaries of Gujarati texts. Comparative evaluations at 20% and 30% compression ratios reveal that our method outperforms existing baselines, with ROUGE-1 score improvements of 14.63% at 30% and 28.60% at 20% compression. For reproducibility and further exploration, all experimental data and source code are made publicly available.","PeriodicalId":13205,"journal":{"name":"IEEE Open Journal of the Computer Society","volume":"6 ","pages":"1414-1425"},"PeriodicalIF":0.0000,"publicationDate":"2025-08-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=11130639","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Open Journal of the Computer Society","FirstCategoryId":"1085","ListUrlMain":"https://ieeexplore.ieee.org/document/11130639/","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Automatic text summarization has been a prominent research topic for over a decade, aiming to distill concise summaries from extensive textual documents. This study introduces a novel approach addressing the intricacies of morphologically rich Indo-Iranian languages. We propose a unique method that leverages linguistic formality to guide summary generation. Building on an existing formality formula designed for English, we adapt and extend it for the structural characteristics of Indo-Iranian languages, which follow the Subject-Object-Verb (SOV) order. Our refined formula demonstrates a 7.28% improvement in formality scores compared to informal texts, validated through statistical significance testing. To assess sentence formality, we use our custom formula alongside additional features such as Shannon entropy scores and numeric token presence, combining these into a comprehensive sentence evaluation metric. Using this framework, we generate extractive summaries of Gujarati texts. Comparative evaluations at 20% and 30% compression ratios reveal that our method outperforms existing baselines, with ROUGE-1 score improvements of 14.63% at 30% and 28.60% at 20% compression. For reproducibility and further exploration, all experimental data and source code are made publicly available.