An innovative approach to advanced voice classification of sacred Quranic recitations through multimodal fusion

IF 4.3 3区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Egyptian Informatics Journal Pub Date : 2025-03-18 DOI:10.1016/j.eij.2025.100640

Esraa Hassan , Abeer Saber , Omar Alqahtani , Nora El-Rashidy , Samar Elbedwehy

{"title":"An innovative approach to advanced voice classification of sacred Quranic recitations through multimodal fusion","authors":"Esraa Hassan , Abeer Saber , Omar Alqahtani , Nora El-Rashidy , Samar Elbedwehy","doi":"10.1016/j.eij.2025.100640","DOIUrl":null,"url":null,"abstract":"<div><div>The Quran is the most important book we have ever read or recited. Perfecting recitation of the Holy Quran is challenging. In this paper, we integrate the use of multimodal fusion to result in advanced voice classification of sacred Quranic recitations. The proposed work called Voice Shortcut Connection Fusion (VSCF) architecture also targets restrictions corresponding to the dataset size and reciters’ variations into which Residual Neural Network (ResNet50) with the Fusion Layer incorporated in voice classification is integrated. The VSCF architecture is designed in a highly complex manner and is designed to be highly sophisticated about the extent to which it can approximate high-level features as well as higher-level features within a wide range of acoustic signals. The Fusion Layer proves to be an important layer that combines the ResNet50 model’s final layer with the Global Average Pooling of the raw MFCC features of the audios. This synergistic fusion enhances the ability of the model by a vast extent to identify the underlying stylistic features inherent in each reciter’s performance. The dataset consists of a Quranic Recitation Dataset having 7144 WAV format audio files from 12 Quran reciters. Compared with the traditional voice classification strategies, VSCF aims at solving issues regarding limitations of the adopted datasets and variations among different reciters. The results from our experiment showcase the effectiveness of the VSCF architecture, achieving an accuracy of 0.97683%. Further metrics include sensitivity at 0.9752, specificity at 0.9785, precision at 0.9875, and an F1 score of 0.9813.</div></div>","PeriodicalId":56010,"journal":{"name":"Egyptian Informatics Journal","volume":"30 ","pages":"Article 100640"},"PeriodicalIF":4.3000,"publicationDate":"2025-03-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Egyptian Informatics Journal","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1110866525000337","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

The Quran is the most important book we have ever read or recited. Perfecting recitation of the Holy Quran is challenging. In this paper, we integrate the use of multimodal fusion to result in advanced voice classification of sacred Quranic recitations. The proposed work called Voice Shortcut Connection Fusion (VSCF) architecture also targets restrictions corresponding to the dataset size and reciters’ variations into which Residual Neural Network (ResNet50) with the Fusion Layer incorporated in voice classification is integrated. The VSCF architecture is designed in a highly complex manner and is designed to be highly sophisticated about the extent to which it can approximate high-level features as well as higher-level features within a wide range of acoustic signals. The Fusion Layer proves to be an important layer that combines the ResNet50 model’s final layer with the Global Average Pooling of the raw MFCC features of the audios. This synergistic fusion enhances the ability of the model by a vast extent to identify the underlying stylistic features inherent in each reciter’s performance. The dataset consists of a Quranic Recitation Dataset having 7144 WAV format audio files from 12 Quran reciters. Compared with the traditional voice classification strategies, VSCF aims at solving issues regarding limitations of the adopted datasets and variations among different reciters. The results from our experiment showcase the effectiveness of the VSCF architecture, achieving an accuracy of 0.97683%. Further metrics include sensitivity at 0.9752, specificity at 0.9785, precision at 0.9875, and an F1 score of 0.9813.

查看原文本刊更多论文

一种基于多模态融合的古兰经诵读高级语音分类创新方法

《古兰经》是我们读过或背诵过的最重要的书。完美地背诵《古兰经》是一项挑战。在本文中，我们将多模态融合的应用整合到古兰经诵经的高级语音分类中。所提出的称为语音快捷连接融合（VSCF）架构的工作还针对与语音分类中融合层的残差神经网络（ResNet50）相对应的数据集大小和recitter变化的限制。VSCF架构的设计非常复杂，在很大程度上，它可以近似高水平特征，以及在广泛的声学信号范围内的高水平特征。融合层被证明是一个重要的层，它结合了ResNet50模型的最后一层和原始音频MFCC特征的全局平均池。这种协同融合在很大程度上增强了模型的能力，以识别每个背诵者表演中固有的潜在风格特征。该数据集由《古兰经》背诵数据集组成，其中包含来自12位《古兰经》背诵者的7144个WAV格式音频文件。与传统的语音分类策略相比，VSCF旨在解决所采用数据集的局限性和不同背诵者之间的差异等问题。实验结果显示了VSCF架构的有效性，达到了0.97683%的准确率。进一步的指标包括灵敏度0.9752，特异性0.9785，精度0.9875，F1评分0.9813。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Egyptian Informatics Journal Decision Sciences-Management Science and Operations Research

CiteScore

11.10

自引率

1.90%

发文量

审稿时长

110 days

期刊介绍： The Egyptian Informatics Journal is published by the Faculty of Computers and Artificial Intelligence, Cairo University. This Journal provides a forum for the state-of-the-art research and development in the fields of computing, including computer sciences, information technologies, information systems, operations research and decision support. Innovative and not-previously-published work in subjects covered by the Journal is encouraged to be submitted, whether from academic, research or commercial sources.