Enhancing spoken dialect identification with stacked generalization of deep learning models

IF 3 4区计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS

Multimedia Tools and Applications Pub Date : 2024-09-04 DOI:10.1007/s11042-024-20143-9

Khaled Lounnas, Mohamed Lichouri, Mourad Abbas

{"title":"Enhancing spoken dialect identification with stacked generalization of deep learning models","authors":"Khaled Lounnas, Mohamed Lichouri, Mourad Abbas","doi":"10.1007/s11042-024-20143-9","DOIUrl":null,"url":null,"abstract":"<p>As dialects are widely used in many countries, there is growing interest in incorporating them into various applications, including conversational systems. Processing spoken dialects is an important module in such systems, yet it remains a challenging task due to the lack of resources and the inherent ambiguity and complexity of dialects. This paper presents a comparison of two approaches for identifying spoken Maghrebi dialects, tested on an in-house corpus composed of four dialects: Algerian Arabic Dialect (AAD), Algerian Berber Dialect (ABD), Moroccan Arabic Dialect (MAD), and Moroccan Berber Dialect (MBD), as well as two variants of Modern Standard Arabic (MSA): MSA_ALG and MSA_MAR. The first method uses a fully connected neural network (NN2) to retrain several Transfer Learning (TL) models with varying layer numbers, including Residual Networks (ResNet50, ResNet101), Visual Geometric Group networks (VGG16, VGG19), Dense Convolutional Networks (DenseNet121, DenseNet169), and Efficient Convolutional Neural Networks for Mobile Vision Applications (MobileNet, MobileNetV2). These models were chosen based on their proven ability to capture different levels of feature abstraction: deeper models like ResNet and DenseNet are capable of capturing more complex and nuanced patterns, which is critical for distinguishing subtle differences in dialects, while VGG and MobileNet models offer computational efficiency, making them suitable for applications with limited resources. The second approach employs a “stacked generalization” strategy, which merges predictions from the previously trained models to enhance the final classification performance. Our results show that this cascade strategy improves the overall performance of the Language/Dialect Identification system, with an accuracy increase of up to 5% for specific dialect pairs. Notably, the best performance was achieved with DenseNet and ResNet models, reaching an accuracy of 99.11% for distinguishing between Algerian Berber Dialect and Moroccan Berber Dialect. These findings indicate that despite the limited size of the employed dataset, the cascade strategy and the selection of robust TL models significantly enhance the system’s performance in dialect identification. By leveraging the unique strengths of each model, our approach demonstrates a robust and efficient solution to the challenge of spoken dialect processing.</p>","PeriodicalId":18770,"journal":{"name":"Multimedia Tools and Applications","volume":"35 1","pages":""},"PeriodicalIF":3.0000,"publicationDate":"2024-09-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Multimedia Tools and Applications","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1007/s11042-024-20143-9","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 0

Abstract

As dialects are widely used in many countries, there is growing interest in incorporating them into various applications, including conversational systems. Processing spoken dialects is an important module in such systems, yet it remains a challenging task due to the lack of resources and the inherent ambiguity and complexity of dialects. This paper presents a comparison of two approaches for identifying spoken Maghrebi dialects, tested on an in-house corpus composed of four dialects: Algerian Arabic Dialect (AAD), Algerian Berber Dialect (ABD), Moroccan Arabic Dialect (MAD), and Moroccan Berber Dialect (MBD), as well as two variants of Modern Standard Arabic (MSA): MSA_ALG and MSA_MAR. The first method uses a fully connected neural network (NN2) to retrain several Transfer Learning (TL) models with varying layer numbers, including Residual Networks (ResNet50, ResNet101), Visual Geometric Group networks (VGG16, VGG19), Dense Convolutional Networks (DenseNet121, DenseNet169), and Efficient Convolutional Neural Networks for Mobile Vision Applications (MobileNet, MobileNetV2). These models were chosen based on their proven ability to capture different levels of feature abstraction: deeper models like ResNet and DenseNet are capable of capturing more complex and nuanced patterns, which is critical for distinguishing subtle differences in dialects, while VGG and MobileNet models offer computational efficiency, making them suitable for applications with limited resources. The second approach employs a “stacked generalization” strategy, which merges predictions from the previously trained models to enhance the final classification performance. Our results show that this cascade strategy improves the overall performance of the Language/Dialect Identification system, with an accuracy increase of up to 5% for specific dialect pairs. Notably, the best performance was achieved with DenseNet and ResNet models, reaching an accuracy of 99.11% for distinguishing between Algerian Berber Dialect and Moroccan Berber Dialect. These findings indicate that despite the limited size of the employed dataset, the cascade strategy and the selection of robust TL models significantly enhance the system’s performance in dialect identification. By leveraging the unique strengths of each model, our approach demonstrates a robust and efficient solution to the challenge of spoken dialect processing.

Abstract Image

查看原文本刊更多论文

利用深度学习模型的堆叠泛化增强方言口语识别能力

由于方言在许多国家被广泛使用，将方言纳入各种应用（包括对话系统）的兴趣与日俱增。处理方言口语是此类系统中的一个重要模块，但由于缺乏资源以及方言固有的模糊性和复杂性，处理方言口语仍然是一项具有挑战性的任务。本文比较了识别马格里布方言口语的两种方法，并在由四种方言组成的内部语料库上进行了测试：阿尔及利亚阿拉伯语方言 (AAD)、阿尔及利亚柏柏尔方言 (ABD)、摩洛哥阿拉伯语方言 (MAD) 和摩洛哥柏柏尔方言 (MBD)，以及现代标准阿拉伯语 (MSA) 的两种变体：MSA_ALG 和 MSA_MAR。第一种方法使用全连接神经网络 (NN2) 来重新训练不同层数的多个迁移学习 (TL) 模型，包括残差网络 (ResNet50, ResNet101)、视觉几何组网络 (VGG16, VGG19)、密集卷积网络 (DenseNet121, DenseNet169) 和用于移动视觉应用的高效卷积神经网络 (MobileNet, MobileNetV2)。选择这些模型是基于它们捕捉不同层次特征抽象的能力：ResNet 和 DenseNet 等深度模型能够捕捉更复杂、更细微的模式，这对于区分方言中的细微差别至关重要；而 VGG 和 MobileNet 模型具有计算效率高的特点，适合资源有限的应用。第二种方法采用了 "堆叠泛化 "策略，即合并之前训练过的模型的预测结果，以提高最终的分类性能。我们的结果表明，这种级联策略提高了语言/方言识别系统的整体性能，对于特定的方言对，准确率可提高 5%。值得注意的是，DenseNet 和 ResNet 模型的性能最佳，在区分阿尔及利亚柏柏尔方言和摩洛哥柏柏尔方言时，准确率达到 99.11%。这些研究结果表明，尽管采用的数据集规模有限，但级联策略和鲁棒性 TL 模型的选择大大提高了系统在方言识别方面的性能。通过利用每个模型的独特优势，我们的方法为解决方言口语处理难题提供了稳健高效的解决方案。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Multimedia Tools and Applications 工程技术-工程：电子与电气

CiteScore

7.20

自引率

16.70%

发文量

2439

审稿时长

9.2 months

期刊介绍： Multimedia Tools and Applications publishes original research articles on multimedia development and system support tools as well as case studies of multimedia applications. It also features experimental and survey articles. The journal is intended for academics, practitioners, scientists and engineers who are involved in multimedia system research, design and applications. All papers are peer reviewed. Specific areas of interest include: - Multimedia Tools: - Multimedia Applications: - Prototype multimedia systems and platforms