Multimodal Integration of Mel Spectrograms and Text Transcripts for Enhanced Automatic Speech Recognition: Leveraging Extractive Transformer-Based Approaches and Late Fusion Strategies
IF 1.8 4区 计算机科学Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE
{"title":"Multimodal Integration of Mel Spectrograms and Text Transcripts for Enhanced Automatic Speech Recognition: Leveraging Extractive Transformer-Based Approaches and Late Fusion Strategies","authors":"Sunakshi Mehra, Virender Ranga, Ritu Agarwal","doi":"10.1111/coin.70012","DOIUrl":null,"url":null,"abstract":"<div>\n \n <p>This research endeavor aims to advance the field of Automatic Speech Recognition (ASR) by innovatively integrating multimodal data, specifically textual transcripts and Mel Spectrograms (2D images) obtained from raw audio. This study explores the less-explored potential of spectrograms and linguistic information in enhancing spoken word recognition accuracy. To elevate ASR performance, we propose two distinct transformer-based approaches: First, for the audio-centric approach, we leverage RegNet and ConvNeXt architectures, initially trained on a massive dataset of 14 million annotated images from ImageNet, to process Mel Spectrograms as image inputs. Second, we harness the Speech2Text transformer to decouple text transcript acquisition from raw audio. We pre-process Mel Spectrogram images, resizing them to 224 × 224 pixels to create two-dimensional audio representations. ImageNet, RegNet, and ConvNeXt individually categorize these images. The first channel generates the embeddings for visual modalities (RegNet and ConvNeXt) on 2D Mel Spectrograms. Additionally, we employ Sentence-BERT embeddings via Siamese BERT networks to transform Speech2Text transcripts into vectors. These image embeddings, along with Sentence-BERT embeddings from speech transcription, are subsequently fine-tuned within a deep dense model with five layers and batch normalization for spoken word classification. Our experiments focus on the Google Speech Command Dataset (GSCD) version 2, encompassing 35-word categories. To gauge the impact of spectrograms and linguistic features, we conducted an ablation analysis. Our novel late fusion strategy unites word embeddings and image embeddings, resulting in remarkable test accuracy rates of 95.87% for ConvNeXt, 99.95% for RegNet, and 85.93% for text transcripts across the 35-word categories, as processed by the deep dense layered model with Batch Normalization. We obtained a test accuracy of 99.96% for 35-word categories after using the late fusion of ConvNeXt + RegNet + SBERT, demonstrating superior results compared to other state-of-the-art methods.</p>\n </div>","PeriodicalId":55228,"journal":{"name":"Computational Intelligence","volume":"40 6","pages":""},"PeriodicalIF":1.8000,"publicationDate":"2024-12-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computational Intelligence","FirstCategoryId":"94","ListUrlMain":"https://onlinelibrary.wiley.com/doi/10.1111/coin.70012","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0
Abstract
This research endeavor aims to advance the field of Automatic Speech Recognition (ASR) by innovatively integrating multimodal data, specifically textual transcripts and Mel Spectrograms (2D images) obtained from raw audio. This study explores the less-explored potential of spectrograms and linguistic information in enhancing spoken word recognition accuracy. To elevate ASR performance, we propose two distinct transformer-based approaches: First, for the audio-centric approach, we leverage RegNet and ConvNeXt architectures, initially trained on a massive dataset of 14 million annotated images from ImageNet, to process Mel Spectrograms as image inputs. Second, we harness the Speech2Text transformer to decouple text transcript acquisition from raw audio. We pre-process Mel Spectrogram images, resizing them to 224 × 224 pixels to create two-dimensional audio representations. ImageNet, RegNet, and ConvNeXt individually categorize these images. The first channel generates the embeddings for visual modalities (RegNet and ConvNeXt) on 2D Mel Spectrograms. Additionally, we employ Sentence-BERT embeddings via Siamese BERT networks to transform Speech2Text transcripts into vectors. These image embeddings, along with Sentence-BERT embeddings from speech transcription, are subsequently fine-tuned within a deep dense model with five layers and batch normalization for spoken word classification. Our experiments focus on the Google Speech Command Dataset (GSCD) version 2, encompassing 35-word categories. To gauge the impact of spectrograms and linguistic features, we conducted an ablation analysis. Our novel late fusion strategy unites word embeddings and image embeddings, resulting in remarkable test accuracy rates of 95.87% for ConvNeXt, 99.95% for RegNet, and 85.93% for text transcripts across the 35-word categories, as processed by the deep dense layered model with Batch Normalization. We obtained a test accuracy of 99.96% for 35-word categories after using the late fusion of ConvNeXt + RegNet + SBERT, demonstrating superior results compared to other state-of-the-art methods.
期刊介绍:
This leading international journal promotes and stimulates research in the field of artificial intelligence (AI). Covering a wide range of issues - from the tools and languages of AI to its philosophical implications - Computational Intelligence provides a vigorous forum for the publication of both experimental and theoretical research, as well as surveys and impact studies. The journal is designed to meet the needs of a wide range of AI workers in academic and industrial research.