IEEE/ACM Transactions on Audio, Speech, and Language Processing最新文献

筛选
英文 中文
Automatic Detection of Speech Sound Disorder in Cantonese-Speaking Pre-School Children 自动检测粤语学龄前儿童的语音障碍
IF 4.1 2区 计算机科学
IEEE/ACM Transactions on Audio, Speech, and Language Processing Pub Date : 2024-09-18 DOI: 10.1109/TASLP.2024.3463503
Si-Ioi Ng;Cymie Wing-Yee Ng;Jiarui Wang;Tan Lee
{"title":"Automatic Detection of Speech Sound Disorder in Cantonese-Speaking Pre-School Children","authors":"Si-Ioi Ng;Cymie Wing-Yee Ng;Jiarui Wang;Tan Lee","doi":"10.1109/TASLP.2024.3463503","DOIUrl":"10.1109/TASLP.2024.3463503","url":null,"abstract":"Speech sound disorder (SSD) is a type of developmental disorder in which children encounter persistent difficulties in correctly producing certain speech sounds. Conventionally, assessment of SSD relies largely on speech and language pathologists (SLPs) with appropriate language background. With the unsatisfied demand for qualified SLPs, automatic detection of SSD is highly desirable for assisting clinical work and improving the efficiency and quality of services. In this paper, methods and systems for fully automatic detection of SSD in young children are investigated. A microscopic approach and a macroscopic approach are developed. The microscopic system is based on detection of phonological errors in impaired child speech. A deep neural network (DNN) model is trained to learn the similarity and contrast between consonant segments. Phonological error is identified by contrasting a test speech segment to reference segments. The phone-level similarity scores are aggregated for speaker-level SSD detection. The macroscopic approach leverages holistic changes of speech characteristics related to disorders. Various types of speaker-level embeddings are investigated and compared. Experimental results show that the proposed microscopic system achieves unweighted average recall (UAR) from 84.0% to 91.9% on phone-level error detection. The proposed macroscopic approach can achieve a UAR of 89.0% on speaker-level SSD detection. The speaker embeddings adopted for macroscopic SSD detection can effectively discard the information related to speaker's personal identity.","PeriodicalId":13332,"journal":{"name":"IEEE/ACM Transactions on Audio, Speech, and Language Processing","volume":"32 ","pages":"4355-4368"},"PeriodicalIF":4.1,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142263312","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Audio-Visual Fusion With Temporal Convolutional Attention Network for Speech Separation 利用时态卷积注意力网络进行音视频融合以实现语音分离
IF 4.1 2区 计算机科学
IEEE/ACM Transactions on Audio, Speech, and Language Processing Pub Date : 2024-09-18 DOI: 10.1109/TASLP.2024.3463411
Debang Liu;Tianqi Zhang;Mads Græsbøll Christensen;Chen Yi;Zeliang An
{"title":"Audio-Visual Fusion With Temporal Convolutional Attention Network for Speech Separation","authors":"Debang Liu;Tianqi Zhang;Mads Græsbøll Christensen;Chen Yi;Zeliang An","doi":"10.1109/TASLP.2024.3463411","DOIUrl":"10.1109/TASLP.2024.3463411","url":null,"abstract":"Currently, audio-visual speech separation methods utilize the speaker's audio and visual correlation information to help separate the speech of the target speaker. However, these methods commonly use the approach of feature concatenation with linear mapping to obtain the fused audio-visual features, which prompts us to conduct a deeper exploration for audio-visual fusion. Therefore, in this paper, according to the speaker's mouth landmark movements during speech, we propose a novel time-domain single-channel audio-visual speech separation method: audio-visual fusion with temporal convolution attention network for speech separation model (AVTCA). In this method, we design temporal convolution attention network (TCANet) based on the attention mechanism to model the contextual relationships between audio and visual sequences, and use TCANet as the basic unit to construct sequence learning and fusion network. In the whole deep separation framework, we first use cross attention to focus on the cross-correlation information of the audio and visual sequences, and then we use the TCANet to fuse the audio-visual feature sequences with temporal dependencies and cross-correlations. Afterwards, the fused audio-visual features sequences will be used as input to the separation network to predict mask and separate the source of each speaker. Finally, this paper conducts comparative experiments on Vox2, GRID, LRS2 and TCD-TIMIT datasets, indicating that AVTCA outperforms other state-of-the-art (SOTA) separation methods. Furthermore, it exhibits greater efficiency in computational performance and model size.","PeriodicalId":13332,"journal":{"name":"IEEE/ACM Transactions on Audio, Speech, and Language Processing","volume":"32 ","pages":"4647-4660"},"PeriodicalIF":4.1,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142263310","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Efficient Lightweight Speaker Verification With Broadcasting CNN-Transformer and Knowledge Distillation Training of Self-Attention Maps 利用广播式 CNN 变换器和自我注意力地图的知识蒸馏训练实现高效轻量级扬声器验证
IF 4.1 2区 计算机科学
IEEE/ACM Transactions on Audio, Speech, and Language Processing Pub Date : 2024-09-18 DOI: 10.1109/TASLP.2024.3463491
Jeong-Hwan Choi;Joon-Young Yang;Joon-Hyuk Chang
{"title":"Efficient Lightweight Speaker Verification With Broadcasting CNN-Transformer and Knowledge Distillation Training of Self-Attention Maps","authors":"Jeong-Hwan Choi;Joon-Young Yang;Joon-Hyuk Chang","doi":"10.1109/TASLP.2024.3463491","DOIUrl":"10.1109/TASLP.2024.3463491","url":null,"abstract":"Developing a lightweight speaker embedding extractor (SEE) is crucial for the practical implementation of automatic speaker verification (ASV) systems. To this end, we recently introduced \u0000<italic>broadcasting convolutional neural networks (CNNs)-meet-vision-Transformers</i>\u0000 (BC-CMT), a lightweight SEE that utilizes broadcasted residual learning (BRL) within the hybrid CNN-Transformer architecture to maintain a small number of model parameters. We proposed three BC-CMT-based SEE with three different sizes: BC-CMT-Tiny, -Small, and -Base. In this study, we extend our previously proposed BC-CMT by introducing an improved model architecture and a training strategy based on knowledge distillation (KD) using self-attention (SA) maps. First, to reduce the computational costs and latency of the BC-CMT, the two-dimensional (2D) SA operations in the BC-CMT, which calculate the SA maps in the frequency–time dimensions, are simplified to 1D SA operations that consider only temporal importance. Moreover, to enhance the SA capability of the BC-CMT, the group convolution layers in the SA block are adjusted to have smaller number of groups and are combined with the BRL operations. Second, to improve the training effectiveness of the modified BC-CMT-Tiny, the SA maps of a pretrained large BC-CMT-Base are used for the KD to guide those of a smaller BC-CMT-Tiny. Because the attention map sizes of the modified BC-CMT models do not depend on the number of frequency bins or convolution channels, the proposed strategy enables KD between feature maps with different sizes. The experimental results demonstrate that the proposed BC-CMT-Tiny model having 271.44K model parameters achieved 36.8% and 9.3% reduction in floating point operations on 1s signals and equal error rate (EER) on VoxCeleb 1 testset, respectively, compared to the conventional BC-CMT-Tiny. The CPU and GPU running time of the proposed BC-CMT-Tiny ranges of 1 to 10 s signals were 29.07 to 146.32 ms and 36.01 to 206.43 ms, respectively. The proposed KD further reduced the EER by 15.5% with improved attention capability.","PeriodicalId":13332,"journal":{"name":"IEEE/ACM Transactions on Audio, Speech, and Language Processing","volume":"32 ","pages":"4580-4595"},"PeriodicalIF":4.1,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142263314","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
PHAIN: Audio Inpainting via Phase-Aware Optimization With Instantaneous Frequency PHAIN:通过瞬时频率相位感知优化进行音频绘制
IF 4.1 2区 计算机科学
IEEE/ACM Transactions on Audio, Speech, and Language Processing Pub Date : 2024-09-18 DOI: 10.1109/TASLP.2024.3463415
Tomoro Tanaka;Kohei Yatabe;Yasuhiro Oikawa
{"title":"PHAIN: Audio Inpainting via Phase-Aware Optimization With Instantaneous Frequency","authors":"Tomoro Tanaka;Kohei Yatabe;Yasuhiro Oikawa","doi":"10.1109/TASLP.2024.3463415","DOIUrl":"10.1109/TASLP.2024.3463415","url":null,"abstract":"Audio inpainting restores locally corrupted parts of digital audio signals. Sparsity-based methods achieve this by promoting sparsity in the time-frequency (T-F) domain, assuming short-time audio segments consist of a few sinusoids. However, such sparsity promotion reduces the magnitudes of the resulting waveforms; moreover, it often ignores the temporal connections of sinusoidal components. To address these problems, we propose a novel phase-aware audio inpainting method. Our method minimizes the time variations of a particular T-F representation calculated using the time derivative of the phase. This promotes sinusoidal components that coherently fit in the corrupted parts without directly suppressing the magnitudes. Both objective and subjective experiments confirmed the superiority of the proposed method compared with state-of-the-art methods.","PeriodicalId":13332,"journal":{"name":"IEEE/ACM Transactions on Audio, Speech, and Language Processing","volume":"32 ","pages":"4471-4485"},"PeriodicalIF":4.1,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142263315","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
AudioNet: Supervised Deep Hashing for Retrieval of Similar Audio Events 音频网:有监督的深度散列检索相似音频事件
IF 4.1 2区 计算机科学
IEEE/ACM Transactions on Audio, Speech, and Language Processing Pub Date : 2024-09-17 DOI: 10.1109/TASLP.2024.3446232
Sagar Dutta;Vipul Arora
{"title":"AudioNet: Supervised Deep Hashing for Retrieval of Similar Audio Events","authors":"Sagar Dutta;Vipul Arora","doi":"10.1109/TASLP.2024.3446232","DOIUrl":"10.1109/TASLP.2024.3446232","url":null,"abstract":"This work presents a supervised deep hashing method for retrieving similar audio events. The proposed method, named AudioNet, is a deep-learning-based system for efficient hashing and retrieval of similar audio events using an audio example as a query. AudioNet achieves high retrieval performance on multiple standard datasets by generating binary hash codes for similar audio events, setting new benchmarks in the field, and highlighting its efficacy and effectiveness compare to other hashing methods. Through comprehensive experiments on standard datasets, our research represents a pioneering effort in evaluating the retrieval performance of similar audio events. A novel loss function is proposed which incorporates weighted contrastive and weighted pairwise loss along with hashcode balancing to improve the efficiency of audio event retrieval. The method adopts discrete gradient propagation, which allows gradients to be propagated through discrete variables during backpropagation. This enables the network to optimize the discrete hash codes using standard gradient-based optimization algorithms, which are typically used for continuous variables. The proposed method showcases promising retrieval performance, as evidenced by the experimental results, even when dealing with imbalanced datasets. The systematic analysis conducted in this study further supports the significant benefits of the proposed method in retrieval performance across multiple datasets. The findings presented in this work establish a baseline for future studies on the efficient retrieval of similar audio events using deep audio embeddings.","PeriodicalId":13332,"journal":{"name":"IEEE/ACM Transactions on Audio, Speech, and Language Processing","volume":"32 ","pages":"4526-4536"},"PeriodicalIF":4.1,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142263316","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Multi-Task Multi-Attention Transformer for Generative Named Entity Recognition 用于生成式命名实体识别的多任务多注意转换器
IF 4.1 2区 计算机科学
IEEE/ACM Transactions on Audio, Speech, and Language Processing Pub Date : 2024-09-12 DOI: 10.1109/TASLP.2024.3458796
Ying Mo;Jiahao Liu;Hongyin Tang;Qifan Wang;Zenglin Xu;Jingang Wang;Xiaojun Quan;Wei Wu;Zhoujun Li
{"title":"Multi-Task Multi-Attention Transformer for Generative Named Entity Recognition","authors":"Ying Mo;Jiahao Liu;Hongyin Tang;Qifan Wang;Zenglin Xu;Jingang Wang;Xiaojun Quan;Wei Wu;Zhoujun Li","doi":"10.1109/TASLP.2024.3458796","DOIUrl":"10.1109/TASLP.2024.3458796","url":null,"abstract":"Most previous sequential labeling models are task-specific, while recent years have witnessed the rise of generative models due to the advantage of unifying all named entity recognition (NER) tasks into the encoder-decoder framework. Although achieving promising performance, our pilot studies demonstrate that existing generative models are ineffective at detecting entity boundaries and estimating entity types. In this paper, we propose a multi-task Transformer, which incorporates an entity boundary detection task into the named entity recognition task. More concretely, we achieve entity boundary detection by classifying the relations between tokens within the sentence. To improve the accuracy of entity-type mapping during decoding, we adopt an external knowledge base to calculate the prior entity-type distributions and then incorporate the information into the model via the self- and cross-attention mechanisms. We perform experiments on extensive NER benchmarks, including flat, nested, and discontinuous NER datasets involving long entities. It substantially increases nearly \u0000<inline-formula><tex-math>$+0.3 sim +1.5;{F_1}$</tex-math></inline-formula>\u0000 scores across a broad spectrum or performs closely to the best generative NER model. Experimental results show that our approach improves the performance of the generative NER model considerably.","PeriodicalId":13332,"journal":{"name":"IEEE/ACM Transactions on Audio, Speech, and Language Processing","volume":"32 ","pages":"4171-4183"},"PeriodicalIF":4.1,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142210038","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Filtered-X Quasi Affine Projection Algorithm for Active Noise Control Networks 用于主动噪声控制网络的滤波-X 准仿射投影算法
IF 4.1 2区 计算机科学
IEEE/ACM Transactions on Audio, Speech, and Language Processing Pub Date : 2024-09-12 DOI: 10.1109/TASLP.2024.3458806
Miguel Ferrer;María de Diego;Alberto Gonzalez
{"title":"Filtered-X Quasi Affine Projection Algorithm for Active Noise Control Networks","authors":"Miguel Ferrer;María de Diego;Alberto Gonzalez","doi":"10.1109/TASLP.2024.3458806","DOIUrl":"10.1109/TASLP.2024.3458806","url":null,"abstract":"The affine projection (AP) algorithm enhances the performance of gradient-based adaptive algorithms when dealing with colored reference signals, which is typically the case with filtered-X type algorithms. This enhancement is achieved by using various delayed versions of the reference signal data vector, which are appropriately orthogonalized and normalized to optimize convergence performance. The number of these vectors, known as the projection order of the AP, increases the computational requirements, mainly due to the calculation of a matrix inversion whose dimensions are proportional to this projection order. When used in distributed systems, the AP algorithm typically requires each acoustic node in the system to compute the complete matrix inversion, even though they only need a specific set of data (a subblock) from it. This means that the AP does not offer much advantage in terms of computational savings when used in distributed collaborative networks. To address this issue, an approximate version of the filtered-X affine projection (FXAP) algorithm is introduced in this work. This approximate version avoids the matrix inversion computation in each iteration using a precalculated inverse matrix. This strategy provides computational savings and enables easy distribution of the algorithm. Additionally, a variable step-size approach is proposed to mitigate the deviation caused by a precalculated matrix, which provides good performance, high robustness, and cost-effective distribution.","PeriodicalId":13332,"journal":{"name":"IEEE/ACM Transactions on Audio, Speech, and Language Processing","volume":"32 ","pages":"4237-4252"},"PeriodicalIF":4.1,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10679717","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142210040","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Deep Kronecker Product Beamforming for Large-Scale Microphone Arrays 用于大规模麦克风阵列的深克罗内克乘积波束成形
IF 4.1 2区 计算机科学
IEEE/ACM Transactions on Audio, Speech, and Language Processing Pub Date : 2024-09-12 DOI: 10.1109/TASLP.2024.3459430
Weixin Meng;Xiaoyu Li;Andong Li;Xiaoxue Luo;Shefeng Yan;Xiaodong Li;Chengshi Zheng
{"title":"Deep Kronecker Product Beamforming for Large-Scale Microphone Arrays","authors":"Weixin Meng;Xiaoyu Li;Andong Li;Xiaoxue Luo;Shefeng Yan;Xiaodong Li;Chengshi Zheng","doi":"10.1109/TASLP.2024.3459430","DOIUrl":"10.1109/TASLP.2024.3459430","url":null,"abstract":"Although deep learning based beamformers have achieved promising performance using small microphone arrays, they suffer from performance degradation in very challenging environments, such as extremely low Signal-to-Noise Ratio (SNR) environments, e.g., SNR \u0000<inline-formula><tex-math>$le$</tex-math></inline-formula>\u0000−10 dB. A large-scale microphone array with dozens or hundreds of microphones can improve the performance of beamformers in these challenging scenarios because of its high spatial resolution. While a dramatic increase in the number of microphones leads to feature redundancy, causing difficulties in feature extraction and network training. As an attempt to improve the performance of deep beamformers for speech extraction in very challenging scenarios, this paper proposes a novel all neural Kronecker product beamforming denoted by ANKP-BF for large-scale microphone arrays by taking the following two aspects into account. Firstly, a larger microphone array can provide higher performance of spatial filtering when compared with a small microphone array, and deep neural networks are introduced for their powerful non-linear modeling capability in the speech extraction task. Secondly, the feature redundancy problem is solved by introducing the Kronecker product rule to decompose the original one high-dimension weight vector into the Kronecker product of two much lower-dimensional weight vectors. The proposed ANKP-BF is designed to operate in an end-to-end manner. Extensive experiments are conducted on simulated large-scale microphone-array signals using the DNS-Challenge corpus and WSJ0-SI84 corpus, and the real recordings in a semi-anechoic room and outdoor scenes are also used to evaluate and compare the performance of different methods. Quantitative results demonstrate that the proposed method outperforms existing advanced baselines in terms of multiple objective metrics, especially in very low SNR environments.","PeriodicalId":13332,"journal":{"name":"IEEE/ACM Transactions on Audio, Speech, and Language Processing","volume":"32 ","pages":"4537-4553"},"PeriodicalIF":4.1,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142210039","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Improving Non-Autoregressive Translation Quality With Pretrained Language Model, Embedding Distillation and Upsampling Strategy for CTC 利用预训练语言模型、嵌入式蒸馏和上采样策略提高 CTC 的非自回归翻译质量
IF 4.1 2区 计算机科学
IEEE/ACM Transactions on Audio, Speech, and Language Processing Pub Date : 2024-09-12 DOI: 10.1109/TASLP.2024.3451977
Shen-sian Syu;Juncheng Xie;Hung-yi Lee
{"title":"Improving Non-Autoregressive Translation Quality With Pretrained Language Model, Embedding Distillation and Upsampling Strategy for CTC","authors":"Shen-sian Syu;Juncheng Xie;Hung-yi Lee","doi":"10.1109/TASLP.2024.3451977","DOIUrl":"10.1109/TASLP.2024.3451977","url":null,"abstract":"Non-autoregressive approaches, especially those that generate output in a one-pass forward manner, have shown great potential in improving the inference speed of translation models. However, these approaches often suffer from a significant drop in translation quality compared to autoregressive models (AT). To tackle this challenge, this paper introduces a series of innovative techniques to enhance the translation quality of non-autoregressive neural machine translation (NAT) models while still maintaining a substantial acceleration in inference speed. Specifically, we propose a method called CTCPMLM, which involves fine-tuning Pretrained Multilingual Language Models (PMLMs) with the Connectionist Temporal Classification (CTC) loss to effectively train NAT models. Additionally, we adopt the MASK insertion scheme instead of token duplication for up-sampling and present an embedding distillation method to further enhance the performance of NAT models. In our experiments, CTCPMLM surpasses the performance of the baseline autoregressive model (Transformer \u0000<italic>base</i>\u0000) on various datasets, including WMT'14 DE \u0000<inline-formula><tex-math>$leftrightarrow$</tex-math></inline-formula>\u0000 EN, WMT'16 RO \u0000<inline-formula><tex-math>$leftrightarrow$</tex-math></inline-formula>\u0000 EN, and IWSLT'14 DE \u0000<inline-formula><tex-math>$leftrightarrow$</tex-math></inline-formula>\u0000 EN. Moreover, CTCPMLM represents the current state-of-the-art among NAT models. Notably, our model achieves superior results compared to the baseline autoregressive model on the IWSLT'14 En \u0000<inline-formula><tex-math>$leftrightarrow$</tex-math></inline-formula>\u0000 De and WMT'16 En \u0000<inline-formula><tex-math>$leftrightarrow$</tex-math></inline-formula>\u0000 Ro datasets, even without using distillation data during training. Particularly, on the IWSLT'14 DE \u0000<inline-formula><tex-math>$rightarrow$</tex-math></inline-formula>\u0000 EN dataset, our model achieves an impressive BLEU score of 39.93, surpassing AT models and establishing a new state-of-the-art. Additionally, our model exhibits a remarkable speed improvement of 16.35 times compared to the autoregressive model.","PeriodicalId":13332,"journal":{"name":"IEEE/ACM Transactions on Audio, Speech, and Language Processing","volume":"32 ","pages":"4121-4133"},"PeriodicalIF":4.1,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142210036","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
TriSAT: Trimodal Representation Learning for Multimodal Sentiment Analysis TriSAT:多模态情感分析的三模态表征学习
IF 4.1 2区 计算机科学
IEEE/ACM Transactions on Audio, Speech, and Language Processing Pub Date : 2024-09-11 DOI: 10.1109/TASLP.2024.3458812
Ruohong Huan;Guowei Zhong;Peng Chen;Ronghua Liang
{"title":"TriSAT: Trimodal Representation Learning for Multimodal Sentiment Analysis","authors":"Ruohong Huan;Guowei Zhong;Peng Chen;Ronghua Liang","doi":"10.1109/TASLP.2024.3458812","DOIUrl":"10.1109/TASLP.2024.3458812","url":null,"abstract":"Transformer-based multimodal sentiment analysis frameworks commonly facilitate cross-modal interactions between two modalities through the attention mechanism. However, such interactions prove inadequate when dealing with three or more modalities, leading to increased computational complexity and network redundancy. To address this challenge, this paper introduces a novel framework, Trimodal representations for Sentiment Analysis from Transformers (TriSAT), tailored for multimodal sentiment analysis. TriSAT incorporates a trimodal transformer featuring a module called Trimodal Multi-Head Attention (TMHA). TMHA considers language as the primary modality, combines information from language, video, and audio using a single computation, and analyzes sentiment from a trimodal perspective. This approach significantly reduces the computational complexity while delivering high performance. Moreover, we propose Attraction-Repulsion (AR) loss and Trimodal Supervised Contrastive (TSC) loss to further enhance sentiment analysis performance. We conduct experiments on three public datasets to evaluate TriSAT's performance, which consistently demonstrates its competitiveness compared to state-of-the-art approaches.","PeriodicalId":13332,"journal":{"name":"IEEE/ACM Transactions on Audio, Speech, and Language Processing","volume":"32 ","pages":"4105-4120"},"PeriodicalIF":4.1,"publicationDate":"2024-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142210053","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信