IEEE/ACM Transactions on Audio, Speech, and Language Processing最新文献

筛选
英文 中文
A Two-Stage Audio-Visual Fusion Piano Transcription Model Based on the Attention Mechanism 基于注意力机制的两阶段视听融合钢琴转写模型
IF 4.1 2区 计算机科学
IEEE/ACM Transactions on Audio, Speech, and Language Processing Pub Date : 2024-07-30 DOI: 10.1109/TASLP.2024.3426303
Yuqing Li;Xianke Wang;Ruimin Wu;Wei Xu;Wenqing Cheng
{"title":"A Two-Stage Audio-Visual Fusion Piano Transcription Model Based on the Attention Mechanism","authors":"Yuqing Li;Xianke Wang;Ruimin Wu;Wei Xu;Wenqing Cheng","doi":"10.1109/TASLP.2024.3426303","DOIUrl":"10.1109/TASLP.2024.3426303","url":null,"abstract":"Piano transcription is a significant problem in the field of music information retrieval, aiming to obtain symbolic representations of music from captured audio or visual signals. Previous research has mainly focused on single-modal transcription methods using either audio or visual information, yet there is a small number of studies based on audio-visual fusion. To leverage the complementary advantages of both modalities and achieve higher transcription accuracy, we propose a two-stage audio-visual fusion piano transcription model based on the attention mechanism, utilizing both audio and visual information from the piano performance. In the first stage, we propose an audio model and a visual model. The audio model utilizes frequency domain sparse attention to capture harmonic relationships in the frequency domain, while the visual model includes both CNN and Transformer branches to merge local and global features at different resolutions. In the second stage, we employ cross-attention to learn the correlations between different modalities and the temporal relationships of the sequences. Experimental results on the OMAPS2 dataset show that our model achieves an F1-score of 98.60%, demonstrating significant improvement compared with the single-modal transcription models.","PeriodicalId":13332,"journal":{"name":"IEEE/ACM Transactions on Audio, Speech, and Language Processing","volume":"32 ","pages":"3618-3630"},"PeriodicalIF":4.1,"publicationDate":"2024-07-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10614622","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141863352","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Improving Mispronunciation Detection Using Speech Reconstruction 利用语音重建改进错误发音检测
IF 4.1 2区 计算机科学
IEEE/ACM Transactions on Audio, Speech, and Language Processing Pub Date : 2024-07-29 DOI: 10.1109/TASLP.2024.3434497
Anurag Das;Ricardo Gutierrez-Osuna
{"title":"Improving Mispronunciation Detection Using Speech Reconstruction","authors":"Anurag Das;Ricardo Gutierrez-Osuna","doi":"10.1109/TASLP.2024.3434497","DOIUrl":"10.1109/TASLP.2024.3434497","url":null,"abstract":"Training related machine learning tasks simultaneously can lead to improved performance on both tasks. Text- to-speech (TTS) and mispronunciation detection and diagnosis (MDD) both operate using phonetic information and we wanted to examine whether a boost in MDD performance can be by two tasks. We propose a network that reconstructs speech from the phones produced by the MDD system and computes a speech reconstruction loss. We hypothesize that the phones produced by the MDD system will be closer to the ground truth if the reconstructed speech sounds closer to the original speech. To test this, we first extract wav2vec features from a pre-trained model and feed it to the MDD system along with the text input. The MDD system then predicts the target annotated phones and then synthesizes speech from the predicted phones. The system is therefore trained by computing both a speech reconstruction loss as well as an MDD loss. Comparing the proposed systems against an identical system but without speech reconstruction and another state-of-the-art baseline, we found that the proposed system achieves higher mispronunciation detection and diagnosis (MDD) scores. On a set of sentences unseen during training, the and speaker verification simultaneously can lead to improve proposed system achieves higher MDD scores, which suggests that reconstructing the speech signal from the predicted phones helps the system generalize to new test sentences. We also tested whether the system can generate accented speech when the input phones have mispronunciations. Results from our perceptual experiments show that speech generated from phones containing mispronunciations sounds more accented and less intelligible than phones without any mispronunciations, which suggests that the system can identify differences in phones and generate the desired speech signal.","PeriodicalId":13332,"journal":{"name":"IEEE/ACM Transactions on Audio, Speech, and Language Processing","volume":"32 ","pages":"4420-4433"},"PeriodicalIF":4.1,"publicationDate":"2024-07-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141863423","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
VioLA: Conditional Language Models for Speech Recognition, Synthesis, and Translation VIOLA:用于语音识别、合成和翻译的条件语言模型
IF 4.1 2区 计算机科学
IEEE/ACM Transactions on Audio, Speech, and Language Processing Pub Date : 2024-07-29 DOI: 10.1109/TASLP.2024.3434425
Tianrui Wang;Long Zhou;Ziqiang Zhang;Yu Wu;Shujie Liu;Yashesh Gaur;Zhuo Chen;Jinyu Li;Furu Wei
{"title":"VioLA: Conditional Language Models for Speech Recognition, Synthesis, and Translation","authors":"Tianrui Wang;Long Zhou;Ziqiang Zhang;Yu Wu;Shujie Liu;Yashesh Gaur;Zhuo Chen;Jinyu Li;Furu Wei","doi":"10.1109/TASLP.2024.3434425","DOIUrl":"10.1109/TASLP.2024.3434425","url":null,"abstract":"Recent research shows a big convergence in model architecture, training objectives, and inference methods across various tasks for different modalities. In this paper, we propose \u0000<sc><b>VioLA</b></small>\u0000, a single auto-regressive Transformer decoder-only network that unifies various cross-modal tasks involving speech and text, such as speech-to-text, text-to-text, text-to-speech, and speech-to-speech tasks, as a conditional language model task via multi-task learning framework. To accomplish this, we first convert the speech utterances to discrete tokens (similar to the textual data) using an offline neural codec encoder. In such a way, all these tasks are converted to token-based sequence prediction problems, which can be naturally handled with one conditional language model. We further integrate task IDs (TID), language IDs (LID), and LSTM-based acoustic embedding into the proposed model to enhance the modeling capability of handling different languages and tasks. Experimental results demonstrate that the proposed \u0000<sc>VioLA</small>\u0000 model can support both single-modal and cross-modal tasks well, and the decoder-only model achieves a comparable and even better performance than the strong baselines.","PeriodicalId":13332,"journal":{"name":"IEEE/ACM Transactions on Audio, Speech, and Language Processing","volume":"32 ","pages":"3709-3716"},"PeriodicalIF":4.1,"publicationDate":"2024-07-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141863353","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Heterogeneous-Graph Reasoning With Context Paraphrase for Commonsense Question Answering 利用上下文转述进行异构图推理以实现常识性问题解答
IF 4.1 2区 计算机科学
IEEE/ACM Transactions on Audio, Speech, and Language Processing Pub Date : 2024-07-26 DOI: 10.1109/TASLP.2024.3434469
Yujie Wang;Hu Zhang;Jiye Liang;Ru Li
{"title":"Heterogeneous-Graph Reasoning With Context Paraphrase for Commonsense Question Answering","authors":"Yujie Wang;Hu Zhang;Jiye Liang;Ru Li","doi":"10.1109/TASLP.2024.3434469","DOIUrl":"10.1109/TASLP.2024.3434469","url":null,"abstract":"Commonsense question answering (CQA) generally means that the machine uses its mastered commonsense to answer questions without relevant background material, which is a challenging task in natural language processing. Existing methods focus on retrieving relevant subgraphs from knowledge graphs based on key entities and designing complex graph neural networks to perform reasoning over the subgraphs. However, they have the following problems: i) the nested entities in key entities lead to the introduction of irrelevant knowledge; ii) the QA context is not well integrated with the subgraphs; and iii) insufficient context knowledge hinders subgraph nodes understanding. In this paper, we present a heterogeneous-graph reasoning with context paraphrase method (HCP), which introduces the paraphrase knowledge from the dictionary into key entity recognition and subgraphs construction, and effectively fuses QA context and subgraphs during the encoding phase of the pre-trained language model (PTLM). Specifically, HCP filters the nested entities through the dictionary's vocabulary and constructs the Heterogeneous Path-Paraphrase (HPP) graph by connecting the paraphrase descriptions\u0000<xref><sup>1</sup></xref>\u0000<fn><label><sup>1</sup></label><p>The paraphrase descriptions are English explanations of words or phrases in WordNet and Wiktionary.</p></fn>\u0000 with the key entity nodes in the subgraphs. Then, by constructing the visible matrices in the PTLM encoding phase, we fuse the QA context representation into the HPP graph. Finally, to get the answer, we perform reasoning on the HPP graph by Mask Self-Attention. Experimental results on CommonsenseQA and OpenBookQA show that fusing QA context with HPP graph in the encoding stage and enhancing the HPP graph representation by using context paraphrase can improve the machine's commonsense reasoning ability.","PeriodicalId":13332,"journal":{"name":"IEEE/ACM Transactions on Audio, Speech, and Language Processing","volume":"32 ","pages":"3759-3770"},"PeriodicalIF":4.1,"publicationDate":"2024-07-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141774173","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Masked Graph Learning With Recurrent Alignment for Multimodal Emotion Recognition in Conversation 利用递归对齐进行掩蔽图学习,实现对话中的多模态情感识别
IF 4.1 2区 计算机科学
IEEE/ACM Transactions on Audio, Speech, and Language Processing Pub Date : 2024-07-26 DOI: 10.1109/TASLP.2024.3434495
Tao Meng;Fuchen Zhang;Yuntao Shou;Hongen Shao;Wei Ai;Keqin Li
{"title":"Masked Graph Learning With Recurrent Alignment for Multimodal Emotion Recognition in Conversation","authors":"Tao Meng;Fuchen Zhang;Yuntao Shou;Hongen Shao;Wei Ai;Keqin Li","doi":"10.1109/TASLP.2024.3434495","DOIUrl":"10.1109/TASLP.2024.3434495","url":null,"abstract":"Since Multimodal Emotion Recognition in Conversation (MERC) can be applied to public opinion monitoring, intelligent dialogue robots, and other fields, it has received extensive research attention in recent years. Unlike traditional unimodal emotion recognition, MERC can fuse complementary semantic information between multiple modalities (e.g., text, audio, and vision) to improve emotion recognition. However, previous work ignored the inter-modal alignment process and the intra-modal noise information before multimodal fusion but directly fuses multimodal features, which will hinder the model for representation learning. In this study, we have developed a novel approach called Masked Graph Learning with Recursive Alignment (MGLRA) to tackle this problem, which uses a recurrent iterative module with memory to align multimodal features, and then uses the masked GCN for multimodal feature fusion. First, we employ LSTM to capture contextual information and use a graph attention-filtering mechanism to eliminate noise effectively within the modality. Second, we build a recurrent iteration module with a memory function, which can use communication between different modalities to eliminate the gap between modalities and achieve the preliminary alignment of features between modalities. Then, a cross-modal multi-head attention mechanism is introduced to achieve feature alignment between modalities and construct a masked GCN for multimodal feature fusion, which can perform random mask reconstruction on the nodes in the graph to obtain better node feature representation. Finally, we utilize a multilayer perceptron (MLP) for emotion recognition. Extensive experiments on two benchmark datasets (i.e., IEMOCAP and MELD) demonstrate that MGLRA outperforms state-of-the-art methods.","PeriodicalId":13332,"journal":{"name":"IEEE/ACM Transactions on Audio, Speech, and Language Processing","volume":"32 ","pages":"4298-4312"},"PeriodicalIF":4.1,"publicationDate":"2024-07-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141774174","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
ELP-Adapters: Parameter Efficient Adapter Tuning for Various Speech Processing Tasks ELP 适配器:针对各种语音处理任务的参数高效适配器调整
IF 4.1 2区 计算机科学
IEEE/ACM Transactions on Audio, Speech, and Language Processing Pub Date : 2024-07-26 DOI: 10.1109/TASLP.2024.3434445
Nakamasa Inoue;Shinta Otake;Takumi Hirose;Masanari Ohi;Rei Kawakami
{"title":"ELP-Adapters: Parameter Efficient Adapter Tuning for Various Speech Processing Tasks","authors":"Nakamasa Inoue;Shinta Otake;Takumi Hirose;Masanari Ohi;Rei Kawakami","doi":"10.1109/TASLP.2024.3434445","DOIUrl":"10.1109/TASLP.2024.3434445","url":null,"abstract":"Self-supervised learning has emerged as a key approach for learning generic representations from speech data. Despite promising results in downstream tasks such as speech recognition, speaker verification, and emotion recognition, a significant number of parameters is required, which makes fine-tuning for each task memory-inefficient. To address this limitation, we introduce ELP-adapter tuning, a novel method for parameter-efficient fine-tuning using three types of adapter, namely encoder adapters (E-adapters), layer adapters (L-adapters), and a prompt adapter (P-adapter). The E-adapters are integrated into transformer-based encoder layers and help to learn fine-grained speech representations that are effective for speech recognition. The L-adapters create paths from each encoder layer to the downstream head and help to extract non-linguistic features from lower encoder layers that are effective for speaker verification and emotion recognition. The P-adapter appends pseudo features to CNN features to further improve effectiveness and efficiency. With these adapters, models can be quickly adapted to various speech processing tasks. Our evaluation across four downstream tasks using five backbone models demonstrated the effectiveness of the proposed method. With the WavLM backbone, its performance was comparable to or better than that of full fine-tuning on all tasks while requiring 90% fewer learnable parameters.","PeriodicalId":13332,"journal":{"name":"IEEE/ACM Transactions on Audio, Speech, and Language Processing","volume":"32 ","pages":"3867-3880"},"PeriodicalIF":4.1,"publicationDate":"2024-07-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141774175","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Learning Multi-Dimensional Speaker Localization: Axis Partitioning, Unbiased Label Distribution, and Data Augmentation 学习多维扬声器定位:轴划分、无偏标签分布和数据扩充
IF 4.1 2区 计算机科学
IEEE/ACM Transactions on Audio, Speech, and Language Processing Pub Date : 2024-07-25 DOI: 10.1109/TASLP.2024.3426309
Linfeng Feng;Yijun Gong;Zhi Liu;Xiao-Lei Zhang;Xuelong Li
{"title":"Learning Multi-Dimensional Speaker Localization: Axis Partitioning, Unbiased Label Distribution, and Data Augmentation","authors":"Linfeng Feng;Yijun Gong;Zhi Liu;Xiao-Lei Zhang;Xuelong Li","doi":"10.1109/TASLP.2024.3426309","DOIUrl":"10.1109/TASLP.2024.3426309","url":null,"abstract":"Multi-dimensional speaker localization (SL) aims to estimate the two- or three-dimensional locations of speakers. A recent advancement in multi-dimensional SL is the end-to-end deep neural networks (DNNs) with ad-hoc microphone arrays. This method transforms the SL problem into a classification problem, i.e. a problem of identifying the grids where speakers are located. However, the classification formulation has two closely connected weaknesses. Firstly, this approach introduces quantization error, which needs a large number of grids to mitigate the error. However, increasing the number of grids leads to the curse of dimensionality. To address the problems, we propose an efficient multi-dimensional SL algorithm, which has the following three novel contributions. First, we decouple the high-dimensional grid partitioning into \u0000<italic>axis partitioning</i>\u0000, which substantially mitigates the curse-of-dimensionality. Particularly, for the multi-speaker localization problem, we employ a separator to circumvent the permutation ambiguity of the axis partitioning in the inference stage. Second, we introduce a comprehensive \u0000<italic>unbiased label distribution</i>\u0000 scheme to further eliminate quantization errors. Finally, a set of data augmentation techniques are proposed, including coordinate transformation, stochastic node selection, and mixed training, to alleviate overfitting and sample imbalance problems. The proposed methods were evaluated on both simulated and real-world data, and the experimental results confirm the effectiveness.","PeriodicalId":13332,"journal":{"name":"IEEE/ACM Transactions on Audio, Speech, and Language Processing","volume":"32 ","pages":"4013-4025"},"PeriodicalIF":4.1,"publicationDate":"2024-07-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141774282","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Rethinking Processing Distortions: Disentangling the Impact of Speech Enhancement Errors on Speech Recognition Performance 反思处理失真:厘清语音增强错误对语音识别性能的影响
IF 4.1 2区 计算机科学
IEEE/ACM Transactions on Audio, Speech, and Language Processing Pub Date : 2024-07-22 DOI: 10.1109/TASLP.2024.3426924
Tsubasa Ochiai;Kazuma Iwamoto;Marc Delcroix;Rintaro Ikeshita;Hiroshi Sato;Shoko Araki;Shigeru Katagiri
{"title":"Rethinking Processing Distortions: Disentangling the Impact of Speech Enhancement Errors on Speech Recognition Performance","authors":"Tsubasa Ochiai;Kazuma Iwamoto;Marc Delcroix;Rintaro Ikeshita;Hiroshi Sato;Shoko Araki;Shigeru Katagiri","doi":"10.1109/TASLP.2024.3426924","DOIUrl":"10.1109/TASLP.2024.3426924","url":null,"abstract":"It is challenging to improve automatic speech recognition (ASR) performance in noisy conditions with a single-channel speech enhancement (SE) front-end. This is generally attributed to the processing distortions caused by the nonlinear processing of single-channel SE front-ends. However, the causes of such degraded ASR performance have not been fully investigated. How to design single-channel SE front-ends in a way that significantly improves ASR performance remains an open research question. In this study, we investigate a signal-level numerical metric that can explain the cause of degradation in ASR performance. To this end, we propose a novel analysis scheme based on the orthogonal projection-based decomposition of SE errors. This scheme manually modifies the ratio of the decomposed interference, noise, and artifact errors, and it enables us to directly evaluate the impact of each error type on ASR performance. Our analysis reveals the particularly detrimental effect of artifact errors on ASR performance compared to the other types of errors. This provides us with a more principled definition of processing distortions that cause the ASR performance degradation. Then, we study two practical approaches for reducing the impact of artifact errors. First, we prove that the simple observation adding (OA) post-processing (i.e., interpolating the enhanced and observed signals) can improve the signal-to-artifact ratio. Second, we propose a novel training objective, called artifact-boosted signal-to-distortion ratio (AB-SDR), which forces the model to estimate the enhanced signals with fewer artifact errors. Through experiments, we confirm that both the OA and AB-SDR approaches are effective in decreasing artifact errors caused by single-channel SE front-ends, allowing them to significantly improve ASR performance.","PeriodicalId":13332,"journal":{"name":"IEEE/ACM Transactions on Audio, Speech, and Language Processing","volume":"32 ","pages":"3589-3602"},"PeriodicalIF":4.1,"publicationDate":"2024-07-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141785199","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Multimodal Consistency-Based Teacher for Semi-Supervised Multimodal Sentiment Analysis 基于多模态一致性的半监督多模态情感分析教师
IF 4.1 2区 计算机科学
IEEE/ACM Transactions on Audio, Speech, and Language Processing Pub Date : 2024-07-18 DOI: 10.1109/TASLP.2024.3430543
Ziqi Yuan;Jingliang Fang;Hua Xu;Kai Gao
{"title":"Multimodal Consistency-Based Teacher for Semi-Supervised Multimodal Sentiment Analysis","authors":"Ziqi Yuan;Jingliang Fang;Hua Xu;Kai Gao","doi":"10.1109/TASLP.2024.3430543","DOIUrl":"10.1109/TASLP.2024.3430543","url":null,"abstract":"Multimodal sentiment analysis holds significant importance within the realm of human-computer interaction. Due to the ease of collecting unlabeled online resources compared to the high costs associated with annotation, it becomes imperative for researchers to develop semi-supervised methods that leverage unlabeled data to enhance model performance. Existing semi-supervised approaches, particularly those applied to trivial image classification tasks, are not suitable for multimodal regression tasks due to their reliance on task-specific augmentation and thresholds designed for classification tasks. To address this limitation, we propose the Multimodal Consistency-based Teacher (MC-Teacher), which incorporates consistency-based pseudo-label technique into semi-supervised multimodal sentiment analysis. In our approach, we first propose synergistic consistency assumption which focus on the consistency among bimodal representation. Building upon this assumption, we develop a learnable filter network that autonomously learns how to identify misleading instances instead of threshold-based methods. This is achieved by leveraging both the implicit discriminant consistency on unlabeled instances and the explicit guidance on constructed training data with labeled instances. Additionally, we design the self-adaptive exponential moving average strategy to decouple the student and teacher networks, utilizing a heuristic momentum coefficient. Through both quantitative and qualitative experiments on two benchmark datasets, we demonstrate the outstanding performances of the proposed MC-Teacher approach. Furthermore, detailed analysis experiments and case studies are provided for each crucial component to intuitively elucidate the inner mechanism and further validate their effectiveness.","PeriodicalId":13332,"journal":{"name":"IEEE/ACM Transactions on Audio, Speech, and Language Processing","volume":"32 ","pages":"3669-3683"},"PeriodicalIF":4.1,"publicationDate":"2024-07-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141745508","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Exploring Universal Intrinsic Task Subspace for Few-Shot Learning via Prompt Tuning 通过提示调整探索通用固有任务子空间,实现快速学习
IF 4.1 2区 计算机科学
IEEE/ACM Transactions on Audio, Speech, and Language Processing Pub Date : 2024-07-18 DOI: 10.1109/TASLP.2024.3430545
Yujia Qin;Xiaozhi Wang;Yusheng Su;Yankai Lin;Ning Ding;Jing Yi;Weize Chen;Zhiyuan Liu;Juanzi Li;Lei Hou;Peng Li;Maosong Sun;Jie Zhou
{"title":"Exploring Universal Intrinsic Task Subspace for Few-Shot Learning via Prompt Tuning","authors":"Yujia Qin;Xiaozhi Wang;Yusheng Su;Yankai Lin;Ning Ding;Jing Yi;Weize Chen;Zhiyuan Liu;Juanzi Li;Lei Hou;Peng Li;Maosong Sun;Jie Zhou","doi":"10.1109/TASLP.2024.3430545","DOIUrl":"10.1109/TASLP.2024.3430545","url":null,"abstract":"Why can pre-trained language models (PLMs) learn universal representations and effectively adapt to broad NLP tasks differing a lot superficially? In this work, we empirically find evidence indicating that the adaptations of PLMs to various few-shot tasks can be reparameterized as optimizing only a few free parameters in a unified low-dimensional \u0000<italic>intrinsic task subspace</i>\u0000, which may help us understand why PLMs could easily adapt to various NLP tasks with small-scale data. To find such a subspace and examine its universality, we propose an analysis pipeline called \u0000<italic>intrinsic prompt tuning</i>\u0000 (IPT). Specifically, we resort to the recent success of prompt tuning and decompose the soft prompts of multiple NLP tasks into the same low-dimensional nonlinear subspace, then we learn to adapt the PLM to unseen data or tasks by only tuning parameters in this subspace. In the experiments, we study diverse few-shot NLP tasks and surprisingly find that in a 250-dimensional subspace found with 100 tasks, by only tuning 250 free parameters, we can recover 97% and 83% of the full prompt tuning performance for 100 seen tasks (using different training data) and 20 unseen tasks, respectively, showing great generalization ability of the found intrinsic task subspace. Besides being an analysis tool, IPTcould further help us improve the prompt tuning stability.","PeriodicalId":13332,"journal":{"name":"IEEE/ACM Transactions on Audio, Speech, and Language Processing","volume":"32 ","pages":"3631-3643"},"PeriodicalIF":4.1,"publicationDate":"2024-07-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10603438","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141745640","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信