Li-Wei Chen, Takuya Higuchi, He Bai, Ahmed Hussen Abdelaziz, Alexander Rudnicky, Shinji Watanabe, Tatiana Likhomanenko, Barry-John Theobald, Zakaria Aldeneh
{"title":"Exploring Prediction Targets in Masked Pre-Training for Speech Foundation Models","authors":"Li-Wei Chen, Takuya Higuchi, He Bai, Ahmed Hussen Abdelaziz, Alexander Rudnicky, Shinji Watanabe, Tatiana Likhomanenko, Barry-John Theobald, Zakaria Aldeneh","doi":"arxiv-2409.10788","DOIUrl":null,"url":null,"abstract":"Speech foundation models, such as HuBERT and its variants, are pre-trained on\nlarge amounts of unlabeled speech for various downstream tasks. These models\nuse a masked prediction objective, where the model learns to predict\ninformation about masked input segments from the unmasked context. The choice\nof prediction targets in this framework can influence performance on downstream\ntasks. For example, targets that encode prosody are beneficial for\nspeaker-related tasks, while targets that encode phonetics are more suited for\ncontent-related tasks. Additionally, prediction targets can vary in the level\nof detail they encode; targets that encode fine-grained acoustic details are\nbeneficial for denoising tasks, while targets that encode higher-level\nabstractions are more suited for content-related tasks. Despite the importance\nof prediction targets, the design choices that affect them have not been\nthoroughly studied. This work explores the design choices and their impact on\ndownstream task performance. Our results indicate that the commonly used design\nchoices for HuBERT can be suboptimal. We propose novel approaches to create\nmore informative prediction targets and demonstrate their effectiveness through\nimprovements across various downstream tasks.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2024-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - EE - Audio and Speech Processing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.10788","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Speech foundation models, such as HuBERT and its variants, are pre-trained on
large amounts of unlabeled speech for various downstream tasks. These models
use a masked prediction objective, where the model learns to predict
information about masked input segments from the unmasked context. The choice
of prediction targets in this framework can influence performance on downstream
tasks. For example, targets that encode prosody are beneficial for
speaker-related tasks, while targets that encode phonetics are more suited for
content-related tasks. Additionally, prediction targets can vary in the level
of detail they encode; targets that encode fine-grained acoustic details are
beneficial for denoising tasks, while targets that encode higher-level
abstractions are more suited for content-related tasks. Despite the importance
of prediction targets, the design choices that affect them have not been
thoroughly studied. This work explores the design choices and their impact on
downstream task performance. Our results indicate that the commonly used design
choices for HuBERT can be suboptimal. We propose novel approaches to create
more informative prediction targets and demonstrate their effectiveness through
improvements across various downstream tasks.