Unsupervised Face-Masked Speech Enhancement Using Generative Adversarial Networks With Human-in-the-Loop Assessment Metrics

IF 4.1 2区计算机科学 Q1 ACOUSTICS

IEEE/ACM Transactions on Audio, Speech, and Language Processing Pub Date : 2024-07-12 DOI:10.1109/TASLP.2024.3426996

Syu-Siang Wang;Jia-Yang Chen;Bo-Ren Bai;Shih-Hau Fang;Yu Tsao

{"title":"Unsupervised Face-Masked Speech Enhancement Using Generative Adversarial Networks With Human-in-the-Loop Assessment Metrics","authors":"Syu-Siang Wang;Jia-Yang Chen;Bo-Ren Bai;Shih-Hau Fang;Yu Tsao","doi":"10.1109/TASLP.2024.3426996","DOIUrl":null,"url":null,"abstract":"The utilization of face masks is an essential healthcare measure, particularly during times of pandemics, yet it can present challenges in communication in our daily lives. To address this problem, we propose a novel approach known as the human-in-the-loop StarGAN (HL–StarGAN) face-masked speech enhancement method. HL–StarGAN comprises discriminator, classifier, metric assessment predictor, and generator that leverages an attention mechanism. The metric assessment predictor, referred to as MaskQSS, incorporates human participants in its development and serves as a “human-in-the-loop” module during the learning process of HL–StarGAN. The overall HL–StarGAN model was trained using an unsupervised learning strategy that simultaneously focuses on the reconstruction of the original clean speech and the optimization of human perception. To implement HL–StarGAN, we created a face-masked speech database named “FMVD,” which comprises recordings from 34 speakers in three distinct face-masked scenarios and a clean condition. We conducted subjective and objective tests on the proposed HL–StarGAN using this database. The outcomes of the test results are as follows: (1) MaskQSS successfully predicted the quality scores of face-masked voices, outperforming several existing speech assessment methods. (2) The integration of the MaskQSS predictor enhanced the ability of HL–StarGAN to transform face-masked voices into high-quality speech; this enhancement is evident in both objective and subjective tests, outperforming conventional StarGAN and CycleGAN-based systems.","PeriodicalId":13332,"journal":{"name":"IEEE/ACM Transactions on Audio, Speech, and Language Processing","volume":"32 ","pages":"3826-3837"},"PeriodicalIF":4.1000,"publicationDate":"2024-07-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10596684","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE/ACM Transactions on Audio, Speech, and Language Processing","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10596684/","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ACOUSTICS","Score":null,"Total":0}

引用次数: 0

Abstract

The utilization of face masks is an essential healthcare measure, particularly during times of pandemics, yet it can present challenges in communication in our daily lives. To address this problem, we propose a novel approach known as the human-in-the-loop StarGAN (HL–StarGAN) face-masked speech enhancement method. HL–StarGAN comprises discriminator, classifier, metric assessment predictor, and generator that leverages an attention mechanism. The metric assessment predictor, referred to as MaskQSS, incorporates human participants in its development and serves as a “human-in-the-loop” module during the learning process of HL–StarGAN. The overall HL–StarGAN model was trained using an unsupervised learning strategy that simultaneously focuses on the reconstruction of the original clean speech and the optimization of human perception. To implement HL–StarGAN, we created a face-masked speech database named “FMVD,” which comprises recordings from 34 speakers in three distinct face-masked scenarios and a clean condition. We conducted subjective and objective tests on the proposed HL–StarGAN using this database. The outcomes of the test results are as follows: (1) MaskQSS successfully predicted the quality scores of face-masked voices, outperforming several existing speech assessment methods. (2) The integration of the MaskQSS predictor enhanced the ability of HL–StarGAN to transform face-masked voices into high-quality speech; this enhancement is evident in both objective and subjective tests, outperforming conventional StarGAN and CycleGAN-based systems.

查看原文本刊更多论文

使用生成式对抗网络和人在回路中评估指标的无监督人脸掩码语音增强技术

使用面罩是一项重要的医疗保健措施，尤其是在大流行病时期，但它可能会给我们的日常生活交流带来挑战。为了解决这个问题，我们提出了一种新方法，即人在环 StarGAN（HL-StarGAN）面罩语音增强方法。HL-StarGAN 由鉴别器、分类器、度量评估预测器和利用注意力机制的生成器组成。度量评估预测器被称为 MaskQSS，在其开发过程中纳入了人类参与者，并在 HL-StarGAN 的学习过程中充当 "人在环中 "模块。整个 HL-StarGAN 模型采用无监督学习策略进行训练，该策略同时关注原始纯净语音的重建和人类感知的优化。为了实现 HL-StarGAN，我们创建了一个名为 "FMVD "的人脸屏蔽语音数据库，其中包括 34 位发言人在三种不同的人脸屏蔽场景和一种干净状态下的录音。我们使用该数据库对拟议的 HL-StarGAN 进行了主观和客观测试。测试结果如下：(1) MaskQSS 成功预测了蒙面语音的质量得分，优于现有的几种语音评估方法。(2) MaskQSS 预测器的集成增强了 HL-StarGAN 将人脸屏蔽语音转换为高质量语音的能力；这种增强在客观和主观测试中都很明显，优于基于 StarGAN 和 CycleGAN 的传统系统。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE/ACM Transactions on Audio, Speech, and Language Processing ACOUSTICS-ENGINEERING, ELECTRICAL & ELECTRONIC

CiteScore

11.30

自引率

11.10%

发文量

217

期刊介绍： The IEEE/ACM Transactions on Audio, Speech, and Language Processing covers audio, speech and language processing and the sciences that support them. In audio processing: transducers, room acoustics, active sound control, human audition, analysis/synthesis/coding of music, and consumer audio. In speech processing: areas such as speech analysis, synthesis, coding, speech and speaker recognition, speech production and perception, and speech enhancement. In language processing: speech and text analysis, understanding, generation, dialog management, translation, summarization, question answering and document indexing and retrieval, as well as general language modeling.