Speaker Identity Recognition by Acoustic and Visual Data Fusion through Personal Privacy for Smart Care and Service Applications

IF 0.5 4区计算机科学 Q4 IMAGING SCIENCE & PHOTOGRAPHIC TECHNOLOGY

Journal of Imaging Science and Technology Pub Date : 2020-07-01 DOI:10.2352/j.imagingsci.technol.2020.64.4.040404

I. Ding, C.-M. Ruan

{"title":"Speaker Identity Recognition by Acoustic and Visual Data Fusion through Personal Privacy for Smart Care and Service Applications","authors":"I. Ding, C.-M. Ruan","doi":"10.2352/j.imagingsci.technol.2020.64.4.040404","DOIUrl":null,"url":null,"abstract":"Abstract With rapid developments in techniques related to the internet of things, smart service applications such as voice-command-based speech recognition and smart care applications such as context-aware-based emotion recognition will gain much attention and potentially\n be a requirement in smart home or office environments. In such intelligence applications, identity recognition of the specific member in indoor spaces will be a crucial issue. In this study, a combined audio-visual identity recognition approach was developed. In this approach, visual information\n obtained from face detection was incorporated into acoustic Gaussian likelihood calculations for constructing speaker classification trees to significantly enhance the Gaussian mixture model (GMM)-based speaker recognition method. This study considered the privacy of the monitored person and\n reduced the degree of surveillance. Moreover, the popular Kinect sensor device containing a microphone array was adopted to obtain acoustic voice data from the person. The proposed audio-visual identity recognition approach deploys only two cameras in a specific indoor space for conveniently\n performing face detection and quickly determining the total number of people in the specific space. Such information pertaining to the number of people in the indoor space obtained using face detection was utilized to effectively regulate the accurate GMM speaker classification tree design.\n Two face-detection-regulated speaker classification tree schemes are presented for the GMM speaker recognition method in this study—the binary speaker classification tree (GMM-BT) and the non-binary speaker classification tree (GMM-NBT). The proposed GMM-BT and GMM-NBT methods achieve\n excellent identity recognition rates of 84.28% and 83%, respectively; both values are higher than the rate of the conventional GMM approach (80.5%). Moreover, as the extremely complex calculations of face recognition in general audio-visual speaker recognition tasks are not required, the proposed\n approach is rapid and efficient with only a slight increment of 0.051 s in the average recognition time.","PeriodicalId":15924,"journal":{"name":"Journal of Imaging Science and Technology","volume":"64 1","pages":"40404-1-40404-16"},"PeriodicalIF":0.5000,"publicationDate":"2020-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Imaging Science and Technology","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.2352/j.imagingsci.technol.2020.64.4.040404","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"IMAGING SCIENCE & PHOTOGRAPHIC TECHNOLOGY","Score":null,"Total":0}

引用次数: 0

Abstract

Abstract With rapid developments in techniques related to the internet of things, smart service applications such as voice-command-based speech recognition and smart care applications such as context-aware-based emotion recognition will gain much attention and potentially be a requirement in smart home or office environments. In such intelligence applications, identity recognition of the specific member in indoor spaces will be a crucial issue. In this study, a combined audio-visual identity recognition approach was developed. In this approach, visual information obtained from face detection was incorporated into acoustic Gaussian likelihood calculations for constructing speaker classification trees to significantly enhance the Gaussian mixture model (GMM)-based speaker recognition method. This study considered the privacy of the monitored person and reduced the degree of surveillance. Moreover, the popular Kinect sensor device containing a microphone array was adopted to obtain acoustic voice data from the person. The proposed audio-visual identity recognition approach deploys only two cameras in a specific indoor space for conveniently performing face detection and quickly determining the total number of people in the specific space. Such information pertaining to the number of people in the indoor space obtained using face detection was utilized to effectively regulate the accurate GMM speaker classification tree design. Two face-detection-regulated speaker classification tree schemes are presented for the GMM speaker recognition method in this study—the binary speaker classification tree (GMM-BT) and the non-binary speaker classification tree (GMM-NBT). The proposed GMM-BT and GMM-NBT methods achieve excellent identity recognition rates of 84.28% and 83%, respectively; both values are higher than the rate of the conventional GMM approach (80.5%). Moreover, as the extremely complex calculations of face recognition in general audio-visual speaker recognition tasks are not required, the proposed approach is rapid and efficient with only a slight increment of 0.051 s in the average recognition time.

查看原文本刊更多论文

基于个人隐私的声音和视觉数据融合的说话人身份识别，用于智能护理和服务应用

摘要随着物联网技术的快速发展，基于语音命令的语音识别等智能服务应用和基于上下文感知的情绪识别等智能护理应用将受到广泛关注，并可能成为智能家居或办公环境的需求。在这种智能应用中，室内空间中特定成员的身份识别将是一个关键问题。在这项研究中，开发了一种组合的视听身份识别方法。在该方法中，将人脸检测获得的视觉信息纳入声学高斯似然计算中，用于构建说话人分类树，以显著增强基于高斯混合模型（GMM）的说话人识别方法。这项研究考虑了被监控者的隐私，降低了监控的程度。此外，采用了流行的包含麦克风阵列的Kinect传感器设备来获取人的声学语音数据。所提出的视听身份识别方法在特定的室内空间中只部署了两个摄像头，以便方便地进行人脸检测并快速确定特定空间中的总人数。使用人脸检测获得的这种与室内空间中的人数有关的信息被用来有效地调节精确的GMM扬声器分类树设计。针对GMM说话人识别方法，提出了两种基于人脸检测的说话人分类树方案——二元说话人分类树（GMM-BT）和非二元说话人识别树（GMM-NBT）。所提出的GMM-BT和GMM-NBT方法分别获得了84.28%和83%的优秀身份识别率；这两个值都高于传统GMM方法的识别率（80.5%）。此外，由于在一般的视听说话人识别任务中不需要极其复杂的人脸识别计算，因此该方法快速有效，平均识别时间仅略微增加0.051s。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Journal of Imaging Science and Technology 工程技术-成像科学与照相技术

CiteScore

2.00

自引率

10.00%

发文量

审稿时长

>12 weeks

期刊介绍： Typical issues include research papers and/or comprehensive reviews from a variety of topical areas. In the spirit of fostering constructive scientific dialog, the Journal accepts Letters to the Editor commenting on previously published articles. Periodically the Journal features a Special Section containing a group of related— usually invited—papers introduced by a Guest Editor. Imaging research topics that have coverage in JIST include: Digital fabrication and biofabrication; Digital printing technologies; 3D imaging: capture, display, and print; Augmented and virtual reality systems; Mobile imaging; Computational and digital photography; Machine vision and learning; Data visualization and analysis; Image and video quality evaluation; Color image science; Image archiving, permanence, and security; Imaging applications including astronomy, medicine, sports, and autonomous vehicles.