Speaker Identity Recognition by Acoustic and Visual Data Fusion through Personal Privacy for Smart Care and Service Applications

IF 0.6 4区 计算机科学 Q4 IMAGING SCIENCE & PHOTOGRAPHIC TECHNOLOGY
I. Ding, C.-M. Ruan
{"title":"Speaker Identity Recognition by Acoustic and Visual Data Fusion through Personal Privacy for Smart Care and Service Applications","authors":"I. Ding, C.-M. Ruan","doi":"10.2352/j.imagingsci.technol.2020.64.4.040404","DOIUrl":null,"url":null,"abstract":"Abstract With rapid developments in techniques related to the internet of things, smart service applications such as voice-command-based speech recognition and smart care applications such as context-aware-based emotion recognition will gain much attention and potentially\n be a requirement in smart home or office environments. In such intelligence applications, identity recognition of the specific member in indoor spaces will be a crucial issue. In this study, a combined audio-visual identity recognition approach was developed. In this approach, visual information\n obtained from face detection was incorporated into acoustic Gaussian likelihood calculations for constructing speaker classification trees to significantly enhance the Gaussian mixture model (GMM)-based speaker recognition method. This study considered the privacy of the monitored person and\n reduced the degree of surveillance. Moreover, the popular Kinect sensor device containing a microphone array was adopted to obtain acoustic voice data from the person. The proposed audio-visual identity recognition approach deploys only two cameras in a specific indoor space for conveniently\n performing face detection and quickly determining the total number of people in the specific space. Such information pertaining to the number of people in the indoor space obtained using face detection was utilized to effectively regulate the accurate GMM speaker classification tree design.\n Two face-detection-regulated speaker classification tree schemes are presented for the GMM speaker recognition method in this study—the binary speaker classification tree (GMM-BT) and the non-binary speaker classification tree (GMM-NBT). The proposed GMM-BT and GMM-NBT methods achieve\n excellent identity recognition rates of 84.28% and 83%, respectively; both values are higher than the rate of the conventional GMM approach (80.5%). Moreover, as the extremely complex calculations of face recognition in general audio-visual speaker recognition tasks are not required, the proposed\n approach is rapid and efficient with only a slight increment of 0.051 s in the average recognition time.","PeriodicalId":15924,"journal":{"name":"Journal of Imaging Science and Technology","volume":"64 1","pages":"40404-1-40404-16"},"PeriodicalIF":0.6000,"publicationDate":"2020-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Imaging Science and Technology","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.2352/j.imagingsci.technol.2020.64.4.040404","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"IMAGING SCIENCE & PHOTOGRAPHIC TECHNOLOGY","Score":null,"Total":0}
引用次数: 0

Abstract

Abstract With rapid developments in techniques related to the internet of things, smart service applications such as voice-command-based speech recognition and smart care applications such as context-aware-based emotion recognition will gain much attention and potentially be a requirement in smart home or office environments. In such intelligence applications, identity recognition of the specific member in indoor spaces will be a crucial issue. In this study, a combined audio-visual identity recognition approach was developed. In this approach, visual information obtained from face detection was incorporated into acoustic Gaussian likelihood calculations for constructing speaker classification trees to significantly enhance the Gaussian mixture model (GMM)-based speaker recognition method. This study considered the privacy of the monitored person and reduced the degree of surveillance. Moreover, the popular Kinect sensor device containing a microphone array was adopted to obtain acoustic voice data from the person. The proposed audio-visual identity recognition approach deploys only two cameras in a specific indoor space for conveniently performing face detection and quickly determining the total number of people in the specific space. Such information pertaining to the number of people in the indoor space obtained using face detection was utilized to effectively regulate the accurate GMM speaker classification tree design. Two face-detection-regulated speaker classification tree schemes are presented for the GMM speaker recognition method in this study—the binary speaker classification tree (GMM-BT) and the non-binary speaker classification tree (GMM-NBT). The proposed GMM-BT and GMM-NBT methods achieve excellent identity recognition rates of 84.28% and 83%, respectively; both values are higher than the rate of the conventional GMM approach (80.5%). Moreover, as the extremely complex calculations of face recognition in general audio-visual speaker recognition tasks are not required, the proposed approach is rapid and efficient with only a slight increment of 0.051 s in the average recognition time.
基于个人隐私的声音和视觉数据融合的说话人身份识别,用于智能护理和服务应用
摘要随着物联网技术的快速发展,基于语音命令的语音识别等智能服务应用和基于上下文感知的情绪识别等智能护理应用将受到广泛关注,并可能成为智能家居或办公环境的需求。在这种智能应用中,室内空间中特定成员的身份识别将是一个关键问题。在这项研究中,开发了一种组合的视听身份识别方法。在该方法中,将人脸检测获得的视觉信息纳入声学高斯似然计算中,用于构建说话人分类树,以显著增强基于高斯混合模型(GMM)的说话人识别方法。这项研究考虑了被监控者的隐私,降低了监控的程度。此外,采用了流行的包含麦克风阵列的Kinect传感器设备来获取人的声学语音数据。所提出的视听身份识别方法在特定的室内空间中只部署了两个摄像头,以便方便地进行人脸检测并快速确定特定空间中的总人数。使用人脸检测获得的这种与室内空间中的人数有关的信息被用来有效地调节精确的GMM扬声器分类树设计。针对GMM说话人识别方法,提出了两种基于人脸检测的说话人分类树方案——二元说话人分类树(GMM-BT)和非二元说话人识别树(GMM-NBT)。所提出的GMM-BT和GMM-NBT方法分别获得了84.28%和83%的优秀身份识别率;这两个值都高于传统GMM方法的识别率(80.5%)。此外,由于在一般的视听说话人识别任务中不需要极其复杂的人脸识别计算,因此该方法快速有效,平均识别时间仅略微增加0.051s。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
Journal of Imaging Science and Technology
Journal of Imaging Science and Technology 工程技术-成像科学与照相技术
CiteScore
2.00
自引率
10.00%
发文量
45
审稿时长
>12 weeks
期刊介绍: Typical issues include research papers and/or comprehensive reviews from a variety of topical areas. In the spirit of fostering constructive scientific dialog, the Journal accepts Letters to the Editor commenting on previously published articles. Periodically the Journal features a Special Section containing a group of related— usually invited—papers introduced by a Guest Editor. Imaging research topics that have coverage in JIST include: Digital fabrication and biofabrication; Digital printing technologies; 3D imaging: capture, display, and print; Augmented and virtual reality systems; Mobile imaging; Computational and digital photography; Machine vision and learning; Data visualization and analysis; Image and video quality evaluation; Color image science; Image archiving, permanence, and security; Imaging applications including astronomy, medicine, sports, and autonomous vehicles.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信