Speech Section Extraction Method Using Image and Voice Information

Etsuro Nakamura, Y. Kageyama, S. Hirose
{"title":"Speech Section Extraction Method Using Image and Voice Information","authors":"Etsuro Nakamura, Y. Kageyama, S. Hirose","doi":"10.12792/ICIAE2021.008","DOIUrl":null,"url":null,"abstract":"Meeting minutes are useful for efficient operations and meetings. The labor cost and time for taking the minutes can be minimized using a system that can assign a speaker to the minutes. An automatic speaker identification method using lip movements and speech data obtained by an omnidirectional camera was developed. To improve the accuracy of speaker identification, it is necessary to use the extracted data of speech sections. The proposed speech section extraction method was studied as a preprocessor for speaker identification. The proposed method consists of three processes: i) extraction of speaking frames using lip movements, ii) extraction of speaking frames using voices, and iii) discrimination of speech sections using these extraction results. In the extraction of speech sections using lip movements, the nose width was used for a threshold for automatic calculation. The speech sections can be extracted, even when the distance between the camera and the subject changes, by using a threshold based on the width of the nose. Finally, 11 sentences of speech video data (154 data) of 14 subjects were used to evaluate the usefulness of the method. The evaluation result obtained was a high F-measure of 0.96 on average. The results reveal that the proposed method can extract speech sections, even when the distance between the camera and the subject changes.","PeriodicalId":161085,"journal":{"name":"The Proceedings of The 9th IIAE International Conference on Industrial Application Engineering 2020","volume":"17 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-01-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"The Proceedings of The 9th IIAE International Conference on Industrial Application Engineering 2020","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.12792/ICIAE2021.008","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 2

Abstract

Meeting minutes are useful for efficient operations and meetings. The labor cost and time for taking the minutes can be minimized using a system that can assign a speaker to the minutes. An automatic speaker identification method using lip movements and speech data obtained by an omnidirectional camera was developed. To improve the accuracy of speaker identification, it is necessary to use the extracted data of speech sections. The proposed speech section extraction method was studied as a preprocessor for speaker identification. The proposed method consists of three processes: i) extraction of speaking frames using lip movements, ii) extraction of speaking frames using voices, and iii) discrimination of speech sections using these extraction results. In the extraction of speech sections using lip movements, the nose width was used for a threshold for automatic calculation. The speech sections can be extracted, even when the distance between the camera and the subject changes, by using a threshold based on the width of the nose. Finally, 11 sentences of speech video data (154 data) of 14 subjects were used to evaluate the usefulness of the method. The evaluation result obtained was a high F-measure of 0.96 on average. The results reveal that the proposed method can extract speech sections, even when the distance between the camera and the subject changes.
基于图像和语音信息的语音段提取方法
会议记录对于高效的操作和会议是有用的。使用一个可以分配发言者的会议记录系统,可以最大限度地减少人力成本和时间。提出了一种利用全向摄像头采集的语音数据和唇形动作自动识别说话人的方法。为了提高说话人识别的准确性,有必要使用提取的语音片段数据。研究了语音片段提取方法作为说话人识别的预处理方法。该方法包括三个过程:i)利用唇形运动提取说话帧,ii)利用声音提取说话帧,以及iii)利用这些提取结果识别语音片段。在利用唇部运动提取语音片段时,使用鼻子宽度作为自动计算的阈值。通过使用基于鼻子宽度的阈值,即使相机和拍摄对象之间的距离发生变化,也可以提取语音部分。最后,使用14名被试的11句语音视频数据(154条数据)来评估该方法的有效性。评价结果为平均0.96的高f值。实验结果表明,该方法可以在摄像机与被摄对象之间的距离发生变化的情况下提取语音片段。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信