Speech Section Extraction Method Using Image and Voice Information

The Proceedings of The 9th IIAE International Conference on Industrial Application Engineering 2020 Pub Date : 2021-01-22 DOI:10.12792/ICIAE2021.008

Etsuro Nakamura, Y. Kageyama, S. Hirose

{"title":"Speech Section Extraction Method Using Image and Voice Information","authors":"Etsuro Nakamura, Y. Kageyama, S. Hirose","doi":"10.12792/ICIAE2021.008","DOIUrl":null,"url":null,"abstract":"Meeting minutes are useful for efficient operations and meetings. The labor cost and time for taking the minutes can be minimized using a system that can assign a speaker to the minutes. An automatic speaker identification method using lip movements and speech data obtained by an omnidirectional camera was developed. To improve the accuracy of speaker identification, it is necessary to use the extracted data of speech sections. The proposed speech section extraction method was studied as a preprocessor for speaker identification. The proposed method consists of three processes: i) extraction of speaking frames using lip movements, ii) extraction of speaking frames using voices, and iii) discrimination of speech sections using these extraction results. In the extraction of speech sections using lip movements, the nose width was used for a threshold for automatic calculation. The speech sections can be extracted, even when the distance between the camera and the subject changes, by using a threshold based on the width of the nose. Finally, 11 sentences of speech video data (154 data) of 14 subjects were used to evaluate the usefulness of the method. The evaluation result obtained was a high F-measure of 0.96 on average. The results reveal that the proposed method can extract speech sections, even when the distance between the camera and the subject changes.","PeriodicalId":161085,"journal":{"name":"The Proceedings of The 9th IIAE International Conference on Industrial Application Engineering 2020","volume":"17 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-01-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"The Proceedings of The 9th IIAE International Conference on Industrial Application Engineering 2020","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.12792/ICIAE2021.008","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 2

Abstract

Meeting minutes are useful for efficient operations and meetings. The labor cost and time for taking the minutes can be minimized using a system that can assign a speaker to the minutes. An automatic speaker identification method using lip movements and speech data obtained by an omnidirectional camera was developed. To improve the accuracy of speaker identification, it is necessary to use the extracted data of speech sections. The proposed speech section extraction method was studied as a preprocessor for speaker identification. The proposed method consists of three processes: i) extraction of speaking frames using lip movements, ii) extraction of speaking frames using voices, and iii) discrimination of speech sections using these extraction results. In the extraction of speech sections using lip movements, the nose width was used for a threshold for automatic calculation. The speech sections can be extracted, even when the distance between the camera and the subject changes, by using a threshold based on the width of the nose. Finally, 11 sentences of speech video data (154 data) of 14 subjects were used to evaluate the usefulness of the method. The evaluation result obtained was a high F-measure of 0.96 on average. The results reveal that the proposed method can extract speech sections, even when the distance between the camera and the subject changes.

查看原文本刊更多论文

基于图像和语音信息的语音段提取方法

会议记录对于高效的操作和会议是有用的。使用一个可以分配发言者的会议记录系统，可以最大限度地减少人力成本和时间。提出了一种利用全向摄像头采集的语音数据和唇形动作自动识别说话人的方法。为了提高说话人识别的准确性，有必要使用提取的语音片段数据。研究了语音片段提取方法作为说话人识别的预处理方法。该方法包括三个过程:i)利用唇形运动提取说话帧，ii)利用声音提取说话帧，以及iii)利用这些提取结果识别语音片段。在利用唇部运动提取语音片段时，使用鼻子宽度作为自动计算的阈值。通过使用基于鼻子宽度的阈值，即使相机和拍摄对象之间的距离发生变化，也可以提取语音部分。最后，使用14名被试的11句语音视频数据(154条数据)来评估该方法的有效性。评价结果为平均0.96的高f值。实验结果表明，该方法可以在摄像机与被摄对象之间的距离发生变化的情况下提取语音片段。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

The Proceedings of The 9th IIAE International Conference on Industrial Application Engineering 2020

自引率

0.00%

发文量