基于颜色直方图标准差的视频有源说话人检测方法

2023 International Conference on Science, Engineering and Business for Sustainable Development Goals (SEB-SDG) Pub Date : 2023-04-05 DOI:10.1109/SEB-SDG57117.2023.10124488

A. Akinrinmade, E. Adetiba, J. Badejo, Oluwadamilola Oshin

{"title":"基于颜色直方图标准差的视频有源说话人检测方法","authors":"A. Akinrinmade, E. Adetiba, J. Badejo, Oluwadamilola Oshin","doi":"10.1109/SEB-SDG57117.2023.10124488","DOIUrl":null,"url":null,"abstract":"Active Speaker Detection (ASD) is a process that predicts who the speaker is amongst those whose faces appear in a video (if any) at any given point in time within the recorded video. This work presents a novel algorithm capable of detecting the active speakers in each video using the standard deviations of Color Histograms (CHs) computed at the mouth region from one frame to another. This paper relies on the assumption that the lips of an active speaker are in motion. They open and close and thus reveal the inner parts of the mouth, like the tongue, teeth, and the vocal cavity which are of diverse colors in the process of talking. It is possible to use already existing algorithms to detect the mouth region. This region can be analyzed during the speaking process for the changes in color activity, and this can be used to predict whether a user is speaking or not. If a person is not speaking, the lips are at rest the CH of such mouth regions such candidates would be stable. As a result, the standard deviations of such regions would be negligible. A threshold can be experimentally determined which is thus capable of predicting if a person is speaking or otherwise. This paper explores 53 online videos from Channels TV station, these videos were employed in the creation of 250 video clips. Each clip is between 15 to 60 seconds with a total of 3.6 hours. Each video contained the faces of at most two speakers in no particular order. Sometimes, only one of the speakers' faces appears, at other times both appear in the duration of the video. The status of the speakers whether active or not was manually labeled to be used for the performance evaluation of the proposed algorithm. This method was able to predict the active speakers with an accuracy of 99.19%.","PeriodicalId":185729,"journal":{"name":"2023 International Conference on Science, Engineering and Business for Sustainable Development Goals (SEB-SDG)","volume":"6 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-04-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"An Active Speaker Detection Method in Videos using Standard Deviations of Color Histogram\",\"authors\":\"A. Akinrinmade, E. Adetiba, J. Badejo, Oluwadamilola Oshin\",\"doi\":\"10.1109/SEB-SDG57117.2023.10124488\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Active Speaker Detection (ASD) is a process that predicts who the speaker is amongst those whose faces appear in a video (if any) at any given point in time within the recorded video. This work presents a novel algorithm capable of detecting the active speakers in each video using the standard deviations of Color Histograms (CHs) computed at the mouth region from one frame to another. This paper relies on the assumption that the lips of an active speaker are in motion. They open and close and thus reveal the inner parts of the mouth, like the tongue, teeth, and the vocal cavity which are of diverse colors in the process of talking. It is possible to use already existing algorithms to detect the mouth region. This region can be analyzed during the speaking process for the changes in color activity, and this can be used to predict whether a user is speaking or not. If a person is not speaking, the lips are at rest the CH of such mouth regions such candidates would be stable. As a result, the standard deviations of such regions would be negligible. A threshold can be experimentally determined which is thus capable of predicting if a person is speaking or otherwise. This paper explores 53 online videos from Channels TV station, these videos were employed in the creation of 250 video clips. Each clip is between 15 to 60 seconds with a total of 3.6 hours. Each video contained the faces of at most two speakers in no particular order. Sometimes, only one of the speakers' faces appears, at other times both appear in the duration of the video. The status of the speakers whether active or not was manually labeled to be used for the performance evaluation of the proposed algorithm. This method was able to predict the active speakers with an accuracy of 99.19%.\",\"PeriodicalId\":185729,\"journal\":{\"name\":\"2023 International Conference on Science, Engineering and Business for Sustainable Development Goals (SEB-SDG)\",\"volume\":\"6 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2023-04-05\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2023 International Conference on Science, Engineering and Business for Sustainable Development Goals (SEB-SDG)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/SEB-SDG57117.2023.10124488\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2023 International Conference on Science, Engineering and Business for Sustainable Development Goals (SEB-SDG)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SEB-SDG57117.2023.10124488","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

主动说话人检测(ASD)是一个过程，预测谁的脸出现在视频(如果有的话)在录制的视频中的任何给定时间点说话人。这项工作提出了一种新的算法，能够检测每个视频中的活动说话者，使用在嘴区从一帧到另一帧计算的颜色直方图(CHs)的标准偏差。本文基于一个假设，即主动说话者的嘴唇处于运动状态。它们打开和关闭，从而显示出口腔的内部部分，如舌头，牙齿和声腔，这些在说话过程中呈现出不同的颜色。使用已经存在的算法来检测口腔区域是可能的。这个区域可以在说话过程中分析颜色活动的变化，这可以用来预测用户是否在说话。如果一个人不说话，他的嘴唇是静止的，这样的口腔区域的CH是稳定的。因此，这些地区的标准差可以忽略不计。可以通过实验确定一个阈值，从而能够预测一个人是否在说话。本文选取了频道电视台的53个网络视频，并利用这些视频制作了250个视频剪辑。每段时长在15到60秒之间，时长3.6小时。每个视频最多包含两个说话者的面孔，没有特定的顺序。有时，只有一个说话者的脸出现，有时两个人的脸都出现在视频中。演讲者的状态无论是否活跃都被手动标记，用于所提出算法的性能评估。该方法预测主动说话者的准确率为99.19%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

An Active Speaker Detection Method in Videos using Standard Deviations of Color Histogram

Active Speaker Detection (ASD) is a process that predicts who the speaker is amongst those whose faces appear in a video (if any) at any given point in time within the recorded video. This work presents a novel algorithm capable of detecting the active speakers in each video using the standard deviations of Color Histograms (CHs) computed at the mouth region from one frame to another. This paper relies on the assumption that the lips of an active speaker are in motion. They open and close and thus reveal the inner parts of the mouth, like the tongue, teeth, and the vocal cavity which are of diverse colors in the process of talking. It is possible to use already existing algorithms to detect the mouth region. This region can be analyzed during the speaking process for the changes in color activity, and this can be used to predict whether a user is speaking or not. If a person is not speaking, the lips are at rest the CH of such mouth regions such candidates would be stable. As a result, the standard deviations of such regions would be negligible. A threshold can be experimentally determined which is thus capable of predicting if a person is speaking or otherwise. This paper explores 53 online videos from Channels TV station, these videos were employed in the creation of 250 video clips. Each clip is between 15 to 60 seconds with a total of 3.6 hours. Each video contained the faces of at most two speakers in no particular order. Sometimes, only one of the speakers' faces appears, at other times both appear in the duration of the video. The status of the speakers whether active or not was manually labeled to be used for the performance evaluation of the proposed algorithm. This method was able to predict the active speakers with an accuracy of 99.19%.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2023 International Conference on Science, Engineering and Business for Sustainable Development Goals (SEB-SDG)

自引率

0.00%

发文量