突出的面孔预测没有铃铛和口哨

2022 International Conference on Digital Image Computing: Techniques and Applications (DICTA) Pub Date : 2022-11-30 DOI:10.1109/DICTA56598.2022.10034571

{"title":"突出的面孔预测没有铃铛和口哨","authors":"","doi":"10.1109/DICTA56598.2022.10034571","DOIUrl":null,"url":null,"abstract":"Salient face prediction in multiple-face videos is a fundamental task in machine vision. It finds usage in various applications like video editing and human-machine interactions. The field has seen significant progress in recent years, backed by large datasets comprising specifically of multi-face videos. As the first contribution, we present promise in a visual-only baseline, achieving state-of-the-art results for salient face prediction. Our work motivates reconsideration towards sophisticated multimodal, multi-stream architectures. We further show that a simple upstream task like active speaker detection can give a reasonable baseline and match prior tailored models for detecting salient faces. Moreover, we bring to light the inconsistencies in evaluation strategies, highlighting a need for standardization. We propose using a ranking-based evaluation for the task. Overall, our work motivates a fundamental course correction before re-initiating the search for novel architectures and frameworks.","PeriodicalId":159377,"journal":{"name":"2022 International Conference on Digital Image Computing: Techniques and Applications (DICTA)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-11-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Salient Face Prediction without Bells and Whistles\",\"authors\":\"\",\"doi\":\"10.1109/DICTA56598.2022.10034571\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Salient face prediction in multiple-face videos is a fundamental task in machine vision. It finds usage in various applications like video editing and human-machine interactions. The field has seen significant progress in recent years, backed by large datasets comprising specifically of multi-face videos. As the first contribution, we present promise in a visual-only baseline, achieving state-of-the-art results for salient face prediction. Our work motivates reconsideration towards sophisticated multimodal, multi-stream architectures. We further show that a simple upstream task like active speaker detection can give a reasonable baseline and match prior tailored models for detecting salient faces. Moreover, we bring to light the inconsistencies in evaluation strategies, highlighting a need for standardization. We propose using a ranking-based evaluation for the task. Overall, our work motivates a fundamental course correction before re-initiating the search for novel architectures and frameworks.\",\"PeriodicalId\":159377,\"journal\":{\"name\":\"2022 International Conference on Digital Image Computing: Techniques and Applications (DICTA)\",\"volume\":\"1 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-11-30\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2022 International Conference on Digital Image Computing: Techniques and Applications (DICTA)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/DICTA56598.2022.10034571\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 International Conference on Digital Image Computing: Techniques and Applications (DICTA)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/DICTA56598.2022.10034571","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

多人脸视频中的显著性人脸预测是机器视觉中的一项基本任务。它在视频编辑和人机交互等各种应用中都有应用。近年来，该领域取得了重大进展，特别是由多面视频组成的大型数据集。作为第一个贡献，我们提出了一个只有视觉基线的承诺，实现了最先进的显著面部预测结果。我们的工作促使我们重新思考复杂的多模式、多流架构。我们进一步表明，一个简单的上游任务，如主动说话人检测，可以提供一个合理的基线，并匹配先前定制的模型来检测显著面。此外，我们揭示了评估策略的不一致性，强调了标准化的必要性。我们建议对任务使用基于排名的评估。总的来说，我们的工作在重新开始寻找新的体系结构和框架之前激发了一个基本的过程修正。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Salient Face Prediction without Bells and Whistles

Salient face prediction in multiple-face videos is a fundamental task in machine vision. It finds usage in various applications like video editing and human-machine interactions. The field has seen significant progress in recent years, backed by large datasets comprising specifically of multi-face videos. As the first contribution, we present promise in a visual-only baseline, achieving state-of-the-art results for salient face prediction. Our work motivates reconsideration towards sophisticated multimodal, multi-stream architectures. We further show that a simple upstream task like active speaker detection can give a reasonable baseline and match prior tailored models for detecting salient faces. Moreover, we bring to light the inconsistencies in evaluation strategies, highlighting a need for standardization. We propose using a ranking-based evaluation for the task. Overall, our work motivates a fundamental course correction before re-initiating the search for novel architectures and frameworks.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2022 International Conference on Digital Image Computing: Techniques and Applications (DICTA)

自引率

0.00%

发文量