Audio-Visual Predictive Coding for Self-Supervised Visual Representation Learning

M. Tellamekala, M. Valstar, Michael P. Pound, T. Giesbrecht
{"title":"Audio-Visual Predictive Coding for Self-Supervised Visual Representation Learning","authors":"M. Tellamekala, M. Valstar, Michael P. Pound, T. Giesbrecht","doi":"10.1109/ICPR48806.2021.9413295","DOIUrl":null,"url":null,"abstract":"Self-supervised learning has emerged as a candidate approach to learn semantic visual features from unlabeled video data. In self-supervised learning, intrinsic correspondences between data points are used to define a proxy task that forces the model to learn semantic representations. Most existing proxy tasks applied to video data exploit only either intra-modal (e.g. temporal) or cross-modal (e.g. audio-visual) correspondences separately. In theory, jointly learning both these correspondences may result in richer visual features; but, as we show in this work, doing so is non-trivial in practice. To address this problem, we introduce ‘Audio-Visual Permutative Predictive Coding’ (AV-PPC), a multi-task learning framework designed to fully leverage the temporal and cross-modal correspondences as natural supervision signals. In AV-PPC, the model is trained to simultaneously learn multiple intra- and cross-modal predictive coding sub-tasks. By using visual speech recognition (lip-reading) as the downstream evaluation task, we show that our proposed proxy task can learn higher quality visual features than existing proxy tasks. We also show that AV-PPC visual features are highly data-efficient. Without further finetuning, AV-PPC visual encoder achieves 80.30% spoken word classification rate on the LRW dataset, performing on par with directly supervised visual encoders that are learned from large amounts of labeled data.","PeriodicalId":6783,"journal":{"name":"2020 25th International Conference on Pattern Recognition (ICPR)","volume":"89 1","pages":"9912-9919"},"PeriodicalIF":0.0000,"publicationDate":"2021-01-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2020 25th International Conference on Pattern Recognition (ICPR)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICPR48806.2021.9413295","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1

Abstract

Self-supervised learning has emerged as a candidate approach to learn semantic visual features from unlabeled video data. In self-supervised learning, intrinsic correspondences between data points are used to define a proxy task that forces the model to learn semantic representations. Most existing proxy tasks applied to video data exploit only either intra-modal (e.g. temporal) or cross-modal (e.g. audio-visual) correspondences separately. In theory, jointly learning both these correspondences may result in richer visual features; but, as we show in this work, doing so is non-trivial in practice. To address this problem, we introduce ‘Audio-Visual Permutative Predictive Coding’ (AV-PPC), a multi-task learning framework designed to fully leverage the temporal and cross-modal correspondences as natural supervision signals. In AV-PPC, the model is trained to simultaneously learn multiple intra- and cross-modal predictive coding sub-tasks. By using visual speech recognition (lip-reading) as the downstream evaluation task, we show that our proposed proxy task can learn higher quality visual features than existing proxy tasks. We also show that AV-PPC visual features are highly data-efficient. Without further finetuning, AV-PPC visual encoder achieves 80.30% spoken word classification rate on the LRW dataset, performing on par with directly supervised visual encoders that are learned from large amounts of labeled data.
自监督视觉表征学习的视听预测编码
自监督学习已经成为一种从未标记视频数据中学习语义视觉特征的候选方法。在自监督学习中,数据点之间的内在对应关系用于定义代理任务,该任务强制模型学习语义表示。应用于视频数据的大多数现有代理任务仅单独利用模态内(例如时间)或跨模态(例如视听)对应。从理论上讲,共同学习这两种对应关系可能会产生更丰富的视觉特征;但是,正如我们在这项工作中所展示的,这样做在实践中不是微不足道的。为了解决这个问题,我们引入“视听Permutative预测编码”(AV-PPC),一个多任务学习框架,旨在充分利用时间和跨通道通讯监督信号一样自然。在AV-PPC中,训练模型同时学习多个模态内和跨模态的预测编码子任务。通过使用视觉语音识别(唇读)作为下游评估任务,我们表明我们提出的代理任务比现有的代理任务可以学习到更高质量的视觉特征。我们还表明,AV-PPC视觉特征具有很高的数据效率。在没有进一步调整的情况下,AV-PPC视觉编码器在LRW数据集上实现了80.30%的口语单词分类率,与从大量标记数据中学习的直接监督视觉编码器的表现相当。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信