User-trainable video annotation using multimodal cues

Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval Pub Date : 2003-07-28 DOI:10.1145/860435.860522

Ching-Yung Lin, M. Naphade, A. Natsev, C. Neti, John R. Smith, Belle L. Tseng, H. Nock, W. H. Adams

引用次数: 10

Abstract

This paper describes progress towards a general framework for incorporating multimodal cues into a trainable system for automatically annotating user-defined semantic concepts in broadcast video. Models of arbitrary concepts are constructed by building classifiers in a score space defined by a pre-deployed set of multimodal models. Results show annotation for user-defined concepts both in and outside the pre-deployed set is competitive with our best video-only models on the TREC Video 2002 corpus. An interesting side result shows speech-only models give performance comparable to our best video-only models for detecting visual concepts such as "outdoors", "face" and "cityscape".

查看原文本刊更多论文

用户可训练的视频注释使用多模态线索

本文描述了将多模态线索纳入可训练系统的一般框架的进展，该系统用于自动注释广播视频中用户定义的语义概念。任意概念的模型是通过在由预先部署的多模态模型集定义的分数空间中构建分类器来构建的。结果表明，预部署集内外的用户定义概念注释与我们在TREC Video 2002语料库上的最佳视频模型具有竞争力。一个有趣的附带结果表明，在检测“户外”、“人脸”和“城市景观”等视觉概念时，纯语音模型的性能与我们最好的纯视频模型相当。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval

自引率

0.00%

发文量