Ching-Yung Lin, M. Naphade, A. Natsev, C. Neti, John R. Smith, Belle L. Tseng, H. Nock, W. H. Adams
{"title":"User-trainable video annotation using multimodal cues","authors":"Ching-Yung Lin, M. Naphade, A. Natsev, C. Neti, John R. Smith, Belle L. Tseng, H. Nock, W. H. Adams","doi":"10.1145/860435.860522","DOIUrl":null,"url":null,"abstract":"This paper describes progress towards a general framework for incorporating multimodal cues into a trainable system for automatically annotating user-defined semantic concepts in broadcast video. Models of arbitrary concepts are constructed by building classifiers in a score space defined by a pre-deployed set of multimodal models. Results show annotation for user-defined concepts both in and outside the pre-deployed set is competitive with our best video-only models on the TREC Video 2002 corpus. An interesting side result shows speech-only models give performance comparable to our best video-only models for detecting visual concepts such as \"outdoors\", \"face\" and \"cityscape\".","PeriodicalId":209809,"journal":{"name":"Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval","volume":"110 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2003-07-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"10","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/860435.860522","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 10
Abstract
This paper describes progress towards a general framework for incorporating multimodal cues into a trainable system for automatically annotating user-defined semantic concepts in broadcast video. Models of arbitrary concepts are constructed by building classifiers in a score space defined by a pre-deployed set of multimodal models. Results show annotation for user-defined concepts both in and outside the pre-deployed set is competitive with our best video-only models on the TREC Video 2002 corpus. An interesting side result shows speech-only models give performance comparable to our best video-only models for detecting visual concepts such as "outdoors", "face" and "cityscape".
本文描述了将多模态线索纳入可训练系统的一般框架的进展,该系统用于自动注释广播视频中用户定义的语义概念。任意概念的模型是通过在由预先部署的多模态模型集定义的分数空间中构建分类器来构建的。结果表明,预部署集内外的用户定义概念注释与我们在TREC Video 2002语料库上的最佳视频模型具有竞争力。一个有趣的附带结果表明,在检测“户外”、“人脸”和“城市景观”等视觉概念时,纯语音模型的性能与我们最好的纯视频模型相当。