基于语音的视觉概念学习

2005 IEEE International Conference on Multimedia and Expo Pub Date : 2005-07-06 DOI:10.1109/ICME.2005.1521627

Xiaodan Song, Ching-Yung Lin, Ming-Ting Sun

{"title":"基于语音的视觉概念学习","authors":"Xiaodan Song, Ching-Yung Lin, Ming-Ting Sun","doi":"10.1109/ICME.2005.1521627","DOIUrl":null,"url":null,"abstract":"Modeling visual concepts using supervised or unsupervised machine learning approaches are becoming increasing important for video semantic indexing, retrieval, and filtering applications. Naturally, videos include multimodality data such as audio, speech, visual and text, which are combined to infer therein the overall semantic concepts. However, in the literature, most researches were conducted within only one single domain. In this paper we propose an unsupervised technique that builds context-independent keyword lists for desired visual concept modeling using WordNet. Furthermore, we propose an extended speech-based visual concept (ESVC) model to reorder and extend the above keyword lists by supervised learning based on multimodality annotation. Experimental results show that the context-independent models can achieve comparable performance compared to conventional supervised learning algorithms, and the ESVC model achieves about 53% and 28.4% improvement in two testing subsets of the TRECVID 2003 corpus over a state-of-the-art speech-based video concept detection algorithm","PeriodicalId":244360,"journal":{"name":"2005 IEEE International Conference on Multimedia and Expo","volume":"39 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2005-07-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":"{\"title\":\"Speech-Based Visual Concept Learning Using Wordnet\",\"authors\":\"Xiaodan Song, Ching-Yung Lin, Ming-Ting Sun\",\"doi\":\"10.1109/ICME.2005.1521627\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Modeling visual concepts using supervised or unsupervised machine learning approaches are becoming increasing important for video semantic indexing, retrieval, and filtering applications. Naturally, videos include multimodality data such as audio, speech, visual and text, which are combined to infer therein the overall semantic concepts. However, in the literature, most researches were conducted within only one single domain. In this paper we propose an unsupervised technique that builds context-independent keyword lists for desired visual concept modeling using WordNet. Furthermore, we propose an extended speech-based visual concept (ESVC) model to reorder and extend the above keyword lists by supervised learning based on multimodality annotation. Experimental results show that the context-independent models can achieve comparable performance compared to conventional supervised learning algorithms, and the ESVC model achieves about 53% and 28.4% improvement in two testing subsets of the TRECVID 2003 corpus over a state-of-the-art speech-based video concept detection algorithm\",\"PeriodicalId\":244360,\"journal\":{\"name\":\"2005 IEEE International Conference on Multimedia and Expo\",\"volume\":\"39 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2005-07-06\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"2\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2005 IEEE International Conference on Multimedia and Expo\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICME.2005.1521627\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2005 IEEE International Conference on Multimedia and Expo","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICME.2005.1521627","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 2

摘要

使用监督或无监督机器学习方法对视觉概念建模对于视频语义索引、检索和过滤应用变得越来越重要。当然，视频包含音频、语音、视觉和文本等多模态数据，这些数据被组合在一起，从而推断出整体的语义概念。然而，在文献中，大多数研究只在一个单一的领域内进行。在本文中，我们提出了一种无监督技术，该技术构建上下文无关的关键字列表，用于使用WordNet进行所需的视觉概念建模。此外，我们提出了一个扩展的基于语音的视觉概念(ESVC)模型，通过基于多模态标注的监督学习对上述关键字列表进行重新排序和扩展。实验结果表明，与传统的监督学习算法相比，上下文无关的模型可以达到相当的性能，并且ESVC模型在TRECVID 2003语料库的两个测试子集上比最先进的基于语音的视频概念检测算法分别提高了53%和28.4%

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Speech-Based Visual Concept Learning Using Wordnet

Modeling visual concepts using supervised or unsupervised machine learning approaches are becoming increasing important for video semantic indexing, retrieval, and filtering applications. Naturally, videos include multimodality data such as audio, speech, visual and text, which are combined to infer therein the overall semantic concepts. However, in the literature, most researches were conducted within only one single domain. In this paper we propose an unsupervised technique that builds context-independent keyword lists for desired visual concept modeling using WordNet. Furthermore, we propose an extended speech-based visual concept (ESVC) model to reorder and extend the above keyword lists by supervised learning based on multimodality annotation. Experimental results show that the context-independent models can achieve comparable performance compared to conventional supervised learning algorithms, and the ESVC model achieves about 53% and 28.4% improvement in two testing subsets of the TRECVID 2003 corpus over a state-of-the-art speech-based video concept detection algorithm

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2005 IEEE International Conference on Multimedia and Expo

自引率

0.00%

发文量