{"title":"基于语音的视觉概念学习","authors":"Xiaodan Song, Ching-Yung Lin, Ming-Ting Sun","doi":"10.1109/ICME.2005.1521627","DOIUrl":null,"url":null,"abstract":"Modeling visual concepts using supervised or unsupervised machine learning approaches are becoming increasing important for video semantic indexing, retrieval, and filtering applications. Naturally, videos include multimodality data such as audio, speech, visual and text, which are combined to infer therein the overall semantic concepts. However, in the literature, most researches were conducted within only one single domain. In this paper we propose an unsupervised technique that builds context-independent keyword lists for desired visual concept modeling using WordNet. Furthermore, we propose an extended speech-based visual concept (ESVC) model to reorder and extend the above keyword lists by supervised learning based on multimodality annotation. Experimental results show that the context-independent models can achieve comparable performance compared to conventional supervised learning algorithms, and the ESVC model achieves about 53% and 28.4% improvement in two testing subsets of the TRECVID 2003 corpus over a state-of-the-art speech-based video concept detection algorithm","PeriodicalId":244360,"journal":{"name":"2005 IEEE International Conference on Multimedia and Expo","volume":"39 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2005-07-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":"{\"title\":\"Speech-Based Visual Concept Learning Using Wordnet\",\"authors\":\"Xiaodan Song, Ching-Yung Lin, Ming-Ting Sun\",\"doi\":\"10.1109/ICME.2005.1521627\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Modeling visual concepts using supervised or unsupervised machine learning approaches are becoming increasing important for video semantic indexing, retrieval, and filtering applications. Naturally, videos include multimodality data such as audio, speech, visual and text, which are combined to infer therein the overall semantic concepts. However, in the literature, most researches were conducted within only one single domain. In this paper we propose an unsupervised technique that builds context-independent keyword lists for desired visual concept modeling using WordNet. Furthermore, we propose an extended speech-based visual concept (ESVC) model to reorder and extend the above keyword lists by supervised learning based on multimodality annotation. Experimental results show that the context-independent models can achieve comparable performance compared to conventional supervised learning algorithms, and the ESVC model achieves about 53% and 28.4% improvement in two testing subsets of the TRECVID 2003 corpus over a state-of-the-art speech-based video concept detection algorithm\",\"PeriodicalId\":244360,\"journal\":{\"name\":\"2005 IEEE International Conference on Multimedia and Expo\",\"volume\":\"39 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2005-07-06\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"2\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2005 IEEE International Conference on Multimedia and Expo\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICME.2005.1521627\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2005 IEEE International Conference on Multimedia and Expo","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICME.2005.1521627","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Speech-Based Visual Concept Learning Using Wordnet
Modeling visual concepts using supervised or unsupervised machine learning approaches are becoming increasing important for video semantic indexing, retrieval, and filtering applications. Naturally, videos include multimodality data such as audio, speech, visual and text, which are combined to infer therein the overall semantic concepts. However, in the literature, most researches were conducted within only one single domain. In this paper we propose an unsupervised technique that builds context-independent keyword lists for desired visual concept modeling using WordNet. Furthermore, we propose an extended speech-based visual concept (ESVC) model to reorder and extend the above keyword lists by supervised learning based on multimodality annotation. Experimental results show that the context-independent models can achieve comparable performance compared to conventional supervised learning algorithms, and the ESVC model achieves about 53% and 28.4% improvement in two testing subsets of the TRECVID 2003 corpus over a state-of-the-art speech-based video concept detection algorithm