Web-scale Multimedia Search for Internet Video Content

Proceedings of the Ninth ACM International Conference on Web Search and Data Mining Pub Date : 2016-02-08 DOI:10.1145/2835776.2855081

Lu Jiang

{"title":"Web-scale Multimedia Search for Internet Video Content","authors":"Lu Jiang","doi":"10.1145/2835776.2855081","DOIUrl":null,"url":null,"abstract":"The Internet has been witnessing an explosion of video content. According to a Cisco study, video content is estimated to account for 80% of all the entire world's internet traffic by 2019. Video data are becoming one of the most valuable sources to assess information and knowledge. However, existing video search solutions are still based on text matching (text-to-text search), and could fail for the huge volumes of videos that have little relevant metadata or no metadata at all. The need for large-scale and intelligent video search, which bridges the gap between the user's information need and the video content, seems to be urgent. In this thesis, we propose an accurate, efficient and scalable search method for video content. As opposed to text matching, the proposed method relies on automatic video content understanding, and allows for intelligent and flexible search paradigms over the video content, including text-to-video and text&video-to-video search. Suppose our goal is to search the videos about birthday party. In traditional text-to-text queries, we have to search the keywords in the user-generated metadata (titles or descriptions). In a text-to-video query, however, we might look for visual clues in the video content such as \"cake\", \"gift\" and \"kids\", audio clues like \"birthday song\" and \"cheering sound\", or visible text like \"happy birthday\". Text-to-video queries are flexible and can be further refined by Boolean and temporal operators. After watching the retrieved videos, the user may select a few interesting videos to find more relevant videos like these. This can be achieved by issuing a text&video-to-video query which adds the selected video examples to the query. The proposed method provides a new dimension of looking at content-based video search, from finding a simple concept like \"puppy\" to searching a complex incident like \"a scene in urban area where people running away after an explosion\". To achieve this ambitious goal, we propose several novel methods focusing on accuracy, efficiency and scalability in the novel search paradigm. First, we introduce a novel self-paced curriculum learning theory that allows for training more accurate semantic concepts. Second, we propose a novel and scalable approach to index semantic concepts that can significantly improve the search efficiency with minimum accuracy loss. Third, we design a novel video reranking algorithm that can boost accuracy for video retrieval. The extensive experiments demonstrate that the proposed methods are able to surpass state-of-the-art accuracy on multiple datasets. In addition, our method can efficiently scale up the search to hundreds of millions videos, and only takes about 0.2 second to search a semantic query on a collection of 100 million videos, 1 second to process a hybrid query over 1 million videos. Based on the proposed methods, we implement E-Lamp Lite, the first of its kind large-scale semantic search engine for Internet videos. According to National Institute of Standards and Technology (NIST), it achieved the best accuracy in the TRECVID Multimedia Event Detection (MED) 2013, 2014 and 2015, the most representative task for content-based video search. To the best of our knowledge, E-Lamp Lite is the first content-based semantic search engine that is capable of indexing and searching a collection of 100 million videos.","PeriodicalId":20567,"journal":{"name":"Proceedings of the Ninth ACM International Conference on Web Search and Data Mining","volume":"27 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2016-02-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"9","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the Ninth ACM International Conference on Web Search and Data Mining","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2835776.2855081","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 9

Abstract

The Internet has been witnessing an explosion of video content. According to a Cisco study, video content is estimated to account for 80% of all the entire world's internet traffic by 2019. Video data are becoming one of the most valuable sources to assess information and knowledge. However, existing video search solutions are still based on text matching (text-to-text search), and could fail for the huge volumes of videos that have little relevant metadata or no metadata at all. The need for large-scale and intelligent video search, which bridges the gap between the user's information need and the video content, seems to be urgent. In this thesis, we propose an accurate, efficient and scalable search method for video content. As opposed to text matching, the proposed method relies on automatic video content understanding, and allows for intelligent and flexible search paradigms over the video content, including text-to-video and text&video-to-video search. Suppose our goal is to search the videos about birthday party. In traditional text-to-text queries, we have to search the keywords in the user-generated metadata (titles or descriptions). In a text-to-video query, however, we might look for visual clues in the video content such as "cake", "gift" and "kids", audio clues like "birthday song" and "cheering sound", or visible text like "happy birthday". Text-to-video queries are flexible and can be further refined by Boolean and temporal operators. After watching the retrieved videos, the user may select a few interesting videos to find more relevant videos like these. This can be achieved by issuing a text&video-to-video query which adds the selected video examples to the query. The proposed method provides a new dimension of looking at content-based video search, from finding a simple concept like "puppy" to searching a complex incident like "a scene in urban area where people running away after an explosion". To achieve this ambitious goal, we propose several novel methods focusing on accuracy, efficiency and scalability in the novel search paradigm. First, we introduce a novel self-paced curriculum learning theory that allows for training more accurate semantic concepts. Second, we propose a novel and scalable approach to index semantic concepts that can significantly improve the search efficiency with minimum accuracy loss. Third, we design a novel video reranking algorithm that can boost accuracy for video retrieval. The extensive experiments demonstrate that the proposed methods are able to surpass state-of-the-art accuracy on multiple datasets. In addition, our method can efficiently scale up the search to hundreds of millions videos, and only takes about 0.2 second to search a semantic query on a collection of 100 million videos, 1 second to process a hybrid query over 1 million videos. Based on the proposed methods, we implement E-Lamp Lite, the first of its kind large-scale semantic search engine for Internet videos. According to National Institute of Standards and Technology (NIST), it achieved the best accuracy in the TRECVID Multimedia Event Detection (MED) 2013, 2014 and 2015, the most representative task for content-based video search. To the best of our knowledge, E-Lamp Lite is the first content-based semantic search engine that is capable of indexing and searching a collection of 100 million videos.

查看原文本刊更多论文

互联网视频内容的网络规模多媒体搜索

互联网见证了视频内容的爆炸式增长。根据思科的一项研究，到2019年，视频内容预计将占全球互联网流量的80%。视频数据正在成为评估信息和知识的最有价值的来源之一。然而，现有的视频搜索解决方案仍然基于文本匹配(文本到文本搜索)，并且可能会失败，因为大量的视频几乎没有相关的元数据或根本没有元数据。在用户的信息需求和视频内容之间架起桥梁的大规模智能视频搜索的需求显得十分迫切。本文提出了一种准确、高效、可扩展的视频内容搜索方法。与文本匹配相反，该方法依赖于自动视频内容理解，并允许对视频内容进行智能和灵活的搜索范式，包括文本到视频和文本&视频到视频搜索。假设我们的目标是搜索关于生日聚会的视频。在传统的文本到文本查询中，我们必须在用户生成的元数据(标题或描述)中搜索关键字。然而，在文本到视频的查询中，我们可能会在视频内容中寻找视觉线索，如“蛋糕”、“礼物”和“孩子”，音频线索，如“生日歌”和“欢呼的声音”，或可见文本，如“生日快乐”。文本到视频的查询是灵活的，可以通过布尔和时间运算符进一步改进。用户在观看了检索到的视频后，可以选择一些有趣的视频来查找更多类似的相关视频。这可以通过发出一个文本&视频到视频的查询来实现，该查询将选择的视频示例添加到查询中。所提出的方法为基于内容的视频搜索提供了一个新的维度，从寻找像“小狗”这样的简单概念到搜索像“爆炸后人们逃跑的城市场景”这样的复杂事件。为了实现这一雄心勃勃的目标，我们在新的搜索范式中提出了几种关注准确性、效率和可扩展性的新方法。首先，我们引入了一种新的自定进度课程学习理论，允许训练更准确的语义概念。其次，我们提出了一种新颖且可扩展的索引语义概念的方法，该方法可以在最小精度损失的情况下显着提高搜索效率。第三，我们设计了一种新的视频重排序算法，可以提高视频检索的准确性。大量的实验表明，所提出的方法能够在多个数据集上超越最先进的精度。此外，我们的方法可以有效地将搜索扩展到数亿个视频，在1亿个视频集合上搜索一个语义查询只需要0.2秒，处理100万个视频集合上的混合查询只需要1秒。基于提出的方法，我们实现了E-Lamp Lite，这是第一个针对互联网视频的大规模语义搜索引擎。根据美国国家标准与技术研究所(NIST)的数据，该方法在2013年、2014年和2015年最具代表性的基于内容的视频搜索任务TRECVID多媒体事件检测(MED)中取得了最佳准确率。据我们所知，E-Lamp Lite是第一个基于内容的语义搜索引擎，能够索引和搜索1亿个视频集合。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the Ninth ACM International Conference on Web Search and Data Mining

自引率

0.00%

发文量