Joint Saliency Estimation and Matching using Image Regions for Geo-Localization of Online Video

Proceedings of the 2017 ACM on International Conference on Multimedia Retrieval Pub Date : 2017-06-06 DOI:10.1145/3078971.3078996

Freda Shi, Jia Chen, Alexander Hauptmann

{"title":"Joint Saliency Estimation and Matching using Image Regions for Geo-Localization of Online Video","authors":"Freda Shi, Jia Chen, Alexander Hauptmann","doi":"10.1145/3078971.3078996","DOIUrl":null,"url":null,"abstract":"In this paper, we study automatic geo-localization of online event videos. Different from general image localization task through matching, the appearance of an environment during significant events varies greatly from its daily appearance, since there are usually crowds, decorations or even destruction when a major event happens. This introduces a major challenge: matching the event environment to the daily environment, e.g. as recorded by Google Street View. We observe that some regions in the image, as part of the environment, still preserve the daily appearance even though the whole image (environment) looks quite different. Based on this observation, we formulate the problem as joint saliency estimation and matching at the image region level, as opposed to the key point or whole-image level. As image-level labels of daily environment are easily generated with GPS information, we treat region based saliency estimation and matching as a weakly labeled learning problem over the training data. Our solution is to iteratively optimize saliency and the region-matching model. For saliency optimization, we derive a closed form solution, which has an intuitive explanation. For region matching model optimization, we use self-paced learning to learn from the pseudo labels generated by (sub-optimal) saliency values. We conduct extensive experiments on two challenging public datasets: Boston Marathon 2013 and Tokyo Time Machine. Experimental results show that our solution significantly improves over matching on whole images and the automatically learned saliency is a strong predictor of distinctive building areas.","PeriodicalId":403556,"journal":{"name":"Proceedings of the 2017 ACM on International Conference on Multimedia Retrieval","volume":"104 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-06-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2017 ACM on International Conference on Multimedia Retrieval","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3078971.3078996","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

Abstract

In this paper, we study automatic geo-localization of online event videos. Different from general image localization task through matching, the appearance of an environment during significant events varies greatly from its daily appearance, since there are usually crowds, decorations or even destruction when a major event happens. This introduces a major challenge: matching the event environment to the daily environment, e.g. as recorded by Google Street View. We observe that some regions in the image, as part of the environment, still preserve the daily appearance even though the whole image (environment) looks quite different. Based on this observation, we formulate the problem as joint saliency estimation and matching at the image region level, as opposed to the key point or whole-image level. As image-level labels of daily environment are easily generated with GPS information, we treat region based saliency estimation and matching as a weakly labeled learning problem over the training data. Our solution is to iteratively optimize saliency and the region-matching model. For saliency optimization, we derive a closed form solution, which has an intuitive explanation. For region matching model optimization, we use self-paced learning to learn from the pseudo labels generated by (sub-optimal) saliency values. We conduct extensive experiments on two challenging public datasets: Boston Marathon 2013 and Tokyo Time Machine. Experimental results show that our solution significantly improves over matching on whole images and the automatically learned saliency is a strong predictor of distinctive building areas.

查看原文本刊更多论文

基于图像区域的联合显著性估计与匹配用于在线视频地理定位

本文主要研究在线事件视频的自动地理定位。与一般通过匹配完成的图像定位任务不同，重大事件发生时环境的外观与日常环境的外观差异很大，通常会出现人群、装饰甚至破坏。这带来了一个重大挑战:将事件环境与日常环境相匹配，例如谷歌街景记录的环境。图像中我们观察到一些地区,作为环境的一部分,仍然保持每日外表虽然整体形象(环境)看起来很不同。基于这一观察，我们将问题表述为图像区域级别的联合显著性估计和匹配，而不是关键点或整个图像级别。由于GPS信息很容易生成日常环境的图像级标签，因此我们将基于区域的显著性估计和匹配作为对训练数据的弱标记学习问题。我们的解决方案是迭代优化显著性和区域匹配模型。对于显著性优化，我们导出了一个封闭解，它具有直观的解释。对于区域匹配模型优化，我们使用自定节奏学习从(次优)显著性值生成的伪标签中学习。我们在两个具有挑战性的公共数据集上进行了广泛的实验:2013年波士顿马拉松赛和东京时光机。实验结果表明，该方法比在整幅图像上的匹配有明显的改进，并且自动学习显著性可以很好地预测不同的建筑区域。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the 2017 ACM on International Conference on Multimedia Retrieval

自引率

0.00%

发文量