{"title":"A Domain Robust Approach For Image Dataset Construction","authors":"Yazhou Yao, Xiansheng Hua, Fumin Shen, Jian Zhang, Zhenmin Tang","doi":"10.1145/2964284.2967213","DOIUrl":null,"url":null,"abstract":"There have been increasing research interests in automatically constructing image dataset by collecting images from the Internet. However, existing methods tend to have a weak domain adaptation ability, known as the \"dataset bias problem\". To address this issue, in this work, we propose a novel image dataset construction framework which can generalize well to unseen target domains. In specific, the given queries are first expanded by searching in the Google Books Ngrams Corpora (GBNC) to obtain a richer semantic description, from which the noisy query expansions are then filtered out. By treating each expansion as a \"bag\" and the retrieved images therein as \"instances\", we formulate image filtering as a multi-instance learning (MIL) problem with constrained positive bags. By this approach, images from different data distributions will be kept while with noisy images filtered out. Comprehensive experiments on two challenging tasks demonstrate the effectiveness of our proposed approach.","PeriodicalId":140670,"journal":{"name":"Proceedings of the 24th ACM international conference on Multimedia","volume":"223 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2016-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"38","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 24th ACM international conference on Multimedia","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2964284.2967213","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 38
Abstract
There have been increasing research interests in automatically constructing image dataset by collecting images from the Internet. However, existing methods tend to have a weak domain adaptation ability, known as the "dataset bias problem". To address this issue, in this work, we propose a novel image dataset construction framework which can generalize well to unseen target domains. In specific, the given queries are first expanded by searching in the Google Books Ngrams Corpora (GBNC) to obtain a richer semantic description, from which the noisy query expansions are then filtered out. By treating each expansion as a "bag" and the retrieved images therein as "instances", we formulate image filtering as a multi-instance learning (MIL) problem with constrained positive bags. By this approach, images from different data distributions will be kept while with noisy images filtered out. Comprehensive experiments on two challenging tasks demonstrate the effectiveness of our proposed approach.
通过采集互联网上的图像,自动构建图像数据集的研究日益受到关注。然而,现有的方法往往具有较弱的领域适应能力,被称为“数据集偏差问题”。为了解决这一问题,我们提出了一种新的图像数据集构建框架,该框架可以很好地泛化到未知的目标域。具体而言,首先通过在Google Books Ngrams corpus (GBNC)中搜索来扩展给定的查询,以获得更丰富的语义描述,然后从中过滤掉带有噪声的查询扩展。通过将每个扩展视为一个“袋”,并将其中的检索图像视为“实例”,我们将图像过滤制定为具有约束正袋的多实例学习(MIL)问题。通过这种方法,可以保留不同数据分布的图像,同时滤除噪声图像。在两个具有挑战性的任务上的综合实验证明了我们提出的方法的有效性。