Deep Hash-based Relevance-aware Data Quality Assessment for Image Dark Data

ACM/IMS transactions on data science Pub Date : 2021-04-02 DOI:10.1145/3420038

Yu Liu, Yangtao Wang, Lianli Gao, Chan Guo, Yanzhao Xie, Zhili Xiao

{"title":"Deep Hash-based Relevance-aware Data Quality Assessment for Image Dark Data","authors":"Yu Liu, Yangtao Wang, Lianli Gao, Chan Guo, Yanzhao Xie, Zhili Xiao","doi":"10.1145/3420038","DOIUrl":null,"url":null,"abstract":"Data mining can hardly solve but always faces a problem that there is little meaningful information within the dataset serving a given requirement. Faced with multiple unknown datasets, to allocate data mining resources to acquire more desired data, it is necessary to establish a data quality assessment framework based on the relevance between the dataset and requirements. This framework can help the user to judge the potential benefits in advance, so as to optimize the resource allocation to those candidates. However, the unstructured data (e.g., image data) often presents dark data states, which makes it tricky for the user to understand the relevance based on content of the dataset in real time. Even if all data have label descriptions, how to measure the relevance between data efficiently under semantic propagation remains an urgent problem. Based on this, we propose a Deep Hash-based Relevance-aware Data Quality Assessment framework, which contains off-line learning and relevance mining parts as well as an on-line assessing part. In the off-line part, we first design a Graph Convolution Network (GCN)-AutoEncoder hash (GAH) algorithm to recognize the data (i.e., lighten the dark data), then construct a graph with restricted Hamming distance, and finally design a Cluster PageRank (CPR) algorithm to calculate the importance score for each node (image) so as to obtain the relevance representation based on semantic propagation. In the on-line part, we first retrieve the importance score by hash codes and then quickly get the assessment conclusion in the importance list. On the one hand, the introduction of GCN and co-occurrence probability in the GAH promotes the perception ability for dark data. On the other hand, the design of CPR utilizes hash collision to reduce the scale of graph and iteration matrix, which greatly decreases the consumption of space and computing resources. We conduct extensive experiments on both single-label and multi-label datasets to assess the relevance between data and requirements as well as test the resources allocation. Experimental results show our framework can gain the most desired data with the same mining resources. Besides, the test results on Tencent1M dataset demonstrate the framework can complete the assessment with a stability for given different requirements.","PeriodicalId":93404,"journal":{"name":"ACM/IMS transactions on data science","volume":"2 1","pages":"1 - 26"},"PeriodicalIF":0.0000,"publicationDate":"2021-04-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1145/3420038","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACM/IMS transactions on data science","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3420038","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 2

Abstract

Data mining can hardly solve but always faces a problem that there is little meaningful information within the dataset serving a given requirement. Faced with multiple unknown datasets, to allocate data mining resources to acquire more desired data, it is necessary to establish a data quality assessment framework based on the relevance between the dataset and requirements. This framework can help the user to judge the potential benefits in advance, so as to optimize the resource allocation to those candidates. However, the unstructured data (e.g., image data) often presents dark data states, which makes it tricky for the user to understand the relevance based on content of the dataset in real time. Even if all data have label descriptions, how to measure the relevance between data efficiently under semantic propagation remains an urgent problem. Based on this, we propose a Deep Hash-based Relevance-aware Data Quality Assessment framework, which contains off-line learning and relevance mining parts as well as an on-line assessing part. In the off-line part, we first design a Graph Convolution Network (GCN)-AutoEncoder hash (GAH) algorithm to recognize the data (i.e., lighten the dark data), then construct a graph with restricted Hamming distance, and finally design a Cluster PageRank (CPR) algorithm to calculate the importance score for each node (image) so as to obtain the relevance representation based on semantic propagation. In the on-line part, we first retrieve the importance score by hash codes and then quickly get the assessment conclusion in the importance list. On the one hand, the introduction of GCN and co-occurrence probability in the GAH promotes the perception ability for dark data. On the other hand, the design of CPR utilizes hash collision to reduce the scale of graph and iteration matrix, which greatly decreases the consumption of space and computing resources. We conduct extensive experiments on both single-label and multi-label datasets to assess the relevance between data and requirements as well as test the resources allocation. Experimental results show our framework can gain the most desired data with the same mining resources. Besides, the test results on Tencent1M dataset demonstrate the framework can complete the assessment with a stability for given different requirements.

查看原文本刊更多论文

基于深度哈希的图像暗数据相关感知数据质量评估

数据挖掘很难解决问题，但总是面临这样一个问题，即在满足给定需求的数据集中几乎没有有意义的信息。面对多个未知数据集，为了分配数据挖掘资源以获取更多期望的数据，有必要建立一个基于数据集与需求之间相关性的数据质量评估框架。该框架可以帮助用户提前判断潜在的利益，从而优化对这些候选人的资源分配。然而，非结构化数据（例如，图像数据）通常呈现暗数据状态，这使得用户难以实时理解基于数据集内容的相关性。即使所有数据都有标签描述，如何在语义传播下有效地测量数据之间的相关性仍然是一个紧迫的问题。在此基础上，我们提出了一个基于深度哈希的相关性感知数据质量评估框架，该框架包含离线学习和相关性挖掘部分以及在线评估部分。在离线部分，我们首先设计了一种图卷积网络（GCN）-AutoEncoder散列（GAH）算法来识别数据（即亮显暗数据），然后构造一个具有受限汉明距离的图，最后设计了聚类PageRank（CPR）算法来计算每个节点（图像）的重要性得分，从而获得基于语义传播的相关性表示。在在线部分，我们首先通过哈希码检索重要性分数，然后在重要性列表中快速得到评估结论。一方面，在GAH中引入GCN和同现概率提高了对暗数据的感知能力。另一方面，CPR的设计利用散列冲突来减少图和迭代矩阵的规模，从而大大减少了空间和计算资源的消耗。我们在单标签和多标签数据集上进行了广泛的实验，以评估数据和需求之间的相关性，并测试资源分配。实验结果表明，我们的框架可以在相同的挖掘资源下获得最需要的数据。此外，在Tencent1M数据集上的测试结果表明，该框架可以在给定的不同需求下稳定地完成评估。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

ACM/IMS transactions on data science

自引率

0.00%

发文量