研究数据和算法之间的相互作用

World Congress on Electrical Engineering and Computer Systems and Science Pub Date : 2022-07-01 DOI:10.11159/mvml22.106

Daniel Pototzky, Azhar Sultan, L. Schmidt-Thieme

{"title":"研究数据和算法之间的相互作用","authors":"Daniel Pototzky, Azhar Sultan, L. Schmidt-Thieme","doi":"10.11159/mvml22.106","DOIUrl":null,"url":null,"abstract":"– Research in computer vision is centered on algorithmic improvements, for example, by developing better models. Thereby, the data is considered fixed. This is in contrast to many real-world applications of computer vision systems in which algorithms and data co-evolve. To address this shortcoming of previous research, we study the properties of the data and their interaction with deep learning algorithms. Thereby, we investigate the size of the data, the share of mislabels, class imbalance and the presence of unlabeled data which can be leveraged using semi-supervised learning. In experiments on 100 classes from ImageNet, we show that a tiny network architecture outperforms a much more powerful one it if has access to only a little bit more data. Only if vast amounts of data are available so that adding even more images has little effect on performance, large architectures dominate smaller ones. If little data is provided, adding a few labeled images has a huge effect on accuracy. Once accuracy saturates, massive amounts of additional data are needed to achieve even small improvements. Furthermore, we find that mislabels severely reduce performance. To fix that, we propose a cost-efficient way of identifying mislabels which is especially beneficial if many images are already available. Conversely, if little data is available, labeling more images is more advantageous than cleaning existing annotations. In the case of imbalanced data, we illustrate that labeling more instances from rare classes has a much greater effect on performance than only increasing dataset size. Moreover, we show that leveraging unlabeled images by semi-supervised learning offers a consistent benefit even if the labeled subset contains significant label noise.","PeriodicalId":294100,"journal":{"name":"World Congress on Electrical Engineering and Computer Systems and Science","volume":"16 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Investigating the Interaction between Data and Algorithms\",\"authors\":\"Daniel Pototzky, Azhar Sultan, L. Schmidt-Thieme\",\"doi\":\"10.11159/mvml22.106\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"– Research in computer vision is centered on algorithmic improvements, for example, by developing better models. Thereby, the data is considered fixed. This is in contrast to many real-world applications of computer vision systems in which algorithms and data co-evolve. To address this shortcoming of previous research, we study the properties of the data and their interaction with deep learning algorithms. Thereby, we investigate the size of the data, the share of mislabels, class imbalance and the presence of unlabeled data which can be leveraged using semi-supervised learning. In experiments on 100 classes from ImageNet, we show that a tiny network architecture outperforms a much more powerful one it if has access to only a little bit more data. Only if vast amounts of data are available so that adding even more images has little effect on performance, large architectures dominate smaller ones. If little data is provided, adding a few labeled images has a huge effect on accuracy. Once accuracy saturates, massive amounts of additional data are needed to achieve even small improvements. Furthermore, we find that mislabels severely reduce performance. To fix that, we propose a cost-efficient way of identifying mislabels which is especially beneficial if many images are already available. Conversely, if little data is available, labeling more images is more advantageous than cleaning existing annotations. In the case of imbalanced data, we illustrate that labeling more instances from rare classes has a much greater effect on performance than only increasing dataset size. Moreover, we show that leveraging unlabeled images by semi-supervised learning offers a consistent benefit even if the labeled subset contains significant label noise.\",\"PeriodicalId\":294100,\"journal\":{\"name\":\"World Congress on Electrical Engineering and Computer Systems and Science\",\"volume\":\"16 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-07-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"World Congress on Electrical Engineering and Computer Systems and Science\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.11159/mvml22.106\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"World Congress on Electrical Engineering and Computer Systems and Science","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.11159/mvml22.106","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

-计算机视觉的研究集中在算法改进上，例如，通过开发更好的模型。因此，数据被认为是固定的。这与计算机视觉系统的许多现实应用形成对比，其中算法和数据共同发展。为了解决以往研究的这一缺陷，我们研究了数据的性质及其与深度学习算法的相互作用。因此，我们研究了数据的大小、错误标签的比例、类不平衡和未标记数据的存在，这些数据可以使用半监督学习加以利用。在ImageNet的100个类的实验中，我们证明了一个小的网络架构优于一个更强大的网络架构，如果它只能访问一点点多的数据。只有当有大量的数据可用时，添加更多的映像对性能几乎没有影响，大型架构才会主导较小的架构。如果提供的数据很少，添加一些标记图像会对准确性产生巨大影响。一旦准确度达到饱和，就需要大量的额外数据来实现哪怕是很小的改进。此外，我们发现错误标签严重降低了性能。为了解决这个问题，我们提出了一种经济有效的识别错误标签的方法，如果很多图像已经可用，这种方法尤其有益。相反，如果可用的数据很少，标记更多的图像比清除现有注释更有利。在数据不平衡的情况下，我们说明了标记来自稀有类的更多实例比仅仅增加数据集大小对性能的影响要大得多。此外，我们表明，通过半监督学习利用未标记的图像提供了一致的好处，即使标记子集包含显著的标签噪声。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Investigating the Interaction between Data and Algorithms

– Research in computer vision is centered on algorithmic improvements, for example, by developing better models. Thereby, the data is considered fixed. This is in contrast to many real-world applications of computer vision systems in which algorithms and data co-evolve. To address this shortcoming of previous research, we study the properties of the data and their interaction with deep learning algorithms. Thereby, we investigate the size of the data, the share of mislabels, class imbalance and the presence of unlabeled data which can be leveraged using semi-supervised learning. In experiments on 100 classes from ImageNet, we show that a tiny network architecture outperforms a much more powerful one it if has access to only a little bit more data. Only if vast amounts of data are available so that adding even more images has little effect on performance, large architectures dominate smaller ones. If little data is provided, adding a few labeled images has a huge effect on accuracy. Once accuracy saturates, massive amounts of additional data are needed to achieve even small improvements. Furthermore, we find that mislabels severely reduce performance. To fix that, we propose a cost-efficient way of identifying mislabels which is especially beneficial if many images are already available. Conversely, if little data is available, labeling more images is more advantageous than cleaning existing annotations. In the case of imbalanced data, we illustrate that labeling more instances from rare classes has a much greater effect on performance than only increasing dataset size. Moreover, we show that leveraging unlabeled images by semi-supervised learning offers a consistent benefit even if the labeled subset contains significant label noise.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

World Congress on Electrical Engineering and Computer Systems and Science

自引率

0.00%

发文量