{"title":"研究数据和算法之间的相互作用","authors":"Daniel Pototzky, Azhar Sultan, L. Schmidt-Thieme","doi":"10.11159/mvml22.106","DOIUrl":null,"url":null,"abstract":"– Research in computer vision is centered on algorithmic improvements, for example, by developing better models. Thereby, the data is considered fixed. This is in contrast to many real-world applications of computer vision systems in which algorithms and data co-evolve. To address this shortcoming of previous research, we study the properties of the data and their interaction with deep learning algorithms. Thereby, we investigate the size of the data, the share of mislabels, class imbalance and the presence of unlabeled data which can be leveraged using semi-supervised learning. In experiments on 100 classes from ImageNet, we show that a tiny network architecture outperforms a much more powerful one it if has access to only a little bit more data. Only if vast amounts of data are available so that adding even more images has little effect on performance, large architectures dominate smaller ones. If little data is provided, adding a few labeled images has a huge effect on accuracy. Once accuracy saturates, massive amounts of additional data are needed to achieve even small improvements. Furthermore, we find that mislabels severely reduce performance. To fix that, we propose a cost-efficient way of identifying mislabels which is especially beneficial if many images are already available. Conversely, if little data is available, labeling more images is more advantageous than cleaning existing annotations. In the case of imbalanced data, we illustrate that labeling more instances from rare classes has a much greater effect on performance than only increasing dataset size. Moreover, we show that leveraging unlabeled images by semi-supervised learning offers a consistent benefit even if the labeled subset contains significant label noise.","PeriodicalId":294100,"journal":{"name":"World Congress on Electrical Engineering and Computer Systems and Science","volume":"16 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Investigating the Interaction between Data and Algorithms\",\"authors\":\"Daniel Pototzky, Azhar Sultan, L. Schmidt-Thieme\",\"doi\":\"10.11159/mvml22.106\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"– Research in computer vision is centered on algorithmic improvements, for example, by developing better models. Thereby, the data is considered fixed. This is in contrast to many real-world applications of computer vision systems in which algorithms and data co-evolve. To address this shortcoming of previous research, we study the properties of the data and their interaction with deep learning algorithms. Thereby, we investigate the size of the data, the share of mislabels, class imbalance and the presence of unlabeled data which can be leveraged using semi-supervised learning. In experiments on 100 classes from ImageNet, we show that a tiny network architecture outperforms a much more powerful one it if has access to only a little bit more data. Only if vast amounts of data are available so that adding even more images has little effect on performance, large architectures dominate smaller ones. If little data is provided, adding a few labeled images has a huge effect on accuracy. Once accuracy saturates, massive amounts of additional data are needed to achieve even small improvements. Furthermore, we find that mislabels severely reduce performance. To fix that, we propose a cost-efficient way of identifying mislabels which is especially beneficial if many images are already available. Conversely, if little data is available, labeling more images is more advantageous than cleaning existing annotations. In the case of imbalanced data, we illustrate that labeling more instances from rare classes has a much greater effect on performance than only increasing dataset size. Moreover, we show that leveraging unlabeled images by semi-supervised learning offers a consistent benefit even if the labeled subset contains significant label noise.\",\"PeriodicalId\":294100,\"journal\":{\"name\":\"World Congress on Electrical Engineering and Computer Systems and Science\",\"volume\":\"16 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-07-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"World Congress on Electrical Engineering and Computer Systems and Science\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.11159/mvml22.106\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"World Congress on Electrical Engineering and Computer Systems and Science","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.11159/mvml22.106","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Investigating the Interaction between Data and Algorithms
– Research in computer vision is centered on algorithmic improvements, for example, by developing better models. Thereby, the data is considered fixed. This is in contrast to many real-world applications of computer vision systems in which algorithms and data co-evolve. To address this shortcoming of previous research, we study the properties of the data and their interaction with deep learning algorithms. Thereby, we investigate the size of the data, the share of mislabels, class imbalance and the presence of unlabeled data which can be leveraged using semi-supervised learning. In experiments on 100 classes from ImageNet, we show that a tiny network architecture outperforms a much more powerful one it if has access to only a little bit more data. Only if vast amounts of data are available so that adding even more images has little effect on performance, large architectures dominate smaller ones. If little data is provided, adding a few labeled images has a huge effect on accuracy. Once accuracy saturates, massive amounts of additional data are needed to achieve even small improvements. Furthermore, we find that mislabels severely reduce performance. To fix that, we propose a cost-efficient way of identifying mislabels which is especially beneficial if many images are already available. Conversely, if little data is available, labeling more images is more advantageous than cleaning existing annotations. In the case of imbalanced data, we illustrate that labeling more instances from rare classes has a much greater effect on performance than only increasing dataset size. Moreover, we show that leveraging unlabeled images by semi-supervised learning offers a consistent benefit even if the labeled subset contains significant label noise.