{"title":"AdaEmb-Encoder:自适应嵌入基于空间编码器的重复数据删除备份分类器训练数据","authors":"Yaobin Qin, D. Lilja","doi":"10.1109/IPCCC50635.2020.9391523","DOIUrl":null,"url":null,"abstract":"The advent of the AI era has made it increasingly important to have an efficient backup system to protect training data from loss. Furthermore, a backup of the training data makes it possible to update or retrain the learned model as more data are collected. However, a huge backup overhead will result if a complete copy of all daily collected training data is always made to backup storage, especially because the data typically contain highly redundant information that makes no contribution to model learning. Deduplication is a common technique in modern backup systems to reduce data redundancy. However, existing deduplication methods are invalid for training data. Hence, this paper proposes a novel deduplication strategy for the training data used for learning in a deep neural network classifier. Experimental results showed that the proposed deduplication strategy achieved 93% backup storage space reduction with only 1.3% loss of classification accuracy.","PeriodicalId":226034,"journal":{"name":"2020 IEEE 39th International Performance Computing and Communications Conference (IPCCC)","volume":"65 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"AdaEmb-Encoder: Adaptive Embedding Spatial Encoder-Based Deduplication for Backing Up Classifier Training Data\",\"authors\":\"Yaobin Qin, D. Lilja\",\"doi\":\"10.1109/IPCCC50635.2020.9391523\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The advent of the AI era has made it increasingly important to have an efficient backup system to protect training data from loss. Furthermore, a backup of the training data makes it possible to update or retrain the learned model as more data are collected. However, a huge backup overhead will result if a complete copy of all daily collected training data is always made to backup storage, especially because the data typically contain highly redundant information that makes no contribution to model learning. Deduplication is a common technique in modern backup systems to reduce data redundancy. However, existing deduplication methods are invalid for training data. Hence, this paper proposes a novel deduplication strategy for the training data used for learning in a deep neural network classifier. Experimental results showed that the proposed deduplication strategy achieved 93% backup storage space reduction with only 1.3% loss of classification accuracy.\",\"PeriodicalId\":226034,\"journal\":{\"name\":\"2020 IEEE 39th International Performance Computing and Communications Conference (IPCCC)\",\"volume\":\"65 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2020-11-06\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2020 IEEE 39th International Performance Computing and Communications Conference (IPCCC)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/IPCCC50635.2020.9391523\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2020 IEEE 39th International Performance Computing and Communications Conference (IPCCC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IPCCC50635.2020.9391523","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
AdaEmb-Encoder: Adaptive Embedding Spatial Encoder-Based Deduplication for Backing Up Classifier Training Data
The advent of the AI era has made it increasingly important to have an efficient backup system to protect training data from loss. Furthermore, a backup of the training data makes it possible to update or retrain the learned model as more data are collected. However, a huge backup overhead will result if a complete copy of all daily collected training data is always made to backup storage, especially because the data typically contain highly redundant information that makes no contribution to model learning. Deduplication is a common technique in modern backup systems to reduce data redundancy. However, existing deduplication methods are invalid for training data. Hence, this paper proposes a novel deduplication strategy for the training data used for learning in a deep neural network classifier. Experimental results showed that the proposed deduplication strategy achieved 93% backup storage space reduction with only 1.3% loss of classification accuracy.