Saurabh Goyal, Anamitra R. Choudhury, Vivek Sharma
{"title":"结合剪枝和低秩分解的深度神经网络压缩","authors":"Saurabh Goyal, Anamitra R. Choudhury, Vivek Sharma","doi":"10.1109/IPDPSW.2019.00162","DOIUrl":null,"url":null,"abstract":"Large number of weights in deep neural networks make the models difficult to be deployed in low memory environments such as, mobile phones, IOT edge devices as well as \"inferencing as a service\" environments on the cloud. Prior work has considered reduction in the size of the models, through compression techniques like weight pruning, filter pruning, etc. or through low-rank decomposition of the convolution layers. In this paper, we demonstrate the use of multiple techniques to achieve not only higher model compression but also reduce the compute resources required during inferencing. We do filter pruning followed by low-rank decomposition using Tucker decomposition for model compression. We show that our approach achieves up to 57% higher model compression when compared to either Tucker Decomposition or Filter pruning alone at similar accuracy for GoogleNet. Also, it reduces the Flops by up to 48% thereby making the inferencing faster.","PeriodicalId":292054,"journal":{"name":"2019 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"78 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"6","resultStr":"{\"title\":\"Compression of Deep Neural Networks by Combining Pruning and Low Rank Decomposition\",\"authors\":\"Saurabh Goyal, Anamitra R. Choudhury, Vivek Sharma\",\"doi\":\"10.1109/IPDPSW.2019.00162\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Large number of weights in deep neural networks make the models difficult to be deployed in low memory environments such as, mobile phones, IOT edge devices as well as \\\"inferencing as a service\\\" environments on the cloud. Prior work has considered reduction in the size of the models, through compression techniques like weight pruning, filter pruning, etc. or through low-rank decomposition of the convolution layers. In this paper, we demonstrate the use of multiple techniques to achieve not only higher model compression but also reduce the compute resources required during inferencing. We do filter pruning followed by low-rank decomposition using Tucker decomposition for model compression. We show that our approach achieves up to 57% higher model compression when compared to either Tucker Decomposition or Filter pruning alone at similar accuracy for GoogleNet. Also, it reduces the Flops by up to 48% thereby making the inferencing faster.\",\"PeriodicalId\":292054,\"journal\":{\"name\":\"2019 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)\",\"volume\":\"78 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2019-05-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"6\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2019 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/IPDPSW.2019.00162\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IPDPSW.2019.00162","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Compression of Deep Neural Networks by Combining Pruning and Low Rank Decomposition
Large number of weights in deep neural networks make the models difficult to be deployed in low memory environments such as, mobile phones, IOT edge devices as well as "inferencing as a service" environments on the cloud. Prior work has considered reduction in the size of the models, through compression techniques like weight pruning, filter pruning, etc. or through low-rank decomposition of the convolution layers. In this paper, we demonstrate the use of multiple techniques to achieve not only higher model compression but also reduce the compute resources required during inferencing. We do filter pruning followed by low-rank decomposition using Tucker decomposition for model compression. We show that our approach achieves up to 57% higher model compression when compared to either Tucker Decomposition or Filter pruning alone at similar accuracy for GoogleNet. Also, it reduces the Flops by up to 48% thereby making the inferencing faster.