A Case for Two-stage Inference with Knowledge Caching

The 3rd International Workshop on Deep Learning for Mobile Systems and Applications - EMDL '19 Pub Date : 2019-06-13 DOI:10.1145/3325413.3329789

Geonha Park, Changho Hwang, KyoungSoo Park

{"title":"A Case for Two-stage Inference with Knowledge Caching","authors":"Geonha Park, Changho Hwang, KyoungSoo Park","doi":"10.1145/3325413.3329789","DOIUrl":null,"url":null,"abstract":"Real-world intelligent services employing deep learning technology typically take a two-tier system architecture -- a dumb front-end device and smart back-end cloud servers. The front-end device simply forwards a human query while the back-end servers run a complex deep model to resolve the query and respond to the front-end device. While simple and effective, the current architecture not only increases the load at servers but also runs the risk of harming user privacy. In this paper, we present knowledge caching, which exploits the front-end device as a smart cache of a generalized deep model. The cache locally resolves a subset of popular or privacy-sensitive queries while it forwards the rest of them to back-end cloud servers. We discuss the feasibility of knowledge caching as well as technical challenges around deep model specialization and compression. We show our prototype two-stage inference system that populates a front-end cache with 10 voice commands out of 35 commands. We demonstrate that our specialization and compression techniques reduce the cached model size by 17.4x from the original model with 1.8x improvement on the inference accuracy.","PeriodicalId":164793,"journal":{"name":"The 3rd International Workshop on Deep Learning for Mobile Systems and Applications - EMDL '19","volume":"35 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-06-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"The 3rd International Workshop on Deep Learning for Mobile Systems and Applications - EMDL '19","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3325413.3329789","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 2

Abstract

Real-world intelligent services employing deep learning technology typically take a two-tier system architecture -- a dumb front-end device and smart back-end cloud servers. The front-end device simply forwards a human query while the back-end servers run a complex deep model to resolve the query and respond to the front-end device. While simple and effective, the current architecture not only increases the load at servers but also runs the risk of harming user privacy. In this paper, we present knowledge caching, which exploits the front-end device as a smart cache of a generalized deep model. The cache locally resolves a subset of popular or privacy-sensitive queries while it forwards the rest of them to back-end cloud servers. We discuss the feasibility of knowledge caching as well as technical challenges around deep model specialization and compression. We show our prototype two-stage inference system that populates a front-end cache with 10 voice commands out of 35 commands. We demonstrate that our specialization and compression techniques reduce the cached model size by 17.4x from the original model with 1.8x improvement on the inference accuracy.

查看原文本刊更多论文

基于知识缓存的两阶段推理

现实世界中采用深度学习技术的智能服务通常采用两层系统架构——哑前端设备和智能后端云服务器。前端设备简单地转发人工查询，而后端服务器运行复杂的深度模型来解析查询并响应前端设备。虽然简单有效，但目前的架构不仅增加了服务器的负载，而且还存在损害用户隐私的风险。本文提出了一种利用前端设备作为广义深度模型的智能缓存的知识缓存方法。缓存在本地解析流行查询或隐私敏感查询的子集，同时将其余查询转发到后端云服务器。我们讨论了知识缓存的可行性，以及围绕深度模型专门化和压缩的技术挑战。我们展示了我们的原型两阶段推理系统，它用35个语音命令中的10个来填充前端缓存。我们证明了我们的专门化和压缩技术将缓存的模型大小比原始模型减少了17.4倍，推理精度提高了1.8倍。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

The 3rd International Workshop on Deep Learning for Mobile Systems and Applications - EMDL '19

自引率

0.00%

发文量