Xingyao Zhang, S. Song, Chenhao Xie, Jing Wang, Wei-gong Zhang, Xin Fu
{"title":"Enabling Highly Efficient Capsule Networks Processing Through A PIM-Based Architecture Design","authors":"Xingyao Zhang, S. Song, Chenhao Xie, Jing Wang, Wei-gong Zhang, Xin Fu","doi":"10.1109/HPCA47549.2020.00051","DOIUrl":"https://doi.org/10.1109/HPCA47549.2020.00051","url":null,"abstract":"In recent years, the CNNs have achieved great successes in the image processing tasks, e.g., image recognition and object detection. Unfortunately, traditional CNN's classification is found to be easily misled by increasingly complex image features due to the usage of pooling operations, hence unable to preserve accurate position and pose information of the objects. To address this challenge, a novel neural network structure called Capsule Network has been proposed, which introduces equivariance through capsules to significantly enhance the learning ability for image segmentation and object detection. Due to its requirement of performing a high volume of matrix operations, CapsNets have been generally accelerated on modern GPU platforms that provide highly optimized software library for common deep learning tasks. However, based on our performance characterization on modern GPUs, CapsNets exhibit low efficiency due to the special program and execution features of their routing procedure, including massive unshareable intermediate variables and intensive synchronizations, which are very difficult to optimize at software level. To address these challenges, we propose a hybrid computing architecture design named PIM-CapsNet. It preserves GPU's on-chip computing capability for accelerating CNN types of layers in CapsNet, while pipelining with an off-chip in-memory acceleration solution that effectively tackles routing procedure's inefficiency by leveraging the processing-in-memory capability of today's 3D stacked memory. Using routing procedure's inherent parallellization feature, our design enables hierarchical improvements on CapsNet inference efficiency through minimizing data movement and maximizing parallel processing in memory. Evaluation results demonstrate that our proposed design can achieve substantial improvement on both performance and energy savings for CapsNet inference, with almost zero accuracy loss. The results also suggest good performance scalability in optimizing the routing procedure with increasing network size.","PeriodicalId":339648,"journal":{"name":"2020 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131081482","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"PREMA: A Predictive Multi-Task Scheduling Algorithm For Preemptible Neural Processing Units","authors":"Yujeong Choi, Minsoo Rhu","doi":"10.1109/HPCA47549.2020.00027","DOIUrl":"https://doi.org/10.1109/HPCA47549.2020.00027","url":null,"abstract":"To amortize cost, cloud vendors providing DNN acceleration as a service to end-users employ consolidation and virtualization to share the underlying resources among multiple DNN service requests. This paper makes a case for a \"preemptible\" neural processing unit (NPU) and a \"predictive\" multi-task scheduler to meet the latency demands of high-priority inference while maintaining high throughput. We evaluate both the mechanisms that enable NPUs to be preemptible and the policies that utilize them to meet scheduling objectives. We show that preemptive NPU multi-tasking can achieve an average 7.8×, 1.4×, and 4.8× improvement in latency, throughput, and SLA satisfaction, respectively.","PeriodicalId":339648,"journal":{"name":"2020 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"282 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-09-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124240081","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Udit Gupta, Xiaodong Wang, M. Naumov, Carole-Jean Wu, Brandon Reagen, D. Brooks, Bradford Cottel, K. Hazelwood, Bill Jia, Hsien-Hsin S. Lee, Andrey Malevich, Dheevatsa Mudigere, M. Smelyanskiy, Liang Xiong, Xuan Zhang
{"title":"The Architectural Implications of Facebook's DNN-Based Personalized Recommendation","authors":"Udit Gupta, Xiaodong Wang, M. Naumov, Carole-Jean Wu, Brandon Reagen, D. Brooks, Bradford Cottel, K. Hazelwood, Bill Jia, Hsien-Hsin S. Lee, Andrey Malevich, Dheevatsa Mudigere, M. Smelyanskiy, Liang Xiong, Xuan Zhang","doi":"10.1109/HPCA47549.2020.00047","DOIUrl":"https://doi.org/10.1109/HPCA47549.2020.00047","url":null,"abstract":"The widespread application of deep learning has changed the landscape of computation in data centers. In particular, personalized recommendation for content ranking is now largely accomplished using deep neural networks. However, despite their importance and the amount of compute cycles they consume, relatively little research attention has been devoted to recommendation systems. To facilitate research and advance the understanding of these workloads, this paper presents a set of real-world, production-scale DNNs for personalized recommendation coupled with relevant performance metrics for evaluation. In addition to releasing a set of open-source workloads, we conduct in-depth analysis that underpins future system design and optimization for at-scale recommendation: Inference latency varies by 60% across three Intel server generations, batching and co-location of inference jobs can drastically improve latency-bounded throughput, and diversity across recommendation models leads to different optimization strategies.","PeriodicalId":339648,"journal":{"name":"2020 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"37 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-06-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130952817","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Leaking Information Through Cache LRU States","authors":"Wenjie Xiong, Jakub Szefer","doi":"10.1109/HPCA47549.2020.00021","DOIUrl":"https://doi.org/10.1109/HPCA47549.2020.00021","url":null,"abstract":"The Least-Recently Used cache replacement policy and its variants are widely deployed in modern processors. This paper shows for the first time in detail that the LRU states of caches can be used to leak information: any access to a cache by a sender will modify the LRU state, and the receiver is able to observe this through a timing measurement. This paper presents LRU timing-based channels both when the sender and the receiver have shared memory, e.g., shared library data pages, and when they are separate processes without shared memory. In addition, the new LRU timing-based channels are demonstrated on both Intel and AMD processors in scenarios where the sender and the receiver are sharing the cache in both hyper-threaded setting and time-sliced setting. The transmission rate of the LRU channels can be up to 600Kbps per cache set in the hyper-threaded setting. Different from the majority of existing cache channels which require the sender to trigger cache misses, the new LRU channels work with the sender only having cache hits, making the channel faster and more stealthy. This paper also demonstrates that the new LRU channels can be used in transient execution attacks, e.g., Spectre. Further, this paper shows that the LRU channels pose threats to existing secure cache designs, and this work demonstrates the LRU channels affect the secure PL cache. The paper finishes by discussing and evaluating possible defenses.","PeriodicalId":339648,"journal":{"name":"2020 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116741580","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"IRONHIDE: A Secure Multicore that Efficiently Mitigates Microarchitecture State Attacks for Interactive Applications","authors":"H. Omar, O. Khan","doi":"10.1109/HPCA47549.2020.00019","DOIUrl":"https://doi.org/10.1109/HPCA47549.2020.00019","url":null,"abstract":"Microprocessors enable aggressive hardware virtualization by means of which multiple processes temporally execute on the system. These security-critical and ordinary processes interact with each other to assure application progress. However, temporal sharing of hardware resources exposes the processor to various microarchitecture state attacks. State-of-the-art secure processors, such as MI6 adopt Intel's SGX enclave execution model. MI6 architects strong isolation by statically isolating shared memory state, and purging the microarchitecture state of private core, cache, and TLB resources on every enclave entry and exit. The purging overhead significantly impacts performance as the interactivity across the secure and insecure processes increases. This paper proposes IRONHIDE that implements strong isolation in the context of multicores to form spatially isolated secure and insecure clusters of cores. For an interactive application comprising of secure and insecure processes, IRONHIDE pins the secure process(es) to the secure cluster, where they execute and interact with the insecure process(es) without incurring the microarchitecture state purging overheads on every interaction event. IRONHIDE improves performance by 2.1x over the MI6 baseline for a set of user and OS interactive applications. Moreover, IRONHIDE improves performance by 20% over an SGX-like baseline, while also ensuring strong isolation guarantees against microarchitecture state attacks.","PeriodicalId":339648,"journal":{"name":"2020 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"110 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-04-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127773490","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}