PuDianNao: A Polyvalent Machine Learning Accelerator

Proceedings of the Twentieth International Conference on Architectural Support for Programming Languages and Operating Systems Pub Date : 2015-03-14 DOI:10.1145/2694344.2694358

Dao-Fu Liu, Tianshi Chen, Shaoli Liu, Jinhong Zhou, Shengyuan Zhou, O. Temam, Xiaobing Feng, Xuehai Zhou, Yunji Chen

{"title":"PuDianNao: A Polyvalent Machine Learning Accelerator","authors":"Dao-Fu Liu, Tianshi Chen, Shaoli Liu, Jinhong Zhou, Shengyuan Zhou, O. Temam, Xiaobing Feng, Xuehai Zhou, Yunji Chen","doi":"10.1145/2694344.2694358","DOIUrl":null,"url":null,"abstract":"Machine Learning (ML) techniques are pervasive tools in various emerging commercial applications, but have to be accommodated by powerful computer systems to process very large data. Although general-purpose CPUs and GPUs have provided straightforward solutions, their energy-efficiencies are limited due to their excessive supports for flexibility. Hardware accelerators may achieve better energy-efficiencies, but each accelerator often accommodates only a single ML technique (family). According to the famous No-Free-Lunch theorem in the ML domain, however, an ML technique performs well on a dataset may perform poorly on another dataset, which implies that such accelerator may sometimes lead to poor learning accuracy. Even if regardless of the learning accuracy, such accelerator can still become inapplicable simply because the concrete ML task is altered, or the user chooses another ML technique. In this study, we present an ML accelerator called PuDianNao, which accommodates seven representative ML techniques, including k-means, k-nearest neighbors, naive bayes, support vector machine, linear regression, classification tree, and deep neural network. Benefited from our thorough analysis on computational primitives and locality properties of different ML techniques, PuDianNao can perform up to 1056 GOP/s (e.g., additions and multiplications) in an area of 3.51 mm^2, and consumes 596 mW only. Compared with the NVIDIA K20M GPU (28nm process), PuDianNao (65nm process) is 1.20x faster, and can reduce the energy by 128.41x.","PeriodicalId":403247,"journal":{"name":"Proceedings of the Twentieth International Conference on Architectural Support for Programming Languages and Operating Systems","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2015-03-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"287","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the Twentieth International Conference on Architectural Support for Programming Languages and Operating Systems","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2694344.2694358","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 287

Abstract

Machine Learning (ML) techniques are pervasive tools in various emerging commercial applications, but have to be accommodated by powerful computer systems to process very large data. Although general-purpose CPUs and GPUs have provided straightforward solutions, their energy-efficiencies are limited due to their excessive supports for flexibility. Hardware accelerators may achieve better energy-efficiencies, but each accelerator often accommodates only a single ML technique (family). According to the famous No-Free-Lunch theorem in the ML domain, however, an ML technique performs well on a dataset may perform poorly on another dataset, which implies that such accelerator may sometimes lead to poor learning accuracy. Even if regardless of the learning accuracy, such accelerator can still become inapplicable simply because the concrete ML task is altered, or the user chooses another ML technique. In this study, we present an ML accelerator called PuDianNao, which accommodates seven representative ML techniques, including k-means, k-nearest neighbors, naive bayes, support vector machine, linear regression, classification tree, and deep neural network. Benefited from our thorough analysis on computational primitives and locality properties of different ML techniques, PuDianNao can perform up to 1056 GOP/s (e.g., additions and multiplications) in an area of 3.51 mm^2, and consumes 596 mW only. Compared with the NVIDIA K20M GPU (28nm process), PuDianNao (65nm process) is 1.20x faster, and can reduce the energy by 128.41x.

查看原文本刊更多论文

机器学习(ML)技术是各种新兴商业应用中普遍存在的工具，但必须由强大的计算机系统来处理非常大的数据。尽管通用cpu和gpu提供了直接的解决方案，但由于对灵活性的过度支持，它们的能效受到限制。硬件加速器可以实现更好的能源效率，但每个加速器通常只能容纳一种ML技术(家族)。然而，根据ML领域著名的No-Free-Lunch定理，ML技术在一个数据集上表现良好，可能在另一个数据集上表现不佳，这意味着这种加速器有时可能导致较差的学习准确性。即使不考虑学习的准确性，这种加速器仍然可能因为具体的机器学习任务被改变，或者用户选择了另一种机器学习技术而变得不适用。得益于我们对不同ML技术的计算基元和局域特性的深入分析，PuDianNao可以在3.51 mm^2的面积内执行高达1056 GOP/s(例如加法和乘法)，仅消耗596 mW。与NVIDIA K20M GPU (28nm制程)相比，PuDianNao (65nm制程)速度快1.20倍，能耗降低128.41倍。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the Twentieth International Conference on Architectural Support for Programming Languages and Operating Systems

自引率

0.00%

发文量