An Instruction Set Architecture for Machine Learning

ACM Transactions on Computer Systems (TOCS) Pub Date : 2019-08-13 DOI:10.1145/3331469

Yunji Chen, Huiying Lan, Zidong Du, Shaoli Liu, Jinhua Tao, D. Han, Tao Luo, Qi Guo, Ling Li, Yuan Xie, Tianshi Chen

{"title":"An Instruction Set Architecture for Machine Learning","authors":"Yunji Chen, Huiying Lan, Zidong Du, Shaoli Liu, Jinhua Tao, D. Han, Tao Luo, Qi Guo, Ling Li, Yuan Xie, Tianshi Chen","doi":"10.1145/3331469","DOIUrl":null,"url":null,"abstract":"Machine Learning (ML) are a family of models for learning from the data to improve performance on a certain task. ML techniques, especially recent renewed neural networks (deep neural networks), have proven to be efficient for a broad range of applications. ML techniques are conventionally executed on general-purpose processors (such as CPU and GPGPU), which usually are not energy efficient, since they invest excessive hardware resources to flexibly support various workloads. Consequently, application-specific hardware accelerators have been proposed recently to improve energy efficiency. However, such accelerators were designed for a small set of ML techniques sharing similar computational patterns, and they adopt complex and informative instructions (control signals) directly corresponding to high-level functional blocks of an ML technique (such as layers in neural networks) or even an ML as a whole. Although straightforward and easy to implement for a limited set of similar ML techniques, the lack of agility in the instruction set prevents such accelerator designs from supporting a variety of different ML techniques with sufficient flexibility and efficiency. In this article, we first propose a novel domain-specific Instruction Set Architecture (ISA) for NN accelerators, called Cambricon, which is a load-store architecture that integrates scalar, vector, matrix, logical, data transfer, and control instructions, based on a comprehensive analysis of existing NN techniques. We then extend the application scope of Cambricon from NN to ML techniques. We also propose an assembly language, an assembler, and runtime to support programming with Cambricon, especially targeting large-scale ML problems. Our evaluation over a total of 16 representative yet distinct ML techniques have demonstrated that Cambricon exhibits strong descriptive capacity over a broad range of ML techniques and provides higher code density than general-purpose ISAs such as x86, MIPS, and GPGPU. Compared to the latest state-of-the-art NN accelerator design DaDianNao [7] (which can only accommodate three types of NN techniques), our Cambricon-based accelerator prototype implemented in TSMC 65nm technology incurs only negligible latency/power/area overheads, with a versatile coverage of 10 different NN benchmarks and 7 other ML benchmarks. Compared to the recent prevalent ML accelerator PuDianNao, our Cambricon-based accelerator is able to support all the ML techniques as well as the 10 NNs but with only approximate 5.1% performance loss.","PeriodicalId":318554,"journal":{"name":"ACM Transactions on Computer Systems (TOCS)","volume":"468 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-08-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"9","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACM Transactions on Computer Systems (TOCS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3331469","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 9

Abstract

Machine Learning (ML) are a family of models for learning from the data to improve performance on a certain task. ML techniques, especially recent renewed neural networks (deep neural networks), have proven to be efficient for a broad range of applications. ML techniques are conventionally executed on general-purpose processors (such as CPU and GPGPU), which usually are not energy efficient, since they invest excessive hardware resources to flexibly support various workloads. Consequently, application-specific hardware accelerators have been proposed recently to improve energy efficiency. However, such accelerators were designed for a small set of ML techniques sharing similar computational patterns, and they adopt complex and informative instructions (control signals) directly corresponding to high-level functional blocks of an ML technique (such as layers in neural networks) or even an ML as a whole. Although straightforward and easy to implement for a limited set of similar ML techniques, the lack of agility in the instruction set prevents such accelerator designs from supporting a variety of different ML techniques with sufficient flexibility and efficiency. In this article, we first propose a novel domain-specific Instruction Set Architecture (ISA) for NN accelerators, called Cambricon, which is a load-store architecture that integrates scalar, vector, matrix, logical, data transfer, and control instructions, based on a comprehensive analysis of existing NN techniques. We then extend the application scope of Cambricon from NN to ML techniques. We also propose an assembly language, an assembler, and runtime to support programming with Cambricon, especially targeting large-scale ML problems. Our evaluation over a total of 16 representative yet distinct ML techniques have demonstrated that Cambricon exhibits strong descriptive capacity over a broad range of ML techniques and provides higher code density than general-purpose ISAs such as x86, MIPS, and GPGPU. Compared to the latest state-of-the-art NN accelerator design DaDianNao [7] (which can only accommodate three types of NN techniques), our Cambricon-based accelerator prototype implemented in TSMC 65nm technology incurs only negligible latency/power/area overheads, with a versatile coverage of 10 different NN benchmarks and 7 other ML benchmarks. Compared to the recent prevalent ML accelerator PuDianNao, our Cambricon-based accelerator is able to support all the ML techniques as well as the 10 NNs but with only approximate 5.1% performance loss.

查看原文本刊更多论文

机器学习的指令集体系结构

机器学习(ML)是用于从数据中学习以提高特定任务性能的一系列模型。ML技术，特别是最近更新的神经网络(深度神经网络)，已被证明在广泛的应用中是有效的。ML技术通常在通用处理器(如CPU和GPGPU)上执行，这通常不节能，因为它们投入了过多的硬件资源来灵活地支持各种工作负载。因此，最近提出了特定于应用程序的硬件加速器来提高能源效率。然而，这样的加速器是为一小部分共享类似计算模式的机器学习技术而设计的，它们采用复杂而信息丰富的指令(控制信号)，直接对应于机器学习技术的高级功能块(如神经网络中的层)，甚至是整个机器学习。尽管对于一组有限的类似机器学习技术来说，实现起来简单易行，但指令集缺乏灵活性阻碍了这种加速器设计以足够的灵活性和效率支持各种不同的机器学习技术。在本文中，我们首先提出了一种新的针对神经网络加速器的特定领域指令集架构(ISA)，称为“Cambricon”，这是一种基于对现有神经网络技术的综合分析，集成了标量、向量、矩阵、逻辑、数据传输和控制指令的负载存储架构。然后我们将寒武纪的应用范围从神经网络扩展到机器学习技术。我们还提出了一种汇编语言、汇编器和运行时，以支持使用寒武纪进行编程，特别是针对大规模机器学习问题。我们对16种具有代表性但不同的机器学习技术的评估表明，寒武纪在广泛的机器学习技术中表现出强大的描述能力，并且比通用isa(如x86, MIPS和GPGPU)提供更高的代码密度。与最新的最先进的神经网络加速器设计DaDianNao[7](只能容纳三种类型的神经网络技术)相比，我们采用台积电65nm技术实现的基于寒武纪的加速器原型只会产生可忽略的延迟/功耗/面积开销，具有10种不同的神经网络基准和7种其他ML基准的通用覆盖。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

ACM Transactions on Computer Systems (TOCS)

自引率

0.00%

发文量