E. Nurvitadhi, Asit K. Mishra, Yu Wang, Ganesh Venkatesh, Debbie Marr
{"title":"用于稀疏数据分析的硬件加速器","authors":"E. Nurvitadhi, Asit K. Mishra, Yu Wang, Ganesh Venkatesh, Debbie Marr","doi":"10.3850/9783981537079_0766","DOIUrl":null,"url":null,"abstract":"Rapid growth of Internet led to web applications that produce large unstructured sparse datasets (e.g., texts, ratings). Machine learning (ML) algorithms are the basis for many important analytics workloads that extract knowledge from these datasets. This paper characterizes such workloads on a high-end server for real-world datasets and shows that a set of sparse matrix operations dominates runtime. Further, they run inefficiently due to low compute-per-byte and challenging thread scaling behavior. As such, we propose a hardware accelerator to perform these operations with extreme efficiency. Simulations and RTL synthesis to 14nm ASIC demonstrate significant performance and performance/Watt improvements over conventional processors, with only a small area overhead.","PeriodicalId":311352,"journal":{"name":"2016 Design, Automation & Test in Europe Conference & Exhibition (DATE)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2016-03-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"19","resultStr":"{\"title\":\"Hardware accelerator for analytics of sparse data\",\"authors\":\"E. Nurvitadhi, Asit K. Mishra, Yu Wang, Ganesh Venkatesh, Debbie Marr\",\"doi\":\"10.3850/9783981537079_0766\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Rapid growth of Internet led to web applications that produce large unstructured sparse datasets (e.g., texts, ratings). Machine learning (ML) algorithms are the basis for many important analytics workloads that extract knowledge from these datasets. This paper characterizes such workloads on a high-end server for real-world datasets and shows that a set of sparse matrix operations dominates runtime. Further, they run inefficiently due to low compute-per-byte and challenging thread scaling behavior. As such, we propose a hardware accelerator to perform these operations with extreme efficiency. Simulations and RTL synthesis to 14nm ASIC demonstrate significant performance and performance/Watt improvements over conventional processors, with only a small area overhead.\",\"PeriodicalId\":311352,\"journal\":{\"name\":\"2016 Design, Automation & Test in Europe Conference & Exhibition (DATE)\",\"volume\":\"1 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2016-03-14\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"19\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2016 Design, Automation & Test in Europe Conference & Exhibition (DATE)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.3850/9783981537079_0766\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2016 Design, Automation & Test in Europe Conference & Exhibition (DATE)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.3850/9783981537079_0766","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Rapid growth of Internet led to web applications that produce large unstructured sparse datasets (e.g., texts, ratings). Machine learning (ML) algorithms are the basis for many important analytics workloads that extract knowledge from these datasets. This paper characterizes such workloads on a high-end server for real-world datasets and shows that a set of sparse matrix operations dominates runtime. Further, they run inefficiently due to low compute-per-byte and challenging thread scaling behavior. As such, we propose a hardware accelerator to perform these operations with extreme efficiency. Simulations and RTL synthesis to 14nm ASIC demonstrate significant performance and performance/Watt improvements over conventional processors, with only a small area overhead.