Konstantin Lübeck, Alexander Louis-Ferdinand Jung, Felix Wedlich, Mika Markus Müller, Federico Nicolás Peccia, Felix Thömmes, Jannik Steinmetz, Valentin Biermaier, Adrian Frischknecht, Paul Palomero Bernardo, Oliver Bringmann
{"title":"Automatic Generation of Fast and Accurate Performance Models for Deep Neural Network Accelerators","authors":"Konstantin Lübeck, Alexander Louis-Ferdinand Jung, Felix Wedlich, Mika Markus Müller, Federico Nicolás Peccia, Felix Thömmes, Jannik Steinmetz, Valentin Biermaier, Adrian Frischknecht, Paul Palomero Bernardo, Oliver Bringmann","doi":"arxiv-2409.08595","DOIUrl":null,"url":null,"abstract":"Implementing Deep Neural Networks (DNNs) on resource-constrained edge devices\nis a challenging task that requires tailored hardware accelerator architectures\nand a clear understanding of their performance characteristics when executing\nthe intended AI workload. To facilitate this, we present an automated\ngeneration approach for fast performance models to accurately estimate the\nlatency of a DNN mapped onto systematically modeled and concisely described\naccelerator architectures. Using our accelerator architecture description\nmethod, we modeled representative DNN accelerators such as Gemmini, UltraTrail,\nPlasticine-derived, and a parameterizable systolic array. Together with DNN\nmappings for those modeled architectures, we perform a combined DNN/hardware\ndependency graph analysis, which enables us, in the best case, to evaluate only\n154 loop kernel iterations to estimate the performance for 4.19 billion\ninstructions achieving a significant speedup. We outperform regression and\nanalytical models in terms of mean absolute percentage error (MAPE) compared to\nsimulation results, while being several magnitudes faster than an RTL\nsimulation.","PeriodicalId":501291,"journal":{"name":"arXiv - CS - Performance","volume":"75 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Performance","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.08595","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Implementing Deep Neural Networks (DNNs) on resource-constrained edge devices
is a challenging task that requires tailored hardware accelerator architectures
and a clear understanding of their performance characteristics when executing
the intended AI workload. To facilitate this, we present an automated
generation approach for fast performance models to accurately estimate the
latency of a DNN mapped onto systematically modeled and concisely described
accelerator architectures. Using our accelerator architecture description
method, we modeled representative DNN accelerators such as Gemmini, UltraTrail,
Plasticine-derived, and a parameterizable systolic array. Together with DNN
mappings for those modeled architectures, we perform a combined DNN/hardware
dependency graph analysis, which enables us, in the best case, to evaluate only
154 loop kernel iterations to estimate the performance for 4.19 billion
instructions achieving a significant speedup. We outperform regression and
analytical models in terms of mean absolute percentage error (MAPE) compared to
simulation results, while being several magnitudes faster than an RTL
simulation.