Boosting Domain-Specific Debug Through Inter-frame Compression

2022 International Conference on Field-Programmable Technology (ICFPT) Pub Date : 2022-12-05 DOI:10.1109/ICFPT56656.2022.9974385

Zakary Nafziger, Martin Chua, D. H. Noronha, S. Wilton

{"title":"Boosting Domain-Specific Debug Through Inter-frame Compression","authors":"Zakary Nafziger, Martin Chua, D. H. Noronha, S. Wilton","doi":"10.1109/ICFPT56656.2022.9974385","DOIUrl":null,"url":null,"abstract":"Acceleration of machine learning models is proving to be an important application for FPGAs. Unfortunately, debugging such models during training or inference is difficult. Software simulations of a machine learning system may be of insufficient detail to provide meaningful debug insight, or may require infeasibly long run-times. Thus, it is often desirable to debug the accelerated model while it is running on real hardware. Effective on-chip debug often requires instrumenting a design with additional circuitry to store run-time data, consuming valuable chip resources. Previous work has developed methods to perform lossy compression of signals by exploiting machine learning specific knowledge, thereby increasing the amount of debug context that can be stored in an on-chip trace buffer. However, all prior work compresses each successive element in a signal of interest independently. Since debug signals may have temporal similarity in many machine learning applications there is an opportunity to further increase trace buffer utilization. In this paper, we present an architecture to perform lossless temporal compression in addition to the existing lossy element-wise compression. We show that, when applied to a typical machine learning algorithm in realistic debug scenarios, we are able to store twice as much information in an on-chip buffer while increasing the total area of the debug instrument by approximately 25%. The impact is that, for a given instrumentation budget, a significantly larger trace window is available during debug, possibly allowing a designer to narrow down the root cause of a bug faster.","PeriodicalId":239314,"journal":{"name":"2022 International Conference on Field-Programmable Technology (ICFPT)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-12-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 International Conference on Field-Programmable Technology (ICFPT)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICFPT56656.2022.9974385","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Acceleration of machine learning models is proving to be an important application for FPGAs. Unfortunately, debugging such models during training or inference is difficult. Software simulations of a machine learning system may be of insufficient detail to provide meaningful debug insight, or may require infeasibly long run-times. Thus, it is often desirable to debug the accelerated model while it is running on real hardware. Effective on-chip debug often requires instrumenting a design with additional circuitry to store run-time data, consuming valuable chip resources. Previous work has developed methods to perform lossy compression of signals by exploiting machine learning specific knowledge, thereby increasing the amount of debug context that can be stored in an on-chip trace buffer. However, all prior work compresses each successive element in a signal of interest independently. Since debug signals may have temporal similarity in many machine learning applications there is an opportunity to further increase trace buffer utilization. In this paper, we present an architecture to perform lossless temporal compression in addition to the existing lossy element-wise compression. We show that, when applied to a typical machine learning algorithm in realistic debug scenarios, we are able to store twice as much information in an on-chip buffer while increasing the total area of the debug instrument by approximately 25%. The impact is that, for a given instrumentation budget, a significantly larger trace window is available during debug, possibly allowing a designer to narrow down the root cause of a bug faster.

查看原文本刊更多论文

通过帧间压缩增强特定于域的调试

机器学习模型的加速被证明是fpga的一个重要应用。不幸的是，在训练或推理期间调试这样的模型是很困难的。机器学习系统的软件模拟可能没有足够的细节来提供有意义的调试洞察力，或者可能需要不可思议的长运行时间。因此，通常需要在加速模型在实际硬件上运行时调试它。有效的片上调试通常需要用额外的电路来存储运行时数据，从而消耗宝贵的芯片资源。以前的工作已经开发出了通过利用机器学习特定知识对信号进行有损压缩的方法，从而增加了可以存储在片上跟踪缓冲区中的调试上下文的数量。然而，所有先前的工作都是独立地压缩感兴趣信号中的每个连续元素。由于调试信号在许多机器学习应用程序中可能具有时间相似性，因此有机会进一步提高跟踪缓冲区的利用率。在本文中，我们提出了一种结构来执行无损时间压缩，除了现有的有损逐元压缩。我们表明，当应用于实际调试场景中的典型机器学习算法时，我们能够在片上缓冲区中存储两倍的信息，同时将调试仪器的总面积增加约25%。其影响是，对于给定的检测预算，在调试期间可以使用更大的跟踪窗口，这可能允许设计人员更快地缩小bug的根本原因。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2022 International Conference on Field-Programmable Technology (ICFPT)

自引率

0.00%

发文量