Exploiting cache coherence for effective on-the-fly data tracing in multicores

2016 IEEE 34th International Conference on Computer Design (ICCD) Pub Date : 2016-10-01 DOI:10.1109/ICCD.2016.7753295

Mounika Ponugoti, A. Milenković

{"title":"Exploiting cache coherence for effective on-the-fly data tracing in multicores","authors":"Mounika Ponugoti, A. Milenković","doi":"10.1109/ICCD.2016.7753295","DOIUrl":null,"url":null,"abstract":"Software testing and debugging of modern embedded computer systems become increasingly a challenging task due to growing hardware and software complexity, increased integration and miniaturization, and ever tightening time-to-market. To find software bugs faster, developers often rely on on-chip trace and debug resources. However, these resources offer limited visibility of the system, increase the system cost, and do not scale well with a growing number of processor cores. This paper introduces a new hardware/software mechanism for capturing and filtering load data value traces in multicores that enables a complete reconstruction of a parallel program execution. The proposed mechanism exploits data caches and cache coherence protocol states to minimize the number of trace events that are necessary to stream out of the target platform to the software debugger. The mechanism relies on a single trace bit per data cache block, thus minimizing the cost of hardware implementation. Our experimental evaluation explores the effectiveness of the proposed technique by measuring the trace port bandwidth as a function of the cache size and the number of processor cores. The results show that the proposed mechanism significantly reduces the required trace port bandwidth when compared to the Nexus-like load data value tracing. Depending on data cache size, the improvements range from 9.9 to 23.5 times for single cores and from 18.6 to 37.3 times for octa cores.","PeriodicalId":297899,"journal":{"name":"2016 IEEE 34th International Conference on Computer Design (ICCD)","volume":"567 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2016-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2016 IEEE 34th International Conference on Computer Design (ICCD)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICCD.2016.7753295","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 3

Abstract

Software testing and debugging of modern embedded computer systems become increasingly a challenging task due to growing hardware and software complexity, increased integration and miniaturization, and ever tightening time-to-market. To find software bugs faster, developers often rely on on-chip trace and debug resources. However, these resources offer limited visibility of the system, increase the system cost, and do not scale well with a growing number of processor cores. This paper introduces a new hardware/software mechanism for capturing and filtering load data value traces in multicores that enables a complete reconstruction of a parallel program execution. The proposed mechanism exploits data caches and cache coherence protocol states to minimize the number of trace events that are necessary to stream out of the target platform to the software debugger. The mechanism relies on a single trace bit per data cache block, thus minimizing the cost of hardware implementation. Our experimental evaluation explores the effectiveness of the proposed technique by measuring the trace port bandwidth as a function of the cache size and the number of processor cores. The results show that the proposed mechanism significantly reduces the required trace port bandwidth when compared to the Nexus-like load data value tracing. Depending on data cache size, the improvements range from 9.9 to 23.5 times for single cores and from 18.6 to 37.3 times for octa cores.

查看原文本刊更多论文

利用缓存一致性在多核中进行有效的动态数据跟踪

由于硬件和软件的复杂性、集成度和小型化程度的提高以及上市时间的缩短，现代嵌入式计算机系统的软件测试和调试变得越来越具有挑战性。为了更快地发现软件错误，开发人员通常依赖于芯片上的跟踪和调试资源。然而，这些资源提供了有限的系统可见性，增加了系统成本，并且随着处理器内核数量的增加而不能很好地扩展。本文介绍了一种新的硬件/软件机制，用于在多核中捕获和过滤负载数据值跟踪，从而能够完全重建并行程序的执行。所提出的机制利用数据缓存和缓存一致性协议状态来最小化从目标平台流向软件调试器所必需的跟踪事件的数量。该机制依赖于每个数据缓存块的单个跟踪位，从而将硬件实现的成本降至最低。我们的实验评估通过测量跟踪端口带宽作为缓存大小和处理器内核数量的函数来探索所提出技术的有效性。结果表明，与类似nexus的负载数据值跟踪相比，该机制显著降低了所需的跟踪端口带宽。根据数据缓存大小的不同，单核的改进幅度从9.9到23.5倍不等，八核的改进幅度从18.6到37.3倍不等。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2016 IEEE 34th International Conference on Computer Design (ICCD)

自引率

0.00%

发文量