International Conference on Compilers, Architecture, and Synthesis for Embedded Systems最新文献

筛选
英文 中文
Resource recycling: putting idle resources to work on a composable accelerator 资源回收:将空闲资源用于可组合加速器
Yongjun Park, Hyunchul Park, S. Mahlke, Sukjin Kim
{"title":"Resource recycling: putting idle resources to work on a composable accelerator","authors":"Yongjun Park, Hyunchul Park, S. Mahlke, Sukjin Kim","doi":"10.1145/1878921.1878925","DOIUrl":"https://doi.org/10.1145/1878921.1878925","url":null,"abstract":"Mobile computing platforms in the form of smart phones, netbooks, and personal digital assistants have become an integral part of our everyday lives. Moving ahead to the future, mobile multimedia support will become a key differentiating factor for customers. Features such as high-definition audio and video, video conferencing, 3D graphics, and image projection will lead to the adoption of one phone over another. However, in contrast to wireless signal processing which is dominated by vectorizable computation, mobile multimedia applications often contain complex control flow and variable computational requirements. Moreover, data access is more complex where media applications typically operate on multi-dimensional vectors of data rather than single-dimensional vectors with simple strides. To handle these complexities, composable accelerators such as the Polymorphic Pipeline Array, or PPA, present an appealing hardware platform by adding a degree of hardware configurability over existing accelerators. Hardware resources can be both statically as well as dynamically partitioned among executing tasks to maximize execution efficiency. However, an effective compilation framework is essential to partition and assign resources to make intelligent use of the available hardware. In this paper, a compilation framework is introduced that maximizes application throughput with hybrid resource partitioning of a PPA system. Static partitioning handles part of the resource assignment, but this is followed up by dynamic partitioning to identify idle resources and put them to use -- resource recycling. Experimental results show that real-time media applications can take advantage of the static and dynamic configurability of the PPA for increase.\u0000 throughput.","PeriodicalId":136293,"journal":{"name":"International Conference on Compilers, Architecture, and Synthesis for Embedded Systems","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-10-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129277366","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Characterization and exploitation of narrow-width loads: the narrow-width cache approach 窄宽负载的特性和利用:窄宽缓存方法
M. Islam, P. Stenström
{"title":"Characterization and exploitation of narrow-width loads: the narrow-width cache approach","authors":"M. Islam, P. Stenström","doi":"10.1145/1878921.1878955","DOIUrl":"https://doi.org/10.1145/1878921.1878955","url":null,"abstract":"This paper exploits small-value locality to accelerate the execution of memory instructions. We find that narrow-width loads (NWLDs) --- loads with small-value operands of 8 bits or less --- comprise 26% of all executed loads across 40 applications of the SPEC benchmark suites. We establish that the frequency of NWLDs are almost independent of compiler and input data. We introduce narrow-width caches (NWC) to cache small-value memory words. NWCs provide a significant speedup for several memory-intensive applications with a negligible chip-area overhead. NWCs also reduce the overall energy dissipation and memory traffic.","PeriodicalId":136293,"journal":{"name":"International Conference on Compilers, Architecture, and Synthesis for Embedded Systems","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-10-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129224586","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 15
Erbium: a deterministic, concurrent intermediate representation to map data-flow tasks to scalable, persistent streaming processes Erbium:一种确定性的、并发的中间表示,用于将数据流任务映射到可扩展的、持久的流处理
Cupertino Miranda, Antoniu Pop, Philippe Dumont, Albert Cohen, M. Duranton
{"title":"Erbium: a deterministic, concurrent intermediate representation to map data-flow tasks to scalable, persistent streaming processes","authors":"Cupertino Miranda, Antoniu Pop, Philippe Dumont, Albert Cohen, M. Duranton","doi":"10.1145/1878921.1878924","DOIUrl":"https://doi.org/10.1145/1878921.1878924","url":null,"abstract":"Tuning applications for multicore systems involve subtle concurrency concepts and target-dependent optimizations. This paper advocates for a streaming execution model, called ER, where persistent processes communicate and synchronize through a multi-consumer processing applications, we demonstrate the scalability and efficiency advantages of streaming compared to data-driven scheduling. To exploit these benefits in compilers for parallel languages, we propose an intermediate representation enabling the compilation of data-flow tasks into streaming processes. This intermediate representation also facilitates the application of classical compiler optimizations to concurrent programs.","PeriodicalId":136293,"journal":{"name":"International Conference on Compilers, Architecture, and Synthesis for Embedded Systems","volume":"41 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-10-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129269800","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 26
Design space exploration of the turbo decoding algorithm on GPUs gpu上turbo译码算法的设计空间探索
Dongwon Lee, M. Wolf, Hyesoon Kim
{"title":"Design space exploration of the turbo decoding algorithm on GPUs","authors":"Dongwon Lee, M. Wolf, Hyesoon Kim","doi":"10.1145/1878921.1878953","DOIUrl":"https://doi.org/10.1145/1878921.1878953","url":null,"abstract":"In this paper, we explore the design space of the Turbo decoding algorithm on GPUs and find a performance bottleneck. We consider three axes for the design space exploration: a radix degree, a parallelization method, and the number of sub-frames per thread block. In Turbo decoding, a degree of radix affects computational complexity and memory access patterns in both algorithmic and implementation viewpoints. Second, computations of branch metrics (BMs) and state metrics (SMs) have a different degree of parallelism, which affects the mapping method of computational tasks to GPU threads. Finally, we can easily adjust the number of sub-frames per thread block to balance the occupancy and memory access traffic. Experimental results show that the radix-4 algorithm with the SM-centric mapping method shows the best performance at four sub-frames per thread block. According to our analysis, two factors -- the occupancy and shared memory bank conflicts -- differentiate the performance of different cases in the design space. We show further performance improvements by optimizing a kernel operation (max*) and applying the MAX-Log-Maximum A Posteriori (MAP) algorithm. A performance bottleneck at the finally optimized case is global memory access latency.\u0000 Since the most optimized performance is comparable to that of the other programmable platforms, the GPU can be considered as another type of coprocessor for Turbo decoding implementations in mobile devices.","PeriodicalId":136293,"journal":{"name":"International Conference on Compilers, Architecture, and Synthesis for Embedded Systems","volume":"64 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-10-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125946136","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 22
Improved procedure placement for set associative caches 改进了集合关联缓存的过程放置
Yun Liang, T. Mitra
{"title":"Improved procedure placement for set associative caches","authors":"Yun Liang, T. Mitra","doi":"10.1145/1878921.1878944","DOIUrl":"https://doi.org/10.1145/1878921.1878944","url":null,"abstract":"The performance of most embedded systems is critically dependent on the memory hierarchy performance. In particular, higher cache hit rate can provide significant performance boost to an embedded application. Procedure placement is a popular technique that aims to improve instruction cache hit rate by reducing conflicts in the cache through compile/link time reordering of procedures. However, existing procedure placement techniques make reordering decisions based on imprecise conflict information. This imprecision leads to limited and sometimes negative performance gain, specially for set-associative caches. In this paper, we introduce intermediate blocks profile (IBP) to accurately but compactly model cost-benefit of procedure placement for both direct mapped and set associative caches. We propose an efficient algorithm that exploits IBP to place procedures in memory such that cache conflicts are minimized. Experimental results demonstrate that our approach provides substantial improvement in cache performance over existing procedure placement techniques. Furthermore, we observe that the code layout for a specific cache configuration is not portable across different cache configurations. To solve this problem, we propose an algorithm that exploits IBP to place procedures in memory such that the average cache miss rate across a set of cache configurations is minimized.","PeriodicalId":136293,"journal":{"name":"International Conference on Compilers, Architecture, and Synthesis for Embedded Systems","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-10-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124458715","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 10
Implementing dynamic implied addressing mode for multi-output instructions 实现多输出指令的动态隐含寻址模式
Jonghee M. Youn, Jongwon Lee, Y. Paek, Jongwung Kim, Jeonghun Cho
{"title":"Implementing dynamic implied addressing mode for multi-output instructions","authors":"Jonghee M. Youn, Jongwon Lee, Y. Paek, Jongwung Kim, Jeonghun Cho","doi":"10.1145/1878921.1878937","DOIUrl":"https://doi.org/10.1145/1878921.1878937","url":null,"abstract":"The ever-increasing demand for faster execution time, smaller resource usage and lower energy consumption has compelled architects of embedded processors to adopt more specialized hardware features with irregular data paths and heterogeneous registers that are customized to the needs of their target applications. These processors consequently provide a rich set of specialized instructions in order to enable programmers to access these features. Such an instruction is typically a multi-output instruction (MOI), which outputs multiple results parallely in order to exploit inherent underlying hardware parallelism. Earlier study has exhibited that MOIs help to enhance performance in aspect of instruction counts and code size. However, as MOIs require more operands, they tend to increase not only the size of the instruction set but also the size of individual instructions. This can be a serious setback for embedded processors, which are mostly subject to strong resource limitations (particularly in this case, limited instruction encoding space). For this reason, these processors are often allowed to include only a very small subset of the total desired MOIs in their instruction sets, despite there can be sufficient silicon real estate to accommodate these specialized MOIs. To attack this problem, we introduce a novel instruction encoding scheme based on the dynamic implied addressing mode (DIAM). In this paper, we will discuss how we have overcome the encoding space problem for our target embedded processor whose instruction set has been augmented with a variety of MOIs. Our DIAM-based encoding scheme employs a small on-chip buffer to supplement extra encoding information for MOIs at run time. The empirical results are promising: the scheme allows us to encode many more MOIs for our processor; thereby helping us to achieve considerable reduction of code size as well as running time after the DIAM is additively implemented in the original architecture.","PeriodicalId":136293,"journal":{"name":"International Conference on Compilers, Architecture, and Synthesis for Embedded Systems","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-10-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125500622","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Hardware-based data value and address trace filtering techniques 基于硬件的数据值和地址跟踪过滤技术
Vladimir Uzelac, A. Milenković
{"title":"Hardware-based data value and address trace filtering techniques","authors":"Vladimir Uzelac, A. Milenković","doi":"10.1145/1878921.1878940","DOIUrl":"https://doi.org/10.1145/1878921.1878940","url":null,"abstract":"Capturing program and data traces during program execution unobtrusively in real-time is crucial in debugging and testing of cyber-physical systems. However, tracing a complete program unobtrusively is often cost-prohibitive, requiring large on-chip trace buffers and wide trace ports. Whereas program execution traces can be efficiently compressed in hardware, compression of data address and data value traces is much more challenging due to limited redundancy. In this paper we describe two hardware-based filtering techniques for data traces: cache first-access tracking for load data values and data address filtering using partial register-file replay. The results of our experimental analysis indicate that the proposed filtering techniques can significantly reduce the size of the data traces (~5 20 times for the load data value trace, depending on the data cache size; and ~5 times for the data address trace) at the cost of rather small hardware structures in the trace module.","PeriodicalId":136293,"journal":{"name":"International Conference on Compilers, Architecture, and Synthesis for Embedded Systems","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-10-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131210012","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
Enabling large decoded instruction loop caching for energy-aware embedded processors 为能量感知的嵌入式处理器启用大型解码指令循环缓存
Ji Gu, Hui Guo
{"title":"Enabling large decoded instruction loop caching for energy-aware embedded processors","authors":"Ji Gu, Hui Guo","doi":"10.1145/1878921.1878957","DOIUrl":"https://doi.org/10.1145/1878921.1878957","url":null,"abstract":"Low energy consumption in embedded processors is increasingly important in step with the system complexity. The on-chip instruction cache (I-cache) is usually a most energy consuming component on the processor chip due to its large size and frequent access operations. To reduce such energy consumption, the existing loop cache approaches use a tiny decoded cache to filter the I-cache access and instruction decode activity for repeated loop iterations. However, such designs are effective to small and simple loops, and only suitable for DSP kernel-like applications. They are not effectual to many embedded applications where complex loops are common.\u0000 In this paper, we propose a decoded loop instruction cache (DLIC) that is small, hence energy efficient, yet can capture most loops, including large, nested ones with branch executions, so that a significant amount of I-cache accesses and instruction decoding can be eradicated. Experiments on a set of embedded benchmarks show that our proposed DLIC scheme can reduce energy consumption by up to 87%. On average, 66% energy can be saved on instruction fetching and decoding, at a performance overhead of only 1.4%.","PeriodicalId":136293,"journal":{"name":"International Conference on Compilers, Architecture, and Synthesis for Embedded Systems","volume":"36 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-10-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114418898","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Eliminating false phase interactions to reduce optimization phase order search space 消除假相相互作用,减少优化相序搜索空间
Michael R. Jantz, P. Kulkarni
{"title":"Eliminating false phase interactions to reduce optimization phase order search space","authors":"Michael R. Jantz, P. Kulkarni","doi":"10.1145/1878921.1878950","DOIUrl":"https://doi.org/10.1145/1878921.1878950","url":null,"abstract":"Compiler optimization phase ordering is a long-standing problem, and is of particular relevance to the performance-oriented and cost constrained domain of embedded systems applications. Optimization phases are known to interact with each other, enabling and disabling opportunities for successive phases. Therefore, varying the order of applying these phases often generates distinct output codes, with different speed, code-size and power consumption characteristics. Most current approaches to address this issue focus on developing innovative methods to selectively evaluate the vast phase order search space to produce a good (but, potentially suboptimal) representation for each program.\u0000 In contrast, the goal of this work is to study and identify common causes of optimization phase interactions across all phases, and then devise techniques to eliminate them, if and when possible. We observe that several phase interactions are caused by false register dependence during many optimization phases. We further find that depending on the implementation of optimization phases, even an increased availability of registers may not be able to significantly reduce such false register dependences. We explore the potential of cleanup phases, such as register remapping and copy propagation, at reducing false dependences. We show that innovative implementation and application of these phases to reduce false register dependences not only reduces the size of the phase order search space substantially, but can also improve the quality of code generated by optimizing compilers.","PeriodicalId":136293,"journal":{"name":"International Conference on Compilers, Architecture, and Synthesis for Embedded Systems","volume":"58 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-10-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133251686","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 7
Implementing virtual secure circuit using a custom-instruction approach 使用自定义指令方法实现虚拟安全电路
Zhimin Chen, A. Sinha, P. Schaumont
{"title":"Implementing virtual secure circuit using a custom-instruction approach","authors":"Zhimin Chen, A. Sinha, P. Schaumont","doi":"10.1145/1878921.1878933","DOIUrl":"https://doi.org/10.1145/1878921.1878933","url":null,"abstract":"Although cryptographic algorithms are designed to resist at least thousands of years of cryptoanalysis, implementing them with either software or hardware usually leaks additional information which may enable the attackers to break the cryptographic systems within days. A Side Channel Attack (SCA) is such a kind of attack that breaks a security system at a low cost within a short time. SCA uses side-channel leakage, such as the cryptographic implementations' execution time, power dissipation and magnetic radiation. This paper presents a countermeasure to protect software-based cryptography from SCA by emulating the behavior of the secure hardware circuits. The emulation is done by introducing two simple complementary instructions to the processor and applying a secure programming style. We call the resulting secure software program a Virtual Secure Circuit (VSC). VSC inherits the idea of a secure logic circuit, a hardware SCA countermeasure. It not only maintains the secure circuits' generality without limitation to a specific algorithm, but also increases its flexibility. Experiments on a prototype implementation demonstrated that the new countermeasure considerably increases the difficulty of the attacks by 20 times, which is in the same order as the improvement achieved by the dedicated secure hardware circuits. Therefore, we conclude that VSC is an efficient way to protect cryptographic software.","PeriodicalId":136293,"journal":{"name":"International Conference on Compilers, Architecture, and Synthesis for Embedded Systems","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-10-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128353044","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 18
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信