International Conference on Compilers, Architecture, and Synthesis for Embedded Systems最新文献_第4页

Resource recycling: putting idle resources to work on a composable accelerator 资源回收:将空闲资源用于可组合加速器

International Conference on Compilers, Architecture, and Synthesis for Embedded Systems Pub Date : 2010-10-24 DOI: 10.1145/1878921.1878925

Yongjun Park, Hyunchul Park, S. Mahlke, Sukjin Kim

{"title":"Resource recycling: putting idle resources to work on a composable accelerator","authors":"Yongjun Park, Hyunchul Park, S. Mahlke, Sukjin Kim","doi":"10.1145/1878921.1878925","DOIUrl":"https://doi.org/10.1145/1878921.1878925","url":null,"abstract":"Mobile computing platforms in the form of smart phones, netbooks, and personal digital assistants have become an integral part of our everyday lives. Moving ahead to the future, mobile multimedia support will become a key differentiating factor for customers. Features such as high-definition audio and video, video conferencing, 3D graphics, and image projection will lead to the adoption of one phone over another. However, in contrast to wireless signal processing which is dominated by vectorizable computation, mobile multimedia applications often contain complex control flow and variable computational requirements. Moreover, data access is more complex where media applications typically operate on multi-dimensional vectors of data rather than single-dimensional vectors with simple strides. To handle these complexities, composable accelerators such as the Polymorphic Pipeline Array, or PPA, present an appealing hardware platform by adding a degree of hardware configurability over existing accelerators. Hardware resources can be both statically as well as dynamically partitioned among executing tasks to maximize execution efficiency. However, an effective compilation framework is essential to partition and assign resources to make intelligent use of the available hardware. In this paper, a compilation framework is introduced that maximizes application throughput with hybrid resource partitioning of a PPA system. Static partitioning handles part of the resource assignment, but this is followed up by dynamic partitioning to identify idle resources and put them to use -- resource recycling. Experimental results show that real-time media applications can take advantage of the static and dynamic configurability of the PPA for increase.\u0000 throughput.","PeriodicalId":136293,"journal":{"name":"International Conference on Compilers, Architecture, and Synthesis for Embedded Systems","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-10-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129277366","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Characterization and exploitation of narrow-width loads: the narrow-width cache approach 窄宽负载的特性和利用:窄宽缓存方法

International Conference on Compilers, Architecture, and Synthesis for Embedded Systems Pub Date : 2010-10-24 DOI: 10.1145/1878921.1878955

M. Islam, P. Stenström

引用次数: 15

Erbium: a deterministic, concurrent intermediate representation to map data-flow tasks to scalable, persistent streaming processes Erbium:一种确定性的、并发的中间表示，用于将数据流任务映射到可扩展的、持久的流处理

International Conference on Compilers, Architecture, and Synthesis for Embedded Systems Pub Date : 2010-10-24 DOI: 10.1145/1878921.1878924

Cupertino Miranda, Antoniu Pop, Philippe Dumont, Albert Cohen, M. Duranton

引用次数: 26

Design space exploration of the turbo decoding algorithm on GPUs gpu上turbo译码算法的设计空间探索

International Conference on Compilers, Architecture, and Synthesis for Embedded Systems Pub Date : 2010-10-24 DOI: 10.1145/1878921.1878953

Dongwon Lee, M. Wolf, Hyesoon Kim

{"title":"Design space exploration of the turbo decoding algorithm on GPUs","authors":"Dongwon Lee, M. Wolf, Hyesoon Kim","doi":"10.1145/1878921.1878953","DOIUrl":"https://doi.org/10.1145/1878921.1878953","url":null,"abstract":"In this paper, we explore the design space of the Turbo decoding algorithm on GPUs and find a performance bottleneck. We consider three axes for the design space exploration: a radix degree, a parallelization method, and the number of sub-frames per thread block. In Turbo decoding, a degree of radix affects computational complexity and memory access patterns in both algorithmic and implementation viewpoints. Second, computations of branch metrics (BMs) and state metrics (SMs) have a different degree of parallelism, which affects the mapping method of computational tasks to GPU threads. Finally, we can easily adjust the number of sub-frames per thread block to balance the occupancy and memory access traffic. Experimental results show that the radix-4 algorithm with the SM-centric mapping method shows the best performance at four sub-frames per thread block. According to our analysis, two factors -- the occupancy and shared memory bank conflicts -- differentiate the performance of different cases in the design space. We show further performance improvements by optimizing a kernel operation (max*) and applying the MAX-Log-Maximum A Posteriori (MAP) algorithm. A performance bottleneck at the finally optimized case is global memory access latency.\u0000 Since the most optimized performance is comparable to that of the other programmable platforms, the GPU can be considered as another type of coprocessor for Turbo decoding implementations in mobile devices.","PeriodicalId":136293,"journal":{"name":"International Conference on Compilers, Architecture, and Synthesis for Embedded Systems","volume":"64 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-10-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125946136","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 22

Improved procedure placement for set associative caches 改进了集合关联缓存的过程放置

International Conference on Compilers, Architecture, and Synthesis for Embedded Systems Pub Date : 2010-10-24 DOI: 10.1145/1878921.1878944

Yun Liang, T. Mitra

{"title":"Improved procedure placement for set associative caches","authors":"Yun Liang, T. Mitra","doi":"10.1145/1878921.1878944","DOIUrl":"https://doi.org/10.1145/1878921.1878944","url":null,"abstract":"The performance of most embedded systems is critically dependent on the memory hierarchy performance. In particular, higher cache hit rate can provide significant performance boost to an embedded application. Procedure placement is a popular technique that aims to improve instruction cache hit rate by reducing conflicts in the cache through compile/link time reordering of procedures. However, existing procedure placement techniques make reordering decisions based on imprecise conflict information. This imprecision leads to limited and sometimes negative performance gain, specially for set-associative caches. In this paper, we introduce intermediate blocks profile (IBP) to accurately but compactly model cost-benefit of procedure placement for both direct mapped and set associative caches. We propose an efficient algorithm that exploits IBP to place procedures in memory such that cache conflicts are minimized. Experimental results demonstrate that our approach provides substantial improvement in cache performance over existing procedure placement techniques. Furthermore, we observe that the code layout for a specific cache configuration is not portable across different cache configurations. To solve this problem, we propose an algorithm that exploits IBP to place procedures in memory such that the average cache miss rate across a set of cache configurations is minimized.","PeriodicalId":136293,"journal":{"name":"International Conference on Compilers, Architecture, and Synthesis for Embedded Systems","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-10-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124458715","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 10

Implementing dynamic implied addressing mode for multi-output instructions 实现多输出指令的动态隐含寻址模式

International Conference on Compilers, Architecture, and Synthesis for Embedded Systems Pub Date : 2010-10-24 DOI: 10.1145/1878921.1878937

Jonghee M. Youn, Jongwon Lee, Y. Paek, Jongwung Kim, Jeonghun Cho

{"title":"Implementing dynamic implied addressing mode for multi-output instructions","authors":"Jonghee M. Youn, Jongwon Lee, Y. Paek, Jongwung Kim, Jeonghun Cho","doi":"10.1145/1878921.1878937","DOIUrl":"https://doi.org/10.1145/1878921.1878937","url":null,"abstract":"The ever-increasing demand for faster execution time, smaller resource usage and lower energy consumption has compelled architects of embedded processors to adopt more specialized hardware features with irregular data paths and heterogeneous registers that are customized to the needs of their target applications. These processors consequently provide a rich set of specialized instructions in order to enable programmers to access these features. Such an instruction is typically a multi-output instruction (MOI), which outputs multiple results parallely in order to exploit inherent underlying hardware parallelism. Earlier study has exhibited that MOIs help to enhance performance in aspect of instruction counts and code size. However, as MOIs require more operands, they tend to increase not only the size of the instruction set but also the size of individual instructions. This can be a serious setback for embedded processors, which are mostly subject to strong resource limitations (particularly in this case, limited instruction encoding space). For this reason, these processors are often allowed to include only a very small subset of the total desired MOIs in their instruction sets, despite there can be sufficient silicon real estate to accommodate these specialized MOIs. To attack this problem, we introduce a novel instruction encoding scheme based on the dynamic implied addressing mode (DIAM). In this paper, we will discuss how we have overcome the encoding space problem for our target embedded processor whose instruction set has been augmented with a variety of MOIs. Our DIAM-based encoding scheme employs a small on-chip buffer to supplement extra encoding information for MOIs at run time. The empirical results are promising: the scheme allows us to encode many more MOIs for our processor; thereby helping us to achieve considerable reduction of code size as well as running time after the DIAM is additively implemented in the original architecture.","PeriodicalId":136293,"journal":{"name":"International Conference on Compilers, Architecture, and Synthesis for Embedded Systems","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-10-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125500622","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Hardware-based data value and address trace filtering techniques 基于硬件的数据值和地址跟踪过滤技术

International Conference on Compilers, Architecture, and Synthesis for Embedded Systems Pub Date : 2010-10-24 DOI: 10.1145/1878921.1878940

Vladimir Uzelac, A. Milenković

引用次数: 6

Enabling large decoded instruction loop caching for energy-aware embedded processors 为能量感知的嵌入式处理器启用大型解码指令循环缓存

International Conference on Compilers, Architecture, and Synthesis for Embedded Systems Pub Date : 2010-10-24 DOI: 10.1145/1878921.1878957

Ji Gu, Hui Guo

{"title":"Enabling large decoded instruction loop caching for energy-aware embedded processors","authors":"Ji Gu, Hui Guo","doi":"10.1145/1878921.1878957","DOIUrl":"https://doi.org/10.1145/1878921.1878957","url":null,"abstract":"Low energy consumption in embedded processors is increasingly important in step with the system complexity. The on-chip instruction cache (I-cache) is usually a most energy consuming component on the processor chip due to its large size and frequent access operations. To reduce such energy consumption, the existing loop cache approaches use a tiny decoded cache to filter the I-cache access and instruction decode activity for repeated loop iterations. However, such designs are effective to small and simple loops, and only suitable for DSP kernel-like applications. They are not effectual to many embedded applications where complex loops are common.\u0000 In this paper, we propose a decoded loop instruction cache (DLIC) that is small, hence energy efficient, yet can capture most loops, including large, nested ones with branch executions, so that a significant amount of I-cache accesses and instruction decoding can be eradicated. Experiments on a set of embedded benchmarks show that our proposed DLIC scheme can reduce energy consumption by up to 87%. On average, 66% energy can be saved on instruction fetching and decoding, at a performance overhead of only 1.4%.","PeriodicalId":136293,"journal":{"name":"International Conference on Compilers, Architecture, and Synthesis for Embedded Systems","volume":"36 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-10-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114418898","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Eliminating false phase interactions to reduce optimization phase order search space 消除假相相互作用，减少优化相序搜索空间

International Conference on Compilers, Architecture, and Synthesis for Embedded Systems Pub Date : 2010-10-24 DOI: 10.1145/1878921.1878950

Michael R. Jantz, P. Kulkarni

{"title":"Eliminating false phase interactions to reduce optimization phase order search space","authors":"Michael R. Jantz, P. Kulkarni","doi":"10.1145/1878921.1878950","DOIUrl":"https://doi.org/10.1145/1878921.1878950","url":null,"abstract":"Compiler optimization phase ordering is a long-standing problem, and is of particular relevance to the performance-oriented and cost constrained domain of embedded systems applications. Optimization phases are known to interact with each other, enabling and disabling opportunities for successive phases. Therefore, varying the order of applying these phases often generates distinct output codes, with different speed, code-size and power consumption characteristics. Most current approaches to address this issue focus on developing innovative methods to selectively evaluate the vast phase order search space to produce a good (but, potentially suboptimal) representation for each program.\u0000 In contrast, the goal of this work is to study and identify common causes of optimization phase interactions across all phases, and then devise techniques to eliminate them, if and when possible. We observe that several phase interactions are caused by false register dependence during many optimization phases. We further find that depending on the implementation of optimization phases, even an increased availability of registers may not be able to significantly reduce such false register dependences. We explore the potential of cleanup phases, such as register remapping and copy propagation, at reducing false dependences. We show that innovative implementation and application of these phases to reduce false register dependences not only reduces the size of the phase order search space substantially, but can also improve the quality of code generated by optimizing compilers.","PeriodicalId":136293,"journal":{"name":"International Conference on Compilers, Architecture, and Synthesis for Embedded Systems","volume":"58 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-10-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133251686","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 7

Implementing virtual secure circuit using a custom-instruction approach 使用自定义指令方法实现虚拟安全电路

International Conference on Compilers, Architecture, and Synthesis for Embedded Systems Pub Date : 2010-10-24 DOI: 10.1145/1878921.1878933

Zhimin Chen, A. Sinha, P. Schaumont

{"title":"Implementing virtual secure circuit using a custom-instruction approach","authors":"Zhimin Chen, A. Sinha, P. Schaumont","doi":"10.1145/1878921.1878933","DOIUrl":"https://doi.org/10.1145/1878921.1878933","url":null,"abstract":"Although cryptographic algorithms are designed to resist at least thousands of years of cryptoanalysis, implementing them with either software or hardware usually leaks additional information which may enable the attackers to break the cryptographic systems within days. A Side Channel Attack (SCA) is such a kind of attack that breaks a security system at a low cost within a short time. SCA uses side-channel leakage, such as the cryptographic implementations' execution time, power dissipation and magnetic radiation. This paper presents a countermeasure to protect software-based cryptography from SCA by emulating the behavior of the secure hardware circuits. The emulation is done by introducing two simple complementary instructions to the processor and applying a secure programming style. We call the resulting secure software program a Virtual Secure Circuit (VSC). VSC inherits the idea of a secure logic circuit, a hardware SCA countermeasure. It not only maintains the secure circuits' generality without limitation to a specific algorithm, but also increases its flexibility. Experiments on a prototype implementation demonstrated that the new countermeasure considerably increases the difficulty of the attacks by 20 times, which is in the same order as the improvement achieved by the dedicated secure hardware circuits. Therefore, we conclude that VSC is an efficient way to protect cryptographic software.","PeriodicalId":136293,"journal":{"name":"International Conference on Compilers, Architecture, and Synthesis for Embedded Systems","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-10-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128353044","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 18