Yongjun Park, Hyunchul Park, S. Mahlke, Sukjin Kim
{"title":"Resource recycling: putting idle resources to work on a composable accelerator","authors":"Yongjun Park, Hyunchul Park, S. Mahlke, Sukjin Kim","doi":"10.1145/1878921.1878925","DOIUrl":"https://doi.org/10.1145/1878921.1878925","url":null,"abstract":"Mobile computing platforms in the form of smart phones, netbooks, and personal digital assistants have become an integral part of our everyday lives. Moving ahead to the future, mobile multimedia support will become a key differentiating factor for customers. Features such as high-definition audio and video, video conferencing, 3D graphics, and image projection will lead to the adoption of one phone over another. However, in contrast to wireless signal processing which is dominated by vectorizable computation, mobile multimedia applications often contain complex control flow and variable computational requirements. Moreover, data access is more complex where media applications typically operate on multi-dimensional vectors of data rather than single-dimensional vectors with simple strides. To handle these complexities, composable accelerators such as the Polymorphic Pipeline Array, or PPA, present an appealing hardware platform by adding a degree of hardware configurability over existing accelerators. Hardware resources can be both statically as well as dynamically partitioned among executing tasks to maximize execution efficiency. However, an effective compilation framework is essential to partition and assign resources to make intelligent use of the available hardware. In this paper, a compilation framework is introduced that maximizes application throughput with hybrid resource partitioning of a PPA system. Static partitioning handles part of the resource assignment, but this is followed up by dynamic partitioning to identify idle resources and put them to use -- resource recycling. Experimental results show that real-time media applications can take advantage of the static and dynamic configurability of the PPA for increase.\u0000 throughput.","PeriodicalId":136293,"journal":{"name":"International Conference on Compilers, Architecture, and Synthesis for Embedded Systems","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-10-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129277366","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Characterization and exploitation of narrow-width loads: the narrow-width cache approach","authors":"M. Islam, P. Stenström","doi":"10.1145/1878921.1878955","DOIUrl":"https://doi.org/10.1145/1878921.1878955","url":null,"abstract":"This paper exploits small-value locality to accelerate the execution of memory instructions. We find that narrow-width loads (NWLDs) --- loads with small-value operands of 8 bits or less --- comprise 26% of all executed loads across 40 applications of the SPEC benchmark suites. We establish that the frequency of NWLDs are almost independent of compiler and input data. We introduce narrow-width caches (NWC) to cache small-value memory words. NWCs provide a significant speedup for several memory-intensive applications with a negligible chip-area overhead. NWCs also reduce the overall energy dissipation and memory traffic.","PeriodicalId":136293,"journal":{"name":"International Conference on Compilers, Architecture, and Synthesis for Embedded Systems","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-10-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129224586","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Cupertino Miranda, Antoniu Pop, Philippe Dumont, Albert Cohen, M. Duranton
{"title":"Erbium: a deterministic, concurrent intermediate representation to map data-flow tasks to scalable, persistent streaming processes","authors":"Cupertino Miranda, Antoniu Pop, Philippe Dumont, Albert Cohen, M. Duranton","doi":"10.1145/1878921.1878924","DOIUrl":"https://doi.org/10.1145/1878921.1878924","url":null,"abstract":"Tuning applications for multicore systems involve subtle concurrency concepts and target-dependent optimizations. This paper advocates for a streaming execution model, called ER, where persistent processes communicate and synchronize through a multi-consumer processing applications, we demonstrate the scalability and efficiency advantages of streaming compared to data-driven scheduling. To exploit these benefits in compilers for parallel languages, we propose an intermediate representation enabling the compilation of data-flow tasks into streaming processes. This intermediate representation also facilitates the application of classical compiler optimizations to concurrent programs.","PeriodicalId":136293,"journal":{"name":"International Conference on Compilers, Architecture, and Synthesis for Embedded Systems","volume":"41 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-10-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129269800","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Design space exploration of the turbo decoding algorithm on GPUs","authors":"Dongwon Lee, M. Wolf, Hyesoon Kim","doi":"10.1145/1878921.1878953","DOIUrl":"https://doi.org/10.1145/1878921.1878953","url":null,"abstract":"In this paper, we explore the design space of the Turbo decoding algorithm on GPUs and find a performance bottleneck. We consider three axes for the design space exploration: a radix degree, a parallelization method, and the number of sub-frames per thread block. In Turbo decoding, a degree of radix affects computational complexity and memory access patterns in both algorithmic and implementation viewpoints. Second, computations of branch metrics (BMs) and state metrics (SMs) have a different degree of parallelism, which affects the mapping method of computational tasks to GPU threads. Finally, we can easily adjust the number of sub-frames per thread block to balance the occupancy and memory access traffic. Experimental results show that the radix-4 algorithm with the SM-centric mapping method shows the best performance at four sub-frames per thread block. According to our analysis, two factors -- the occupancy and shared memory bank conflicts -- differentiate the performance of different cases in the design space. We show further performance improvements by optimizing a kernel operation (max*) and applying the MAX-Log-Maximum A Posteriori (MAP) algorithm. A performance bottleneck at the finally optimized case is global memory access latency.\u0000 Since the most optimized performance is comparable to that of the other programmable platforms, the GPU can be considered as another type of coprocessor for Turbo decoding implementations in mobile devices.","PeriodicalId":136293,"journal":{"name":"International Conference on Compilers, Architecture, and Synthesis for Embedded Systems","volume":"64 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-10-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125946136","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Improved procedure placement for set associative caches","authors":"Yun Liang, T. Mitra","doi":"10.1145/1878921.1878944","DOIUrl":"https://doi.org/10.1145/1878921.1878944","url":null,"abstract":"The performance of most embedded systems is critically dependent on the memory hierarchy performance. In particular, higher cache hit rate can provide significant performance boost to an embedded application. Procedure placement is a popular technique that aims to improve instruction cache hit rate by reducing conflicts in the cache through compile/link time reordering of procedures. However, existing procedure placement techniques make reordering decisions based on imprecise conflict information. This imprecision leads to limited and sometimes negative performance gain, specially for set-associative caches. In this paper, we introduce intermediate blocks profile (IBP) to accurately but compactly model cost-benefit of procedure placement for both direct mapped and set associative caches. We propose an efficient algorithm that exploits IBP to place procedures in memory such that cache conflicts are minimized. Experimental results demonstrate that our approach provides substantial improvement in cache performance over existing procedure placement techniques. Furthermore, we observe that the code layout for a specific cache configuration is not portable across different cache configurations. To solve this problem, we propose an algorithm that exploits IBP to place procedures in memory such that the average cache miss rate across a set of cache configurations is minimized.","PeriodicalId":136293,"journal":{"name":"International Conference on Compilers, Architecture, and Synthesis for Embedded Systems","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-10-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124458715","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jonghee M. Youn, Jongwon Lee, Y. Paek, Jongwung Kim, Jeonghun Cho
{"title":"Implementing dynamic implied addressing mode for multi-output instructions","authors":"Jonghee M. Youn, Jongwon Lee, Y. Paek, Jongwung Kim, Jeonghun Cho","doi":"10.1145/1878921.1878937","DOIUrl":"https://doi.org/10.1145/1878921.1878937","url":null,"abstract":"The ever-increasing demand for faster execution time, smaller resource usage and lower energy consumption has compelled architects of embedded processors to adopt more specialized hardware features with irregular data paths and heterogeneous registers that are customized to the needs of their target applications. These processors consequently provide a rich set of specialized instructions in order to enable programmers to access these features. Such an instruction is typically a multi-output instruction (MOI), which outputs multiple results parallely in order to exploit inherent underlying hardware parallelism. Earlier study has exhibited that MOIs help to enhance performance in aspect of instruction counts and code size. However, as MOIs require more operands, they tend to increase not only the size of the instruction set but also the size of individual instructions. This can be a serious setback for embedded processors, which are mostly subject to strong resource limitations (particularly in this case, limited instruction encoding space). For this reason, these processors are often allowed to include only a very small subset of the total desired MOIs in their instruction sets, despite there can be sufficient silicon real estate to accommodate these specialized MOIs. To attack this problem, we introduce a novel instruction encoding scheme based on the dynamic implied addressing mode (DIAM). In this paper, we will discuss how we have overcome the encoding space problem for our target embedded processor whose instruction set has been augmented with a variety of MOIs. Our DIAM-based encoding scheme employs a small on-chip buffer to supplement extra encoding information for MOIs at run time. The empirical results are promising: the scheme allows us to encode many more MOIs for our processor; thereby helping us to achieve considerable reduction of code size as well as running time after the DIAM is additively implemented in the original architecture.","PeriodicalId":136293,"journal":{"name":"International Conference on Compilers, Architecture, and Synthesis for Embedded Systems","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-10-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125500622","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Hardware-based data value and address trace filtering techniques","authors":"Vladimir Uzelac, A. Milenković","doi":"10.1145/1878921.1878940","DOIUrl":"https://doi.org/10.1145/1878921.1878940","url":null,"abstract":"Capturing program and data traces during program execution unobtrusively in real-time is crucial in debugging and testing of cyber-physical systems. However, tracing a complete program unobtrusively is often cost-prohibitive, requiring large on-chip trace buffers and wide trace ports. Whereas program execution traces can be efficiently compressed in hardware, compression of data address and data value traces is much more challenging due to limited redundancy. In this paper we describe two hardware-based filtering techniques for data traces: cache first-access tracking for load data values and data address filtering using partial register-file replay. The results of our experimental analysis indicate that the proposed filtering techniques can significantly reduce the size of the data traces (~5 20 times for the load data value trace, depending on the data cache size; and ~5 times for the data address trace) at the cost of rather small hardware structures in the trace module.","PeriodicalId":136293,"journal":{"name":"International Conference on Compilers, Architecture, and Synthesis for Embedded Systems","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-10-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131210012","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Enabling large decoded instruction loop caching for energy-aware embedded processors","authors":"Ji Gu, Hui Guo","doi":"10.1145/1878921.1878957","DOIUrl":"https://doi.org/10.1145/1878921.1878957","url":null,"abstract":"Low energy consumption in embedded processors is increasingly important in step with the system complexity. The on-chip instruction cache (I-cache) is usually a most energy consuming component on the processor chip due to its large size and frequent access operations. To reduce such energy consumption, the existing loop cache approaches use a tiny decoded cache to filter the I-cache access and instruction decode activity for repeated loop iterations. However, such designs are effective to small and simple loops, and only suitable for DSP kernel-like applications. They are not effectual to many embedded applications where complex loops are common.\u0000 In this paper, we propose a decoded loop instruction cache (DLIC) that is small, hence energy efficient, yet can capture most loops, including large, nested ones with branch executions, so that a significant amount of I-cache accesses and instruction decoding can be eradicated. Experiments on a set of embedded benchmarks show that our proposed DLIC scheme can reduce energy consumption by up to 87%. On average, 66% energy can be saved on instruction fetching and decoding, at a performance overhead of only 1.4%.","PeriodicalId":136293,"journal":{"name":"International Conference on Compilers, Architecture, and Synthesis for Embedded Systems","volume":"36 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-10-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114418898","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Eliminating false phase interactions to reduce optimization phase order search space","authors":"Michael R. Jantz, P. Kulkarni","doi":"10.1145/1878921.1878950","DOIUrl":"https://doi.org/10.1145/1878921.1878950","url":null,"abstract":"Compiler optimization phase ordering is a long-standing problem, and is of particular relevance to the performance-oriented and cost constrained domain of embedded systems applications. Optimization phases are known to interact with each other, enabling and disabling opportunities for successive phases. Therefore, varying the order of applying these phases often generates distinct output codes, with different speed, code-size and power consumption characteristics. Most current approaches to address this issue focus on developing innovative methods to selectively evaluate the vast phase order search space to produce a good (but, potentially suboptimal) representation for each program.\u0000 In contrast, the goal of this work is to study and identify common causes of optimization phase interactions across all phases, and then devise techniques to eliminate them, if and when possible. We observe that several phase interactions are caused by false register dependence during many optimization phases. We further find that depending on the implementation of optimization phases, even an increased availability of registers may not be able to significantly reduce such false register dependences. We explore the potential of cleanup phases, such as register remapping and copy propagation, at reducing false dependences. We show that innovative implementation and application of these phases to reduce false register dependences not only reduces the size of the phase order search space substantially, but can also improve the quality of code generated by optimizing compilers.","PeriodicalId":136293,"journal":{"name":"International Conference on Compilers, Architecture, and Synthesis for Embedded Systems","volume":"58 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-10-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133251686","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Implementing virtual secure circuit using a custom-instruction approach","authors":"Zhimin Chen, A. Sinha, P. Schaumont","doi":"10.1145/1878921.1878933","DOIUrl":"https://doi.org/10.1145/1878921.1878933","url":null,"abstract":"Although cryptographic algorithms are designed to resist at least thousands of years of cryptoanalysis, implementing them with either software or hardware usually leaks additional information which may enable the attackers to break the cryptographic systems within days. A Side Channel Attack (SCA) is such a kind of attack that breaks a security system at a low cost within a short time. SCA uses side-channel leakage, such as the cryptographic implementations' execution time, power dissipation and magnetic radiation. This paper presents a countermeasure to protect software-based cryptography from SCA by emulating the behavior of the secure hardware circuits. The emulation is done by introducing two simple complementary instructions to the processor and applying a secure programming style. We call the resulting secure software program a Virtual Secure Circuit (VSC). VSC inherits the idea of a secure logic circuit, a hardware SCA countermeasure. It not only maintains the secure circuits' generality without limitation to a specific algorithm, but also increases its flexibility. Experiments on a prototype implementation demonstrated that the new countermeasure considerably increases the difficulty of the attacks by 20 times, which is in the same order as the improvement achieved by the dedicated secure hardware circuits. Therefore, we conclude that VSC is an efficient way to protect cryptographic software.","PeriodicalId":136293,"journal":{"name":"International Conference on Compilers, Architecture, and Synthesis for Embedded Systems","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-10-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128353044","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}