{"title":"Multi-granularity sampling for simulating concurrent heterogeneous applications","authors":"Melhem Tawk, K. Ibrahim, S. Niar","doi":"10.1145/1450095.1450127","DOIUrl":"https://doi.org/10.1145/1450095.1450127","url":null,"abstract":"Detailed or cycle-accurate/bit-accurate (CABA) simulation is a critical phase in the design flow of embedded systems. However, with increasing system complexity, full detailed simulation is prohibitively slower than the hardware being simulated. In this paper, we present an approach that uses the sampling technique to speed up the design flow of Multiprocessor System-on-Chip (MPSoC) systems. Based on the dynamic behavior of the applications running concurrently, our method dynamically chooses between multiple granularities of the sampling phase. The similarities of the execution phases for all possible granularities are first analyzed, then transitions between phase overlaps are discretized. To facilitate the detection of repetitions, one phase, with an appropriate granularity, is chosen per process. Unlike most other proposals, the associated performance is usually accurate enough not to need repeated resampling. The use of checkpointing in conjunction with our approach is simplified because the amount of the needed disk space is significantly reduced. Experimental results show that the simulation of concurrent heterogeneous applications can be accelerated by a factor of up to 60x, while maintaining an average performance estimation error lower than 5%.","PeriodicalId":136293,"journal":{"name":"International Conference on Compilers, Architecture, and Synthesis for Embedded Systems","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-10-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126577772","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Execution context optimization for disk energy","authors":"Jerry Hom, U. Kremer","doi":"10.1145/1450095.1450132","DOIUrl":"https://doi.org/10.1145/1450095.1450132","url":null,"abstract":"Power, energy, and thermal concerns have constrained embedded systems designs. Computing capability and storage density have increased dramatically, enabling the emergence of handheld devices from special to general purpose computing. In many mobile systems, the disk is among the top energy consumers. Many previous optimizations for disk energy have assumed uniprogramming environments. However, many optimizations degrade in multiprogramming because programs are unaware of other programs (execution context). We introduce a framework to make programs aware of and adapt to their runtime execution context.\u0000 We evaluated real workloads by collecting user activity traces and characterizing the execution contexts. The study confirms that many users run a limited number of programs concurrently. We applied execution context optimizations to eight programs and tested ten combinations. The programs ran concurrently while the disk's power was measured. Our measurement infrastructure allows interactive sessions to be scripted, recorded, and replayed to compare the optimizations' effects against the baseline. Our experiments covered two write cache policies. For write-through, energy savings was in the range 3-63% with an average of 21%. For write-back, energy savings was in the range -33-61% with an average of 8%. In all cases, our optimizations incurred less than 1% performance penalty.","PeriodicalId":136293,"journal":{"name":"International Conference on Compilers, Architecture, and Synthesis for Embedded Systems","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-10-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123719020","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"ZebraNet and beyond: applications and systems support for mobile, dynamic networks","authors":"M. Martonosi","doi":"10.1145/1450095.1450096","DOIUrl":"https://doi.org/10.1145/1450095.1450096","url":null,"abstract":"Mobile and wireless computing has the potential to offer the next big revolution in how we relate to and make use of our computing devices. In addition to untethered operation, mobile and distributed systems offer the opportunity to consider new computational models in which dynamic, sparsely-connected confederations of embedded computing devices collaborate across wide areas to gather information and solve problems. These systems, however, face significant energy, form-factor and connectivity constraints that influence how they should be designed. In this talk, I will describe our experiences building the ZebraNet system for wildlife tracking, based on sparse, mobile collections of GPS-based sensing devices. Drawing from ZebraNet and other systems experiences, my most recent work seeks to provide dynamic, optimizable systems layers for expressing mobile node relationships and for allowing distributed confederations of nodes to collaborate. I will discuss both the technical challenges of dynamic networks, as well as the broader opportunities for deploying such networks, particularly in low-infrastructure developing regions.","PeriodicalId":136293,"journal":{"name":"International Conference on Compilers, Architecture, and Synthesis for Embedded Systems","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-10-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130645483","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Active control and digital rights management of integrated circuit IP cores","authors":"Y. Alkabani, F. Koushanfar","doi":"10.1145/1450095.1450129","DOIUrl":"https://doi.org/10.1145/1450095.1450129","url":null,"abstract":"We introduce the first approach that can actively control multiple hardware intellectual property (IP) cores used in an integrated circuit (IC). The IP rights owner(s) can remotely monitor, control, enable, or disable each individual IP on each chip. The approach introduces a paradigm shift in the microelectronic business model, nurturing smaller businesses, and supporting the design-reuse paradigm. The IPs can be controlled by the original designer or by the designers who reuse them. Each IP has a built-in functional lock that pertains to the unique unclonable ID of the chip. A control structure that coordinates the locking and unlocking of the IPs is embedded within the IC. We introduce a trusted third party approach for issuing certificates of authenticity, in case it is required for the applications. We present methods for safeguarding the approach against two attack sources: the foundry (fab), and the reuser. Experimental results show that our approach can be implemented with low area, power, and delay overheads making it suitable for embedded systems. The introduced control method is also low overhead in terms of the added steps to the current design and manufacturing flow.","PeriodicalId":136293,"journal":{"name":"International Conference on Compilers, Architecture, and Synthesis for Embedded Systems","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-10-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134400238","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Dynamic coprocessor management for FPGA-enhanced compute platforms","authors":"Chen-Chun Huang, F. Vahid","doi":"10.1145/1450095.1450108","DOIUrl":"https://doi.org/10.1145/1450095.1450108","url":null,"abstract":"Various commercial programmable compute platforms have their processor architecture enhanced with field-programmable gate arrays (FPGAs). In a common usage scenario, an application loads custom processors into the FPGA to speed up application execution compared to processor-only execution. Transient applications, changing application workloads, and limited FPGA capacity have led to a new problem of operating-system-controlled dynamic management of the loading of coprocessors into the FPGAs for best overall performance or energy. We define the Dynamic Coprocessor Management problem and provide a mapping to an online optimization problem known as Metrical Task Systems. We introduce a robust heuristic, called the fading cumulative benefit (FCBenefit) heuristic, that outperforms other heuristics, including a previously developed one for MTS. For two distinct application sets, we generate numerous workloads and show that the FCBenefit heuristic provides best results across all considered workloads. In our simulations, the heuristic's results were within 9% of the offline optimal for performance, and within 3% for energy. The heuristic may be applicable to a wide variety of dynamic architecture management problems.","PeriodicalId":136293,"journal":{"name":"International Conference on Compilers, Architecture, and Synthesis for Embedded Systems","volume":"51 6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-10-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134457141","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Efficient code caching to improve performance and energy consumption for java applications","authors":"Yu Sun, Wei Zhang","doi":"10.1145/1450095.1450115","DOIUrl":"https://doi.org/10.1145/1450095.1450115","url":null,"abstract":"Java applications rely on Just-In-Time (JIT) compilers or adaptive compilers to generate and optimize binary code at runtime to boost performance. In conventional Java Virtual Machines (JVM), however, the binary code is typically written into the data cache, and then is loaded into the instruction cache through the shared L2 cache or memory, which is not efficient in terms of both time and energy. In this paper, we study three hardware-based code caching strategies to write and read the dynamically generated code faster and more energy-efficiently. Our experimental results indicate that writing code directly into the instruction cache can improve the performance of a variety of Java applications by 9.6% on average, and up to 42.9%. Also, the overall energy dissipation of these Java programs can be reduced by 6% on average.","PeriodicalId":136293,"journal":{"name":"International Conference on Compilers, Architecture, and Synthesis for Embedded Systems","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-10-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128321087","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Efficiency and scalability of barrier synchronization on NoC based many-core architectures","authors":"Oreste Villa, G. Palermo, C. Silvano","doi":"10.1145/1450095.1450110","DOIUrl":"https://doi.org/10.1145/1450095.1450110","url":null,"abstract":"Interconnects based on Networks-on-Chip are an appealing solution to address future microprocessor designs where, very likely, hundreds of cores will be connected on a single chip. A fundamental role in highly parallelized applications running on many-core architectures will be played by barrier primitives used to synchronize the execution of parallel processes. This paper focuses on the analysis of the efficiency and scalability of different barrier implementations in many-core architectures based on NoCs. Several message passing barrier implementations based on four algorithms (all-to-all, master-slave, butterfly and tree) have been implemented and evaluated for a single-chip target architecture composed of a variable number of cores (from 4 to 128) and different network topologies (mesh, torus, ring, clustered-ring and fat-tree). Using a cycle-accurate simulator, we show the scalability of each barrier for every NoC topology, analyzing and comparing theoretical with real behaviors. We observed that some barrier algorithms, when implemented in hardware or software, show a different scaling behavior with respect to those theoretically expected. We evaluate the efficiency of each combination topology-barrier, demonstrating that, in many cases, simple network topologies can be more efficient than complex and highly connected topologies.","PeriodicalId":136293,"journal":{"name":"International Conference on Compilers, Architecture, and Synthesis for Embedded Systems","volume":"28 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-10-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125163671","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
José Baiocchi, B. Childers, J. Davidson, Jason Hiser
{"title":"Reducing pressure in bounded DBT code caches","authors":"José Baiocchi, B. Childers, J. Davidson, Jason Hiser","doi":"10.1145/1450095.1450114","DOIUrl":"https://doi.org/10.1145/1450095.1450114","url":null,"abstract":"Dynamic binary translators (DBT) have recently attracted much attention for embedded systems. The effective implementation of DBT in these systems is challenging due to tight constraints on memory and performance. A DBT uses a software-managed code cache to hold blocks of translated code. To minimize overhead, the code cache is usually large so blocks are translated once and never discarded. However, an embedded system may lack the resources for a large code cache. This constraint leads to significant slowdowns due to the retranslation of blocks prematurely discarded from a small code cache. This paper addresses the problem and shows how to impose a tight size bound on the code cache without performance loss. We show that about 70% of the code cache is consumed by instructions that the DBT introduces for its own purposes. Based on this observation, we propose novel techniques that reduce the amount of space required by DBT-injected code, leaving more room for actual application code and improving the miss ratio. We experimentally demonstrate that a bounded code cache can have performance on-par with an unbounded one.","PeriodicalId":136293,"journal":{"name":"International Conference on Compilers, Architecture, and Synthesis for Embedded Systems","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-10-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132178682","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Power on demand for mobile computing devices","authors":"Bill Athas","doi":"10.1145/1450095.1450097","DOIUrl":"https://doi.org/10.1145/1450095.1450097","url":null,"abstract":"","PeriodicalId":136293,"journal":{"name":"International Conference on Compilers, Architecture, and Synthesis for Embedded Systems","volume":"689 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-10-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115471256","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Control flow optimization in loops using interval analysis","authors":"M. Ghodrat, T. Givargis, A. Nicolau","doi":"10.1145/1450095.1450120","DOIUrl":"https://doi.org/10.1145/1450095.1450120","url":null,"abstract":"We present a novel loop transformation technique, particularly well suited for optimizing embedded compilers, where an increase in compilation time is acceptable in exchange for significant performance increase. The transformation technique optimizes loops containing nested conditional blocks. Specifically, the transformation takes advantage of the fact that the Boolean value of the conditional expression, determining the true/false paths, can be statically analyzed using a novel interval analysis technique that can evaluate conditional expressions in the general polynomial form. Results from interval analysis combined with loop dependency information is used to partition the iteration space of the nested loop. In such cases, the loop nest is decomposed such as to eliminate the conditional test, thus substantially reducing the execution time. Our technique completely eliminates the conditional from the loops (unlike previous techniques) thus further facilitating the application of other optimizations and improving the overall speedup. Applying the proposed transformation technique on loop kernels taken from Mediabench, SPEC-2000, mpeg4, qsdpcm and gimp, on average we measured a 175% (1.75X) improvement of execution time when running on a SPARC processor, a 336% (4.36X) improvement of execution time when running on an Intel Core Duo processor and a 198.9% (2.98X) improvement of execution time when running on a PowerPC G5 processor.","PeriodicalId":136293,"journal":{"name":"International Conference on Compilers, Architecture, and Synthesis for Embedded Systems","volume":"32 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-10-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132970951","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}