Y. Wen, Ziyi Liu, W. Shi, Yifei Jiang, A. Cheng, Khoa N. Le
{"title":"Energy efficient hybrid display and predictive models for embedded and mobile systems","authors":"Y. Wen, Ziyi Liu, W. Shi, Yifei Jiang, A. Cheng, Khoa N. Le","doi":"10.1145/2380403.2380429","DOIUrl":"https://doi.org/10.1145/2380403.2380429","url":null,"abstract":"Electrophoretic displays (EPDs) and organic light emitting diode (OLEDs) are two key technologies used in mobile de-vices. In this paper, we propose the design of an integrated hybrid display combining a transparent OLED (TOLED) and a low power EPD, which is adaptive to show contents of a frame partially on either the TOLED or the EPD. A windows-based predictive model and a calibration algorithm on TOLED are introduced to decide how frame contents can be split between the two displays for achieving the best tradeoff between power reduction and user experiences. A simulation environment that can estimate both the energy consumption and optical properties of the proposed hybrid display is set up based on actual physical measurements. Simulation results show that the predictive model can make right decisions on choosing proper displays in over 90% of the test cases, and this new display design can save over 70% power under many mobile application contexts and still sup-port contents that require fast update rates.","PeriodicalId":136293,"journal":{"name":"International Conference on Compilers, Architecture, and Synthesis for Embedded Systems","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-10-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133755710","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"When less is more (LIMO):controlled parallelism forimproved efficiency","authors":"Gaurav Chadha, S. Mahlke, S. Narayanasamy","doi":"10.1145/2380403.2380431","DOIUrl":"https://doi.org/10.1145/2380403.2380431","url":null,"abstract":"While developing shared-memory programs, programmers often contend with the problem of how many threads to create for best efficiency. Creating as many threads as the number of available processor cores, or more, may not be the most efficient configuration. Too many threads can result in excessive contention for shared resources, wasting energy, which is of primary concern for embedded devices. Furthermore, thermal and power constraints prevent us from operating all the processor cores at the highest possible frequency, favoring fewer threads. The best number of threads to run depends on the application, user input and hardware resources available. It can also change at runtime making it infeasible for the programmer to determine this number.\u0000 To address this problem, we propose LIMO, a runtime system that dynamically manages the number of running threads of an application for maximizing peformance and energy-efficiency. LIMO monitors threads' progress along with the usage of shared hardware resources to determine the best number of threads to run and the voltage and frequency level. With dynamic adaptation, LIMO provides an average of 21% performance improvement and a 2x improvement in energy-efficiency on a 32-core system over the default configuration of 32 threads for a set of concurrent applications from the PARSEC suite, the Apache web server, and the Sphinx speech recognition system.","PeriodicalId":136293,"journal":{"name":"International Conference on Compilers, Architecture, and Synthesis for Embedded Systems","volume":"103 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-10-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132574865","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Lars Schor, Iuliana Bacivarov, Devendra Rai, Hoeseok Yang, Shin-Haeng Kang, L. Thiele
{"title":"Scenario-based design flow for mapping streaming applications onto on-chip many-core systems","authors":"Lars Schor, Iuliana Bacivarov, Devendra Rai, Hoeseok Yang, Shin-Haeng Kang, L. Thiele","doi":"10.1145/2380403.2380422","DOIUrl":"https://doi.org/10.1145/2380403.2380422","url":null,"abstract":"The next generation of embedded software has high performance requirements and is increasingly dynamic. Multiple applications are typically sharing the system, running in parallel in different combinations, starting and stopping their individual execution at different moments in time. The different combinations of applications are forming system execution scenarios. In this paper, we present the distributed application layer, a scenario-based design flow for mapping a set of applications onto heterogeneous on-chip many-core systems. Applications are specified as Kahn process networks and the execution scenarios are combined into a finite state machine. Transitions between scenarios are triggered by behavioral events generated by either running applications or the run-time system. A set of optimal mappings are precalculated during design-time analysis. Later, at run-time, hierarchically organized controllers monitor behavioral events, and apply the precalculated mappings when starting new applications. To handle architectural failures, spare cores are allocated at design-time. At run-time, the controllers have the ability to move all processes assigned to a faulty physical core to a spare core. Finally, we apply the proposed design flow to design and optimize a picture-in-picture software.","PeriodicalId":136293,"journal":{"name":"International Conference on Compilers, Architecture, and Synthesis for Embedded Systems","volume":"82 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-10-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132839779","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A low-overhead interconnect architecture for virtual reconfigurable fabrics","authors":"Aaron Landy, G. Stitt","doi":"10.1145/2380403.2380427","DOIUrl":"https://doi.org/10.1145/2380403.2380427","url":null,"abstract":"Field-programmable gate arrays (FPGAs) have been widely shown to have significant performance and power advantages compared to microprocessors and graphics-processing units (GPUs), but remain a niche technology due in part to productivity challenges. Although such challenges have numerous causes, previous work has shown two significant contributing factors: 1) prohibitive place-and-route times preventing mainstream design methodologies, and 2) limited application portability preventing design reuse. Virtual reconfigurable architectures, referred to as intermediate fabrics (IFs), were recently introduced as a potential solution to these problems, providing 100x-1000x place-and-route speedup, while also enabling application portability across potentially any physical FPGA. However, one significant limitation of existing intermediate fabrics is area overhead incurred from virtualized interconnect resources. In this paper, we perform design-space exploration of virtual interconnect architectures and introduce an optimized virtual interconnect that reduces area overhead by 48% to 54% compared to previous work, while also improving clock frequencies by 24% with a modest routability overhead of 16%.","PeriodicalId":136293,"journal":{"name":"International Conference on Compilers, Architecture, and Synthesis for Embedded Systems","volume":"182 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-10-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114849362","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Anthony Gutierrez, Joseph Pusdesris, R. Dreslinski, T. Mudge
{"title":"Lazy cache invalidation for self-modifying codes","authors":"Anthony Gutierrez, Joseph Pusdesris, R. Dreslinski, T. Mudge","doi":"10.1145/2380403.2380433","DOIUrl":"https://doi.org/10.1145/2380403.2380433","url":null,"abstract":"Just-in-time compilation with dynamic code optimization is often used to help improve the performance of applications that utilize high-level languages and virtual run-time environments, such as those found in smartphones. Just-in-time compilation introduces additional overhead into the instruction fetch stage of a processor that is particularly problematic for user applications-instruction cache invalidation due to the use of self-modifying code. This software-assisted cache coherence serializes cache line invalidations, or causes a costly invalidation of the entire instruction cache, and prevents useful instructions from being fetched for the period during which the stale instructions are being invalidated. This overhead is not acceptable for user applications, which are expected to respond quickly.\u0000 In this work we introduce a new technique that can, using a single instruction, invalidate cache lines in page-sized chunks as opposed to invalidating only a single line at a time. Lazy cache invalidation reduces the amount of time spent stalling due to instruction cache invalidation by removing stale instructions on demand as they are accessed, as opposed to all at once. The key observation behind lazy cache invalidation is that stale instructions do not necessarily need to be removed from the instruction cache; as long as it is guaranteed that attempts to fetch stale instructions will not hit in the instruction cache, the program will behave as the developer had intended.","PeriodicalId":136293,"journal":{"name":"International Conference on Compilers, Architecture, and Synthesis for Embedded Systems","volume":"41 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-10-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114435706","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Static secure page allocation for light-weight dynamic information flow tracking","authors":"Juan Carlos Martínez Santos, Yunsi Fei, Z. Shi","doi":"10.1145/2380403.2380415","DOIUrl":"https://doi.org/10.1145/2380403.2380415","url":null,"abstract":"Dynamic information flow tracking (DIFT) is an effective security countermeasure for both low-level memory corruptions and high-level semantic attacks. However, many software approaches suffer large performance degradation, and hardware approaches have high logic and storage overhead. We propose a flexible and light-weight hardware/software co-design approach to perform DIFT based on secure page allocation. Instead of associating every data with a taint tag, we aggregate data according to their taints, i.e., putting data with different attributes in separate memory pages. Our approach is a compiler-aided process with architecture support. The implementation and analysis show that the memory overhead is little, and our approach can protect critical information, including return address, indirect jump address, and system call IDs, from being overwritten by malicious users.","PeriodicalId":136293,"journal":{"name":"International Conference on Compilers, Architecture, and Synthesis for Embedded Systems","volume":"1779 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-10-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129568476","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"From sequential programming to flexible parallel execution","authors":"A. Raman, Jae W. Lee, David I. August","doi":"10.1145/2380403.2380417","DOIUrl":"https://doi.org/10.1145/2380403.2380417","url":null,"abstract":"The embedded computing landscape is being transformed by three trends: growing demand for greater functionality and enriched user experience, increasing diversity and parallelism in the processing substrate, and an accelerating push for ever-greater energy efficiency. For programmers, these trends give rise to three challenges: writing code for a potentially heterogeneous architecture, extracting parallelism in software, and maximizing a multivariate (performance, power, energy, etc.) fitness function of user satisfaction which may vary with time. To meet these challenges, clarion calls have been issued for programmers to start writing software in new parallel programming models. Fundamentally, however, these proposals detract programmers from delivering new features and enriched user experience in the shortest time possible. This paper proposes to attract embedded systems programmers to a vertically integrated approach, comprising extensions to the sequential programming model, a parallelizing compiler, and an optimizing run-time system, to enable them to tackle all three challenges.","PeriodicalId":136293,"journal":{"name":"International Conference on Compilers, Architecture, and Synthesis for Embedded Systems","volume":"20 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-10-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121358528","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Devendra Rai, Hoeseok Yang, Iuliana Bacivarov, L. Thiele
{"title":"Power agnostic technique for efficient temperature estimation of multicore embedded systems","authors":"Devendra Rai, Hoeseok Yang, Iuliana Bacivarov, L. Thiele","doi":"10.1145/2380403.2380421","DOIUrl":"https://doi.org/10.1145/2380403.2380421","url":null,"abstract":"Temperature plays an increasingly important role in the overall performance and reliability of a computing system. Multi- and many-core systems provide an opportunity to manage the overall temperature profile by cleverly designing the application-to-core mapping and the associated scheduling policies. An uncontrolled temperature profile may lead to an unplanned performance loss, since the system activates protective mechanisms such as voltage and/or frequency scaling to cool itself. Similarly, deep thermal cycles with high frequency lead to severe deterioration in the overall reliability of the system. Design space exploration tools are often used to optimize binding and scheduling choices based on a given set of constraints and objectives, thus motivating the need for fast and accurate temperature estimation techniques. We argue that the currently available techniques are not an ideal fit to design space exploration tools, and suggest a system level technique which is based on application fingerprinting. It does not need any information about the processor floorplan, the physical and thermal structure, or about power consumption. Instead, its temperature estimation is based on a set of application-specific calibration runs and associated temperature measurements using available built-in sensors. We show that a given application possesses a unique thermal signature on the system it executes on, which provides a computationally fast method to calculate accurate temperature traces. Extensive experimental studies show that our technique can estimate temperature on all cores of a system to within $5^{o}C$, and is three orders of magnitude faster than state of the art numerical simulators like emph{Hotspot.}","PeriodicalId":136293,"journal":{"name":"International Conference on Compilers, Architecture, and Synthesis for Embedded Systems","volume":"70 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-10-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121888094","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"LLBT: an LLVM-based static binary translator","authors":"Bor-Yeh Shen, Jiunn-Yeu Chen, W. Hsu, Wuu Yang","doi":"10.1145/2380403.2380419","DOIUrl":"https://doi.org/10.1145/2380403.2380419","url":null,"abstract":"Lack of applications has always been a serious concern for designing machines with a new but incompatible ISA. To address this concern, binary translation is one common technique to migrate applications from one legacy ISA to new ones. In the past, dynamic binary translation (DBT) has been more widely adopted for migrating applications since it avoids some challenging problems for binary translation such as code discovery for variable length ISA and code location issues for handling indirect branches. Static binary translation (SBT) is usually regarded as a less general solution and has not been actively researched on. However, SBT has advantages of performing more aggressive optimizations, which could yield more compact code and greater code quality. In general, SBT translated applications are likely to consume less memory, processor cycles and power, and can be started more quickly. All the above advantages are more critical for embedded systems than for general systems. Therefore, we believe that even though SBT is not as general as DBT, it has a unique role to play for migrating applications in embedded systems.\u0000 In this paper, we designed and implemented a new portable SBT tool, called LLBT, which translates source binary into LLVM IR and then retargets the LLVM IR to various ISAs by using the LLVM compiler infrastructure. Using the LLVM compiler infrastructure, LLBT successfully leverages two important functionalities from LLVM: the comprehensive optimizations and the retargetability. For example, most DBTs map guest architecture states into the host registers to minimize accessing guest architecture states with memory operations, but must deal with guest architecture state saving/reloading at trace/block entry/exit points. LLBT can treat the complete application binary as a single function and uses the global register allocation optimization in LLVM to consistently map guest architecture states in host registers so as to avoid the costly state saving and reloading at trace/block exits.\u0000 In this paper, we have shown our ARM-based LLBT can effectively migrate EEMBC benchmark Suite from ARMv5 to Intel IA32, Intel x64, MIPS, and other ARMs such as ARMv7. On the Intel i7 based host systems, the LLBT generated code can run 3 to 64 times faster than emulating with QEMU, which uses the DBT technique.","PeriodicalId":136293,"journal":{"name":"International Conference on Compilers, Architecture, and Synthesis for Embedded Systems","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-10-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126497451","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
W. H. Minhass, P. Pop, J. Madsen, Felician Stefan Blaga
{"title":"Architectural synthesis of flow-based microfluidic large-scale integration biochips","authors":"W. H. Minhass, P. Pop, J. Madsen, Felician Stefan Blaga","doi":"10.1145/2380403.2380437","DOIUrl":"https://doi.org/10.1145/2380403.2380437","url":null,"abstract":"Microfluidic biochips are replacing the conventional biochemical analyzers and are able to integrate the necessary functions for biochemical analysis on-chip. In this paper we are interested in flow-based biochips, in which the flow of liquid is manipulated using integrated microvalves. By combining several microvalves, more complex units, such as micropumps, switches, mixers, and multiplexers, can be built. The manufacturing technology, soft lithography, used for the flow-based biochips is advancing faster than Moore's law, resulting in increased architectural complexity. However, the designers are still using full-custom and bottom-up, manual techniques in order to design and implement these chips. As the chips become larger and the applications become more complex, the manual methodologies will not scale, becoming highly inadequate. Therefore, for the first time to our knowledge,we propose a top-down architectural synthesis methodology for the flow-based biochips. Starting from a given biochemical application and a microfluidic component library, we are interested in synthesizing a biochip architecture, i.e., performing component allocation from the library based on the biochemical application, generating the biochip schematic (netlist) and then performing physical synthesis (deciding the placement of the microfluidic components on the chip and performing routing of the microfluidic channels), such that the application completion time is minimized. We evaluate our proposed approach by synthesizing architectures for real-life applications as well as synthetic benchmarks.","PeriodicalId":136293,"journal":{"name":"International Conference on Compilers, Architecture, and Synthesis for Embedded Systems","volume":"440 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-10-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122930052","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}