PARMA-DITAM '14Pub Date : 2014-01-20DOI: 10.1145/2556863.2556864
Efstathios Sotiriou-Xanthopoulos, S. Xydis, K. Siozios, G. Economakos, D. Soudris
{"title":"Effective Platform-Level Exploration for Heterogeneous Multicores Exploiting Simulation-Induced Slacks","authors":"Efstathios Sotiriou-Xanthopoulos, S. Xydis, K. Siozios, G. Economakos, D. Soudris","doi":"10.1145/2556863.2556864","DOIUrl":"https://doi.org/10.1145/2556863.2556864","url":null,"abstract":"Heterogeneous Multi-Processor Systems-on-Chip (MPSoC) exhibit increased design complexity due to numerous architectural parameters and hardware/software partitioning schemes. Automated Design Space Exploration (DSE) becomes an essential design procedure to discover optimized solutions in a reasonable time. For high-quality DSE, the accurate solution evaluation is a strong requirement. To this direction, High-Level Synthesis (HLS) can be used for the characterization of the design solutions. In this paper, we propose (a) a platform design methodology that exploits simulation-induced slacks generated by avoiding simulation re-initializations and exploits the gained time for HLS, and (b) a DSE tool-flow which takes into account multiple HW/SW partitioning schemes and intelligently schedules system evaluations. Experimental results show that the proposed methodology achieves 17% simulation improvements together with 77% higher accuracy, in comparison to a typical exploration approach.","PeriodicalId":210814,"journal":{"name":"PARMA-DITAM '14","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-01-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130260233","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
PARMA-DITAM '14Pub Date : 2014-01-20DOI: 10.1145/2556863.2556870
Ricardo Nobre, Pedro Pinto, Tiago Carvalho, João MP Cardoso, P. Diniz
{"title":"On Expressing Strategies for Directive-Driven Multicore Programing Models","authors":"Ricardo Nobre, Pedro Pinto, Tiago Carvalho, João MP Cardoso, P. Diniz","doi":"10.1145/2556863.2556870","DOIUrl":"https://doi.org/10.1145/2556863.2556870","url":null,"abstract":"A common migration path for applications to high-performance multicore architectures relies on code annotations with concurrent semantics. Some annotations, however, are very target architecture specific and thus highly non-portable. In this paper we describe a source-to-source code transformation system that allows programmers to specify transformations using an aspect-oriented domain specific language - LARA. LARA allows programmers to specify strategies to search large code transformation design spaces while preserving the original source code. As the experimental results reveal, this approach leads to a substantial reduction in code maintenance costs, and promotes the portability of both programmers and performance.","PeriodicalId":210814,"journal":{"name":"PARMA-DITAM '14","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-01-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127807503","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
PARMA-DITAM '14Pub Date : 2014-01-20DOI: 10.1145/2556863.2556868
G. Massari, Chiara Caffarri, P. Bellasi, W. Fornaciari
{"title":"Extending a Run-time Resource Management framework to support OpenCL and Heterogeneous Systems","authors":"G. Massari, Chiara Caffarri, P. Bellasi, W. Fornaciari","doi":"10.1145/2556863.2556868","DOIUrl":"https://doi.org/10.1145/2556863.2556868","url":null,"abstract":"From Mobile to High-Performance Computing (HPC) systems, performance and energy efficiency are becoming always more challenging requirements. In this regard, heterogeneous systems, made by a general-purpose processor and one or more hardware accelerators, are emerging as affordable solutions. However, the effective exploitation of such platforms requires specific programming languages, like for instance OpenCL, and suitable run-time software layers. This work illustrates the extension of a run-time resource management (RTRM) framework, to support the execution of OpenCL applications on systems featuring a multi-core CPU and multiple GPUs. Early results show how this solution leads to benefits both for the applications, in terms of performance, and for the system, in terms of resource utilization, i.e. load balancing and thermal leveling over the computing devices.","PeriodicalId":210814,"journal":{"name":"PARMA-DITAM '14","volume":"55 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-01-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130863966","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
PARMA-DITAM '14Pub Date : 2014-01-20DOI: 10.1145/2556863.2556867
Thomas Sideropoulos, N. Pitsianis
{"title":"A cycle-accurate synthesizable MIPS simulator in Simulink","authors":"Thomas Sideropoulos, N. Pitsianis","doi":"10.1145/2556863.2556867","DOIUrl":"https://doi.org/10.1145/2556863.2556867","url":null,"abstract":"We introduce a novel methodology for creating a synthesizable, cycle-accurate simulator of the MIPS32 processor with concise, high-level programming expressions using Simulink and other matlab tools. The simulator, named SimuMIPS, is capable of running binaries generated by the GNU gcc compiler and associated binutils. It can be easily configured, modified and extended not only for academic instruction but also to be included in commercial SOC products. Synthesizable instantiations of SimuMIPS in Verilog and VHDL may be generated by Simulink HDL Coder for FPGA programming and system-on-chip prototyping. In addition, the SimuMIPS simulator can run on embedded processors, rapid prototyping boards, and off-the-shelf microprocessors via the Embedded Coder generated C and C++ implementations.","PeriodicalId":210814,"journal":{"name":"PARMA-DITAM '14","volume":"187 ","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-01-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114089402","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
PARMA-DITAM '14Pub Date : 2014-01-20DOI: 10.1145/2556863.2556866
Simone Libutti, G. Massari, P. Bellasi, W. Fornaciari
{"title":"Exploiting Performance Counters for Energy Efficient Co-Scheduling of Mixed Workloads on Multi-Core Platforms","authors":"Simone Libutti, G. Massari, P. Bellasi, W. Fornaciari","doi":"10.1145/2556863.2556866","DOIUrl":"https://doi.org/10.1145/2556863.2556866","url":null,"abstract":"Mainstream multicore architectures allow the execution of mixed workloads where multiple parallel applications run concurrently competing on shared computational resources. As different applications exhibit different and time varying resources needs, a suitable allocation policy is required to properly select and map resources at run-time on demanding applications.\u0000 We demonstrate how a user-space run-time resource manager could be extended to easily take advantage of performance counters in order to optimize both workloads execution time and energy consumption. Our approach, initially evaluated on a quad-core Intel machine considering a representative set of mixed-workloads from a standard benchmark suite, attains a 49,9% mean energy-delay-product (EDP) speed-up over the standard Linux case, and a 13.4% EDP speed-up over our previous work.","PeriodicalId":210814,"journal":{"name":"PARMA-DITAM '14","volume":"46 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-01-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116193808","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
PARMA-DITAM '14Pub Date : 2014-01-20DOI: 10.1145/2556863.2556869
D. Göhringer, Jan Tepelmann
{"title":"An Interactive Tool based on Polly for Detection and Parallelization of Loops","authors":"D. Göhringer, Jan Tepelmann","doi":"10.1145/2556863.2556869","DOIUrl":"https://doi.org/10.1145/2556863.2556869","url":null,"abstract":"In many applications, such as signal and image processing, most computation time is spent within loops. Therefore, these loops are ideal candidates for performance increase when moving to parallel architectures, such as multi- or many-core systems. However, manual parallelization of existing applications is a complex and cumbersome task. To leverage this, we introduce in this paper an interactive tool based on Polly, LLVM and the linux perf tools. With the help of our tool compute intensive loops can be found and parallelized. Polly is a polyhedral optimizer for LLVM. In the polyhedral model, loops are described in an abstract mathematical way and loop optimizations are mathematical transformations on this abstract description. Loops must meet specific requirements to be representable in the polyhedral model. If only one requirement is not satisfied, the loop cannot be optimized with Polly. Our tool can help here by showing the user all the problems which prevent an automatic optimization with Polly. Such an optimization is only worthwhile for compute intensive loops. To find such loops our tool uses the linux perf tools for performance profiling. Evaluation results for the following two applications are presented: Tiff2rgba and 2D Cross-Correlation image processing algorithm.","PeriodicalId":210814,"journal":{"name":"PARMA-DITAM '14","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-01-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124071208","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
PARMA-DITAM '14Pub Date : 2014-01-20DOI: 10.1145/2556863.2556865
J. Harbin, L. Indrusiak
{"title":"Fine-Grained Link Locking Within Power and Latency Transaction Level Modelling in Wormhole Switching Non-Preemptive Networks On Chip","authors":"J. Harbin, L. Indrusiak","doi":"10.1145/2556863.2556865","DOIUrl":"https://doi.org/10.1145/2556863.2556865","url":null,"abstract":"An increasingly time-consuming part of the design flow of on-chip multiprocessors is simulation of the network on chip (NoC) architecture. Cycle-accurate simulation of state-of-the art network-on-chip interconnects can be prohibitively slow for realistic application examples. In this paper, we identify a time-predictable non-preemptive network-on-chip architecture and propose a TLM model with fine-grained locking of links. The model is tested via simulation of two benchmark application scenarios. Results demonstrate that the proposed algorithm can model the latency upon the majority of flows very closely to the cycle-accurate model, while providing more than 97% accurate power consumption modelling even on the worst case links. This is achieved while simulating nearly three orders of magnitude faster compared to a cycle-accurate model of the same interconnect.","PeriodicalId":210814,"journal":{"name":"PARMA-DITAM '14","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-01-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125416665","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}