PARMA-DITAM '16Pub Date : 2016-01-18DOI: 10.1145/2872421.2872423
A. Georgiadis, S. Xydis, D. Soudris
{"title":"Deploying and monitoring hadoop MapReduce analytics on single-chip cloud computer","authors":"A. Georgiadis, S. Xydis, D. Soudris","doi":"10.1145/2872421.2872423","DOIUrl":"https://doi.org/10.1145/2872421.2872423","url":null,"abstract":"Modern data analytics applications exhibit scale-out characteristics, requiring large amount of computational power. Recent research has shown that modern manycore architectures forms a promising platform solution for this emerging type of workloads. In this paper, we present a framework for the deployment, monitoring and automated exploration of Hadoop MapReduce clusters implementing data analytics applications onto the Intel SCC manycore platform. We provide an in-depth analysis on the performance and energy characteristics of Hadoop MapReduce workloads on the Intel SCC, i.e. on a real-silicon manycore which highly differentiates from typical server and accelerator architectures. Through extensive experimentation, we show that there is a trade-off between the number of worker nodes and the per-node available I/O bandwidth and that intelligently scaling the frequency of data-nodes yields in energy savings with minimal impact on performance.","PeriodicalId":115716,"journal":{"name":"PARMA-DITAM '16","volume":"78 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-01-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117307153","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
PARMA-DITAM '16Pub Date : 2016-01-18DOI: 10.1145/2872421.2872426
K. Madhu, Anuj Rao, Saptarsi Das, Krishna C. Madhava, S. Nandy, R. Narayan
{"title":"Flexible resource allocation and management for application graphs on ReNÉ MPSoC","authors":"K. Madhu, Anuj Rao, Saptarsi Das, Krishna C. Madhava, S. Nandy, R. Narayan","doi":"10.1145/2872421.2872426","DOIUrl":"https://doi.org/10.1145/2872421.2872426","url":null,"abstract":"Performance of an application on a many-core machine primarily hinges on the ability of the architecture to exploit parallelism and to provide fast memory accesses. Exploiting parallelism in static application graphs on a multicore target is relatively easy owing to the fact that compilers can map them onto an optimal set of processing elements and memory modules. Dynamic application graphs have computations and data dependencies that manifest at runtime and hence may not be schedulable statically. Load balancing of such graphs requires runtime support (such as support for work-stealing) but results in overheads due to data and code movement. In this work, we use ReNÉ MPSoC as an alternative to the traditional many-core processing platforms to target application kernel graphs. ReNÉ is designed to be used as an accelerator to a host and offers the ability to exploit massive parallelism at multiple granularities and supports work-stealing for dynamic load-balancing. Further, it offers handles to enable and disable work-stealing selectively. ReNÉ employs an explicitly managed global memory with minimal hardware support for address translation required for relocating application kernels. We present a resource management methodology on ReNE MPSoC that encompasses a lightweight resource management hardware module and a compilation flow. Our methodology aims at identifying resource requirements at compile time and create resource boundaries (per application kernel) to guarantee performance and maximize resource utilization. The approach offers similar flexibility in resource allocation as a dynamic scheduling runtime but guarantees performance since locality of reference of data and code can be ensured.","PeriodicalId":115716,"journal":{"name":"PARMA-DITAM '16","volume":"38 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-01-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114593679","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
PARMA-DITAM '16Pub Date : 2016-01-18DOI: 10.1145/2872421.2872424
Amir H. Ashouri, Andrea Bignoli, G. Palermo, C. Silvano
{"title":"Predictive modeling methodology for compiler phase-ordering","authors":"Amir H. Ashouri, Andrea Bignoli, G. Palermo, C. Silvano","doi":"10.1145/2872421.2872424","DOIUrl":"https://doi.org/10.1145/2872421.2872424","url":null,"abstract":"Today's compilers offer a huge number of transformation options to choose among and this choice can significantly impact on the performance of the code being optimized. Not only the selection of compiler options represents a hard problem to be solved, but also the ordering of the phases is adding further complexity, making it a long standing problem in compilation research. This paper presents an innovative approach for tackling the compiler phase-ordering problem by using predictive modeling. The proposed methodology enables i) to efficiently explore compiler exploration space including optimization permutations and repetitions and ii) to extract the application dynamic features to predict the next-best optimization to be applied to maximize the performance given the current status. Experimental results are done by assessing the proposed methodology with utilizing two different search heuristics on the compiler optimization space and it demonstrates the effectiveness of the methodology on the selected set of applications. Using the proposed methodology on average we observed up to 4% execution speedup with respect to LLVM standard baseline.","PeriodicalId":115716,"journal":{"name":"PARMA-DITAM '16","volume":"193 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-01-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132280104","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
PARMA-DITAM '16Pub Date : 2016-01-18DOI: 10.1145/2872421.2893173
W. Fornaciari, G. Pozzi, F. Reghenzani, Andrea Marchese, Mauro Belluschi
{"title":"Runtime resource management for embedded and HPC systems","authors":"W. Fornaciari, G. Pozzi, F. Reghenzani, Andrea Marchese, Mauro Belluschi","doi":"10.1145/2872421.2893173","DOIUrl":"https://doi.org/10.1145/2872421.2893173","url":null,"abstract":"Resource management is a well known problem in almost every computing system ranging from embedded to High Performance Computing (HPC) and is useful to optimize multiple orthogonal system metrics such as power consumption, performance and reliability. To achieve such an optimization a resource manager must suitably allocate the available system resources -- e.g. processing elements, memories and interconnect -- to the running applications. This kind of process incurs in two main problems: a) system resources are usually shared between multiple applications and this induces resource contention; and b) each application requires a different Quality of Service, making it harder for the resource manager to work in an application-agnostic mode. In this scenario, resource management represents a critical and essential component in a computing system and should act at different levels to optimize the whole system while keeping it flexible and versatile.\u0000 In this paper we describe a multi-layer resource management strategy that operates at application, operating system and hardware level and tries to optimize resource allocation on embedded, desktop multi-core and HPC systems.","PeriodicalId":115716,"journal":{"name":"PARMA-DITAM '16","volume":"143 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-01-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124313259","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
PARMA-DITAM '16Pub Date : 2016-01-18DOI: 10.1145/2872421.2872422
H. R. Mendis, L. Indrusiak
{"title":"Low communication overhead dynamic mapping of multiple HEVC video stream decoding on NoCs","authors":"H. R. Mendis, L. Indrusiak","doi":"10.1145/2872421.2872422","DOIUrl":"https://doi.org/10.1145/2872421.2872422","url":null,"abstract":"The High Efficiency Video Coding (HEVC) standard offers several parallelisation tools such as wave-front parallel processing (WPP) and Tiles (independent frame regions) to better manage the computationally expensive workloads on modern multicore/many-core platforms. However, poor allocation of tile-level HEVC decoding tasks to processing elements may result in increased latency and energy consumption due to data-communication overhead between dependent tiles. In this work, we discuss the difficulties in decoding multiple HEVC bitstreams with highly varying resolutions and data-dependency characteristics as seen in HEVC coded video streams with random-access, adaptive group of pictures (GoP) structures. Secondly, in order to address the above challenges, we introduce a runtime tile allocation scheme that help to reduce the energy usage during HEVC decoding. Evaluations against a bin-packing algorithm, show that the proposed workload mapping technique is able to maintain reasonably acceptable latency results, whilst reducing communication overhead (8-10%) and increasing the mean processor idle periods (~30%) to support dynamic power management.","PeriodicalId":115716,"journal":{"name":"PARMA-DITAM '16","volume":"24 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-01-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130868732","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
PARMA-DITAM '16Pub Date : 2016-01-18DOI: 10.1145/2872421.2872425
Stefano Cherubin, M. Scandale, G. Agosta
{"title":"Stack size estimation on machine-independent intermediate code for OpenCL kernels","authors":"Stefano Cherubin, M. Scandale, G. Agosta","doi":"10.1145/2872421.2872425","DOIUrl":"https://doi.org/10.1145/2872421.2872425","url":null,"abstract":"Stack size is an important factor in the mapping decision when dealing with embedded heterogeneous architectures, where fast memory is a scarce resource. Trying to map a kernel onto a device with insufficient memory may lead to reduced performance or even failure to run the kernel. OpenCL kernels are often compiled just-in-time, starting from the source code or an intermediate machine-independent representation. Precise stack size information, however, is only available in machine-dependent code. We provide a method for computing the stack size with sufficient accuracy on machine-independent code, given knowledge of the target ABI and register file architecture. This method can be applied to make mapping decisions early, thus avoiding to compile multiple times the code for each possible accelerator in a complex embedded heterogeneous system.","PeriodicalId":115716,"journal":{"name":"PARMA-DITAM '16","volume":"50 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-01-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123118662","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}