Michael Opoku Agyeman, Quoc-Tuan Vien, G. Hill, S. J Turner, T. Mak
{"title":"An Efficient Channel Model for Evaluating Wireless NoC Architectures","authors":"Michael Opoku Agyeman, Quoc-Tuan Vien, G. Hill, S. J Turner, T. Mak","doi":"10.1109/SBAC-PADW.2016.23","DOIUrl":"https://doi.org/10.1109/SBAC-PADW.2016.23","url":null,"abstract":"Wireless Networks-on-Chip (WiNoCs) have emerged to solve the scalability and performance bottleneck of conventional wired NoC architectures. However unlike communication in the macro-world, on-chip communication poses several constraints, hence there is the need for simulation and design tools that consider the effect of the wireless channel at the nanotechnology level. In this paper, we present a parameterizable channel model for WiNoCs which takes into account practical issues and constraints of the propagation medium, such as transmission frequency, operating temperature, ambient pressure and distance between the on-chip antennas. The proposed channel model demonstrates that total path loss of the wireless channel in WiNoCs suffers from not only dielectric propagation loss (DPL) but also molecular absorption attenuation (MAA) which reduces the reliability of the system.","PeriodicalId":186179,"journal":{"name":"2016 International Symposium on Computer Architecture and High Performance Computing Workshops (SBAC-PADW)","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-10-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114599255","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
E. Borin, C. Benedicto, I. Rodrigues, F. Pisani, M. Tygel, M. Breternitz
{"title":"PY-PITS: A Scalable Python Runtime System for the Computation of Partially Idempotent Tasks","authors":"E. Borin, C. Benedicto, I. Rodrigues, F. Pisani, M. Tygel, M. Breternitz","doi":"10.1109/SBAC-PADW.2016.10","DOIUrl":"https://doi.org/10.1109/SBAC-PADW.2016.10","url":null,"abstract":"The popularization of multi-core architectures and cloud services has allowed users access to high performance computing infrastructures. However, programming for these systems might be cumbersome due to challenges involving system failures, load balancing, and task scheduling. Aiming at solving these problems, we previously introduced SPITS, a programming model and reference architecture for executing bag-of-task applications. In this work, we discuss how this programming model allowed us to design and implement PY-PITS, a simple and effective open source runtime system that is scalable, tolerates faults and allows dynamic provisioning of resources during computation of tasks. We also discuss how PY-PITS can be used to improve utilization of multi-user computational clusters equipped with queues to submit jobs and propose a performance model to aid users to understand when the performance of PY-PITS scales with the number of Workers.","PeriodicalId":186179,"journal":{"name":"2016 International Symposium on Computer Architecture and High Performance Computing Workshops (SBAC-PADW)","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130564370","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A Comparative Study of SYCL, OpenCL, and OpenMP","authors":"H. C. D. Silva, F. Pisani, E. Borin","doi":"10.1109/SBAC-PADW.2016.19","DOIUrl":"https://doi.org/10.1109/SBAC-PADW.2016.19","url":null,"abstract":"Recent trends indicate that future computing systems will be composed by a group of heterogeneous computing devices, including CPUs, GPUs, and other hardware accelerators. These devices provide increased processing performance, however, creating efficient code for them may require that programmers manage memory assignments and use specialized APIs, compilers, or runtime systems, thus making their programs dependent on specific tools. In this scenario, SYCL is an emerging C++ programming model for OpenCL that allows developers to write code for heterogeneous computing devices that are compatible with standard C++ compilation frameworks. In this paper, we analyze the performance and programming characteristics of SYCL, OpenMP, and OpenCL using both a benchmark and a real-world application. Our performance results indicate that programs that rely on available SYCL runtimes are not on par with the ones based on OpenMP and OpenCL yet. Nonetheless, the gap is getting smaller if we consider the results reported by previous studies. In terms of programmability, SYCL presents itself as a competitive alternative to OpenCL, requiring fewer lines of code to implement kernels and also fewer calls to essential API functions and methods.","PeriodicalId":186179,"journal":{"name":"2016 International Symposium on Computer Architecture and High Performance Computing Workshops (SBAC-PADW)","volume":"45 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132634380","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Emmanoel M. De Sousa Junior, I. Sardiña, Frederico Lopes
{"title":"Parallelism and Scalability: A Solution Focused on the Cloud Computing Processing Service Billing","authors":"Emmanoel M. De Sousa Junior, I. Sardiña, Frederico Lopes","doi":"10.1109/SBAC-PADW.2016.14","DOIUrl":"https://doi.org/10.1109/SBAC-PADW.2016.14","url":null,"abstract":"The application scheduling is an important requirement in the cloud computing context. It allows to define the required resources to execute applications tasks following predefined criteria, for instance, maximum execution time, number of virtual machines, volume of data, among others. Selecting process to choose the most appropriate execution structure is driven by scheduling algorithms. This paper proposes a scheduling mechanism for data processing in cloud computing environments. Such mechanism analyzes some specific variables in the business context of a software house specialized in software for lawyers and law offices. The main goal of this mechanism is to fulfill the seasonal company's demand using IaaS services and considering two policies: (i) the maximum execution time allowed by the application may not be exceeded and (ii) the data have to be processed considering the lowest possible monetary cost. The proposed solution generates strategies to select the best set of virtual machines to process the current bunch of data considering the amount of data, the estimated execution time for each specific strategy and the monetary cost of the virtual machines sets. In the context of this work, the strategy concept means the schedule of a set of virtual machines to process a specific amount of data, load balancing decisions and the parallelism of application's execution flow. The proposed solution has resulted in great impact for that company since it allowed the vertiginous increase of the amount of clients served.","PeriodicalId":186179,"journal":{"name":"2016 International Symposium on Computer Architecture and High Performance Computing Workshops (SBAC-PADW)","volume":"53 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122128907","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
E. Rios, I. M. Coelho, L. Ochi, Cristina Boeres, R. Farias
{"title":"A Benchmark on Multi Improvement Neighborhood Search Strategies in CPU/GPU Systems","authors":"E. Rios, I. M. Coelho, L. Ochi, Cristina Boeres, R. Farias","doi":"10.1109/SBAC-PADW.2016.17","DOIUrl":"https://doi.org/10.1109/SBAC-PADW.2016.17","url":null,"abstract":"In combinatorial optimization problems, the neighborhood search (NS) is a fundamental component for local search based heuristics. It consists of selecting a solution from a high cardinality set of neighbor solutions, by means of operations called moves. To perform this search, NS algorithms usually adopt two main approaches: selecting the first or best improving move. The Multi Improvement (MI) strategy is a recently proposed method that consists in exploring simultaneously multiple move operations during the NS phase aiming to reach good quality solutions with shorter computational steps. This paper presents a benchmark for MI strategies in hybrid CPU/GPU systems. This technique efficiently explores the CPU processing power together with the massive parallelism achieved by modern GPUs, emerging as an efficient alternative for classic CPU neighborhood search strategies. The advantage of this approach depends heavily on finding the best tradeoff between CPU and GPU processing, as well as minimizing the memory transfers involved in the process. In the experiments, several MI configurations were tested in a hybrid CPU/GPU environment presenting better results than classical neighborhood search strategies for the Minimum Latency Problem, a hard combinatorial optimization problem.","PeriodicalId":186179,"journal":{"name":"2016 International Symposium on Computer Architecture and High Performance Computing Workshops (SBAC-PADW)","volume":"100 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"120842517","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
A. P. Nascimento, C. Vasconcelos, F. S. Jamel, A. Sena
{"title":"A Hybrid Parallel Algorithm for the Auction Algorithm in Multicore Systems","authors":"A. P. Nascimento, C. Vasconcelos, F. S. Jamel, A. Sena","doi":"10.1109/SBAC-PADW.2016.21","DOIUrl":"https://doi.org/10.1109/SBAC-PADW.2016.21","url":null,"abstract":"The bipartite graph matching problem is based on finding a point that maximizes the chances of similarity with another one, and it is explored in several areas such as Bioinformatics and Computer Vision. To solve that matching problem the auction algorithm has been widely used and its parallel implementation is employed to find matching solutions in a reasonable computational time. For example, image analysis may require a large amount of processing, as dense images can have thousands of points to be considered. Furthermore, to exploit the benefits of multicore architectures, a hybrid implementation can be used to deal with communication in both distributed and shared memory. The main goal of this paper is to implement and evaluate the performance of an hybrid parallel auction algorithm for multicore clusters. The experiments carried out analyzes the problem size, the number of iterations to solve the matching and the impact of these parameters in the communication costs and how it affects the execution times.","PeriodicalId":186179,"journal":{"name":"2016 International Symposium on Computer Architecture and High Performance Computing Workshops (SBAC-PADW)","volume":"104 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124371794","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A Processor Workload Distribution Algorithm for Massively Parallel Applications","authors":"Serge Midonnet, Achille Wattelar","doi":"10.1109/SBAC-PADW.2016.13","DOIUrl":"https://doi.org/10.1109/SBAC-PADW.2016.13","url":null,"abstract":"Directed Acyclic Graph (DAG) is a standard model used to describe tasks that execute according to precedence constraints and that allows intra-task parallelism. This model is well suited to camera-based applications where multiple treatments must be executed in parallel according to the camera input, such applications found for example in self-driving cars or image recognition via convolutional neural network (CNN). Such applications are used on embedded systems and therefore require low energy cost and a limited hardware space. The main contribution of this paper is to present a new partitioning algorithm based on a DAG stretching technique. This stretching algorithm frees processor cores and thus implies energy savings and leads to new hardware design using a reduced number of processors. We present an experimental evaluation of this algorithm to show its efficiency.","PeriodicalId":186179,"journal":{"name":"2016 International Symposium on Computer Architecture and High Performance Computing Workshops (SBAC-PADW)","volume":"61 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134098430","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Outline of a Thick Control Flow Architecture","authors":"M. Forsell, J. Roivainen, V. Leppänen","doi":"10.1109/SBAC-PADW.2016.9","DOIUrl":"https://doi.org/10.1109/SBAC-PADW.2016.9","url":null,"abstract":"The recently invented thick control flow (TCF) model packs together an unbounded number of fibers, thread-like computational entities, flowing through the same control path. This promises to simplify parallel programming by partially eliminating looping and artificial thread arithmetics. In this paper we outline an architecture for efficiently executing programs written for the TCF model. It features scalable latency hiding via replication of instructions, radical synchronization cost reduction via a wave-based synchronization mechanism, and improved low-level parallelism exploitation via chaining of functional units. Replication of instructions is supported by a dynamic multithreading-like mechanism, which saves the fiber-wise data into special replicated register blocks. The architecture facilitates programmers with compact, unbounded notation of fibers and groups of them together with strong synchronous shared memory algorithmics. According to evaluations, the architecture is able to efficiently handle workloads featuring computational elements with the same control flow, independently of the number of elements. In its turn, this pays out as improved performance and lower power consumption due to elimination of redundant parts of computation and machinery.","PeriodicalId":186179,"journal":{"name":"2016 International Symposium on Computer Architecture and High Performance Computing Workshops (SBAC-PADW)","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123868455","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
V. Gil-Costa, Nicolás Hidalgo, Erika Rosas, Mauricio Marín
{"title":"A Dynamic Load Balance Algorithm for the S4 Parallel Stream Processing Engine","authors":"V. Gil-Costa, Nicolás Hidalgo, Erika Rosas, Mauricio Marín","doi":"10.1109/SBAC-PADW.2016.12","DOIUrl":"https://doi.org/10.1109/SBAC-PADW.2016.12","url":null,"abstract":"Large streams of data can be analyzed in realtimeby Parallel Stream Processing Engines (PSPEs) which arebased on a graph paradigm where vertices represent processingelements (PEs) and edges represent flows of data among PEs. Inthis work, we propose a new elastic strategy for the S4 PSPE toadjust the overall load of PEs in accordance with the utilizationlevels and data traffic at each PE. Our approach exploits aproducer/consumer model to achieve load balance where newworkers pull events from a buffer queue in order to release theamount of traffic in an overloaded PE. Results show that theproposed strategy prevents saturation of PEs and improves theoverall throughput of the system by up to 470%.","PeriodicalId":186179,"journal":{"name":"2016 International Symposium on Computer Architecture and High Performance Computing Workshops (SBAC-PADW)","volume":"41 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133977625","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
R. Santos, Ricardo Aguiar, Paulo Soken, Samuel Ferraz, Liana Duenha
{"title":"Thread Footprint Analysis for the Design of Multithreaded Applications and Multicore Systems","authors":"R. Santos, Ricardo Aguiar, Paulo Soken, Samuel Ferraz, Liana Duenha","doi":"10.1109/SBAC-PADW.2016.18","DOIUrl":"https://doi.org/10.1109/SBAC-PADW.2016.18","url":null,"abstract":"This work presents Coretool, a pin tool for thread analysis (identification, scheduling, and instruction workload) of multithreaded application in multicore systems. The main goal of Coretool is to provide enough information to improve performance in multithreaded applications and multicore systems. Coretool can be helpful for multithreaded software developer to take the application performance overheads into account to redesign the application. A multicore system designer/administrator can use the thread scheduling, threads usage, and instruction workload to perform a system tuning to improve performance or to maximize throughput. We have performed a set of experiments to characterize multithreaded applications according to their thread footprint on multicore available resources have shown some applications with thread workload unbalance, thus suggesting the need of application redesigning.","PeriodicalId":186179,"journal":{"name":"2016 International Symposium on Computer Architecture and High Performance Computing Workshops (SBAC-PADW)","volume":"56 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114583384","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}