{"title":"BSSync: Processing Near Memory for Machine Learning Workloads with Bounded Staleness Consistency Models","authors":"Joo Hwan Lee, Jaewoong Sim, Hyesoon Kim","doi":"10.1109/PACT.2015.42","DOIUrl":"https://doi.org/10.1109/PACT.2015.42","url":null,"abstract":"Parallel machine learning workloads have become prevalent in numerous application domains. Many of these workloads are iterative convergent, allowing different threads to compute in an asynchronous manner, relaxing certain read-after-write data dependencies to use stale values. While considerable effort has been devoted to reducing the communication latency between nodes by utilizing asynchronous parallelism, inefficient utilization of relaxed consistency models within a single node have caused parallel implementations to have low execution efficiency. The long latency and serialization caused by atomic operations have a significant impact on performance. The data communication is not overlapped with the main computation, which reduces execution efficiency. The inefficiency comes from the data movement between where they are stored and where they are processed. In this work, we propose Bounded Staled Sync (BSSync), a hardware support for the bounded staleness consistency model, which accompanies simple logic layers in the memory hierarchy. BSSync overlaps the long latency atomic operation with the main computation, targeting iterative convergent machine learning workloads. Compared to previous work that allows staleness for read operations, BSSync utilizes staleness for write operations, allowing stale-writes. We demonstrate the benefit of the proposed scheme for representative machine learning workloads. On average, our approach outperforms the baseline asynchronous parallel implementation by 1.33x times.","PeriodicalId":385398,"journal":{"name":"2015 International Conference on Parallel Architecture and Compilation (PACT)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2015-10-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121176697","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"An Optimization of Resource Arrangement for Network-on-Chip using Genetic Algorithm","authors":"Daichi Murakami, K. Hiraki","doi":"10.1109/PACT.2015.54","DOIUrl":"https://doi.org/10.1109/PACT.2015.54","url":null,"abstract":"Non-uniform arrangement of on-chip resources could outperform the uniform design. The way of arrangement itself, however, has not been well researched yet. In this research, we proposed an automatic method for better arrangement of the on-chip resources using GA.","PeriodicalId":385398,"journal":{"name":"2015 International Conference on Parallel Architecture and Compilation (PACT)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2015-10-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129059990","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Vinita V. Deodhar, Hrushit Parikh, Ada Gavrilovska, S. Pande
{"title":"Compiler Assisted Load Balancing on Large Clusters","authors":"Vinita V. Deodhar, Hrushit Parikh, Ada Gavrilovska, S. Pande","doi":"10.1109/PACT.2015.40","DOIUrl":"https://doi.org/10.1109/PACT.2015.40","url":null,"abstract":"Load balancing of tasks across processing nodes is critical for achieving speed up on large scale clusters. Load balancing schemes typically detect the imbalance and then migrate the load from an overloaded processing node to an idle or lightly loaded processing node and thus, estimation of load critically affects the performance of load balancing schemes. On large scale clusters, the latency of load migration between processing nodes (and the energy) is also a significant overhead and any missteps in load estimation can cause significant migrations and performance losses. Currently, the load estimation is done either by profile or feedback driven approaches in sophisticated systems such as Charm++, but such approaches must be re-thought in light of some workloads such as Adaptive Mesh Refinement (AMR) and multiscale physics where the load variations could be quite dynamic and rapid. In this work we propose a compiler based framework which performs precise prediction of the forthcoming workload. The compiler driven load prediction technique performs static analysis of a task and derives an expression to predict load of a task and hoists it as early as possible in the control flow of execution. The compiler also inserts corrector expressions at strategic program points which refine the reachability probability of the load as well as its estimation. The predictor and the corrector expressions are evaluated at runtime and the predicted load information is refined as the execution proceeds and is eventually used by load balancer to take efficient migration decisions. We present an implementation of the above in the Rose compiler and the Charm++ parallel programming framework. We demonstrate the effectiveness of the framework on some key benchmarks that exhibit dynamic variations and show how the compiler framework assists load balancing schemes in Charm++ to provide significant gains.","PeriodicalId":385398,"journal":{"name":"2015 International Conference on Parallel Architecture and Compilation (PACT)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2015-10-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115568154","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Lluc Alvarez, Miquel Moretó, Marc Casas, Emilio Castillo, X. Martorell, Jesús Labarta, E. Ayguadé, M. Valero
{"title":"Runtime-Guided Management of Scratchpad Memories in Multicore Architectures","authors":"Lluc Alvarez, Miquel Moretó, Marc Casas, Emilio Castillo, X. Martorell, Jesús Labarta, E. Ayguadé, M. Valero","doi":"10.1109/PACT.2015.26","DOIUrl":"https://doi.org/10.1109/PACT.2015.26","url":null,"abstract":"The increasing number of cores and the anticipated level of heterogeneity in upcoming multicore architectures cause important problems in traditional cache hierarchies. A good way to alleviate these problems is to add scratchpad memories alongside the cache hierarchy, forming a hybrid memory hierarchy. This memory organization has the potential to improve performance and to reduce the power consumption and the on-chip network traffic, but exposing such a complex memory model to the programmer has a very negative impact on the programmability of the architecture. Emerging task-based programming models are a promising alternative to program heterogeneous multicore architectures. In these models the runtime system manages the execution of the tasks on the architecture, allowing them to apply many optimizations in a generic way at the runtime system level. This paper proposes giving the runtime system the responsibility to manage the scratchpad memories of a hybrid memory hierarchy in multicore processors, transparently to the programmer. In the envisioned system, the runtime system takes advantage of the information found in the task dependences to map the inputs and outputs of a task to the scratchpad memory of the core that is going to execute it. In addition, the paper exploits two mechanisms to overlap the data transfers with computation and a locality-aware scheduler to reduce the data motion. In a 32-core multicore architecture, the hybrid memory hierarchy outperforms cache-only hierarchies by up to 16%, reduces on-chip network traffic by up to 31% and saves up to 22% of the consumed power.","PeriodicalId":385398,"journal":{"name":"2015 International Conference on Parallel Architecture and Compilation (PACT)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2015-10-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131360426","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Evaluating the Cost of Atomic Operations on Modern Architectures","authors":"Hermann Schweizer, Maciej Besta, T. Hoefler","doi":"10.1109/PACT.2015.24","DOIUrl":"https://doi.org/10.1109/PACT.2015.24","url":null,"abstract":"Atomic operations (atomics) such as Compare-and-Swap (CAS) or Fetch-and-Add (FAA) are ubiquitous in parallel programming. Yet, performance tradeoffs between these operations and various characteristics of such systems, such as the structure of caches, are unclear and have not been thoroughly analyzed. In this paper we establish an evaluation methodology, develop a performance model, and present a set of detailed benchmarks for latency and bandwidth of different atomics. We consider various state-of-the-art x86 architectures: Intel Haswell, Xeon Phi, Ivy Bridge, and AMD Bulldozer. The results unveil surprising performance relationships between the considered atomics and architectural properties such as the coherence state of the accessed cache lines. One key finding is that all the tested atomics have comparable latency and bandwidth even if they are characterized by different consensus numbers. Another insight is that the design of atomics prevents any instruction-level parallelism even if there are no dependencies between the issued operations. Finally, we discuss solutions to the discovered performance issues in the analyzed architectures. Our analysis can be used for making better design and algorithmic decisions in parallel programming on various architectures deployed in both off-the-shelf machines and large compute systems.","PeriodicalId":385398,"journal":{"name":"2015 International Conference on Parallel Architecture and Compilation (PACT)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2015-10-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126290260","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ivan Jibaja, P. Jensen, Ningxin Hu, M. Haghighat, J. McCutchan, D. Gohman, S. Blackburn, K. McKinley
{"title":"Vector Parallelism in JavaScript: Language and Compiler Support for SIMD","authors":"Ivan Jibaja, P. Jensen, Ningxin Hu, M. Haghighat, J. McCutchan, D. Gohman, S. Blackburn, K. McKinley","doi":"10.1109/PACT.2015.33","DOIUrl":"https://doi.org/10.1109/PACT.2015.33","url":null,"abstract":"JavaScript is the most widely used web programming language and is increasingly used to implement sophisticated and demanding applications in such domains as graphics, games, video, and cryptography. The performance and energy usage of these applications can benefit from hardware parallelism, including SIMD (Single Instruction, Multiple Data) vector parallel instructions. JavaScript's current support for parallelism is, however, limited and does not directly utilize SIMD capabilities. This paper presents the design and implementation of SIMD language extensions and compiler support that together add fine-grain vector parallelism to JavaScript. The specification for this language extension is in final stages of adoption by the JavaScript standardization committee and our compiler support is available in two open-source production browsers. The design principles seek portability, SIMD performance portability on various SIMD architectures, and compiler simplicity to ease adoption. The design does not require automatic vectorization compiler technology, but does not preclude it either. The SIMD extensions define immutable fixed-length SIMD data types and operations that are common to both ARM and x86 ISAs. The contributions of this work include type speculation and optimizations that generate minimal numbers of SIMD native instructions from high-level JavaScript SIMD instructions. We implement type speculation, optimizations, and code generation in two open-source JavaScript VMs and measure performance improvements between a factor of 1.7× to 8.9× with an average of 3.3× and average energy improvements of 2.9× on micro benchmarks and key graphics kernels on various hardware, browsers, and operating systems. These portable SIMD language extensions significantly improve compute-intensive interactive applications in the browser, such as games and media processing, by exploiting vector parallelism without relying on automatic vectorizing compiler technology, non-portable native code, or plugins.","PeriodicalId":385398,"journal":{"name":"2015 International Conference on Parallel Architecture and Compilation (PACT)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2015-10-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131062174","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Communication Avoiding Algorithms: Analysis and Code Generation for Parallel Systems","authors":"K. Murthy, J. Mellor-Crummey","doi":"10.1109/PACT.2015.41","DOIUrl":"https://doi.org/10.1109/PACT.2015.41","url":null,"abstract":"Data movement is a critical bottleneck for future generations of parallel systems. The class of .5D communication-avoiding algorithms were developed to address this bottleneck. These algorithms reduce communication and provide strong scaling in both time and energy. As a firststep towards automating the development of communication-avoiding-libraries, we developed the Maunam compiler. Maunam generates efficient parallel code from a high-level, global view sketch of .5D algorithms that are expressed using symbolic data sizes and numbers of processors. It supports the expression of data movement and communication through-high-level global operations such as TILT and CSHIFT as well as through element-wise copy operations. With the latter, wrap around communication patterns can also be achieved using subscripts based on modulo operations. Maunam employs polyhedral analysis to reason about communication and computation present in a .5D algorithm. After partitioning data and computation, it inserts point-to-point-and collective communication as needed. Maunam also analyzes data dependence patterns and data layouts to identify reductions over processor subsets. Maunam-generated Fortran+MPI code for 2.5D matrix multiplication running on 4096 cores of a Cray XC30 super computer achieves 59 TFlops/s (76% of the machine peak). Our generated parallel code achieves 91% of the performance of a hand-coded version.","PeriodicalId":385398,"journal":{"name":"2015 International Conference on Parallel Architecture and Compilation (PACT)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2015-10-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132323424","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
S. Mitra, G. Bronevetsky, Suhas Javagal, S. Bagchi
{"title":"Dealing with the Unknown: Resilience to Prediction Errors","authors":"S. Mitra, G. Bronevetsky, Suhas Javagal, S. Bagchi","doi":"10.1109/PACT.2015.19","DOIUrl":"https://doi.org/10.1109/PACT.2015.19","url":null,"abstract":"Accurate prediction of applications' performance and functional behavior is a critical component for a widerange of tools, including anomaly detection, task scheduling and approximate computing. Statistical modeling is a very powerful approach for making such predictions and it uses observations of application behavior on a small number of training cases to predict how the application will behave in practice. However, the fact that applications' behavior often depends closely on their configuration parameters and properties of their inputs means that any suite of application training runs will cover only a small fraction of its overall behavior space. Since a model's accuracy often degrades as application configuration and inputs deviate further from its training set, this makes it difficult to act based on the model's predictions. This paper presents a systematic approach to quantify theprediction errors of the statistical models of the application behavior, focusing on extrapolation, where the application configuration and input parameters differ significantly from the model's training set. Given any statistical model of application behavior and a data set of training application runs from which this model is built, our technique predicts the accuracy of the model for predicting application behavior on a new run on hitherto unseen inputs. We validate the utility of this method by evaluating it on the use case of anomaly detection for seven mainstream applications and benchmarks. The evaluation demonstrates that our technique can reduce false alarms while providing high detection accuracy compared to a statistical, input-unaware modeling technique.","PeriodicalId":385398,"journal":{"name":"2015 International Conference on Parallel Architecture and Compilation (PACT)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2015-10-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121513595","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Narges Shahidi, Anand Sivasubramanian, M. Kandemir, C. Das
{"title":"Storage Consolidation on SSDs: Not Always a Panacea, but Can We Ease the Pain?","authors":"Narges Shahidi, Anand Sivasubramanian, M. Kandemir, C. Das","doi":"10.1109/PACT.2015.61","DOIUrl":"https://doi.org/10.1109/PACT.2015.61","url":null,"abstract":"Storage Consolidation is increasing being adopted to reduce system costs, simplify the storage infrastructure, and enhance availability and resource management. However, consolidation leads to interference in shared resources. This poster shows the effect of consolidation on performance of co-located applications and proposes a static approach to reduce interference.","PeriodicalId":385398,"journal":{"name":"2015 International Conference on Parallel Architecture and Compilation (PACT)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2015-10-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114434935","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Practical Near-Data Processing for In-Memory Analytics Frameworks","authors":"Mingyu Gao, Grant Ayers, C. Kozyrakis","doi":"10.1109/PACT.2015.22","DOIUrl":"https://doi.org/10.1109/PACT.2015.22","url":null,"abstract":"The end of Dennard scaling has made all systemsenergy-constrained. For data-intensive applications with limitedtemporal locality, the major energy bottleneck is data movementbetween processor chips and main memory modules. For such workloads, the best way to optimize energy is to place processing near the datain main memory. Advances in 3D integrationprovide an opportunity to implement near-data processing (NDP) withoutthe technology problems that similar efforts had in the past. This paper develops the hardware and software of an NDP architecturefor in-memory analytics frameworks, including MapReduce, graphprocessing, and deep neural networks. We develop simple but scalablehardware support for coherence, communication, and synchronization, anda runtime system that is sufficient to support analytics frameworks withcomplex data patterns while hiding all thedetails of the NDP hardware. Our NDP architecture provides up to 16x performance and energy advantageover conventional approaches, and 2.5x over recently-proposed NDP systems. We also investigate the balance between processing and memory throughput, as well as the scalability and physical and logical organization of the memory system. Finally, we show that it is critical to optimize software frameworksfor spatial locality as it leads to 2.9x efficiency improvements for NDP.","PeriodicalId":385398,"journal":{"name":"2015 International Conference on Parallel Architecture and Compilation (PACT)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2015-10-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116964144","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}