Daniele Cattaneo, Antonio Di Bello, M. Chiari, Stefano Cherubin, G. Agosta
{"title":"Fixed point exploitation via compiler analyses and transformations: POSTER","authors":"Daniele Cattaneo, Antonio Di Bello, M. Chiari, Stefano Cherubin, G. Agosta","doi":"10.1145/3310273.3323424","DOIUrl":"https://doi.org/10.1145/3310273.3323424","url":null,"abstract":"Fixed point computation represents a key feature in the design process of embedded applications. It is also exploited as a mean to data size tuning for HPC tasks [2]. Since the conversion from floating point to fixed point is generally performed manually, it is time-consuming and error-prone. However, the full automation of such task is currently unfeasible, as existing open source tools are not mature enough for industry adoption. To bridge this gap, we introduce our Tuning Assistant for Floating point to Fixed point Optimization (TAFFO). TAFFO is a toolset of LLVM compiler plugins that automatically converts computations from floating point to fixed point. TAFFO leverages programmer hints to understand the characteristics of the input data, and then performs the code conversion using the most appropriate data types. TAFFO allows programmers to equally apply fine-grained precision tuning to a wide range of programming languages, whereas most current competitors are limited to C. Moreover, it is easily applicable to most embedded [1] and high performance applications [10, 11], and it allows easy maintenance and extensions.","PeriodicalId":431860,"journal":{"name":"Proceedings of the 16th ACM International Conference on Computing Frontiers","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-04-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131377818","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Abstracting parallel program specification: a case study on k-means clustering","authors":"A. Hommelberg, K. Rietveld, H. Wijshoff","doi":"10.1145/3310273.3322828","DOIUrl":"https://doi.org/10.1145/3310273.3322828","url":null,"abstract":"The Forelem framework was first introduced to optimize database queries using compiler techniques. Since its introduction, Forelem has proven to be more versatile and to be applicable beyond database applications. In this paper we show that Forelem can be used to specify parallel programs at an abstract level whilst still guaranteeing efficient parallel execution. This is achieved by a sequence of transformations that can be directly implemented as an optimizing compiler toolchain. To demonstrate this, a case study is described, k-Means clustering, for which four implementations are mechanically generated that improve standard MPI C/C++ and outperform state-of-the-art Hadoop implementations.","PeriodicalId":431860,"journal":{"name":"Proceedings of the 16th ACM International Conference on Computing Frontiers","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-04-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130092869","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Toward a graph-based dependence analysis framework for high level design verification","authors":"John D. Leidel, Frank Conlon","doi":"10.1145/3310273.3323433","DOIUrl":"https://doi.org/10.1145/3310273.3323433","url":null,"abstract":"Recent efforts to deploy FPGA's and application-specific accelerator devices in scalable data center environments has led to a resurgence in research associated with high level synthesis and design verification. The goal of this research has been to accelerate the initial design, verification and deployment process for abstract accelerator platforms. While the research associated with high level synthesis flows has provided significant gains in design acceleration, research in the verification of these designs has largely been based upon augmenting traditional methodologies. This work introduces the CoreGen high level design verification infrastructure. The goal of the CoreGen infrastructure is to provide a rapid, high level design verification infrastructure for complex, heterogeneous hardware architectures. Unlike traditional high-level verification strategies, CoreGen utilizes an intermediate representation (IR) for the target design constructed using a directed acyclic graph (DAG). CoreGen then applies classic compiler dependence analysis techniques using a multitude of graph inference and combinatorial logic solvers. The application of traditional compiler dependence analysis using directed acyclic graphs provides the ability to optimize the performance of the high level verification pipeline regardless of the target design complexity. We highlight this capability by demonstrating the verification performance scaling using a complex, heterogeneous design input. Our results indicate performance competitive with traditional optimizing compilers.","PeriodicalId":431860,"journal":{"name":"Proceedings of the 16th ACM International Conference on Computing Frontiers","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-04-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121180096","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"PGAS for graph analytics: can one sided communications break the scalability barrier?","authors":"J. Langguth","doi":"10.1145/3310273.3324293","DOIUrl":"https://doi.org/10.1145/3310273.3324293","url":null,"abstract":"As the world is becoming increasingly interconnected and systems increasingly complex. Therefore, technologies that can analyze connected systems and their dynamic characteristics become indispensable. Consequently, the last decade has seen increasing interest in graph analytics, which allows obtaining insights from such connected data. Parallel graph analytics can reveal the workings of intricate systems and networks at massive scales, which are found in diverse areas such as social networks, economic transactions, and protein interactions. While sequential graph algorithms have been studied for decades, the recent availability of massive datasets has given rise to the need for parallel graph processing, which poses unique challenges.","PeriodicalId":431860,"journal":{"name":"Proceedings of the 16th ACM International Conference on Computing Frontiers","volume":"127 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-04-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125565234","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Y. Nasser, Carlo Sau, Jean-Christophe Prévotet, Tiziana Fanni, F. Palumbo, M. Hélard, L. Raffo
{"title":"NeuPow: artificial neural networks for power and behavioral modeling of arithmetic components in 45nm ASICs technology","authors":"Y. Nasser, Carlo Sau, Jean-Christophe Prévotet, Tiziana Fanni, F. Palumbo, M. Hélard, L. Raffo","doi":"10.1145/3310273.3322820","DOIUrl":"https://doi.org/10.1145/3310273.3322820","url":null,"abstract":"In this paper, we present a flexible, simple and accurate power modeling technique that can be used to estimate the power consumption of modern technology devices. We exploit Artificial Neural Networks for power and behavioral estimation in Application Specific Integrated Circuits. Our method, called NeuPow, relies on propagating the predictors between the connected neural models to estimate the dynamic power consumption of the individual components. As a first proof of concept, to study the effectiveness of NeuPow, we run both component level and system level tests on the Open GPDK 45 nm technology from Cadence, achieving errors below 1.5% and 9% respectively for component and system level. In addition, NeuPow demonstrated a speed up factor of 2490X.","PeriodicalId":431860,"journal":{"name":"Proceedings of the 16th ACM International Conference on Computing Frontiers","volume":"30 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-04-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122376291","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Qingchang Han, Hailong Yang, Zhongzhi Luan, D. Qian
{"title":"Accelerating tile low-rank GEMM on sunway architecture: POSTER","authors":"Qingchang Han, Hailong Yang, Zhongzhi Luan, D. Qian","doi":"10.1145/3310273.3323425","DOIUrl":"https://doi.org/10.1145/3310273.3323425","url":null,"abstract":"Tile Low-Rank (TLR) GEMM can significantly reduce the amount of computation and memory footprint for matrix multiplication while preserving the same level of accuracy [1]. TLR-GEMM is based on the TLR data format, which is an efficient method to store large-scale sparse matrix. The large matrix is divided into several blocks also known as tile, and non-diagonal tile is compressed into the product of two tall and skinny matrices (in low-rank data format). TLR-GEMM performs the multiplication of TLR matrix A and B to obtain matrix C. TLR-GEMM can be implemented in batch mode, that is, multiple threads are started, and each thread applies the operations onto its corresponding tiles, including dense GEMM, SVD and QR decomposition. One research challenge in the field of TLR-GEMM is that modern high-performance processors often use diverse architectures, which requires adapting to the unique architecture features to achieve better performance.","PeriodicalId":431860,"journal":{"name":"Proceedings of the 16th ACM International Conference on Computing Frontiers","volume":"64 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-04-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121193306","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Data and model convergence: a case for software defined architectures","authors":"Antonino Tumeo","doi":"10.1145/3310273.3323438","DOIUrl":"https://doi.org/10.1145/3310273.3323438","url":null,"abstract":"High Performance Computing, data analytics, and machine learning are often considered three separate and different approaches. Applications, software and now hardware stacks are typically designed to only address one of the areas at a time. This creates a false distinction across the three different areas. In reality, domain scientists need to exercise all the three approaches in an integrated way. For example, large scale simulations generate enormous amount of data, to which Big Data Analytics techniques can be applied. Or, as scientist seek to use data analytics as well as simulation for discovery, machine learning can play an important role in making sense of the disparate source's information. Pacific Northwest National Laboratory is launching a new Laboratory Directed Research and Development (LDRD) Initiative to investigate the integration of the three techniques at all level of the high-performance computing stack, the Data-Model Convergence (DMC) Initiative. The DMC Initiative aims to increase scientist productivity by enabling purpose-built software and hardware and domain-aware ML techniques. In this talk, I will present the objectives of PNNL's DMC Initiative, highlighting the research that will be performed to enable the integration of vastly different programming paradigms and mental models. I will then make the case for how reconfigurable architectures could represent a great opportunity to address the challenges of DMC. In principle, the possibility to dynamically modify the architecture during runtime could provide a way to address the requirement of workloads that have significantly diverse behaviors across phases, without losing too much flexibility or programmer productivity, with respect to highly heterogeneous architectures composed by sea of fixed application specific accelerators. Reconfigurable architectures have been explored since long time ago, and arguably new software breakthroughs are required to make them successful. I will thus present the efforts that the DMC initiative is launching to design a productive toolchain for upcoming novel reconfigurable systems.","PeriodicalId":431860,"journal":{"name":"Proceedings of the 16th ACM International Conference on Computing Frontiers","volume":"575 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-04-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122933406","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Kartik Lakhotia, R. Kannan, Aditya Gaur, Ajitesh Srivastava, V. Prasanna
{"title":"Parallel edge-based sampling for static and dynamic graphs","authors":"Kartik Lakhotia, R. Kannan, Aditya Gaur, Ajitesh Srivastava, V. Prasanna","doi":"10.1145/3310273.3323052","DOIUrl":"https://doi.org/10.1145/3310273.3323052","url":null,"abstract":"Graph sampling is an important tool to obtain small and manageable subgraphs from large real-world graphs. Prior research has shown that Induced Edge Sampling (IES) outperforms other sampling methods in terms of the quality of subgraph obtained. Even though fast sampling is crucial for several workflows, there has been little work on parallel sampling algorithms in the past. In this paper, we present parIES - a framework for parallel Induced Edge Sampling on shared-memory parallel machines. parIES, equipped with optimized load balancing and synchronization avoiding strategies, can sample both static and streaming dynamic graphs, while achieving high scalability and parallel efficiency. We develop a lightweight concurrent hash table coupled with a space-efficient dynamic graph data structure to overcome the challenges and memory constraints of sampling streaming dynamic graphs. We evaluate parIES on a 16-core (32 threads) Intel server using 7 large synthetic and real-world networks. From a static graph, parIES can sample a subgraph with > 1.4B edges in < 2.5s and achieve upto 15.5X parallel speedup. For dynamic streaming graphs, parIES can process upto 86.7M edges per second achieving 15X parallel speedup.","PeriodicalId":431860,"journal":{"name":"Proceedings of the 16th ACM International Conference on Computing Frontiers","volume":"38 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-04-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128558050","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Go green radio astronomy: Approximate Computing Perspective: Opportunities and Challenges: POSTER","authors":"G. Gillani, A. Kokkeler","doi":"10.1145/3310273.3323427","DOIUrl":"https://doi.org/10.1145/3310273.3323427","url":null,"abstract":"Modern radio telescopes require highly energy/power-efficient computing systems. Signal processing pipelines of such radio telescopes are dominated by accumulation based iterative processes. As the input signal received at a radio telescope is regarded as Gaussian noise, employing approximate computing looks promising. Therefore, we present opportunities and challenges offered by the approximate computing paradigm to achieve the required efficiency targets.","PeriodicalId":431860,"journal":{"name":"Proceedings of the 16th ACM International Conference on Computing Frontiers","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-04-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114609727","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"High performance, power efficient hardware accelerators: emerging devices, circuits and architecture co-design","authors":"Catherine E. Graves","doi":"10.1145/3310273.3324055","DOIUrl":"https://doi.org/10.1145/3310273.3324055","url":null,"abstract":"General-purpose digital systems have long benefited from favorable scaling, but performance improvements have slowed dramatically in the last decade. Computing is therefore returning to custom and specialized systems, frequently using heterogeneous accelerators. Particularly driven by the data-centric workloads of machine learning and deep learning, an intense development of conventional accelerators (GPUs, FPGAs, CMOS ASICs) but also unconventional accelerators using novel circuits and devices beyond CMOS is currently underway. In this talk, I will discuss some common characteristics of high-performance and power-efficient accelerators in this diverse space and the ecosystem development (such as new interconnects) needed for them to thrive. To illustrate accelerator characteristics and their potential, I will describe our group's efforts to co-design from algorithms and architectures down to novel devices for gains in speed and power. We have developed architectures leveraging the analog and non-volatile nature of memristors (tunable resistance switches) assembled in crossbar arrays to accelerate machine learning, image and signal processing. We have also developed new circuits and assembled architectures to accelerate Finite Automata, enabling rapid pattern matching used in applications from security to genomics. Significant improvements over CPUs, GPUs, and custom digital ASICs are forecasted in both such systems, highlighting the potential for unconventional accelerators in future high-performance computing systems.","PeriodicalId":431860,"journal":{"name":"Proceedings of the 16th ACM International Conference on Computing Frontiers","volume":"22 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-04-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130995549","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}