Proceedings of the 15th ACM International Conference on Computing Frontiers最新文献_第2页

A decoupled access-execute architecture for reconfigurable accelerators 可重构加速器的解耦访问-执行体系结构

Proceedings of the 15th ACM International Conference on Computing Frontiers Pub Date : 2018-05-08 DOI: 10.1145/3203217.3203267

George Charitopoulos, Charalampos Vatsolakis, Grigorios Chrysos, D. Pnevmatikatos

{"title":"A decoupled access-execute architecture for reconfigurable accelerators","authors":"George Charitopoulos, Charalampos Vatsolakis, Grigorios Chrysos, D. Pnevmatikatos","doi":"10.1145/3203217.3203267","DOIUrl":"https://doi.org/10.1145/3203217.3203267","url":null,"abstract":"Mapping computational intensive applications on reconfigurable technology for acceleration requires two main implementation parts: (a) the data plane, i.e., efficient interconnected units that accelerate processing, and (b) the access-plane, i.e., efficient ways to access data and transfer them to/from the accelerator. Data plane construction is well understood and mature tools -such as High Level Synthesis (HLS)- that produce efficient reconfigurable architectures exist. The access plane, however, is more challenging: data fetching for big-data and high-performance computing applications is even more complex and time consuming than processing. Towards this end, we present DAER, a Decoupled Access-Execute architecture and framework for Reconfigurable accelerators. Our approach maps the code to be accelerated in two separate parts: (a) the fetch unit, responsible for fetching data to the accelerator and storing results back in memory, and (b) the processing unit, which processes the fetched data in a streaming way. This approach offers the user a structured and well-defined way of mapping applications on an FPGA. Additionally, it bodes well with other hardware-based optimization techniques, e.g. pipelining, custom processing and data prefetching, which hide the memory data access latency. We use the DAER framework and HLS mapping tools on five applications and show the proposed DAER framework achieves an order of magnitude performance speed-up compared to unmodified applications, and as much as 2x performance improvement compared to their optimized HLS versions. We, also, map the DAER-based architectures on HPC platforms showing the performance advantages of our approach on real world platforms.","PeriodicalId":127096,"journal":{"name":"Proceedings of the 15th ACM International Conference on Computing Frontiers","volume":"06 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-05-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115899233","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 11

An FPGA framework for edge-centric graph processing 边缘中心图形处理的FPGA框架

Proceedings of the 15th ACM International Conference on Computing Frontiers Pub Date : 2018-05-08 DOI: 10.1145/3203217.3203233

Shijie Zhou, R. Kannan, Hanqing Zeng, V. Prasanna

{"title":"An FPGA framework for edge-centric graph processing","authors":"Shijie Zhou, R. Kannan, Hanqing Zeng, V. Prasanna","doi":"10.1145/3203217.3203233","DOIUrl":"https://doi.org/10.1145/3203217.3203233","url":null,"abstract":"Many emerging real-world applications require fast processing of large-scale data represented in the form of graphs. In this paper, we design a Field-Programmable Gate Array (FPGA) framework to accelerate graph algorithms based on the edge-centric paradigm. Our design is flexible for accelerating general graph algorithms with various vertex attributes and update propagation functions, such as Sparse Matrix Vector Multiplication (SpMV), PageRank (PR), Single Source Shortest Path (SSSP), and Weakly Connected Component (WCC). The target platform consists of large external memory to store the graph data and FPGA to accelerate the processing. By taking an edge-centric graph algorithm and hardware resource constraints as inputs, our framework can determine the optimal design parameters and produce an optimized Register-Transfer Level (RTL) FPGA accelerator design. To improve data locality and increase parallelism, we partition the input graph into non-overlapping partitions. This enables our framework to efficiently buffer vertex data in the on-chip memory of FPGA and exploit both inter-partition and intra-partition parallelism. Further, we propose an optimized data layout to improve external memory performance and reduce data communication between FPGA and external memory. Based on our design methodology, we accelerate two fundamental graph algorithms for performance evaluation: Sparse Matrix Vector Multiplication (SpMV) and PageRank (PR). Experimental results show that our accelerators sustain a high throughput of up to 2250 Million Traversed Edges Per Second (MTEPS) and 2487 MTEPS for SpMV and PR, respectively. Compared with several highly-optimized multi-core designs, our FPGA framework achieves up to 20.5× speedup for SpMV, and 17.7× speedup for PR, respectively; compared with two state-of-the-art FPGA frameworks, our designs demonstrate up to 5.3× and 1.8× throughput improvement for SpMV and PR, respectively.","PeriodicalId":127096,"journal":{"name":"Proceedings of the 15th ACM International Conference on Computing Frontiers","volume":"37 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-05-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123309807","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 41

Taming irregular applications via advanced dynamic parallelism on GPUs 通过gpu上的高级动态并行性来驯服不规则应用程序

Proceedings of the 15th ACM International Conference on Computing Frontiers Pub Date : 2018-05-08 DOI: 10.1145/3203217.3203243

Jing Zhang, Ashwin M. Aji, Michael L. Chu, Hao Wang, Wu-chun Feng

{"title":"Taming irregular applications via advanced dynamic parallelism on GPUs","authors":"Jing Zhang, Ashwin M. Aji, Michael L. Chu, Hao Wang, Wu-chun Feng","doi":"10.1145/3203217.3203243","DOIUrl":"https://doi.org/10.1145/3203217.3203243","url":null,"abstract":"On recent GPU architectures, dynamic parallelism, which enables the launching of kernels from the GPU without CPU involvement, provides a way to improve the performance of irregular applications by generating child kernels dynamically to reduce workload imbalance and improve GPU utilization. However, in practice, dynamic parallelism does not improve performance due to high kernel launch overhead and low child kernel occupancy. Consequently, most existing studies focus on mitigating the kernel launch overhead. As the kernel launch overhead has decreased due to algorithmic redesigns and hardware architectural innovations, the organization of subtasks to child kernels becomes a new performance bottleneck. We present an in-depth characterization of existing software approaches for dynamic parallelism optimizations on the latest GPUs. We observe that current approaches of subtask aggregation, which use the \"one-size-fits-all\" method by treating all subtasks equally, can under-utilize resources and degrade overall performance, as different subtasks require various configurations for optimal performance. To address this problem, we leverage statistical and machine-learning techniques and propose a performance modeling and task scheduling tool that can (1) analyze the performance characteristics of subtasks to identify the critical performance factors, (2) predict the performance of new subtasks, and (3) generate the optimal aggregation strategy for new subtasks. Experimental results show that our approach with the optimal subtask aggregation strategy can achieve up to a 1.8-fold speedup over the existing task aggregation approach for dynamic parallelism.","PeriodicalId":127096,"journal":{"name":"Proceedings of the 15th ACM International Conference on Computing Frontiers","volume":"243 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-05-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124686398","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 9

Distributed learning-based state prediction for multi-agent systems with reduced communication effort 基于分布式学习的多智能体系统状态预测

Proceedings of the 15th ACM International Conference on Computing Frontiers Pub Date : 2018-05-08 DOI: 10.1145/3203217.3203230

Daniel Hinkelmann, A. Schmeink, Guido Dartmann

引用次数: 0

Gathering and analyzing identity leaks for a proactive warning of affected users 收集和分析身份泄露，为受影响的用户提供主动警告

Proceedings of the 15th ACM International Conference on Computing Frontiers Pub Date : 2018-05-08 DOI: 10.1145/3203217.3203269

Timo Malderle, Matthias Wübbeling, S. Knauer, Arnold Sykosch, M. Meier

引用次数: 10

Vulnerability analysis of Android auto infotainment apps Android汽车信息娱乐应用的漏洞分析

Proceedings of the 15th ACM International Conference on Computing Frontiers Pub Date : 2018-05-08 DOI: 10.1145/3203217.3203278

A. K. Mandal, Agostino Cortesi, Pietro Ferrara, F. Panarotto, F. Spoto

引用次数: 27

Horizon: a multi-abstraction framework for graph analytics Horizon:用于图分析的多抽象框架

Proceedings of the 15th ACM International Conference on Computing Frontiers Pub Date : 2018-05-08 DOI: 10.1145/3203217.3203270

Adnan Haider, Fabio Checconi, Xinyu Que, L. Schneidenbach, Daniele Buono, Xian-He Sun

引用次数: 0

The D.A.V.I.D.E. big-data-powered fine-grain power and performance monitoring support D.A.V.I.D.E.大数据驱动的精细电源和性能监控支持

Proceedings of the 15th ACM International Conference on Computing Frontiers Pub Date : 2018-05-08 DOI: 10.1145/3203217.3205863

Andrea Bartolini, Andrea Borghesi, Antonio Libri, Francesco Beneventi, D. Gregori, S. Tinti, Cosimo Gianfreda, Piero Altoe

引用次数: 22

The SAGE project: a storage centric approach for exascale computing: invited paper SAGE项目:以存储为中心的百亿亿次计算方法:特邀论文

Proceedings of the 15th ACM International Conference on Computing Frontiers Pub Date : 2018-05-08 DOI: 10.1145/3203217.3205341

Sai B. Narasimhamurthy, N. Danilov, S. Wu, G. Umanesan, Steven W. D. Chien, Sergio Rivas-Gomez, I. Peng, E. Laure, S. D. Witt, D. Pleiter, S. Markidis

{"title":"The SAGE project: a storage centric approach for exascale computing: invited paper","authors":"Sai B. Narasimhamurthy, N. Danilov, S. Wu, G. Umanesan, Steven W. D. Chien, Sergio Rivas-Gomez, I. Peng, E. Laure, S. D. Witt, D. Pleiter, S. Markidis","doi":"10.1145/3203217.3205341","DOIUrl":"https://doi.org/10.1145/3203217.3205341","url":null,"abstract":"SAGE (Percipient StorAGe for Exascale Data Centric Computing) is a European Commission funded project towards the era of Exascale computing. Its goal is to design and implement a Big Data/Extreme Computing (BDEC) capable infrastructure with associated software stack. The SAGE system follows a storage centric approach as it is capable of storing and processing large data volumes at the Exascale regime. SAGE addresses the convergence of Big Data Analysis and HPC in an era of next-generation data centric computing. This convergence is driven by the proliferation of massive data sources, such as large, dispersed scientific instruments and sensors where data needs to be processed, analyzed and integrated into simulations to derive scientific and innovative insights. A first prototype of the SAGE system has been been implemented and installed at the Jülich Supercomputing Center. The SAGE storage system consists of multiple types of storage device technologies in a multi-tier I/O hierarchy, including flash, disk, and non-volatile memory technologies. The main SAGE software component is the Seagate Mero Object Storage that is accessible via the Clovis API and higher level interfaces. The SAGE project also includes scientific applications for the validation of the SAGE concepts. The objective of this paper is to present the SAGE project concepts, the prototype of the SAGE platform and discuss the software architecture of the SAGE system.","PeriodicalId":127096,"journal":{"name":"Proceedings of the 15th ACM International Conference on Computing Frontiers","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-05-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128755447","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 20

Comprehensive assessment of run-time hardware-supported malware detection using general and ensemble learning 使用通用和集成学习对运行时硬件支持的恶意软件检测进行全面评估

Proceedings of the 15th ACM International Conference on Computing Frontiers Pub Date : 2018-05-08 DOI: 10.1145/3203217.3203264

H. Sayadi, Sai Manoj Pudukotai Dinakarrao, A. Houmansadr, S. Rafatirad, H. Homayoun

{"title":"Comprehensive assessment of run-time hardware-supported malware detection using general and ensemble learning","authors":"H. Sayadi, Sai Manoj Pudukotai Dinakarrao, A. Houmansadr, S. Rafatirad, H. Homayoun","doi":"10.1145/3203217.3203264","DOIUrl":"https://doi.org/10.1145/3203217.3203264","url":null,"abstract":"Recent studies have demonstrated the effectiveness of Hardware Performance Counters (HPCs) for detecting pattern of malicious applications. Hardware-supported detectors utilize Machine Learning (ML) classifiers for malware detection by analyzing a large number of HPC features, more than the very limited number of HPC registers available in modern microprocessors. Obtaining more HPCs requires running the application (malware or benign) more than once to collect the required data, which in turn makes the solution less practical for run-time detection of malware. In response to this challenge, in this work, we first identify the critical HPC features required for malware detection. Next, we explore the use of various ML techniques to classify benign and malware applications using the selected HPCs at run-time. Further, we investigate the effectiveness of ensemble learning in improving the performance of ML classifiers. For this purpose, we apply AdaBoost on all general ML classifiers. We thoroughly compare the general and ensemble ML classifiers in terms of accuracy, robustness, performance, and hardware overhead. The experimental results indicate that ensemble learning enhances the performance of malware detection for rule-based and tree-based algorithms up to 13%. However, it diminishes the performance of neural network and Bayesian network-based detectors by 6% and 4%, respectively.","PeriodicalId":127096,"journal":{"name":"Proceedings of the 15th ACM International Conference on Computing Frontiers","volume":"79 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-05-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134568927","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 30