Proceedings of the ACM International Conference on Computing Frontiers最新文献

An architecture for near-data processing systems 近数据处理系统的体系结构

Proceedings of the ACM International Conference on Computing Frontiers Pub Date : 2016-05-16 DOI: 10.1145/2903150.2903478

E. Vermij, C. Hagleitner, Leandro Fiorin, R. Jongerius, J. V. Lunteren, K. Bertels

引用次数: 9

Heterogeneous chip multiprocessor architectures for big data applications 面向大数据应用的异构芯片多处理器架构

Proceedings of the ACM International Conference on Computing Frontiers Pub Date : 2016-05-16 DOI: 10.1145/2903150.2908078

H. Homayoun

{"title":"Heterogeneous chip multiprocessor architectures for big data applications","authors":"H. Homayoun","doi":"10.1145/2903150.2908078","DOIUrl":"https://doi.org/10.1145/2903150.2908078","url":null,"abstract":"Emerging big data analytics applications require a significant amount of server computational power. The costs of building and running a computing server to process big data and the capacity to which we can scale it are driven in large part by those computational resources. However, big data applications share many characteristics that are fundamentally different from traditional desktop, parallel, and scale-out applications. Big data analytics applications rely heavily on specific deep machine learning and data mining algorithms, and are running a complex and deep software stack with various components (e.g. Hadoop, Spark, MPI, Hbase, Impala, MySQL, Hive, Shark, Apache, and MangoDB) that are bound together with a runtime software system and interact significantly with I/O and OS, exhibiting high computational intensity, memory intensity, I/O intensity and control intensity. Current server designs, based on commodity homogeneous processors, will not be the most efficient in terms of performance/watt for this emerging class of applications. In other domains, heterogeneous architectures have emerged as a promising solution to enhance energy-efficiency by allowing each application to run on a core that matches resource needs more closely than a one-size-fits-all core. A heterogeneous architecture integrates cores with various micro-architectures and accelerators to provide more opportunity for efficient workload mapping. In this work, through methodical investigation of power and performance measurements, and comprehensive system level characterization, we demonstrate that a heterogeneous architecture combining high performance big and low power little cores is required for efficient big data analytics applications processing, and in particular in the presence of accelerators and near real-time performance constraints.","PeriodicalId":226569,"journal":{"name":"Proceedings of the ACM International Conference on Computing Frontiers","volume":"107 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124757179","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 6

Does it sound as it claims: a detailed side-channel security analysis of QuadSeal countermeasure 它听起来像它声称的那样:对QuadSeal反措施的详细侧信道安全分析

Proceedings of the ACM International Conference on Computing Frontiers Pub Date : 2016-05-16 DOI: 10.1145/2903150.2911709

Darshana Jayasinghe, S. Bhasin, S. Parameswaran, A. Ignjatović

引用次数: 1

Area-energy tradeoffs of logic wear-leveling for BTI-induced aging bti诱导老化逻辑磨损均衡的面积-能量权衡

Proceedings of the ACM International Conference on Computing Frontiers Pub Date : 2016-05-16 DOI: 10.1145/2903150.2903171

R. Ashraf, N. Khoshavi, Ahmad Alzahrani, R. Demara, S. Kiamehr, M. Tahoori

{"title":"Area-energy tradeoffs of logic wear-leveling for BTI-induced aging","authors":"R. Ashraf, N. Khoshavi, Ahmad Alzahrani, R. Demara, S. Kiamehr, M. Tahoori","doi":"10.1145/2903150.2903171","DOIUrl":"https://doi.org/10.1145/2903150.2903171","url":null,"abstract":"Ensuring operational reliability in the presence of Bias Temperature Instability (BTI) effects often results in a compromise either in the form of lower performance and/or higher energy-consumption. This is due to the performance degradation over time caused by BTI effects which needs to be compensated through frequency, voltage, or area margining to meet the circuit's timing specification till end of operational lifetime. In this paper, a circuit-level approach referred to as Logic-Wear-Leveling (LWL) utilizes Dark-Silicon to mitigate BTI effects in logic datapaths. LWL introduces fine-grained spatial redundancy in timing vulnerable logic components, and leverages it at runtime to enable post-Silicon adaptability. The activation interval of redundant datapaths allows for controlled stress and recovery phases. This produces a wear-leveling effect which helps to reduce the BTI induced performance degradation over time, which in turn helps to reduce the design margins. This approach demonstrates a significant reduction in energy consumption of up to 31.98% at 10 years as compared to conventional voltage guardbanding approach. The benefit of energy reduction is also assessed against the area overheads of spatial redundancy.","PeriodicalId":226569,"journal":{"name":"Proceedings of the ACM International Conference on Computing Frontiers","volume":"69 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121960476","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 8

P-Socket: optimizing a communication library for a PCIe-based intra-rack interconnect P-Socket:优化基于pcie的机架内互连的通信库

Proceedings of the ACM International Conference on Computing Frontiers Pub Date : 2016-05-16 DOI: 10.1145/2903150.2903168

Liuhang Zhang, Rui Hou, S. Mckee, Jianbo Dong, Lixin Zhang

引用次数: 2

Boosting performance of directory-based cache coherence protocols with coherence bypass at subpage granularity and a novel on-chip page table 提高基于目录的缓存一致性协议在子页面粒度上的一致性绕过和一个新的片上页表的性能

Proceedings of the ACM International Conference on Computing Frontiers Pub Date : 2016-05-16 DOI: 10.1145/2903150.2903175

M. Soltaniyeh, I. Kadayif, Özcan Özturk

{"title":"Boosting performance of directory-based cache coherence protocols with coherence bypass at subpage granularity and a novel on-chip page table","authors":"M. Soltaniyeh, I. Kadayif, Özcan Özturk","doi":"10.1145/2903150.2903175","DOIUrl":"https://doi.org/10.1145/2903150.2903175","url":null,"abstract":"Chip multiprocessors (CMPs) require effective cache coherence protocols as well as fast virtual-to-physical address translation mechanisms for high performance. Directory-based cache coherence protocols are the state-of-the-art approaches in many-core CMPs to keep the data blocks coherent at the last level private caches. However, the area overhead and high associativity requirement of the directory structures may not scale well with increasingly higher number of cores. As shown in some prior studies, a significant percentage of data blocks are accessed by only one core, therefore, it is not necessary to keep track of these in the directory structure. In this study, we have two major contributions. First, we show that compared to the classification of cache blocks at page granularity as done in some previous studies, data block classification at subpage level helps to detect considerably more private data blocks. Consequently, it reduces the percentage of blocks required to be tracked in the directory significantly compared to similar page level classification approaches. This, in turn, enables smaller directory caches with lower associativity to be used in CMPs without hurting performance, thereby helping the directory structure to scale gracefully with the increasing number of cores. Memory block classification at subpage level, however, may increase the frequency of the Operating System's (OS) involvement in updating the maintenance bits belonging to subpages stored in page table entries, nullifying some portion of performance benefits of subpage level data classification. To overcome this, we propose a distributed on-chip page table as a our second contribution.","PeriodicalId":226569,"journal":{"name":"Proceedings of the ACM International Conference on Computing Frontiers","volume":"85 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114842905","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Prototyping real-time tracking systems on mobile devices 移动设备上的实时跟踪系统原型

Proceedings of the ACM International Conference on Computing Frontiers Pub Date : 2016-05-16 DOI: 10.1145/2903150.2903471

Kyunghun Lee, Haifa Ben Salem, T. Damarla, W. Stechele, S. Bhattacharyya

{"title":"Prototyping real-time tracking systems on mobile devices","authors":"Kyunghun Lee, Haifa Ben Salem, T. Damarla, W. Stechele, S. Bhattacharyya","doi":"10.1145/2903150.2903471","DOIUrl":"https://doi.org/10.1145/2903150.2903471","url":null,"abstract":"In this paper, we address the design an implementation of low power embedded systems for real-time tracking of humans and vehicles. Such systems are important in applications such as activity monitoring and border security. We motivate the utility of mobile devices in prototyping the targeted class of tracking systems, and demonstrate a dataflow-based and cross-platform design methodology that enables efficient experimentation with key aspects of our tracking system design, including real-time operation, experimentation with advanced sensors, and streamlined management of design versions on host and mobile platforms. Our experiments demonstrate the utility of our mobile-device-targeted design methodology in validating tracking algorithm operation; evaluating real-time performance, energy efficiency, and accuracy of tracking system execution; and quantifying trade-offs involving use of advanced sensors, which offer improved sensing accuracy at the expense of increased cost and weight. Additionally, through application of a novel, cross-platform, model-based design approach, our design requires no change in source code when migrating from an initial, host-computer-based functional reference to a fully-functional implementation on the targeted mobile device.","PeriodicalId":226569,"journal":{"name":"Proceedings of the ACM International Conference on Computing Frontiers","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114901366","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

A lightweight user tracking method for app providers 应用程序提供商的轻量级用户跟踪方法

Proceedings of the ACM International Conference on Computing Frontiers Pub Date : 2016-05-16 DOI: 10.1145/2903150.2903484

R. M. Frey, Runhua Xu, A. Ilic

引用次数: 6

IVM: a task-based shared memory programming model and runtime system to enable uniform access to CPU-GPU clusters IVM:基于任务的共享内存编程模型和运行时系统，支持对CPU-GPU集群的统一访问

Proceedings of the ACM International Conference on Computing Frontiers Pub Date : 2016-05-16 DOI: 10.1145/2903150.2903174

Kittisak Sajjapongse, Ruidong Gu, M. Becchi

{"title":"IVM: a task-based shared memory programming model and runtime system to enable uniform access to CPU-GPU clusters","authors":"Kittisak Sajjapongse, Ruidong Gu, M. Becchi","doi":"10.1145/2903150.2903174","DOIUrl":"https://doi.org/10.1145/2903150.2903174","url":null,"abstract":"GPUs have been widely used to accelerate a variety of applications from different domains and have become part of high-performance computing clusters. Yet, the use of GPUs within distributed applications still faces significant challenges in terms of programmability and performance portability. The use of popular programming models for distributed applications (such as MPI, SHMEM, and Charm++) in combination with GPU programming frameworks (such as CUDA and OpenCL) exposes to the programmer disjoint memory address spaces and provides a non-uniform view of compute resources (i.e., CPUs and GPUs). In addition, these programming models often perform static assignment of tasks to compute resources and require significant programming effort to embed dynamic scheduling and load balancing mechanisms within the application. In this work, we propose a programming framework called Inter-node Virtual Memory (IVM) that provides the programmer with a uniform view of compute resources and memory spaces within a CPU-GPU cluster, and a mechanism to easily incorporate load balancing within the application. We compare MPI, Charm++ and IVM on four distributed GPU applications. Our experimental results show that, while the main goal of IVM is programmer productivity, the use of the load balancing mechanisms offered by this framework can also lead to performance gains over existing frameworks.","PeriodicalId":226569,"journal":{"name":"Proceedings of the ACM International Conference on Computing Frontiers","volume":"38 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117126369","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Accelerating the 3D euler atmospheric solver through heterogeneous CPU-GPU platforms 基于异构CPU-GPU平台的三维欧拉大气求解器加速

Proceedings of the ACM International Conference on Computing Frontiers Pub Date : 2016-05-16 DOI: 10.1145/2903150.2903480

Jingheng Xu, H. Fu, L. Gan, Chao Yang, Wei Xue, Guangwen Yang

{"title":"Accelerating the 3D euler atmospheric solver through heterogeneous CPU-GPU platforms","authors":"Jingheng Xu, H. Fu, L. Gan, Chao Yang, Wei Xue, Guangwen Yang","doi":"10.1145/2903150.2903480","DOIUrl":"https://doi.org/10.1145/2903150.2903480","url":null,"abstract":"In climate change studies, the atmospheric model is an essential component for building a high-resolution climate simulation system. While the accuracy of atmospheric simulations has long been limited by the computational capabilities of CPU platforms, the heterogeneous platforms equipped with accelerators are becoming promising candidates for achieving high simulating performance. However, due to the complex algorithms and the heavy communications, atmospheric developers have to face to the tough challenges from both the algorithmic and architectural aspects. In this paper, we propose a hybrid algorithm to accelerate the solver of Euler atmospheric equations, which are the most essential equation sets to simulate the mesoscale atmospheric dynamics. Based on the heterogeneous CPU-GPU platform, we develop a 3-dimensional domain decomposition mechanism, which can achieve more efficient utilization of the computing resources. Furthermore, an extensive set of optimization techniques is applied to boost the performance of the solver on both the host and accelerator side. Compared with the performance of fully-optimized two 6-core CPU version, the optimized Euler solver can achieve a speedup of 6.64x when running on a hybrid node with two 6-core Intel Xeon E5645 CPUs and one Tesla K20c GPU. In addition, a nearly linear weak scaling result is achieved on a cluster with 12 CPU-GPU nodes. The experimental results demonstrate promising possibility to apply heterogeneous architecture in the study of the atmospheric simulation.","PeriodicalId":226569,"journal":{"name":"Proceedings of the ACM International Conference on Computing Frontiers","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127135562","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1