Proceedings of the 11th ACM Conference on Computing Frontiers最新文献

Embracing heterogeneity with dynamic core boosting 以动态提振核心拥抱异质性

Proceedings of the 11th ACM Conference on Computing Frontiers Pub Date : 2014-05-20 DOI: 10.1145/2597917.2597932

Hyoun Kyu Cho, S. Mahlke

{"title":"Embracing heterogeneity with dynamic core boosting","authors":"Hyoun Kyu Cho, S. Mahlke","doi":"10.1145/2597917.2597932","DOIUrl":"https://doi.org/10.1145/2597917.2597932","url":null,"abstract":"Uniformly distributing parallel workloads amongst threads is an effective strategy for programmers to increase application performance. However, in any parallel segment, execution time is determined by the longest running thread. Even for embarrassingly parallel programs in the form of SPMD (single program multiple data), the threads are not perfectly balanced due to control flow divergence, non-deterministic memory latencies, and synchronization operations. Such an imbalance can be significantly exacerbated by performance asymmetry among cores, which is likely to exist in future generations of chip multiprocessors (CMPs) either for energy efficiency or due to process variation. We propose Dynamic Core Boosting (DCB), a software-hardware cooperative system that mitigates the workload imbalance problem in performance asymmetric CMPs. Relying on dynamic voltage and frequency scaling to accelerate individual cores at a fine granularity, DCB attempts to balance the workloads by detecting and boosting critical threads. DCB coordinates its compiler and runtime to enable asymmetric CMPs to achieve near-optimal utilization of core boosting. The compiler instruments the program with instructions to give progress hints and the runtime monitors their execution, enabling DCB to intelligently accelerate selected threads within a total core boosting budget for better performance. On a simulated eight core system of varying frequency, our experiments using PARSEC benchmarks show that DCB improves the overall performance by an average of 33%, outperforming a reactive boosting scheme by an average of 10%.","PeriodicalId":194910,"journal":{"name":"Proceedings of the 11th ACM Conference on Computing Frontiers","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122020274","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

Towards a performance-portable FFT library for heterogeneous computing 面向异构计算的性能可移植FFT库

Proceedings of the 11th ACM Conference on Computing Frontiers Pub Date : 2014-05-20 DOI: 10.1145/2597917.2597943

Carlo C. del Mundo, Wu-chun Feng

{"title":"Towards a performance-portable FFT library for heterogeneous computing","authors":"Carlo C. del Mundo, Wu-chun Feng","doi":"10.1145/2597917.2597943","DOIUrl":"https://doi.org/10.1145/2597917.2597943","url":null,"abstract":"The fast Fourier transform (FFT), a spectral method that computes the discrete Fourier transform and its inverse, pervades many applications in digital signal processing, such as imaging, tomography, and software-defined radio. Its importance has caused the research community to expend significant resources to accelerate the FFT, of which FFTW is the most prominent example. With the emergence of the graphics processing unit (GPU) as a massively parallel computing device for high performance, we seek to identify architecture-aware optimizations across two different generations of high-end AMD and NVIDIA GPUs, namely the AMD Radeon HD 6970 and HD 7970 and the NVIDIA Tesla C2075 and K20c, respectively, to accelerate FFT performance. Despite architectural differences across GPU generations and vendors, we identify the following optimizations, when applied individually and in isolation of one another, as being the most effective in accelerating FFT performance: (1) register preloading, (2) transposition via local memory, and (3) 8- or 16-byte vector access and scalar arithmetic. We then demonstrate the efficacy of combining individual optimizations together and find that the most effective combination of optimizations across all architectures encompasses register preloading, transposition via local memory, and use of constant memory. Our study suggests that FFT performance on GPUs is primarily limited by global memory data transfer. Overall, our optimizations deliver speed-ups as high as 31.5 over a baseline GPU implementation and 9.1 over a multithreaded FFTW CPU implementation with AVX vector extensions.","PeriodicalId":194910,"journal":{"name":"Proceedings of the 11th ACM Conference on Computing Frontiers","volume":"63 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129803278","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 16

A collaborative divide-and-conquer K-means clustering algorithm for processing large data 一种用于处理大数据的协同分治k均值聚类算法

Proceedings of the 11th ACM Conference on Computing Frontiers Pub Date : 2014-05-20 DOI: 10.1145/2597917.2597918

Huimin Cui, G. Ruan, Jingling Xue, Rui Xie, Lei Wang, Xiaobing Feng

引用次数: 16

PEACH: a model for performance and energy aware cooperative hybrid computing PEACH:一个性能和能源感知的协同混合计算模型

Proceedings of the 11th ACM Conference on Computing Frontiers Pub Date : 2014-05-20 DOI: 10.1145/2597917.2597948

Rong Ge, Xizhou Feng, Martin Burtscher, Ziliang Zong

引用次数: 9

ScaffCC: a framework for compilation and analysis of quantum computing programs 脚手架:用于编译和分析量子计算程序的框架

Proceedings of the 11th ACM Conference on Computing Frontiers Pub Date : 2014-05-20 DOI: 10.1145/2597917.2597939

Ali JavadiAbhari, S. Patil, Daniel Kudrow, Jeff Heckey, Alexey Lvov, F. Chong, M. Martonosi

引用次数: 150

A framework for predicting trajectories using global and local information 利用全球和局部信息预测轨迹的框架

Proceedings of the 11th ACM Conference on Computing Frontiers Pub Date : 2014-05-20 DOI: 10.1145/2597917.2597934

William Groves, Ernesto Nunes, Maria L. Gini

{"title":"A framework for predicting trajectories using global and local information","authors":"William Groves, Ernesto Nunes, Maria L. Gini","doi":"10.1145/2597917.2597934","DOIUrl":"https://doi.org/10.1145/2597917.2597934","url":null,"abstract":"We propose a novel framework for predicting the paths of vehicles that move on a road network. The framework leverages global and local patterns in spatio-temporal data. From a large corpus of GPS trajectories, we predict the subsequent path of an in-progress vehicle trajectory using only spatio-temporal features from the data. Our framework consists of three components: (1) a component that abstracts GPS location data into a graph at the neighborhood or street level, (2) a component that generates policies obtained from the graph data, and (3) a component that predicts the subsequent path of an in-progress trajectory. Hierarchical clustering is used to construct the city graph, where the clusters facilitate a compact representation of the trajectory data to make processing large data sets tractable and efficient. We propose four alternative policy generation algorithms: a frequency-based algorithm (FreqCount), a correlation-based algorithm (EigenStrat), a spectral clusteringbased algorithm (LapStrat), and a Markov Chain-based algorithm (MCStrat). The algorithms explore either global patterns (FreqCount and EigenStrat) or local patterns (MCStrat) in the data, with the exception of LapStrat which explores both. We present an analysis of the performance of the alternative prediction algorithms using a large real-world taxi data set.","PeriodicalId":194910,"journal":{"name":"Proceedings of the 11th ACM Conference on Computing Frontiers","volume":"53 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114761684","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 6

Accurate off-line phase classification for HW/SW co-designed processors 硬件/软件协同设计处理器的准确离线相位分类

Proceedings of the 11th ACM Conference on Computing Frontiers Pub Date : 2014-05-20 DOI: 10.1145/2597917.2597937

Aleksandar Brankovic, Kyriakos Stavrou, E. Gibert, Antonio González

{"title":"Accurate off-line phase classification for HW/SW co-designed processors","authors":"Aleksandar Brankovic, Kyriakos Stavrou, E. Gibert, Antonio González","doi":"10.1145/2597917.2597937","DOIUrl":"https://doi.org/10.1145/2597917.2597937","url":null,"abstract":"Evaluation techniques in microprocessor design are mostly based on simulating selected application's samples using a cycle-accurate simulator. These samples usually correspond to different phases of the application stream. To identify these phases, relevant high-level application statistics are collected and clustered using a process named \"Off-Line Phase Classification\". The purpose of phase classification is to reduce the number of samples that need to be simulated with the minimum loss in accuracy (compared to simulating the complete set of samples). Unfortunately, when directly applied to HW/SW co-designed processors the traditional phase classifications do not provide a good trade-off between accuracy and the number of samples. As an example, according to our experimental results, to achieve a 4% error (compared to simulating all the samples) one needs to simulate 2.5X more samples for the case of HW/SW co-designed processors compared to what is necessary for HW-only processors. In this paper, we propose a novel off-line phase classification scheme called TOL Description Vector (TDV), which is suitable for HW/SW co-designed processors. TDV targets at estimating the TOL particularities and on average gives significantly better accuracy than traditional phase classification for any number of selected samples. For instance, TDV reaches the average error of 3% with 3X less samples than traditional classification. These benefits apply for different TOL and microarchitecture configurations.","PeriodicalId":194910,"journal":{"name":"Proceedings of the 11th ACM Conference on Computing Frontiers","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115490917","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

Cache-conscious graph collaborative filtering on multi-socket multicore systems 多套接字多核系统的缓存感知图协同过滤

Proceedings of the 11th ACM Conference on Computing Frontiers Pub Date : 2014-05-20 DOI: 10.1145/2597917.2597935

Lifeng Nai, Yinglong Xia, Ching-Yung Lin, Bo Hong, H. Lee

{"title":"Cache-conscious graph collaborative filtering on multi-socket multicore systems","authors":"Lifeng Nai, Yinglong Xia, Ching-Yung Lin, Bo Hong, H. Lee","doi":"10.1145/2597917.2597935","DOIUrl":"https://doi.org/10.1145/2597917.2597935","url":null,"abstract":"Recommendation systems using graph collaborative filtering often require responses in real time and high throughput. Therefore, besides recommendation accuracy, it is critical to study high performance concurrent collaborative filtering on modern platforms. To achieve high performance, we study the graph data locality characteristics of collaborative filtering. Our experiments demonstrate that although an individual graph traversal exhibits poor data locality, multiple queries have a tendency of sharing their data footprints, especially in the case of queries with neighboring root vertices. Such characteristics lead to both inter- and intra-thread data locality, which can be utilized to significantly improve collaborative filtering performance. Based on these observations, we present a cache-conscious system for collaborative filtering on modern multi-socket multicore platforms. In this system, we propose a cache-conscious query scheduling technique and an in-memory graph representation, and to maximize cache performance and minimize cross-core/socket communication overhead, we address both inter- and intra-thread data locality. To address the workload balancing issue, this study introduces a dynamic work-stealing mechanism to explore the tradeoff between workload balancing and cache-consciousness. The proposed system was evaluated on a Power7+ system against the IBM Knowledge Repository graph dataset. The results demonstrated both good scalability and throughput. Compared with the basic system that does not perform cache-conscious scheduling, inter-thread scheduling improves throughput by up to 18%. Intra-thread scheduling can further improve throughput by as much as 22%. By enabling dynamic work-stealing, the proposed technique balances workloads across all threads with a low standard deviation of the per-thread processing time.","PeriodicalId":194910,"journal":{"name":"Proceedings of the 11th ACM Conference on Computing Frontiers","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131314391","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4

Optimization of neural network through genetic algorithm searches for the prediction of international crude oil price based on energy products prices 利用遗传算法优化神经网络，在能源产品价格的基础上搜索国际原油价格的预测

Proceedings of the 11th ACM Conference on Computing Frontiers Pub Date : 2014-05-20 DOI: 10.1145/2597917.2597956

H. Chiroma, A. Gital, Adamu I. Abubakar, M. Usman, Usman Waziri

引用次数: 10

A neuro-fuzzy fan speed controller for dynamic thermal management of multi-core processors 一种用于多核处理器动态热管理的神经模糊风扇速度控制器

Proceedings of the 11th ACM Conference on Computing Frontiers Pub Date : 2014-05-20 DOI: 10.1145/2597917.2597958

J. M. N. Abad, Bagher Salami, Hamid Noori, A. Soleimani, Farhad Mehdipour

引用次数: 2