{"title":"Embracing heterogeneity with dynamic core boosting","authors":"Hyoun Kyu Cho, S. Mahlke","doi":"10.1145/2597917.2597932","DOIUrl":"https://doi.org/10.1145/2597917.2597932","url":null,"abstract":"Uniformly distributing parallel workloads amongst threads is an effective strategy for programmers to increase application performance. However, in any parallel segment, execution time is determined by the longest running thread. Even for embarrassingly parallel programs in the form of SPMD (single program multiple data), the threads are not perfectly balanced due to control flow divergence, non-deterministic memory latencies, and synchronization operations. Such an imbalance can be significantly exacerbated by performance asymmetry among cores, which is likely to exist in future generations of chip multiprocessors (CMPs) either for energy efficiency or due to process variation. We propose Dynamic Core Boosting (DCB), a software-hardware cooperative system that mitigates the workload imbalance problem in performance asymmetric CMPs. Relying on dynamic voltage and frequency scaling to accelerate individual cores at a fine granularity, DCB attempts to balance the workloads by detecting and boosting critical threads. DCB coordinates its compiler and runtime to enable asymmetric CMPs to achieve near-optimal utilization of core boosting. The compiler instruments the program with instructions to give progress hints and the runtime monitors their execution, enabling DCB to intelligently accelerate selected threads within a total core boosting budget for better performance. On a simulated eight core system of varying frequency, our experiments using PARSEC benchmarks show that DCB improves the overall performance by an average of 33%, outperforming a reactive boosting scheme by an average of 10%.","PeriodicalId":194910,"journal":{"name":"Proceedings of the 11th ACM Conference on Computing Frontiers","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122020274","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Towards a performance-portable FFT library for heterogeneous computing","authors":"Carlo C. del Mundo, Wu-chun Feng","doi":"10.1145/2597917.2597943","DOIUrl":"https://doi.org/10.1145/2597917.2597943","url":null,"abstract":"The fast Fourier transform (FFT), a spectral method that computes the discrete Fourier transform and its inverse, pervades many applications in digital signal processing, such as imaging, tomography, and software-defined radio. Its importance has caused the research community to expend significant resources to accelerate the FFT, of which FFTW is the most prominent example. With the emergence of the graphics processing unit (GPU) as a massively parallel computing device for high performance, we seek to identify architecture-aware optimizations across two different generations of high-end AMD and NVIDIA GPUs, namely the AMD Radeon HD 6970 and HD 7970 and the NVIDIA Tesla C2075 and K20c, respectively, to accelerate FFT performance. Despite architectural differences across GPU generations and vendors, we identify the following optimizations, when applied individually and in isolation of one another, as being the most effective in accelerating FFT performance: (1) register preloading, (2) transposition via local memory, and (3) 8- or 16-byte vector access and scalar arithmetic. We then demonstrate the efficacy of combining individual optimizations together and find that the most effective combination of optimizations across all architectures encompasses register preloading, transposition via local memory, and use of constant memory. Our study suggests that FFT performance on GPUs is primarily limited by global memory data transfer. Overall, our optimizations deliver speed-ups as high as 31.5 over a baseline GPU implementation and 9.1 over a multithreaded FFTW CPU implementation with AVX vector extensions.","PeriodicalId":194910,"journal":{"name":"Proceedings of the 11th ACM Conference on Computing Frontiers","volume":"63 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129803278","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Huimin Cui, G. Ruan, Jingling Xue, Rui Xie, Lei Wang, Xiaobing Feng
{"title":"A collaborative divide-and-conquer K-means clustering algorithm for processing large data","authors":"Huimin Cui, G. Ruan, Jingling Xue, Rui Xie, Lei Wang, Xiaobing Feng","doi":"10.1145/2597917.2597918","DOIUrl":"https://doi.org/10.1145/2597917.2597918","url":null,"abstract":"K-means clustering plays a vital role in data mining. As an iterative computation, its performance will suffer when applied to tremendous amounts of data, due to poor temporal locality across its iterations. The state-of-the-art streaming algorithm, which streams the data from disk into memory and operates on the partitioned streams, improves temporal locality but can misplace objects in clusters since different partitions are processed locally. This paper presents a collaborative divide-and-conquer algorithm to significantly improve the state-of-the-art, based on two key insights. First, we introduce a break-and-recluster procedure to identify the clusters with misplaced objects. Second, we introduce collaborative seeding between different partitions to accelerate the convergence inside each partition. Compared with the streaming algorithm using a number of wikipedia webpages as our datasets, our collaborative algorithm improves its clustering quality by up to 35.3% with an average of 8.8% while decreasing its execution times from 0.3% to 80.1% with an average of 48.6%.","PeriodicalId":194910,"journal":{"name":"Proceedings of the 11th ACM Conference on Computing Frontiers","volume":"7 3-4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128609042","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Rong Ge, Xizhou Feng, Martin Burtscher, Ziliang Zong
{"title":"PEACH: a model for performance and energy aware cooperative hybrid computing","authors":"Rong Ge, Xizhou Feng, Martin Burtscher, Ziliang Zong","doi":"10.1145/2597917.2597948","DOIUrl":"https://doi.org/10.1145/2597917.2597948","url":null,"abstract":"Accelerator-based heterogeneous systems become increasingly important to high performance computing because of their potentials to deliver high performance and energy efficiency. To fully realize this potential, parallel software must utilize both host processors and accelerators' computing power and power-aware capabilities. We develop PEACH, a model for Performance and Energy Aware Cooperative Hybrid computing. PEACH explores judicious workload distribution between hosts and accelerators and intelligent energy-aware scheduling for further performance and energy efficiency gains on heterogenous systems. With a few system- and application-dependent parameters, PEACH accurately captures the performance and energy impact of workload distribution and energy-aware scheduling.","PeriodicalId":194910,"journal":{"name":"Proceedings of the 11th ACM Conference on Computing Frontiers","volume":"63 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127241586","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ali JavadiAbhari, S. Patil, Daniel Kudrow, Jeff Heckey, Alexey Lvov, F. Chong, M. Martonosi
{"title":"ScaffCC: a framework for compilation and analysis of quantum computing programs","authors":"Ali JavadiAbhari, S. Patil, Daniel Kudrow, Jeff Heckey, Alexey Lvov, F. Chong, M. Martonosi","doi":"10.1145/2597917.2597939","DOIUrl":"https://doi.org/10.1145/2597917.2597939","url":null,"abstract":"Quantum computing is a promising technology for high-performance computation, but requires mature toolflows that can map large-scale quantum programs onto targeted hardware. In this paper, we present a scalable compiler for large-scale quantum applications, and show the opportunities for reducing compilation and analysis time, as well as output code size. We discuss the similarities and differences between compiling for a quantum computer as opposed to a classical computer, and present a state-of-the-art approach for compilation of classical circuits into quantum circuits. Our work also highlights the importance of high-level quantum compilation for logical circuit translation, quantitative analysis of algorithms, and optimization of circuit lengths.","PeriodicalId":194910,"journal":{"name":"Proceedings of the 11th ACM Conference on Computing Frontiers","volume":"32 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124933730","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A framework for predicting trajectories using global and local information","authors":"William Groves, Ernesto Nunes, Maria L. Gini","doi":"10.1145/2597917.2597934","DOIUrl":"https://doi.org/10.1145/2597917.2597934","url":null,"abstract":"We propose a novel framework for predicting the paths of vehicles that move on a road network. The framework leverages global and local patterns in spatio-temporal data. From a large corpus of GPS trajectories, we predict the subsequent path of an in-progress vehicle trajectory using only spatio-temporal features from the data. Our framework consists of three components: (1) a component that abstracts GPS location data into a graph at the neighborhood or street level, (2) a component that generates policies obtained from the graph data, and (3) a component that predicts the subsequent path of an in-progress trajectory. Hierarchical clustering is used to construct the city graph, where the clusters facilitate a compact representation of the trajectory data to make processing large data sets tractable and efficient. We propose four alternative policy generation algorithms: a frequency-based algorithm (FreqCount), a correlation-based algorithm (EigenStrat), a spectral clusteringbased algorithm (LapStrat), and a Markov Chain-based algorithm (MCStrat). The algorithms explore either global patterns (FreqCount and EigenStrat) or local patterns (MCStrat) in the data, with the exception of LapStrat which explores both. We present an analysis of the performance of the alternative prediction algorithms using a large real-world taxi data set.","PeriodicalId":194910,"journal":{"name":"Proceedings of the 11th ACM Conference on Computing Frontiers","volume":"53 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114761684","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Aleksandar Brankovic, Kyriakos Stavrou, E. Gibert, Antonio González
{"title":"Accurate off-line phase classification for HW/SW co-designed processors","authors":"Aleksandar Brankovic, Kyriakos Stavrou, E. Gibert, Antonio González","doi":"10.1145/2597917.2597937","DOIUrl":"https://doi.org/10.1145/2597917.2597937","url":null,"abstract":"Evaluation techniques in microprocessor design are mostly based on simulating selected application's samples using a cycle-accurate simulator. These samples usually correspond to different phases of the application stream. To identify these phases, relevant high-level application statistics are collected and clustered using a process named \"Off-Line Phase Classification\". The purpose of phase classification is to reduce the number of samples that need to be simulated with the minimum loss in accuracy (compared to simulating the complete set of samples). Unfortunately, when directly applied to HW/SW co-designed processors the traditional phase classifications do not provide a good trade-off between accuracy and the number of samples. As an example, according to our experimental results, to achieve a 4% error (compared to simulating all the samples) one needs to simulate 2.5X more samples for the case of HW/SW co-designed processors compared to what is necessary for HW-only processors. In this paper, we propose a novel off-line phase classification scheme called TOL Description Vector (TDV), which is suitable for HW/SW co-designed processors. TDV targets at estimating the TOL particularities and on average gives significantly better accuracy than traditional phase classification for any number of selected samples. For instance, TDV reaches the average error of 3% with 3X less samples than traditional classification. These benefits apply for different TOL and microarchitecture configurations.","PeriodicalId":194910,"journal":{"name":"Proceedings of the 11th ACM Conference on Computing Frontiers","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115490917","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Lifeng Nai, Yinglong Xia, Ching-Yung Lin, Bo Hong, H. Lee
{"title":"Cache-conscious graph collaborative filtering on multi-socket multicore systems","authors":"Lifeng Nai, Yinglong Xia, Ching-Yung Lin, Bo Hong, H. Lee","doi":"10.1145/2597917.2597935","DOIUrl":"https://doi.org/10.1145/2597917.2597935","url":null,"abstract":"Recommendation systems using graph collaborative filtering often require responses in real time and high throughput. Therefore, besides recommendation accuracy, it is critical to study high performance concurrent collaborative filtering on modern platforms. To achieve high performance, we study the graph data locality characteristics of collaborative filtering. Our experiments demonstrate that although an individual graph traversal exhibits poor data locality, multiple queries have a tendency of sharing their data footprints, especially in the case of queries with neighboring root vertices. Such characteristics lead to both inter- and intra-thread data locality, which can be utilized to significantly improve collaborative filtering performance. Based on these observations, we present a cache-conscious system for collaborative filtering on modern multi-socket multicore platforms. In this system, we propose a cache-conscious query scheduling technique and an in-memory graph representation, and to maximize cache performance and minimize cross-core/socket communication overhead, we address both inter- and intra-thread data locality. To address the workload balancing issue, this study introduces a dynamic work-stealing mechanism to explore the tradeoff between workload balancing and cache-consciousness. The proposed system was evaluated on a Power7+ system against the IBM Knowledge Repository graph dataset. The results demonstrated both good scalability and throughput. Compared with the basic system that does not perform cache-conscious scheduling, inter-thread scheduling improves throughput by up to 18%. Intra-thread scheduling can further improve throughput by as much as 22%. By enabling dynamic work-stealing, the proposed technique balances workloads across all threads with a low standard deviation of the per-thread processing time.","PeriodicalId":194910,"journal":{"name":"Proceedings of the 11th ACM Conference on Computing Frontiers","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131314391","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
H. Chiroma, A. Gital, Adamu I. Abubakar, M. Usman, Usman Waziri
{"title":"Optimization of neural network through genetic algorithm searches for the prediction of international crude oil price based on energy products prices","authors":"H. Chiroma, A. Gital, Adamu I. Abubakar, M. Usman, Usman Waziri","doi":"10.1145/2597917.2597956","DOIUrl":"https://doi.org/10.1145/2597917.2597956","url":null,"abstract":"This study investigated the prediction of crude oil price based on energy product prices using genetically optimized Neural Network (GANN). It was found from experimental evidence that the international crude oil price can be predicted based on energy product prices. The comparison of the prediction performance accuracy of the propose GANN with Support Vector Machine (SVM), Vector Autoregression (VAR), and Feed Forward NN (FFNN) suggested that the propose GANN was more accurate than the SVM, VAR, and FFNN in the prediction accuracy and time computational complexity. The propose GANN was able to improve the performance accuracy of the comparison algorithms. Our approach can easily be modified for the prediction of similar commodities.","PeriodicalId":194910,"journal":{"name":"Proceedings of the 11th ACM Conference on Computing Frontiers","volume":"95 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121184620","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
J. M. N. Abad, Bagher Salami, Hamid Noori, A. Soleimani, Farhad Mehdipour
{"title":"A neuro-fuzzy fan speed controller for dynamic thermal management of multi-core processors","authors":"J. M. N. Abad, Bagher Salami, Hamid Noori, A. Soleimani, Farhad Mehdipour","doi":"10.1145/2597917.2597958","DOIUrl":"https://doi.org/10.1145/2597917.2597958","url":null,"abstract":"Cooling equipments is a thermal management technique that reduces the thermal resistance of the heat sink without any performance degradation. However, higher fan speed produces a lower thermal resistance, but at the expense of higher power consumption. Our proposed Neuro-Fuzzy fan controller (NFSC), minimizes fan power consumption while avoiding the temperature increase above a certain threshold. The experimental results indicate that our proposed model can significantly decrease the average fan power with negligible temperature overhead compared to the traditional fan controller.","PeriodicalId":194910,"journal":{"name":"Proceedings of the 11th ACM Conference on Computing Frontiers","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133735458","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}