G. S. Gill, Vaibhav Saxena, R. Mittal, Thomas George, Yogish Sabharwal, L. Dagar
{"title":"Evaluation and enhancement of weather application performance on Blue Gene/Q","authors":"G. S. Gill, Vaibhav Saxena, R. Mittal, Thomas George, Yogish Sabharwal, L. Dagar","doi":"10.1109/HiPC.2013.6799138","DOIUrl":"https://doi.org/10.1109/HiPC.2013.6799138","url":null,"abstract":"Numerical weather prediction (NWP) models use mathematical models of the atmosphere to predict the weather. Ongoing efforts in the weather and climate community continuously try to improve the fidelity of weather models by employing higher order numerical methods suitable for solving model equations at high resolutions. In realistic weather forecasting scenario, simulating and tracking multiple regions of interest (nests) at fine resolutions is important in understanding the interplay between multiple weather phenomena and for comprehensive predictions. These multiple regions of interest in a simulation can be significantly different in resolution and other modeling parameters. Currently, the weather simulations involving these nested regions process them one after the other in a sequential fashion. There exists a lot of prior work in performance evaluation and optimization of weather models, however most of this work is either limited to simulations involving a single domain or multiple nests with same resolution and model parameters such as model physics options. In this paper, we evaluate and enhance the performance of popular WRF model on IBM Blue Gene/Q system. We consider nested simulations with multiple child domains and study how parameters such as physics options and simulation time steps for child domains affect the computational requirements. We also analyze how such configurations can benefit from parallel execution of the children domains rather than processing them sequentially. We demonstrate that it is important to allocate processors to nested child domains in proportion to the work load associated with them when executing them in parallel. This ensures that the time spent in the different nested simulations is nearly equal, and the nested domains reach the synchronization step with the parent simulation together. Our experimental evaluation using a simple heuristic for allocation of nodes shows that the performance of WRF simulations can be improved by up to 14% by parallel execution of sibling domains with different configuration of domain sizes, temporal resolutions and physics options.","PeriodicalId":206307,"journal":{"name":"20th Annual International Conference on High Performance Computing","volume":"141 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122281216","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Web-scale entity annotation using MapReduce","authors":"Shashank Gupta, Varun Chandramouli, Soumen Chakrabarti","doi":"10.1109/HiPC.2013.6799137","DOIUrl":"https://doi.org/10.1109/HiPC.2013.6799137","url":null,"abstract":"Cloud computing frameworks such as map-reduce (MR) are widely used in the context of log mining, inverted indexing, and scientific data analysis. Here we address the new and important task of annotating token spans in billions of Web pages that mention named entities from a large entity catalog such as Wikipedia or Freebase. The key step in annotation is disambiguation: given the token Albert, use its mention context to determine which Albert is being mentioned. Disambiguation requires holding in RAM a machine-learnt statistical model for each mention phrase. In earlier work with only two million entities, we could fit all models in RAM, and stream rapidly through the corpus from disk. However, as the catalog grows to hundreds of millions of entities, this simple solution is no longer feasible. Simple adaptations like caching and evicting models online, or making multiple passes over the corpus while holding a fraction of models in RAM, showed unacceptable performance. Then we attempted to write a standard Hadoop MR application, but this hit a serious load skew problem (82.12% idle CPU). Skew in MR application seems widespread. Many skew mitigation approaches have been proposed recently. We tried SkewTune, which showed only modest improvement. We realized that reduce key splitting was essential, and designed simple but effective application-specific load estimation and key-splitting methods. A precise performance model was first created, which led to an objective function that we optimized heuristically. The resulting schedule was executed on Hadoop MR. This approach led to large benefits: our final annotator was 5.4× faster than standard Hadoop MR, and 5.2× faster than even SkewTune. Idle time was reduced to 3%. Although fine-tuned to our application, our technique may be of independent interest.","PeriodicalId":206307,"journal":{"name":"20th Annual International Conference on High Performance Computing","volume":"30 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126024366","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
E. Garcia, Daniel A. Orozco, R. Khan, Ioannis E. Venetis, Kelly Livingston, G. Gao
{"title":"A dynamic schema to increase performance in many-core architectures through percolation operations","authors":"E. Garcia, Daniel A. Orozco, R. Khan, Ioannis E. Venetis, Kelly Livingston, G. Gao","doi":"10.1109/HiPC.2013.6799134","DOIUrl":"https://doi.org/10.1109/HiPC.2013.6799134","url":null,"abstract":"Optimization of parallel applications under new many-core architectures is challenging even for regular applications. Successful strategies inherited from previous generations of parallel or serial architectures just return incremental gains in performance and further optimization and tuning are required. We argue that conservative static optimizations are not the best fit for modern many-core architectures. The limited advantages of static techniques come from the new scenarios present in many-cores: Plenty of thread units sharing several resources under different coordination mechanisms. We point out that scheduling and data movement across the memory hierarchy are extremely important in the performance of applications. In particular, we found that scheduling of data movement operations significantly impact performance. To overcome those difficulties, we took advantage of the fine-grain synchronization primitives of many-cores to define percolation operations in order to schedule data movement properly. In addition, we have fused percolation operations with dynamic scheduling into a dynamic percolation approach. We used Dense Matrix Multiplication on a modern manycore to illustrate how our proposed techniques are able to increase the performance under these new environments. In our study on the IBM Cyclops-64, we raised the performance from 44 GFLOPS (out of 80 GFLOPS possible) to 70.0 GFLOPS (operands in on-chip memory) and 65.6 GFLOPS (operands in off-chip memory). The success of our approach also resulted in excellent power efficiency: 1.09 GFLOPS/Watt and 993 MFLOPS/Watt when the input data resided in on-chip and off-chip memory respectively.","PeriodicalId":206307,"journal":{"name":"20th Annual International Conference on High Performance Computing","volume":"34 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115363469","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Songling Fu, Chenlin Huang, Ligang He, Nadeem Chaudhary, Xiangke Liao, Shazhou Yang, Xiaochuan Wang, Bao Li
{"title":"iFlatLFS: Performance optimization for accessing massive small files","authors":"Songling Fu, Chenlin Huang, Ligang He, Nadeem Chaudhary, Xiangke Liao, Shazhou Yang, Xiaochuan Wang, Bao Li","doi":"10.1109/HiPC.2013.6799116","DOIUrl":"https://doi.org/10.1109/HiPC.2013.6799116","url":null,"abstract":"The processing of massive small files is a challenge in the design of distributed file systems. Currently, the combined-block-storage approach is prevalent. However, the approach employs traditional file systems like ExtFS and may cause inefficiency for random access to small files. This paper focuses on optimizing the performance of data servers in accessing massive small files. We present a Flat Lightweight File System (iFlatLFS) to manage small files, which is based on a simple metadata scheme and a flat storage architecture. iFlatLFS aims to substitute the traditional file system on data servers that are mainly used to store small files, and it can greatly simplify the original data access procedure. The new metadata proposed in this paper occupies only a fraction of the original metadata size based on traditional file systems. We have implemented iFlatLFS in CentOS 5.5 and integrated it into an open source Distributed File System (DFS), called Taobao FileSystem (TFS), which is developed by a top B2C service provider, Alibaba, in China and is managing over 28.6 billion small photos. We have conducted extensive experiments to verify the performance of iFlatLFS. The results show that when the file size ranges from 1KB to 64KB, iFlatLFS is faster than Ext4 by 48% and 54% on average for random read and write in the DFS environment, respectively. Moreover, after iFlatLFS is integrated into TFS, iFlatLFS-based TFS is faster than the existing Ext4-based TFS by 45% and 49% on average for random read access and hybrid access (the mix of read and write accesses), respectively.","PeriodicalId":206307,"journal":{"name":"20th Annual International Conference on High Performance Computing","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114901257","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"X10-based distributed and parallel betweenness centrality and its application to social analytics","authors":"Charuwat Houngkaew, T. Suzumura","doi":"10.1109/HiPC.2013.6799143","DOIUrl":"https://doi.org/10.1109/HiPC.2013.6799143","url":null,"abstract":"Betweenness centrality is a measure that determines the relative importance of a vertex (or an edge) within a graph based on shortest paths. Recently, large-scale graphs have emerged in many different domains, as social networks, road networks, protein interaction networks, etc., and they are too large to fit into the memory of a single SMP. The algorithm proposed by Edmonds et al. [1] is capable of running on distributed memory systems. However, the algorithm does not expose intra-node parallelism. In this paper we investigated the inter- and intra-node parallelism of computing betweenness centrality on distributed memory systems. We developed the implementation based on the algorithm proposed by Edmonds et al. using X10 programming language [2]. We further improved the performance of the implementation by optimizing the network transport of the X10 runtime. We thoroughly evaluated the performance of our implementation on synthetic graphs of various scales against the existing implementation of Edmonds' algorithm from PBGL. We estimated the betweenness centrality of the huge Twitter networks [3] and found that its distribution follows a power law.","PeriodicalId":206307,"journal":{"name":"20th Annual International Conference on High Performance Computing","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129485677","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Adrià Armejach, A. Negi, A. Cristal, O. Unsal, P. Stenström, T. Harris
{"title":"HARP: Adaptive abort recurrence prediction for Hardware Transactional Memory","authors":"Adrià Armejach, A. Negi, A. Cristal, O. Unsal, P. Stenström, T. Harris","doi":"10.1109/HiPC.2013.6799100","DOIUrl":"https://doi.org/10.1109/HiPC.2013.6799100","url":null,"abstract":"Hardware Transactional Memory (HTM) exposes parallelism by allowing possibly conflicting sections of code, called transactions, to execute concurrently in multithreaded applications. However, conflicts among concurrent transactions result in wasted computation and expensive rollbacks. Under high contention HTM protocol overheads can, in many cases, amount to several times the useful work done. Blindly scheduling transactions in the presence of contention is therefore clearly suboptimal from a resource utilization standpoint, especially in situations where several scheduling options exist. This paper presents HARP (Hardware Abort Recurrence Predictor), a hardware-only mechanism to avoid speculation when it is likely to fail. Inspired by branch prediction strategies and prior work on contention management and scheduling in HTM, HARP uses past behavior of transactions and locality in conflicting memory references to accurately predict conflicts. The prediction mechanism adapts to variations in workload characteristics and enables better utilization of computational resources. We show that an HTM protocol that integrates HARP exhibits reductions in both wasted execution time and serialization overheads when compared to prior work, leading to a significant increase in throughput (~30%) in both single-application and multi-application scenarios.","PeriodicalId":206307,"journal":{"name":"20th Annual International Conference on High Performance Computing","volume":"38 11","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132575255","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Cache-based cross-iteration coherence for speculative parallelization","authors":"Andre Baixo, João Paulo Porto, G. Araújo","doi":"10.1109/HiPC.2013.6799113","DOIUrl":"https://doi.org/10.1109/HiPC.2013.6799113","url":null,"abstract":"Maximal utilization of cores in multicore architectures is key to realize the potential performance available from higher density devices. In order to achieve scalable performance, parallelization techniques rely on carefully tunning speculative architecture support, run-time environment and software-based transformations. Hardware and software mechanisms have already been proposed to address this problem. They either require deep (and risky) changes on the existing hardware and cache coherence protocols, or exhibit poor performance scalability for a range of applications. The addition of cache tags as an enabler for data versioning, recently announced by the industry (i.e. IBM BlueGene/Q), could allow a better exploitation of parallelism at the microarchitecture level. In this paper, we present an execution model that supports both DOPIPE-based speculation and traditional speculative parallelization techniques. It is based on a simple cache tagging approach for data versioning, which integrates smoothly with typical cache coherence protocols, not requiring any changes to them. Experimental results, using SPEC and PARSEC benchmarks, reveal substantial speedups in a 24-core simulated CMP, while demonstrate improved scalability when compared to a software-only approach.","PeriodicalId":206307,"journal":{"name":"20th Annual International Conference on High Performance Computing","volume":"17 2","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132120415","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"GPU-enabled efficient executions of radiation calculations in climate modeling","authors":"S. Korwar, Sathish S. Vadhiyar, R. Nanjundiah","doi":"10.1109/HiPC.2013.6799141","DOIUrl":"https://doi.org/10.1109/HiPC.2013.6799141","url":null,"abstract":"In this paper, we discuss the acceleration of a climate model known as the Community Earth System Model (CESM). The use of Graphics Processor Units (GPUs) to accelerate scientific applications that are computationally intensive is well known. This work attempts to extract the performance of GPUs to enable faster execution of CESM and obtain better model throughput. We focus on two major routines that consume the largest amount of time namely, radabs and radcswmx, which compute parameters related to the long wave (infra-red) and short wave (visible and ultra-violet) radiations respectively. We propose a novel asynchronous execution strategy in which the results computed by the GPU for the current time step are used by the CPU in the subsequent time step. Such a technique effectively hides computational effort on the GPU. By exploiting the parallelism offered by the GPU and using asynchronous executions on the CPU and GPU, we obtain a speed-up of about 26× for the routine radabs and about 5.6× for routine radcswmx.","PeriodicalId":206307,"journal":{"name":"20th Annual International Conference on High Performance Computing","volume":"90 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131013521","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Parallel distributed breadth first search on GPU","authors":"Koji Ueno, T. Suzumura","doi":"10.1109/HiPC.2013.6799136","DOIUrl":"https://doi.org/10.1109/HiPC.2013.6799136","url":null,"abstract":"In this paper we propose a highly optimized parallel and distributed BFS on GPU for Graph500 benchmark. We evaluate the performance of our implementation using TSUBAME2.0 supercomputer. We achieve 317 GTEPS (billion traversed edges per second) with scale 35 (a large graph with 34.4 billion vertices and 550 billion edges) using 1366 nodes and 4096 GPUs. With this score, TSUBAME2.0 supercomputer is ranked fourth in the ranking list announced in June 2012. We analyze the performance of our implementation and the result shows that inter-node communication limits the performance of our GPU implementation. We also propose SIMD Variable-Length Quantity (VLQ) encoding for compression of communication data with GPU.","PeriodicalId":206307,"journal":{"name":"20th Annual International Conference on High Performance Computing","volume":"189 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117296353","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Solving tridiagonal systems on a GPU","authors":"B. J. Murphy","doi":"10.1109/HiPC.2013.6799117","DOIUrl":"https://doi.org/10.1109/HiPC.2013.6799117","url":null,"abstract":"We implement a parallel tridiagonal solver based on cyclic reduction (CR) for a graphics processing unit (GPU). The bane of such solvers is a low computation to communication ratio. With this our main consideration we focus our effort on lowering communication costs. In so doing we accelerate system solving. Further, in the diagonally dominant case computation is decoupled into independent partitions allowing for efficient processing of larger systems.","PeriodicalId":206307,"journal":{"name":"20th Annual International Conference on High Performance Computing","volume":"3 6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124833212","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}