H. Fröning, Alexander Giese, Héctor Montaner, F. Silla, J. Duato
{"title":"Highly scalable barriers for future high-performance computing clusters","authors":"H. Fröning, Alexander Giese, Héctor Montaner, F. Silla, J. Duato","doi":"10.1109/HiPC.2011.6152729","DOIUrl":"https://doi.org/10.1109/HiPC.2011.6152729","url":null,"abstract":"Although large scale high performance computing today typically relies on message passing, shared memory can offer significant advantages, as the overhead associated with MPI is completely avoided. In this way, we have developed an FPGA-based Shared Memory Engine that allows to forward memory transactions, like loads and stores, to remote memory locations in large clusters, thus providing a single memory address space. As coherency protocols do not scale with system size we completely avoid a global coherency across the cluster. However, we maintain local coherency domains, thus keeping the cores within one node coherent. In this paper, we show the suitability of our approach by analyzing the performance of barriers, a very common synchronization primitive in parallel programs. Experiments in a real cluster prototype show that our approach allows synchronization among 1024 cores spread over 64 nodes in less than 15us, several times faster than other highly optimized barriers. We show the feasibility of this approach by executing a shared-memory implementation of FFT. Finally, note that this barrier can also be leveraged by MPI applications running on our shared memory architecture for clusters. This ensures the usefulness of this work for applications already written.","PeriodicalId":122468,"journal":{"name":"2011 18th International Conference on High Performance Computing","volume":"138 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-12-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123744016","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ahmet Erdem Sarıyüce, Erik Saule, Ümit V. Çatalyürek
{"title":"Improving graph coloring on distributed-memory parallel computers","authors":"Ahmet Erdem Sarıyüce, Erik Saule, Ümit V. Çatalyürek","doi":"10.1109/HiPC.2011.6152726","DOIUrl":"https://doi.org/10.1109/HiPC.2011.6152726","url":null,"abstract":"Graph coloring is a combinatorial optimization problem that classically appears in distributed computing to identify the sets of tasks that can be safely performed in parallel. Despite many existing efficient sequential algorithms being known for this NP-Complete problem, distributed variants are challenging. Building on an existing distributed-memory graph coloring framework, we investigate two techniques in this paper. First, we investigate the application of two different vertex-visit orderings, namely Largest First and Smallest Last, in a distributed context and show that they can help to significantly decrease the number of colors, on small-to medium-scale parallel architectures. Second, we investigate the use of a distributed post-processing operation, called recoloring, which further drastically improves the number of colors while not increasing the runtime more than twofold on large graphs. We also investigate the use of multicore architectures for distributed graph coloring algorithms.","PeriodicalId":122468,"journal":{"name":"2011 18th International Conference on High Performance Computing","volume":"251 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-12-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132483031","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Salman Khan, Nikolas Ioannou, Polychronis Xekalakis, Marcelo H. Cintra
{"title":"Increasing the energy efficiency of TLS systems using intermediate checkpointing","authors":"Salman Khan, Nikolas Ioannou, Polychronis Xekalakis, Marcelo H. Cintra","doi":"10.1109/HiPC.2011.6152735","DOIUrl":"https://doi.org/10.1109/HiPC.2011.6152735","url":null,"abstract":"With the advent of Chip Multiprocessors (CMPs), improving performance relies on the programmers/compilers to expose thread level parallelism to the underlying hardware. However, this is a difficult and error-prone process for the programmers, while state of the art compiler techniques are unable to provide significant benefits for many classes of applications. An alternative is offered by systems that support Thread Level Speculation (TLS), which relieve the programmer and compiler from checking for thread dependences and instead use the hardware to enforce them. Unfortunately, TLS suffers from power inefficency because data misspeculations cause threads to roll back to the beginning of the speculative task. For this reason intermediate check-pointing of TLS threads has been proposed. When a violation does occur, we now have to roll back to a checkpoint before the violating instruction and not to the start of the task. However, previous work omits study of the microarchitectural details and implementation issues that are essential for effective checkpointing. In this paper we study checkpointing on a state-of-the art TLS system. We systematically study the costs associated with checkpointing and analyze the tradeoffs. We also propose changes to the TLS mechanism to allow effective checkpointing. Further, we establish the need for accurately identifying points in execution that are appropriate for checkpointing and analyze various techniques for doing so in terms of both effectiveness and viability. We propose program counter based and hybrid predictors and show that they outperform previous proposals. Placing checkpoints based on dependence predictors results in power improvements while maintaining the performance advantage of TLS. The checkpointing system proposed achieves an energy saving of up to 14%, with an average of 7% over normal TLS execution.","PeriodicalId":122468,"journal":{"name":"2011 18th International Conference on High Performance Computing","volume":"37 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-12-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133297828","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ligang He, Chenlin Huang, Kenli Li, Hao Chen, Jianhua Sun, Bo Gao, Kewei Duan, S. Jarvis
{"title":"Modelling and analyzing the authorization and execution of video workflows","authors":"Ligang He, Chenlin Huang, Kenli Li, Hao Chen, Jianhua Sun, Bo Gao, Kewei Duan, S. Jarvis","doi":"10.1109/HiPC.2011.6152727","DOIUrl":"https://doi.org/10.1109/HiPC.2011.6152727","url":null,"abstract":"It is becoming common practice to migrate signal-based video workflows to IT-based Video workflows. Video workflows have some inherent features, including: 1) necessary human involvements in video workflows introduce security and authorization concerns; 2) the frequent change of video workflow contexts requires a flexible approach to acquiring performance data; 3) the content-centric nature of video workflows, which is in contrast to the business-centric of business workflows, requires the support of scheduled activities. This paper takes the above issues into account, proposing a novel mechanism for modeling video workflow executions in cluster-based resource pools under Role-Based Authorization Control (RBAC) schemes. The Color Timed Petri-Net (CTPN) formalism is applied to construct the models. Various types of authorization constraint are modeled in this paper, and scheduled activities are also supported in the model. There is a clear interface between workflow execution and workflow authorization modules. The constructed models are then simulated and analyzed to obtain performance data, including authorization overhead, system- and application-oriented performance. Based on the model analysis, this paper further proposes the methods to improve performance in the presence of authorization policies. This work can be used to plan system capacity subject to the authorization control, and can also be used to tune performance by changing the scheduling strategy and resource capacity when it is not possible to adjust the authorization policies.","PeriodicalId":122468,"journal":{"name":"2011 18th International Conference on High Performance Computing","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-12-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133950151","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Johanne Cohen, Daniel Cordeiro, D. Trystram, Frédéric Wagner
{"title":"Coordination mechanisms for selfish multi-organization scheduling","authors":"Johanne Cohen, Daniel Cordeiro, D. Trystram, Frédéric Wagner","doi":"10.1109/HiPC.2011.6152720","DOIUrl":"https://doi.org/10.1109/HiPC.2011.6152720","url":null,"abstract":"We conduct a game theoretic analysis on the problem of scheduling jobs on computing platforms composed of several independent and selfish organizations, known as the Multi-Organization Scheduling Problem (MOSP). Each organization shares resources and jobs with others, expecting to decrease the makespan of its own jobs. We modeled MOSP as a non-cooperative game where each agent is responsible for assigning all jobs belonging to a particular organization to the available processors. The local scheduling of these jobs is defined by coordination mechanisms that first prioritize local jobs and then schedule the jobs from others according to some given priority. When different priorities are given individually to the jobs — like in classical scheduling algorithms such as LPT or SPT — then no pure e-approximate equilibrium is possible for values of e less than 2. We also prove that even deciding whether a given instance admits or not a pure Nash equilibrium is co-NP hard. When these priorities are given to entire organizations, we show the existence of an algorithm that always computes a pure p-approximate equilibrium using any p-approximation list scheduling algorithm. Finally, we prove that the price of anarchy of the MOSP game using this mechanism is asymptotically bounded by 2.","PeriodicalId":122468,"journal":{"name":"2011 18th International Conference on High Performance Computing","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-12-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134454950","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Alberto Miranda, S. Effert, Yangwook Kang, E. L. Miller, A. Brinkmann, Toni Cortes
{"title":"Reliable and randomized data distribution strategies for large scale storage systems","authors":"Alberto Miranda, S. Effert, Yangwook Kang, E. L. Miller, A. Brinkmann, Toni Cortes","doi":"10.1109/HiPC.2011.6152745","DOIUrl":"https://doi.org/10.1109/HiPC.2011.6152745","url":null,"abstract":"The ever-growing amount of data requires highly scalable storage solutions. The most flexible approach is to use storage pools that can be expanded and scaled down by adding or removing storage devices. To make this approach usable, it is necessary to provide a solution to locate data items in such a dynamic environment. This paper presents and evaluates the Random Slicing strategy, which incorporates lessons learned from table-based, rule-based, and pseudo-randomized hashing strategies and is able to provide a simple and efficient strategy that scales up to handle exascale data. Random Slicing keeps a small table with information about previous storage system insert and remove operations, drastically reducing the required amount of randomness while delivering a perfect load distribution.","PeriodicalId":122468,"journal":{"name":"2011 18th International Conference on High Performance Computing","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-12-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114190938","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Scalable clustering using multiple GPUs","authors":"K. Mohiuddin, P J Narayanan","doi":"10.1109/HiPC.2011.6152713","DOIUrl":"https://doi.org/10.1109/HiPC.2011.6152713","url":null,"abstract":"K-Means is a popular clustering algorithm with wide applications in Computer Vision, Data mining, Data Visualization, etc. Clustering is an important step for indexing and searching of documents, images, video, etc. Clustering large numbers of high-dimensional vectors is very computation intensive. In this paper, we present the design and implementation of the K-Means clustering algorithm on the modern GPU. All steps are performed entirely on the GPU efficiently in our approach. We also present a load balanced multi-node, multi-GPU implementation which can handle up to 6 million, 128-dimensional vectors. We use efficient memory layout for all steps to get high performance. The GPU accelerators are now present on high-end workstations and low-end laptops. Scalability in the number and dimensionality of the vectors, the number of clusters, as well as in the number of cores available for processing are important for usability to different users. Our implementation scales linearly or near-linearly with different problem parameters. We achieve up to 2 times increase in speed compared to the best GPU implementation for K-Means on a single GPU. We obtain a speed up of over 170 on a single Nvidia Fermi GPU compared to a standard sequential implementation. We are able to execute one iteration of K-Means in 136 seconds on off-the-shelf GPUs to cluster 6 million vectors of 128 dimensions into 4K clusters and in 2.5 seconds to cluster 125K vectors of 128 dimensions into 2K clusters.","PeriodicalId":122468,"journal":{"name":"2011 18th International Conference on High Performance Computing","volume":"67 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-12-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121072433","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Porting irregular reductions on heterogeneous CPU-GPU configurations","authors":"Xin Huo, Vignesh T. Ravi, G. Agrawal","doi":"10.1109/HiPC.2011.6152715","DOIUrl":"https://doi.org/10.1109/HiPC.2011.6152715","url":null,"abstract":"Heterogeneous architectures are playing a significant role in High Performance Computing (HPC) today, with the popularity of accelerators like the GPUs, and the new trend towards the integration of CPUs and GPUs. Developing applications that can effectively use these architectures is a major challenge. In this paper, we focus on one of the dwarfs in the Berkeley view on parallel computing, which are the irregular applications arising from unstructured grids. We consider the problem of executing these reductions on heterogeneous architectures comprising a multi-core CPU and a GPU. We have developed a Multi-level Partitioning Framework, which has the following features: 1) it supports GPU execution of irregular reductions even when the dataset size exceeds the size of the device memory, 2) it can enable pipelining of partitioning performed on the CPU, and the computations on the GPU, and 3) it supports dynamic distribution of work between the multi-core CPU and the GPU. Our extensive evaluation using two different irregular applications demonstrates the effectiveness of our approach.","PeriodicalId":122468,"journal":{"name":"2011 18th International Conference on High Performance Computing","volume":"24 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-12-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121118223","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Weighted locality-sensitive scheduling for mitigating noise on multi-core clusters","authors":"Vivek Kale, A. Bhatele, W. Gropp","doi":"10.1109/HiPC.2011.6152722","DOIUrl":"https://doi.org/10.1109/HiPC.2011.6152722","url":null,"abstract":"Recent studies have shown that operating system (OS) interference, popularly called OS noise can be a significant problem as we scale to a large number of processors. One solution for mitigating noise is to turn off certain OS services on the machine. However, this is typically infeasible because full-scale OS services may be required for some applications. Furthermore, it is not a choice that an end user can make. Thus, we need an application-level solution. Building upon previous work that demonstrated the utility of within-node light-weight load balancing, we discuss the technique of weighted micro-scheduling and provide insights based on experimentation for two different machines with very different noise signatures. Through careful enumeration of the search space of scheduler parameters, we allow our weighted micro-scheduler to be dynamic, adaptive and tunable for a specific application running on a specific architecture. By doing this, we show how we can enable running scientific applications efficiently on a very large number of processors, even in the presence of noise.","PeriodicalId":122468,"journal":{"name":"2011 18th International Conference on High Performance Computing","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-12-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124313229","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}