2011 18th International Conference on High Performance Computing最新文献_第4页

Highly scalable barriers for future high-performance computing clusters 面向未来高性能计算集群的高度可伸缩障碍

2011 18th International Conference on High Performance Computing Pub Date : 2011-12-18 DOI: 10.1109/HiPC.2011.6152729

H. Fröning, Alexander Giese, Héctor Montaner, F. Silla, J. Duato

{"title":"Highly scalable barriers for future high-performance computing clusters","authors":"H. Fröning, Alexander Giese, Héctor Montaner, F. Silla, J. Duato","doi":"10.1109/HiPC.2011.6152729","DOIUrl":"https://doi.org/10.1109/HiPC.2011.6152729","url":null,"abstract":"Although large scale high performance computing today typically relies on message passing, shared memory can offer significant advantages, as the overhead associated with MPI is completely avoided. In this way, we have developed an FPGA-based Shared Memory Engine that allows to forward memory transactions, like loads and stores, to remote memory locations in large clusters, thus providing a single memory address space. As coherency protocols do not scale with system size we completely avoid a global coherency across the cluster. However, we maintain local coherency domains, thus keeping the cores within one node coherent. In this paper, we show the suitability of our approach by analyzing the performance of barriers, a very common synchronization primitive in parallel programs. Experiments in a real cluster prototype show that our approach allows synchronization among 1024 cores spread over 64 nodes in less than 15us, several times faster than other highly optimized barriers. We show the feasibility of this approach by executing a shared-memory implementation of FFT. Finally, note that this barrier can also be leveraged by MPI applications running on our shared memory architecture for clusters. This ensures the usefulness of this work for applications already written.","PeriodicalId":122468,"journal":{"name":"2011 18th International Conference on High Performance Computing","volume":"138 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-12-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123744016","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4

Improving graph coloring on distributed-memory parallel computers 改进分布式内存并行计算机的图形着色

2011 18th International Conference on High Performance Computing Pub Date : 2011-12-18 DOI: 10.1109/HiPC.2011.6152726

Ahmet Erdem Sarıyüce, Erik Saule, Ümit V. Çatalyürek

引用次数: 11

Increasing the energy efficiency of TLS systems using intermediate checkpointing 利用中间检查点提高TLS系统的能源效率

2011 18th International Conference on High Performance Computing Pub Date : 2011-12-18 DOI: 10.1109/HiPC.2011.6152735

Salman Khan, Nikolas Ioannou, Polychronis Xekalakis, Marcelo H. Cintra

{"title":"Increasing the energy efficiency of TLS systems using intermediate checkpointing","authors":"Salman Khan, Nikolas Ioannou, Polychronis Xekalakis, Marcelo H. Cintra","doi":"10.1109/HiPC.2011.6152735","DOIUrl":"https://doi.org/10.1109/HiPC.2011.6152735","url":null,"abstract":"With the advent of Chip Multiprocessors (CMPs), improving performance relies on the programmers/compilers to expose thread level parallelism to the underlying hardware. However, this is a difficult and error-prone process for the programmers, while state of the art compiler techniques are unable to provide significant benefits for many classes of applications. An alternative is offered by systems that support Thread Level Speculation (TLS), which relieve the programmer and compiler from checking for thread dependences and instead use the hardware to enforce them. Unfortunately, TLS suffers from power inefficency because data misspeculations cause threads to roll back to the beginning of the speculative task. For this reason intermediate check-pointing of TLS threads has been proposed. When a violation does occur, we now have to roll back to a checkpoint before the violating instruction and not to the start of the task. However, previous work omits study of the microarchitectural details and implementation issues that are essential for effective checkpointing. In this paper we study checkpointing on a state-of-the art TLS system. We systematically study the costs associated with checkpointing and analyze the tradeoffs. We also propose changes to the TLS mechanism to allow effective checkpointing. Further, we establish the need for accurately identifying points in execution that are appropriate for checkpointing and analyze various techniques for doing so in terms of both effectiveness and viability. We propose program counter based and hybrid predictors and show that they outperform previous proposals. Placing checkpoints based on dependence predictors results in power improvements while maintaining the performance advantage of TLS. The checkpointing system proposed achieves an energy saving of up to 14%, with an average of 7% over normal TLS execution.","PeriodicalId":122468,"journal":{"name":"2011 18th International Conference on High Performance Computing","volume":"37 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-12-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133297828","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Modelling and analyzing the authorization and execution of video workflows 建模和分析视频工作流的授权和执行

2011 18th International Conference on High Performance Computing Pub Date : 2011-12-18 DOI: 10.1109/HiPC.2011.6152727

Ligang He, Chenlin Huang, Kenli Li, Hao Chen, Jianhua Sun, Bo Gao, Kewei Duan, S. Jarvis

{"title":"Modelling and analyzing the authorization and execution of video workflows","authors":"Ligang He, Chenlin Huang, Kenli Li, Hao Chen, Jianhua Sun, Bo Gao, Kewei Duan, S. Jarvis","doi":"10.1109/HiPC.2011.6152727","DOIUrl":"https://doi.org/10.1109/HiPC.2011.6152727","url":null,"abstract":"It is becoming common practice to migrate signal-based video workflows to IT-based Video workflows. Video workflows have some inherent features, including: 1) necessary human involvements in video workflows introduce security and authorization concerns; 2) the frequent change of video workflow contexts requires a flexible approach to acquiring performance data; 3) the content-centric nature of video workflows, which is in contrast to the business-centric of business workflows, requires the support of scheduled activities. This paper takes the above issues into account, proposing a novel mechanism for modeling video workflow executions in cluster-based resource pools under Role-Based Authorization Control (RBAC) schemes. The Color Timed Petri-Net (CTPN) formalism is applied to construct the models. Various types of authorization constraint are modeled in this paper, and scheduled activities are also supported in the model. There is a clear interface between workflow execution and workflow authorization modules. The constructed models are then simulated and analyzed to obtain performance data, including authorization overhead, system- and application-oriented performance. Based on the model analysis, this paper further proposes the methods to improve performance in the presence of authorization policies. This work can be used to plan system capacity subject to the authorization control, and can also be used to tune performance by changing the scheduling strategy and resource capacity when it is not possible to adjust the authorization policies.","PeriodicalId":122468,"journal":{"name":"2011 18th International Conference on High Performance Computing","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-12-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133950151","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Coordination mechanisms for selfish multi-organization scheduling 自私多组织调度的协调机制

2011 18th International Conference on High Performance Computing Pub Date : 2011-12-18 DOI: 10.1109/HiPC.2011.6152720

Johanne Cohen, Daniel Cordeiro, D. Trystram, Frédéric Wagner

{"title":"Coordination mechanisms for selfish multi-organization scheduling","authors":"Johanne Cohen, Daniel Cordeiro, D. Trystram, Frédéric Wagner","doi":"10.1109/HiPC.2011.6152720","DOIUrl":"https://doi.org/10.1109/HiPC.2011.6152720","url":null,"abstract":"We conduct a game theoretic analysis on the problem of scheduling jobs on computing platforms composed of several independent and selfish organizations, known as the Multi-Organization Scheduling Problem (MOSP). Each organization shares resources and jobs with others, expecting to decrease the makespan of its own jobs. We modeled MOSP as a non-cooperative game where each agent is responsible for assigning all jobs belonging to a particular organization to the available processors. The local scheduling of these jobs is defined by coordination mechanisms that first prioritize local jobs and then schedule the jobs from others according to some given priority. When different priorities are given individually to the jobs — like in classical scheduling algorithms such as LPT or SPT — then no pure e-approximate equilibrium is possible for values of e less than 2. We also prove that even deciding whether a given instance admits or not a pure Nash equilibrium is co-NP hard. When these priorities are given to entire organizations, we show the existence of an algorithm that always computes a pure p-approximate equilibrium using any p-approximation list scheduling algorithm. Finally, we prove that the price of anarchy of the MOSP game using this mechanism is asymptotically bounded by 2.","PeriodicalId":122468,"journal":{"name":"2011 18th International Conference on High Performance Computing","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-12-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134454950","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 10

Reliable and randomized data distribution strategies for large scale storage systems 大规模存储系统的可靠随机数据分布策略

2011 18th International Conference on High Performance Computing Pub Date : 2011-12-18 DOI: 10.1109/HiPC.2011.6152745

Alberto Miranda, S. Effert, Yangwook Kang, E. L. Miller, A. Brinkmann, Toni Cortes

引用次数: 34

Scalable clustering using multiple GPUs 使用多个gpu的可扩展集群

2011 18th International Conference on High Performance Computing Pub Date : 2011-12-18 DOI: 10.1109/HiPC.2011.6152713

K. Mohiuddin, P J Narayanan

{"title":"Scalable clustering using multiple GPUs","authors":"K. Mohiuddin, P J Narayanan","doi":"10.1109/HiPC.2011.6152713","DOIUrl":"https://doi.org/10.1109/HiPC.2011.6152713","url":null,"abstract":"K-Means is a popular clustering algorithm with wide applications in Computer Vision, Data mining, Data Visualization, etc. Clustering is an important step for indexing and searching of documents, images, video, etc. Clustering large numbers of high-dimensional vectors is very computation intensive. In this paper, we present the design and implementation of the K-Means clustering algorithm on the modern GPU. All steps are performed entirely on the GPU efficiently in our approach. We also present a load balanced multi-node, multi-GPU implementation which can handle up to 6 million, 128-dimensional vectors. We use efficient memory layout for all steps to get high performance. The GPU accelerators are now present on high-end workstations and low-end laptops. Scalability in the number and dimensionality of the vectors, the number of clusters, as well as in the number of cores available for processing are important for usability to different users. Our implementation scales linearly or near-linearly with different problem parameters. We achieve up to 2 times increase in speed compared to the best GPU implementation for K-Means on a single GPU. We obtain a speed up of over 170 on a single Nvidia Fermi GPU compared to a standard sequential implementation. We are able to execute one iteration of K-Means in 136 seconds on off-the-shelf GPUs to cluster 6 million vectors of 128 dimensions into 4K clusters and in 2.5 seconds to cluster 125K vectors of 128 dimensions into 2K clusters.","PeriodicalId":122468,"journal":{"name":"2011 18th International Conference on High Performance Computing","volume":"67 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-12-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121072433","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 12

Porting irregular reductions on heterogeneous CPU-GPU configurations 在异构CPU-GPU配置上移植不规则缩减

2011 18th International Conference on High Performance Computing Pub Date : 2011-12-18 DOI: 10.1109/HiPC.2011.6152715

Xin Huo, Vignesh T. Ravi, G. Agrawal

引用次数: 26

Weighted locality-sensitive scheduling for mitigating noise on multi-core clusters 基于加权位置敏感调度的多核集群噪声抑制算法

2011 18th International Conference on High Performance Computing Pub Date : 2011-12-18 DOI: 10.1109/HiPC.2011.6152722

Vivek Kale, A. Bhatele, W. Gropp

引用次数: 2