{"title":"Expressing Parallelism on Many-Core for Deterministic Discrete Ordinates Transport","authors":"Tom Deakin, Simon McIntosh-Smith, W. Gaudin","doi":"10.1109/CLUSTER.2015.127","DOIUrl":"https://doi.org/10.1109/CLUSTER.2015.127","url":null,"abstract":"In this paper we demonstrate techniques for increasing the node-level parallelism of a deterministic discrete ordinates neutral particle transport algorithm on a structured mesh to exploit many-core technologies. Transport calculations form a large part of the computational workload of physical simulations and so good performance is vital for the simulations to complete in reasonable time. We will demonstrate our approach utilizing the SNAP mini-app, which gives a simplified implementation of the full transport algorithm but remains similar enough to the real algorithm to act as a useful proxy for research purposes. We present an OpenCL implementation of our improved algorithm which demonstrates a speedup of up to 2.5x the transport sweep performance on a many-core GPGPU device compared to a state-of-the-art multi-core node, the first time this scale of speedup has been achieved for algorithms of this class.","PeriodicalId":187042,"journal":{"name":"2015 IEEE International Conference on Cluster Computing","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130058191","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"The Cost of Synchronizing Imbalanced Processes in Message Passing Systems","authors":"I. Peng, S. Markidis, E. Laure","doi":"10.1109/CLUSTER.2015.63","DOIUrl":"https://doi.org/10.1109/CLUSTER.2015.63","url":null,"abstract":"Synchronization in message passing systems is achieved by communication among processes. System and architectural noise and different workloads cause processes to be imbalanced and to reach synchronization points at different time. Thus, both communication and imbalance impact the synchronization performance. In this paper, we study the algorithmic properties that allow the communication in synchronization to absorb the initial imbalance among processes. We quantify the imbalance absorption properties of different barrier algorithms using a LogP Monte Carlo simulator. We found that linear and f-way tournament barriers can absorb up to 95% of random exponential imbalance with the standard deviation equal to the communication time for one message. Dissemination, butterfly and pairwise exchange barriers, on the other hand, do not absorb imbalance but can effectively bound the post-barrier imbalance. We identify that synchronization transits from communication-dominated to imbalance-dominated when the standard deviation of imbalance distribution is more than twice the communication time for one message. In our study, f-way tournament barriers provided the best imbalance absorption rate and convenient communication time.","PeriodicalId":187042,"journal":{"name":"2015 IEEE International Conference on Cluster Computing","volume":"32 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123525958","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Distributed-Memory Algorithms for Maximal Cardinality Matching Using Matrix Algebra","authors":"A. Azad, A. Buluç","doi":"10.1109/CLUSTER.2015.62","DOIUrl":"https://doi.org/10.1109/CLUSTER.2015.62","url":null,"abstract":"We design and implement distributed-memory parallel algorithms for computing maximal cardinality matching in a bipartite graph. Relying on matrix algebra building blocks, our algorithms expose a higher degree of parallelism on distributed-memory platforms than existing graph-based algorithms. In contrast to existing parallel algorithms, empirical approximation ratios of the new algorithms are insensitive to concurrency and stay relatively constant with increasing processor counts. On real instances, our algorithms achieve up to 300x speedup on 1024 cores of a Cray XC30 supercomputer. Even higher speedups are obtained on larger synthetically generated graphs where our algorithms show good scaling on up to 16,384 processors.","PeriodicalId":187042,"journal":{"name":"2015 IEEE International Conference on Cluster Computing","volume":"65 38 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125171554","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Ensuring Data Durability with Increasingly Interdependent Content","authors":"Veronica Estrada Galinanes, P. Felber","doi":"10.1109/CLUSTER.2015.33","DOIUrl":"https://doi.org/10.1109/CLUSTER.2015.33","url":null,"abstract":"Data entanglement is a novel approach to generate and propagate redundancy across multiple disk nodes in a fault-tolerant data store. In this paper, we analyse and evaluate helical entanglement codes (HEC), an XOR-based erasure coding algorithm that constructs long sequences of entangled data using incoming data and stored parities. The robust topology guarantees low complexity and a greater resilience to failures than previous codes mentioned in the literature, however, the code pattern requires a minimum fixed amount of storage overhead. A unique characteristic of HEC is that fault tolerance depends on the number of distinct helical strands (p), a parameter that could be changed on the fly and does not add significantly more storage. A p-HEC setting can tolerate arbitrary 5+p failures. Decoding has a low reconstruction cost and good locality. Besides, a deep repair mechanism exploits the available global parities. We perform experiments to compare the repairability of HEC with other codes and present analytical results of its reliability.","PeriodicalId":187042,"journal":{"name":"2015 IEEE International Conference on Cluster Computing","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130260822","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Detecting Thread-Safety Violations in Hybrid OpenMP/MPI Programs","authors":"Hongyi Ma, Liqiang Wang, K. Krishnamoorthy","doi":"10.1109/CLUSTER.2015.70","DOIUrl":"https://doi.org/10.1109/CLUSTER.2015.70","url":null,"abstract":"We propose an approach by integrating static and dynamic program analyses to detect thread-safety violations in hybrid MPI/OpenMP programs. We innovatively transform the thread-safety violation problems to race conditions problems. In our approach, the static analysis identifies a list of MPI calls related to thread-safety violations, then replaces them with our own MPI wrappers, which involve accesses to some specific shared variables. The static analysis avoids instrumenting unrelated code, which significantly reduces runtime overhead. In the dynamic analysis, both happen-before and lockset-based race detection algorithms are used to detect races on these aforementioned shared variables. By detecting races, we can identify thread-safety violations according to their specifications. Our experimental evaluation over real-world applications shows that our approach is both accurate and efficient.","PeriodicalId":187042,"journal":{"name":"2015 IEEE International Conference on Cluster Computing","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130874746","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
A. Woodard, M. Wolf, C. Müller, N. Valls, Benjamín Tovar, P. Donnelly, Peter Ivie, K. H. Anampa, P. Brenner, D. Thain, K. Lannon, M. Hildreth
{"title":"Scaling Data Intensive Physics Applications to 10k Cores on Non-dedicated Clusters with Lobster","authors":"A. Woodard, M. Wolf, C. Müller, N. Valls, Benjamín Tovar, P. Donnelly, Peter Ivie, K. H. Anampa, P. Brenner, D. Thain, K. Lannon, M. Hildreth","doi":"10.1109/CLUSTER.2015.53","DOIUrl":"https://doi.org/10.1109/CLUSTER.2015.53","url":null,"abstract":"The high energy physics (HEP) community relies upon a global network of computing and data centers to analyze data produced by multiple experiments at the Large Hadron Collider (LHC). However, this global network does not satisfy all research needs. Ambitious researchers often wish to harness computing resources that are not integrated into the global network, including private clusters, commercial clouds, and other production grids. To enable these use cases, we have constructed Lobster, a system for deploying data intensive high throughput applications on non-dedicated clusters. This requires solving multiple problems related to non-dedicated resources, including work decomposition, software delivery, concurrency management, data access, data merging, and performance troubleshooting. With these techniques, we demonstrate Lobster running effectively on 10k cores, producing throughput at a level comparable with some of the largest dedicated clusters in the LHC infrastructure.","PeriodicalId":187042,"journal":{"name":"2015 IEEE International Conference on Cluster Computing","volume":"228 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122500556","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Can Cloud Service Get His Family? A Step Towards Service Family Detecting","authors":"Xinkui Zhao, Jianwei Yin, Chen Zhi, Pengxiang Lin, Zuoning Chen","doi":"10.1109/CLUSTER.2015.80","DOIUrl":"https://doi.org/10.1109/CLUSTER.2015.80","url":null,"abstract":"In cloud computing environment, an application is always composed of several service components. A collection of service components is called a service family, and we name the cloud service components as service family members. In this paper, we propose a solution named Icebreaker to assemble service components belonging to the same application without sniffing tenants' privacy. Icebreaker characterizes each service component with basic resource consuming information and proposes a new distance calculating algorithm named iEntropy to distinct service components. We adaptively adopt Affinity Propagation (AP) clustering algorithm and maximum Silhouette index to identify the number of service family and assemble the service family members. Experiments are conducted on RUBiS, Hadoop and ApacheBench clusters with 169 VMs. Evaluation results show that Icebreaker can get 96.45% accuracy.","PeriodicalId":187042,"journal":{"name":"2015 IEEE International Conference on Cluster Computing","volume":"43 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122720451","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Shin Sasaki, Kazushi Takahashi, Y. Oyama, O. Tatebe
{"title":"RDMA-Based Direct Transfer of File Data to Remote Page Cache","authors":"Shin Sasaki, Kazushi Takahashi, Y. Oyama, O. Tatebe","doi":"10.1109/CLUSTER.2015.40","DOIUrl":"https://doi.org/10.1109/CLUSTER.2015.40","url":null,"abstract":"The performance of a distributed file system significantly affects data-intensive applications that frequently execute I/O operations on large amounts of data. Although many modern distributed file systems are geared to provide highly efficient I/O performance, their operations are nonetheless affected by runtime overhead in data transfer between client nodes and I/O servers. A large part of the overhead is caused by memory copies executed by the client interface using the FUSE framework or a special kernel module. In this paper, we propose a method based on InfiniBand RDMA that improves data transfer performance between client and server in a distributed file system. The major characteristic of the method is that it transfers file data directly from a server's memory to the page cache of a client node. The method minimizes memory copies that are otherwise executed in the client interface or the operating system kernel. We implemented the proposed method in the Gfarm distributed file system and tested it using I/O benchmark software and real applications. The experimental results showed that our method effected a performance improvement of up to 78.4% and 256.0% in sequential and random file reads, respectively, and an improvement of up to 6.3% in data-intensive applications.","PeriodicalId":187042,"journal":{"name":"2015 IEEE International Conference on Cluster Computing","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123909547","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Francisco J. Andújar, Juan A. Villar, J. L. Sánchez, F. J. Alfaro, J. Escudero-Sahuquillo
{"title":"VEF Traces: A Framework for Modelling MPI Traffic in Interconnection Network Simulators","authors":"Francisco J. Andújar, Juan A. Villar, J. L. Sánchez, F. J. Alfaro, J. Escudero-Sahuquillo","doi":"10.1109/CLUSTER.2015.141","DOIUrl":"https://doi.org/10.1109/CLUSTER.2015.141","url":null,"abstract":"Simulation is often used to evaluate the behaviour and measure the performance of computing systems. Specifically, in high-performance interconnection networks, the simulation has been extensively considered to verify the behaviour of the network itself and to evaluate its performance. In this context, network simulation must be fed with network traffic, also referred to as network workload, whose nature has been traditionally synthetic. These workloads can be used for the purpose of driving studies on network performance, but often such workloads are not accurate enough if a realistic evaluation is pursued. For this reason, other non-synthetic workloads have gained popularity over last decades since they are best to capture the realistic behaviour of existing applications. In this paper, we present the VEF traces framework, a self-related trace model, and all their associated tools. The main novelty of this framework is that, unlike existing ones, it does not provide a network simulation framework, but only offers an MPI task simulation framework, which allows one to use the MPI-based network traffic by any third-party network simulator, since this framework does not depend on any specific simulation platform.","PeriodicalId":187042,"journal":{"name":"2015 IEEE International Conference on Cluster Computing","volume":"20 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125067840","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Performance Evaluation of Unstructured Mesh Physics on Advanced Architectures","authors":"C. Ferenbaugh","doi":"10.1109/CLUSTER.2015.126","DOIUrl":"https://doi.org/10.1109/CLUSTER.2015.126","url":null,"abstract":"Unstructured mesh physics codes tend to exhibit different performance characteristics than other types of codes such as structured mesh or particle codes, due to their heavy use of indirection arrays and their irregular memory access patterns. For this reason unstructured mesh mini-apps are needed, alongside other types of mini-apps, to evaluate new architectures and hardware features. This paper uses one such mini-app, PENNANT, to investigate performance trends on architectures such as the Intel Xeon Phi, IBM BlueGene/Q, and NVIDIA K40 GPU. We present basic results comparing the performance of these platforms to each other and to traditional multicore CPUs. We also study the usefulness for unstructured codes of various hardware features such as hardware threading, advanced vector instructions, and fast atomic operations.","PeriodicalId":187042,"journal":{"name":"2015 IEEE International Conference on Cluster Computing","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123668758","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}