Keita Iwabuchi, Hitoshi Sato, Ryo Mizote, Yuichiro Yasui, K. Fujisawa, S. Matsuoka
{"title":"Hybrid BFS Approach Using Semi-external Memory","authors":"Keita Iwabuchi, Hitoshi Sato, Ryo Mizote, Yuichiro Yasui, K. Fujisawa, S. Matsuoka","doi":"10.1109/IPDPSW.2014.189","DOIUrl":"https://doi.org/10.1109/IPDPSW.2014.189","url":null,"abstract":"NVM devices will greatly expand the possibility of processing extremely large-scale graphs that exceed the DRAM capacity of the nodes, however, efficient implementation based on detailed performance analysis of access patterns of unstructured graph kernel on systems that utilize a mixture of DRAM and NVM devices has not been well investigated. We introduce a graph data offloading technique using NVMs that augment the hybrid BFS (Breadth-first search) algorithm widely used in the Graph500 benchmark, and conduct performance analysis to demonstrate the utility of NVMs for unstructured data. Experimental results of a Scale27 problem of a Kronecker graph compliant to the Graph500 benchmark show that our approach maximally sustains 4.22 Giga TEPS (Traversed Edges Per Second), reducing DRAM size by half with only 19.18% performance degradation on a 4-way AMD Opteron 6172 machine heavily equipped with NVM devices. Although direct comparison is difficult, this is significantly greater than the result of 0.05 GTEPS for a SCALE 36 problem by using 1TB of DRAM and 12 TB of NVM as reported by Pearce et al. Although our approach uses higher DRAM to NVM ratio, we show that a good compromise is achievable between performance vs. capacity ratio for processing large-scale graphs. This result as well as detailed performance analysis of the proposed technique suggests that we can process extremely large-scale graphs per node with minimum performance degradation by carefully considering the data structures of a given graph and the access patterns to both DRAM and NVM devices. As a result, our implementation has achieved 4.35 MTEPS/W(Mega TEPS per Watt) and ranked 4th on November 2013 edition of the Green Graph500 list in the Big Data category by using only a single fat server heavily equipped with NVMs.","PeriodicalId":153864,"journal":{"name":"2014 IEEE International Parallel & Distributed Processing Symposium Workshops","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114951556","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
A. Mendez, G. Román-Alonso, F. Rojas-González, M. Castro-García, M. Cornejo, Salomón Cordero-Sánchez
{"title":"Construction of Porous Networks Subjected to Geometric Restrictions by Using OpenMP","authors":"A. Mendez, G. Román-Alonso, F. Rojas-González, M. Castro-García, M. Cornejo, Salomón Cordero-Sánchez","doi":"10.1109/IPDPSW.2014.134","DOIUrl":"https://doi.org/10.1109/IPDPSW.2014.134","url":null,"abstract":"The study of porous materials involves great importance for a vast number of industrial applications. In order to study some specific characteristics of materials, in-silico simulations can be employed. The particular simulation of pore networks described in this work finds its basis in the Dual Site-Bond Model (DSBM). Under this approach, a porous material is thought to be made of sites (cavities, bulges) interconnected to each other through bonds (throats, capillaries), while every site is connected to a number of bonds each bond is the link between two sites. At present, several computing algorithms have been implemented for the simulation of pore networks, nevertheless, only a few of these methods take into account the geometric restrictions that arise during the interconnection of a set of bonds to every site of the network. It is likely that introducing restrictions of this sort in the computing algorithms would lead to the implementation of more realistic pore networks. In this work, a sequential algorithm and its parallel computing version are proposed to construct pore networks, allowing geometrical restrictions among hollow entities. Our parallel approach uses OpenMP to create a set of threads (computing tasks) that work simultaneously on independent and random pore network regions. We discuss the obtained results.","PeriodicalId":153864,"journal":{"name":"2014 IEEE International Parallel & Distributed Processing Symposium Workshops","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124259477","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
H. Anzt, W. Sawyer, S. Tomov, P. Luszczek, I. Yamazaki, J. Dongarra
{"title":"Optimizing Krylov Subspace Solvers on Graphics Processing Units","authors":"H. Anzt, W. Sawyer, S. Tomov, P. Luszczek, I. Yamazaki, J. Dongarra","doi":"10.1109/IPDPSW.2014.107","DOIUrl":"https://doi.org/10.1109/IPDPSW.2014.107","url":null,"abstract":"Krylov subspace solvers are often the method of choice when solving sparse linear systems iteratively. At the same time, hardware accelerators such as graphics processing units (GPUs) continue to offer significant floating point performance gains for matrix and vector computations through easy-to-use libraries of computational kernels. However, as these libraries are usually composed of a well optimized but limited set of linear algebra operations, applications that use them often fail to leverage the full potential of the accelerator. In this paper we target the acceleration of the BiCGSTAB solver for GPUs, showing that significant improvement can be achieved by reformulating the method and developing application-specific kernels instead of using the generic CUBLAS library provided by NVIDIA. We propose an implementation that benefits from a significantly reduced number of kernel launches and GPU-host communication events, by means of increased data locality and a simultaneous reduction of multiple scalar products. Using experimental data, we show that, depending on the dominance of the untouched sparse matrix vector products, significant performance improvements can be achieved compared to a reference implementation based on the CUBLAS library. We feel that such optimizations are crucial for the subsequent development of high-level sparse linear algebra libraries.","PeriodicalId":153864,"journal":{"name":"2014 IEEE International Parallel & Distributed Processing Symposium Workshops","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128066917","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Congfeng Jiang, Jian Wan, C. Cérin, Paolo Gianessi, Yanik Ngoko
{"title":"Towards Energy Efficient Allocation for Applications in Volunteer Cloud","authors":"Congfeng Jiang, Jian Wan, C. Cérin, Paolo Gianessi, Yanik Ngoko","doi":"10.1109/IPDPSW.2014.169","DOIUrl":"https://doi.org/10.1109/IPDPSW.2014.169","url":null,"abstract":"We can view the topology of classical clouds infrastructures as data centers to which are connected user machines. In these architectures the computations are centered on a subset of machines (the data centers) among the possible ones. In our study, we propose to consider an alternative view of clouds where both users machines and data centers are used for servicing requests. We refer to these clouds as volunteer clouds. Volunteer clouds offer potential advantages in elasticity and energy savings, but we have also to manage the unavailability of volunteer nodes. In this paper, we are interested in optimizing the energy consumed by the provisioning of applications in volunteer clouds. Given a set of applications requested by cloud's clients for a window of time, the objective is to find the deployment plan that is less energy consuming. In comparison with many works in resource allocation, our specificity is in the management of the unavailability of volunteer nodes. We show that our core challenge can be formalized as an NP-hard and inapproximable problem. We then propose an ILP (Integer Linear Programming) model and various greedy heuristics for its resolution. Finally, we provide an experimental analysis of our proposal in using realistic data and modeling for energy consumption. This work is a work on modeling with simulation results but not a work with emulation and experiments on real systems. However, the parameters and assumptions made for our simulations fit well with the knowledge generally accepted by people working on energy modeling and volunteer computing. Consequently our work should be analyzed as a solid building block towards the implementation of allocation mechanisms in volunteer clouds.","PeriodicalId":153864,"journal":{"name":"2014 IEEE International Parallel & Distributed Processing Symposium Workshops","volume":"42 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124578659","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A Game-Theoretic Approach to Multiobjective Job Scheduling in Cloud Computing Systems","authors":"Jakub Gasior, F. Seredyński","doi":"10.1109/IPDPSW.2014.60","DOIUrl":"https://doi.org/10.1109/IPDPSW.2014.60","url":null,"abstract":"This paper presents a distributed and security-driven solution to multiobjective job scheduling problem in the Cloud Computing infrastructures. The goal of this scheme is allocating a limited quantity of resources to a specific number of jobs minimizing their execution failure probability and job completion time. As this problem is NP-hard in the strong sense, a meta-heuristic NSGA-II is proposed to solve it. To select the best strategy from the resulting Pareto frontier we develop decision-making mechanisms based on the game-theoretic model of Spatial Prisoner's Dilemma and realized by independent, selfish brokering agents. Their behavior is conditioned by objectives of the various entities involved in the scheduling process and driven towards a Nash equilibrium solution by the employed social welfare criteria. The performance of the applied scheduler is verified by a number of numerical experiments. The related results show the effectiveness of the proposed solution for medium and large-sized scheduling problems.","PeriodicalId":153864,"journal":{"name":"2014 IEEE International Parallel & Distributed Processing Symposium Workshops","volume":"35 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124579048","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Searching for the Optimal Data Partitioning Shape for Parallel Matrix Matrix Multiplication on 3 Heterogeneous Processors","authors":"Ashley M. DeFlumere, Alexey L. Lastovetsky","doi":"10.1109/IPDPSW.2014.8","DOIUrl":"https://doi.org/10.1109/IPDPSW.2014.8","url":null,"abstract":"Parallel Matrix-Matrix Multiplication (MMM) is a fundamental part of the linear algebra libraries used by scientific applications on high performance computers. As heterogeneous systems have emerged as high performance computing platforms, the traditional homogeneous algorithms have been adapted to these heterogeneous environments. Although heterogeneous systems have been in use for some time, it remains an open problem of how to optimally partition data on heterogeneous processors to minimize computation, communication, and execution time. While the question of how to subdivide these MMM problems among heterogeneous processors has been studied, the underlying assumption of this prior study is that the data partition shape, the layout of the data within the matrix assigned to each processor, should be rectangular, i.e. that each processor should be assigned a rectangular portion of the matrix to compute. Our previous work in this area questioned the optimality of this traditional rectangular shape and studied this partition shape problem for two processors. In that work, we proposed a novel mathematical method for transforming partition shapes to decrease communication cost and an analytical technique for determining the optimal shape. In this work, we extend this technique to apply to three and more heterogeneous processors. While applying this method to two processors is relatively straightforward, the complexity grows immensely when considering three processors. With this complexity in mind, we propose a hybrid of experimental and analytical techniques. We postulate that a small number of partition shapes are potentially optimal, and perform extensive testing using a computer aided method to apply our previously developed analytical technique, without finding a counterexample. We identified six data partition shapes which are candidates to be the optimal three processor shape.","PeriodicalId":153864,"journal":{"name":"2014 IEEE International Parallel & Distributed Processing Symposium Workshops","volume":"61 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127036325","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A Novel Computational Model for GPUs with Application to I/O Optimal Sorting Algorithms","authors":"A. Koike, K. Sadakane","doi":"10.1109/IPDPSW.2014.72","DOIUrl":"https://doi.org/10.1109/IPDPSW.2014.72","url":null,"abstract":"We propose a novel computational model for GPU. Known parallel computational models such as the PRAM model are not appropriate for evaluating GPU algorithms. Our model, called AGPU, abstracts the essence of current GPU architectures such as global and shared memory, memory coalescing and bank conflicts. We can therefore evaluate asymptotic behavior of GPU algorithms more accurately than known models and we can develop algorithms that are efficient on many real architectures. As a showcase, we first analyze known comparison-based sorting algorithms using the AGPU model and show that they are not I/O optimal, that is, the number of global memory accesses is more than necessary. Then we propose a new algorithm which uses an asymptotically optimal number of global memory accesses and whose time complexity is also nearly optimal.","PeriodicalId":153864,"journal":{"name":"2014 IEEE International Parallel & Distributed Processing Symposium Workshops","volume":"59 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127911853","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
P. Kirchner, Matthias Boehm, B. Reinwald, D. Sow, J. M. Schmidt, D. Turaga, A. Biem
{"title":"Large Scale Discriminative Metric Learning","authors":"P. Kirchner, Matthias Boehm, B. Reinwald, D. Sow, J. M. Schmidt, D. Turaga, A. Biem","doi":"10.1109/IPDPSW.2014.181","DOIUrl":"https://doi.org/10.1109/IPDPSW.2014.181","url":null,"abstract":"We consider the learning of a distance metric, using the Localized Supervised Metric Learning (LSML) scheme, that discriminates entities characterized by high dimensional feature attributes, with respect to labels assigned to each entity. LSML is a supervised learning scheme that learns a Mahalanobis distance grouping together features with the same label and repulsing features with different labels. In this paper, we propose an efficient and scalable implementation of LSML allowing us to scale significantly and process large data sets, both in terms of dimensions and instances. This implementation of LSML is programmed in SystemML with an R-like syntax, and compiled, optimized, and executed on Hadoop. We also propose experimental approaches for the tuning of LSML parameters yielding significant analytical and empirical improvements in terms of discriminative measures such as label prediction accuracy. We present experimental results on both synthetic and real-world data (feature vectors representing patients in an Intensive Care Unit with labels corresponding to different conditions) assessing respectively how well the algorithm scales and how well it works on real world prediction problems.","PeriodicalId":153864,"journal":{"name":"2014 IEEE International Parallel & Distributed Processing Symposium Workshops","volume":"30 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127755308","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Standby System Reliability through DRBD","authors":"S. Distefano","doi":"10.1109/IPDPSW.2014.149","DOIUrl":"https://doi.org/10.1109/IPDPSW.2014.149","url":null,"abstract":"The standby approach is of strategic importance in current technologies, since it is able to reduce the environmental impact while extending the system lifetime, allowing to achieve a trade-off among dependability properties and costs. This is particularly interesting in the IT context where the acquired awareness on energy efficiency and environmental issues pushed towards new forms and paradigms of (\"green\") computing to specifically address such aspects. Standby policies, mechanisms and techniques are characterized by complex phenomena that should be adequately investigated. This need is translated in a strong demand of adequate tools for standby system modelling and evaluation. In this paper the dynamic reliability block diagrams (DRBD) formalism, extending RBD to the representation of dynamic reliability aspects, is proposed for adoption in the standby system evaluation. In order to fill this gap, the DRBD semantics is revised to cover the specific peculiarities of standby system. Then, the effectiveness of the DRBD approach in standby modelling is demonstrated through a case study on a critical area surveillance system, where a capacity planning parametric analysis is performed to design the system through warm standby redundant cameras according to specific reliability requirements.","PeriodicalId":153864,"journal":{"name":"2014 IEEE International Parallel & Distributed Processing Symposium Workshops","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129103130","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Si Zheng, Yunhuai Liu, T. He, Shanshan Li, Xiangke Liao
{"title":"SkewControl: Gini Out of the Bottle","authors":"Si Zheng, Yunhuai Liu, T. He, Shanshan Li, Xiangke Liao","doi":"10.1109/IPDPSW.2014.176","DOIUrl":"https://doi.org/10.1109/IPDPSW.2014.176","url":null,"abstract":"In the age of big data, MapReduce plays an important role in the extreme-scale data processing system. Among all the hot issues, the data skew weights heavily for the MapReduce system performance. In traditional approaches, researchers attempt to leave the users to address the issue which requires the user to possess the application-dependent domain knowledge. Other approaches address the issue automatically but in an open-loop manner which lacks of sufficient adaptivity for different applications. To well address these issues, we conduct trace-driven empirical studies and show that the skew has strong stable and predictable characteristics, which allows us to design a closed-loop automatic mechanism for task partitioning and scheduling, called SkewControl. We implement SkewControl on top of a Hadoop 1.0.4 production system. The experimental results show that compared with the state-of-art LATE and SkewTune systems, SkewControl can consistently improve the system response time by 23.8% and 17% respectively.","PeriodicalId":153864,"journal":{"name":"2014 IEEE International Parallel & Distributed Processing Symposium Workshops","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130619004","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}