Hao Wang, Jing Zhang, Da Zhang, S. Pumma, Wu-chun Feng
{"title":"PaPar: A Parallel Data Partitioning Framework for Big Data Applications","authors":"Hao Wang, Jing Zhang, Da Zhang, S. Pumma, Wu-chun Feng","doi":"10.1109/IPDPS.2017.119","DOIUrl":"https://doi.org/10.1109/IPDPS.2017.119","url":null,"abstract":"Today, big data applications can generate largescale data sets at an unprecedented rate; and scientists have turned to parallel and distributed systems for data analysis. Although many big data processing systems provide advanced mechanisms to partition data and tackle the computational skew, it is difficult to efficiently implement skew-resistant mechanisms, because the runtime of different partitions not only depends on input data size but also algorithms that will be applied on data. As a result, many research efforts have been undertaken to explore user-defined partitioning methods for different types of applications and algorithms. However, manually writing application-specific partitioning methods requires significant coding effort, and finding the optimal data partitioning strategy is particularly challenging even for developers that have mastered sufficient application knowledge. In this paper, we propose PaPar, a Parallel data Partitioning framework for big data applications, to simplify the implementations of data partitioning algorithms. PaPar provides a set of computational operators and distribution strategies for programmers to describe desired data partitioning methods. Taking an input data configuration file and a workflow configuration file as the input, PaPar can automatically generate the parallel partitioning codes by formalizing the user-defined workflow as a sequence of key-value operations and matrixvector multiplications, and efficiently mapping to the parallel implementations with MPI and MapReduce. We apply our approach on two applications: muBLAST, a MPI implementation of BLAST algorithms for biological sequence search; and PowerLyra, a computation and partitioning method for skewed graphs. The experimental results show that compared to the partitioning methods of applications, the codes generated by PaPar can produce the same data partitions with comparable or less partitioning time.","PeriodicalId":209524,"journal":{"name":"2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125259867","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
S. Rajsbaum, Armando Castañeda, D. Flores-Peñaloza, Manuel Alcántara
{"title":"Fault-Tolerant Robot Gathering Problems on Graphs With Arbitrary Appearing Times","authors":"S. Rajsbaum, Armando Castañeda, D. Flores-Peñaloza, Manuel Alcántara","doi":"10.1109/IPDPS.2017.70","DOIUrl":"https://doi.org/10.1109/IPDPS.2017.70","url":null,"abstract":"The LOOK-COMPUTE-MOVE model for a set of autonomous robots has been thoroughly studied for over two decades. Each robot repeatedly LOOKS at its surroundings and obtains a snapshot containing the positions of all robots; based on this information, the robot COMPUTES a destination and then MOVES to it. Previous work assumed all robots are present at the beginning of the computation. What would be the effect of robots appearing asynchronously? This paper studies thisquestion, for problems of bringing the robots close together, andexposes an intimate connection with combinatorial topology. A central problem in the mobile robots area is the gathering problem. In its discrete version, the robots start at vertices in some graph G known to them, move towards the same vertex and stop. The paper shows that if robots are asynchronous and may crash, then gathering is impossible for any graph G with at least two vertices, even if robots can have unique IDs, remember the past, know the same names for the vertices of G and use an arbitrary number of lights to communicate witheach other. Next, the paper studies two weaker variants of gathering: edge gathering and 1-gathering. For both problems we present possibility and impossibility results. The solvability of edge gathering is fully characterized: it is solvable for three or more robots on a given graph if and only if the graph is acyclic. Finally, general robot tasks in a graph are considered. A combinatorial topology characterization for the solvable tasks is presented, by a reduction of the asynchronous fault-tolerant LOOK-COMPUTE-MOVE model to a wait-free read/write shared-memory computing model, bringing together two areas that have been independently studied for a long time into a common theoretical foundation.","PeriodicalId":209524,"journal":{"name":"2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114793135","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A Parallel FastTrack Data Race Detector on Multi-core Systems","authors":"Y. Song, Yann-Hang Lee","doi":"10.1109/IPDPS.2017.87","DOIUrl":"https://doi.org/10.1109/IPDPS.2017.87","url":null,"abstract":"Detecting data races in multithreaded programs is critical to ensure the correctness of the programs. To discover data races precisely without false alarms, dynamic detection approaches are often applied. However, the overhead of the existing dynamic detection approaches, even with recent innovations, is still substantially high. In this paper, we present a simple but efficient approach to parallelize data race detection in multicore SMP (Symmetric Multiprocessing) machines. In our approach, data access information needed for dynamic detection is collected at application threads and passed to de-tection threads. The access information is distributed in a way that the operation performed by each detection thread is inde-pendent of that of other detection threads. As a consequence, the overhead caused by locking operations in data race detection can be alleviated and multiple cores can be fully utilized to speed up and scale up the detection. Furthermore, each detection thread deals with only its own assigned memory access region rather than the whole address space. The executions of detection threads can exploit the spatial locality of accesses leading to an improved cache performance. We have applied our parallel approach on the FastTrack algorithm and demon-strated the validity of our approach on an Intel Xeon machine. Our experimental results show that the parallel FastTrack detector, on average, runs 2.2 times faster than the original FastTrack detector on the 8 core machine.","PeriodicalId":209524,"journal":{"name":"2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"150 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123224203","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Fault-Tolerant Online Packet Scheduling on Parallel Channels","authors":"P. Garncarek, T. Jurdzinski, Krzysztof Lorys","doi":"10.1109/IPDPS.2017.105","DOIUrl":"https://doi.org/10.1109/IPDPS.2017.105","url":null,"abstract":"We consider the problem of scheduling packets of different lengths via k directed parallel communication links. The links are prone to simultaneous errors --- if an error occurs, all links are affected. Dynamic packet arrivals and errors are modelled by a worst-case adversary. The goal is to optimize competitive throughput of online scheduling algorithms. Two types of failures are considered: jamming, when currently scheduled packets are simply not delivered, and crashes, when additionally the channel scheduler crashes losing its current state. For the former, milder type of failures, we prove an upper bound on competitive throughput of 3/4 - 1/(4k) for odd values of k, and 3/4 - 1/(4k+4) for even values of k. On constructive side, we design an online algorithm that, for packets of two different lengths, matches the upper bound on competitive throughput. To compare, scheduling on independent channels, that is, when adversary could cause errors on each channel independently, reaches throughput of 1/2. This shows that scheduling under simultaneous jamming is provably more efficient than scheduling under channel-independent jamming. In the setting with crash failures we prove a general upper bound for competitive throughput of (√5-1)/2 and design an algorithm achieving it for packets of two different lengths. This result has two interesting implications. First, simultaneous crashes are significantly stronger than simultaneous jamming. Second, due to the above mentioned upper bound of 1/2 on throughput under channel-independenterrors, scheduling under simultaneous crashes is significantly stronger than channel-independent crashes, similarly as in the case of jamming errors.","PeriodicalId":209524,"journal":{"name":"2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124312704","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ryan D. Friese, Nathan R. Tallent, Abhinav Vishnu, D. Kerbyson, A. Hoisie
{"title":"Generating Performance Models for Irregular Applications","authors":"Ryan D. Friese, Nathan R. Tallent, Abhinav Vishnu, D. Kerbyson, A. Hoisie","doi":"10.1109/IPDPS.2017.61","DOIUrl":"https://doi.org/10.1109/IPDPS.2017.61","url":null,"abstract":"Many applications have irregular behavior — e.g., input-dependent solvers, irregular memory accesses, or unbiased branches — that cannot be captured using today's automated performance modeling techniques. We describe new hierarchical critical path analyses for the Palm model generation tool. To obtain a good tradeoff between model accuracy, generality, and generation cost, we combine static and dynamic analysis. To create a model's outer structure, we capture tasks along representative MPI critical paths. We create a histogram of critical tasks with parameterized task arguments and instance counts. To model each task, we identify hot instruction-level paths and model each path based on data flow, data locality, and microarchitectural constraints. We describe application models that generate accurate predictions for strong scaling when varying CPU speed, cache and memory speed, microarchitecture, and (with supervision) input data class. Our models' errors are usually below 8%; and always below 13%.","PeriodicalId":209524,"journal":{"name":"2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"32 5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125707288","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Maciej Besta, Florian Marending, Edgar Solomonik, T. Hoefler
{"title":"SlimSell: A Vectorizable Graph Representation for Breadth-First Search","authors":"Maciej Besta, Florian Marending, Edgar Solomonik, T. Hoefler","doi":"10.1109/IPDPS.2017.93","DOIUrl":"https://doi.org/10.1109/IPDPS.2017.93","url":null,"abstract":"Vectorization and GPUs will profoundly change graph processing. Traditional graph algorithms tuned for 32- or 64-bit based memory accesses will be inefficient on architectures with 512-bit wide (or larger) instruction units that are already present in the Intel Knights Landing (KNL) manycore CPU. Anticipating this shift, we propose SlimSell: a vectorizable graph representation to accelerate Breadth-First Search (BFS) based on sparse-matrix dense-vector (SpMV) products. SlimSell extends and combines the state-of-the-art SIMD-friendly Sell-C-σ matrix storage format with tropical, real, boolean, and sel-max semiring operations. The resulting design reduces the necessary storage (by up to 50%) and thus pressure on the memory subsystem. We augment SlimSell with the SlimWork and SlimChunk schemes that reduce the amount of work and improve load balance, further accelerating BFS. We evaluate all the schemes on Intel Haswell multicore CPUs, the state-of-the-art Intel Xeon Phi KNL manycore CPUs, and NVIDIA Tesla GPUs. Our experiments indicate which semiring offers highest speedups for BFS and illustrate that SlimSell accelerates a tuned Graph500 BFS code by up to 33%. This work shows that vectorization can secure high-performance in BFS based on SpMV products; the proposed principles and designs can be extended to other graph algorithms.","PeriodicalId":209524,"journal":{"name":"2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"1995 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125555179","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"The SEPO Model of Computation to Enable Larger-Than-Memory Hash Tables for GPU-Accelerated Big Data Analytics","authors":"Reza Mokhtari, M. Stumm","doi":"10.1109/IPDPS.2017.122","DOIUrl":"https://doi.org/10.1109/IPDPS.2017.122","url":null,"abstract":"The massive parallelism and high memory bandwidth of GPU's are particularly well matched with the exigencies of Big Data analytics applications, for which many independent computations and high data throughput are prevalent. These applications often produce (intermediary or final) results in the form of key-value (KV) pairs, and hash tables are particularly well-suited for storing these KV pairs in memory. How such hash tables are implemented on GPUs, however, has a large impact on performance. Unfortunately, all hash table solutions designed for GPUs to date have limitations that prevent acceleration for Big Data analytics applications. In this paper, we present the design and implementation of a GPU-based hash table for efficiently storing the KV pairs of Big Data analytics applications. The hash table is able to grow beyond the size of available GPU memory without excessive performance penalties. Central to our hash table design is the SEPO model of computation, where the processing of individual tasks is selectively postponed when processing is expected to be inefficient. A performance evaluation on seven GPU-based Big Data analytics applications, each processing several Gigabytes of input data, shows that our hash table allows the applications to achieve, on average, a speedup of 3.5 over their CPU-based multi-threaded implementations. This gain is realized despite having hash tables that grow up to four times larger than the size of available GPU memory.","PeriodicalId":209524,"journal":{"name":"2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128077257","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Significantly Improving Lossy Compression for Scientific Data Sets Based on Multidimensional Prediction and Error-Controlled Quantization","authors":"Dingwen Tao, S. Di, Zizhong Chen, F. Cappello","doi":"10.1109/IPDPS.2017.115","DOIUrl":"https://doi.org/10.1109/IPDPS.2017.115","url":null,"abstract":"Today's HPC applications are producing extremely large amounts of data, such that data storage and analysis are becoming more challenging for scientific research. In this work, we design a new error-controlled lossy compression algorithm for large-scale scientific data. Our key contribution is significantly improving the prediction hitting rate (or prediction accuracy) for each data point based on its nearby data values along multiple dimensions. We derive a series of multilayer prediction formulas and their unified formula in the context of data compression. One serious challenge is that the data prediction has to be performed based on the preceding decompressed values during the compression in order to guarantee the error bounds, which may degrade the prediction accuracy in turn. We explore the best layer for the prediction by considering the impact of compression errors on the prediction accuracy. Moreover, we propose an adaptive error-controlled quantization encoder, which can further improve the prediction hitting rate considerably. The data size can be reduced significantly after performing the variable-length encoding because of the uneven distribution produced by our quantization encoder. We evaluate the new compressor on production scientific data sets and compare it with many other state-of-the-art compressors: GZIP, FPZIP, ZFP, SZ-1.1, and ISABELA. Experiments show that our compressor is the best in class, especially with regard to compression factors (or bit-rates) and compression errors (including RMSE, NRMSE, and PSNR). Our solution is better than the second-best solution by more than a 2x increase in the compression factor and 3.8x reduction in the normalized root mean squared error on average, with reasonable error bounds and user-desired bit-rates.","PeriodicalId":209524,"journal":{"name":"2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"39 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115037932","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Scalable Graph Traversal on Sunway TaihuLight with Ten Million Cores","authors":"Heng Lin, Xiongchao Tang, Bowen Yu, Youwei Zhuo, Wenguang Chen, Jidong Zhai, Wanwang Yin, Weimin Zheng","doi":"10.1109/IPDPS.2017.53","DOIUrl":"https://doi.org/10.1109/IPDPS.2017.53","url":null,"abstract":"Interest has recently grown in efficiently analyzing unstructured data such as social network graphs and protein structures. A fundamental graph algorithm for doing such task is the Breadth-First Search (BFS) algorithm, the foundation for many other important graph algorithms such as calculating the shortest path or finding the maximum flow in graphs. In this paper, we share our experience of designing and implementing the BFS algorithm on Sunway TaihuLight, a newly released machine with 40,960 nodes and 10.6 million accelerator cores. It tops the Top500 list of June 2016 with a 93.01 petaflops Linpack performance [1]. Designed for extremely large-scale computation and power efficiency, processors on Sunway TaihuLight employ a unique heterogeneous many-core architecture and memory hierarchy. With its extremely large size, the machine provides both opportunities and challenges for implementing high-performance irregular algorithms, such as BFS. We propose several techniques, including pipelined module mapping, contention-free data shuffling, and group-based message batching, to address the challenges of efficiently utilizing the features of this large scale heterogeneous machine. We ultimately achieved 23755.7 giga-traversed edges per second (GTEPS), which is the best among heterogeneous machines and the second overall in the Graph500s June 2016 list [2].","PeriodicalId":209524,"journal":{"name":"2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"9 5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125845179","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Cooling-Aware Job Scheduling and Node Allocation for Overprovisioned HPC Systems","authors":"Thang Cao, Wei Huang, Yuan He, Masaaki Kondo","doi":"10.1109/IPDPS.2017.19","DOIUrl":"https://doi.org/10.1109/IPDPS.2017.19","url":null,"abstract":"Limited power budget is becoming one of the most crucial challenges in developing supercomputer systems. Hardware overprovisioning which installs a larger number of nodes beyond the limitations of the power constraint is an attractive way to design next generation supercomputers. In air cooled HPC centers, about half of the total power is consumed by cooling facilities. Reducing cooling power and effectively utilizing power resource for computing nodes are important challenges. It is known that the cooling power depends on the hotspot temperature of the node inlets. Therefore, if we minimize the hotspot temperature, performance efficiency of the HPC system will be increased. One of the ways to reduce the hotspot temperature is to allocate power-hungry jobs to compute nodes whose effect on the hotspot temperature is small. It can be accomplished by optimizing job-to-node mapping in the job scheduler. In this paper, we propose a cooling and node location-aware job scheduling strategy which tries to optimize job-to-node mapping while improving the total system throughput under the constraint of total system (compute nodes and cooling facilities) power consumption. Experimental results with the job scheduling simulation show that our scheduling scheme achieves 1.49X higher total system throughput than the conventional scheme.","PeriodicalId":209524,"journal":{"name":"2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"33 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134025689","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}