{"title":"Computing Large Sparse Multivariate Optimization Problems with an Application in Biophysics","authors":"E. Brookes, R. Boppana, B. Demeler","doi":"10.1145/1188455.1188541","DOIUrl":"https://doi.org/10.1145/1188455.1188541","url":null,"abstract":"We present a novel divide and conquer method for parallelizing a large scale multivariate linear optimization problem, which is commonly solved using a sequential algorithm with the entire parameter space as the input. The optimization solves a large parameter estimation problem where the result is sparse in the parameters. By partitioning the parameters and the associated computations, our technique overcomes memory constraints when used in the context of a single workstation and achieves high processor utilization when large workstation clusters are used. We implemented this technique in a widely used software package for the analysis of a biophysics problem, which is representative for a large class of problems in the physical sciences. We evaluate the performance of the proposed method on a 512-processor cluster and offer an analytical model for predicting the performance of the algorithm","PeriodicalId":333909,"journal":{"name":"ACM/IEEE SC 2006 Conference (SC'06)","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-11-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116644033","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Software Routing and Aggregation of Messages to Optimize the Performance of HPCC Randomaccess Benchmark","authors":"R. Garg, Yogish Sabharwal","doi":"10.1145/1188455.1188569","DOIUrl":"https://doi.org/10.1145/1188455.1188569","url":null,"abstract":"The HPC challenge (HPCC) benchmark suite is increasingly being used to evaluate the performance of supercomputers. It augments the traditional LINPACK benchmark by adding six more benchmarks, each designed to measure a specific aspect of the system performance. In this paper, we analyze the HPCC randomaccess benchmark which is designed to measure the performance of random memory updates. We show that, on many systems, the bisection bandwidth of the network may be the performance bottleneck of this benchmark. We suggest an aggregation and software routing based technique that may be used to optimize this benchmark. We report the performance results obtained using this technique on the Blue Gene/L supercomputer","PeriodicalId":333909,"journal":{"name":"ACM/IEEE SC 2006 Conference (SC'06)","volume":"28 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-11-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116280568","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
T. El-Ghazawi, Dave Bennett, D. Poznanovic, Allan J. Cantle, K. Underwood, R. Pennington, D. Buell, A. George, V. Kindratenko
{"title":"Is High-Performance, Reconfigurable Computing the Next Supercomputing Paradigm?","authors":"T. El-Ghazawi, Dave Bennett, D. Poznanovic, Allan J. Cantle, K. Underwood, R. Pennington, D. Buell, A. George, V. Kindratenko","doi":"10.1145/1188455.1188530","DOIUrl":"https://doi.org/10.1145/1188455.1188530","url":null,"abstract":"High-Performance Reconfigurable Computers (HPRCs) based on integrating conventional microprocessors and Field Programmable Gate Arrays (FPGA) have been gaining increasing attention in the past few years. With offerings from rising companies such as SRC and major high-performance computing vendors such as Cray and SGI, a wide array of such architectures is already available and it is believed that more and more offerings by others will be emerging. Furthermore, in spite of the recent birth of this class of highperformance computing architectures, the approaches followed by hardware and software vendors are starting to converge, signaling a progress towards the maturity of this area and making a room, perhaps, for standardization.","PeriodicalId":333909,"journal":{"name":"ACM/IEEE SC 2006 Conference (SC'06)","volume":"47 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-11-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125711775","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Level-wise Scheduling Algorithm for Fat Tree Interconnection Networks","authors":"Zhu Ding, R. Hoare, A. Jones, R. Melhem","doi":"10.1145/1188455.1188556","DOIUrl":"https://doi.org/10.1145/1188455.1188556","url":null,"abstract":"This paper presents an efficient hardware architecture for scheduling connections on a fat-tree interconnection network for parallel computing systems. Our technique utilizes global routing information to select upward routing paths so that most conflicts can be resolved. Thus, more connections can be successfully scheduled compared with a local scheduler. As a result of applying our technique to two-level, three-level and four-level fat-tree interconnection networks of various sizes in the range of 64 to 4096 nodes, we observe that the improvement of schedulability ratio averages 30% compared with greedy or random local scheduling. Our technique is also scalable and shows increased benefits for large system sizes","PeriodicalId":333909,"journal":{"name":"ACM/IEEE SC 2006 Conference (SC'06)","volume":"135 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-11-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126870722","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
M. Gardner, Wu-chun Feng, J. Archuleta, Heshan Lin, Xiasong Ma
{"title":"Parallel Genomic Sequence-Searching on an Ad-Hoc Grid: Experiences, Lessons Learned, and Implications","authors":"M. Gardner, Wu-chun Feng, J. Archuleta, Heshan Lin, Xiasong Ma","doi":"10.1145/1188455.1188564","DOIUrl":"https://doi.org/10.1145/1188455.1188564","url":null,"abstract":"The Basic local alignment search tool (BLAST) allows bioinformaticists to characterize an unknown sequence by comparing it against a database of known sequences. The similarity between sequences enables biologists to detect evolutionary relationships and infer biological properties of the unknown sequence. mpiBLAST, our parallel BLAST, decreases the search time of a 300 KB query on the current NT database from over two full days to under 10 minutes on a 128-processor cluster and allows larger query files to be compared. Consequently, we propose to compare the largest query available, the entire NT database, against the largest database available, the entire NT database. The result of this comparison can provide critical information to the biology community, including insightful evolutionary, structural, and functional relationships between every sequence and family in the NT database. Preliminary projections indicated that to complete the task in a reasonable length of time required more processors than were available to us at a single site. Hence, we assembled GreenGene, an ad-hoc grid that was constructed \"on the fly\" from donated computational, network, and storage resources during last year's SC|05. GreenGene consisted of 3048 processors from machines that were distributed across the United States. This paper presents a case study of mpiBLAST on GreenGene - specifically, a pre-run characterization of the computation, the hardware and software architectural design, experimental results, and future directions","PeriodicalId":333909,"journal":{"name":"ACM/IEEE SC 2006 Conference (SC'06)","volume":"215 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-11-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116827525","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"CRUSH: Controlled, Scalable, Decentralized Placement of Replicated Data","authors":"S. Weil, S. Brandt, E. L. Miller, C. Maltzahn","doi":"10.1145/1188455.1188582","DOIUrl":"https://doi.org/10.1145/1188455.1188582","url":null,"abstract":"Emerging large-scale distributed storage systems are faced with the task of distributing petabytes of data among tens or hundreds of thousands of storage devices. Such systems must evenly distribute data and workload to efficiently utilize available resources and maximize system performance, while facilitating system growth and managing hardware failures. We have developed CRUSH, a scalable pseudorandom data distribution function designed for distributed object-based storage systems that efficiently maps data objects to storage devices without relying on a central directory. Because large systems are inherently dynamic, CRUSH is designed to facilitate the addition and removal of storage while minimizing unnecessary data movement. The algorithm accommodates a wide variety of data replication and reliability mechanisms and distributes data in terms of user-defined policies that enforce separation of replicas across failure domains","PeriodicalId":333909,"journal":{"name":"ACM/IEEE SC 2006 Conference (SC'06)","volume":"52 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-11-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133453476","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Tiankai Tu, Hongfeng Yu, L. Ramírez-Guzmán, J. Bielak, O. Ghattas, K. Ma, D. O'Hallaron
{"title":"From Mesh Generation to Scientific Visualization: An End-to-End Approach to Parallel Supercomputing","authors":"Tiankai Tu, Hongfeng Yu, L. Ramírez-Guzmán, J. Bielak, O. Ghattas, K. Ma, D. O'Hallaron","doi":"10.1145/1188455.1188551","DOIUrl":"https://doi.org/10.1145/1188455.1188551","url":null,"abstract":"Parallel supercomputing has traditionally focused on the inner kernel of scientific simulations: the solver. The front and back ends of the simulation pipeline - problem description and interpretation of the output - have taken a back seat to the solver when it comes to attention paid to scalability and performance, and are often relegated to offline, sequential computation. As the largest simulations move beyond the realm of the terascale and into the petascale, this decomposition in tasks and platforms becomes increasingly untenable. We propose an end-to-end approach in which all simulation components - meshing, partitioning, solver, and visualization - are tightly coupled and execute in parallel with shared data structures and no intermediate I/O. We present our implementation of this new approach in the context of octree-based finite element simulation of earthquake ground motion. Performance evaluation on up to 2048 processors demonstrates the ability of the end-to-end approach to overcome the scalability bottlenecks of the traditional approach","PeriodicalId":333909,"journal":{"name":"ACM/IEEE SC 2006 Conference (SC'06)","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-11-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122342556","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yang-Suk Kee, K. Yocum, Andrew A. Chien, H. Casanova
{"title":"Improving Grid Resource Allocation via Integrated Selection and Binding","authors":"Yang-Suk Kee, K. Yocum, Andrew A. Chien, H. Casanova","doi":"10.1145/1188455.1188559","DOIUrl":"https://doi.org/10.1145/1188455.1188559","url":null,"abstract":"Discovering and acquiring appropriate, complex resource collections in large-scale distributed computing environments is a fundamental challenge and is critical to application performance. This paper presents a new formulation of the resource selection problem and a new solution to the resource selection and binding problem called integrated selection and binding. Composition operators in our resource description language and efficient data organization enable our approach to allocate complex resource collections efficiently and effectively even in the presence of competition for resources. Our empirical evaluation shows that the integrated approach can produce solutions of significantly higher quality at higher success rate and lower cost than the traditional separate approach. The success rate of the integrated approach can tolerate as much as 15%-60% lower resource availability than the separate approach. Moreover, most requests have at least the 98th percentile rank and can be served in 6 seconds with a population of 1 million hosts","PeriodicalId":333909,"journal":{"name":"ACM/IEEE SC 2006 Conference (SC'06)","volume":"261 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-11-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122861346","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Topology Mapping for Blue Gene/L Supercomputer","authors":"Hao Yu, I-Hsin Chung, Jose Moreira","doi":"10.1145/1188455.1188576","DOIUrl":"https://doi.org/10.1145/1188455.1188576","url":null,"abstract":"Mapping virtual processes onto physical processors is one of the most important issues in parallel computing. The problem of mapping of processes/tasks onto processors is equivalent to the graph embedding problem which has been studied extensively. Although many techniques have been proposed for embeddings of two-dimensional grids, hypercubes, etc., there are few efforts on embeddings of three-dimensional grids and tori. Motivated for better support of task mapping for Blue Gene/L supercomputer, in this paper, we present embedding and integration techniques for the embeddings of three-dimensional grids and tori. The topology mapping library that based on such techniques generates high-quality embeddings of two/three-dimensional grids/tori. In addition, the library is used in BG/L MPI library for scalable support of MPI topology functions. With extensive empirical studies on large scale systems against popular benchmarks and real applications, we demonstrate that the library can significantly improve the communication performance and the scalability of applications","PeriodicalId":333909,"journal":{"name":"ACM/IEEE SC 2006 Conference (SC'06)","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-11-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123932764","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Abhinav Vishnu, Prachi Gupta, A. Mamidala, D. Panda
{"title":"A Software Based Approach for Providing Network Fault Tolerance in Clusters with uDAPL interface: MPI Level Design and Performance Evaluation","authors":"Abhinav Vishnu, Prachi Gupta, A. Mamidala, D. Panda","doi":"10.1145/1188455.1188545","DOIUrl":"https://doi.org/10.1145/1188455.1188545","url":null,"abstract":"In the arena of cluster computing, MPI has emerged as the de facto standard for writing parallel applications. At the same time, introduction of high speed RDMA-enabled interconnects like InfiniBand, Myrinet, Quadrics, RDMA-enabled Ethernet has escalated the trends in cluster computing. Network APIs like uDAPL (user direct access provider library) are being proposed to provide a network-independent interface to different RDMA-enabled interconnects. Clusters with combination(s) of these interconnects are being deployed to leverage their unique features, and network failover in wake of transmission errors. In this paper, we design a network fault tolerant MPI using uDAPL interface, making this design portable for existing and upcoming interconnects. Our design provides failover to available paths, asynchronous recovery of the previous failed paths and recovery from network partitions without application restart. In addition, the design is able to handle network heterogeneity, making it suitable for the current state of the art clusters. We implement our design and evaluate it with micro-benchmarks and applications. Our performance evaluation shows that the proposed design provides significant performance benefits to both homogeneous and heterogeneous clusters. Using a heterogeneous combinations of IBA and Ammasso-GigE, we are able to improve the performance by 10-15% for different NAS parallel benchmarks on 8 times 1 configuration. For simple micro-benchmarks on a homogeneous configuration, we are able to achieve an improvement of 15-20% in throughput. In addition, experiments with simple MPI micro-benchmarks and NAS applications reveal that network fault tolerance modules incur negligible overhead and provide optimal performance in wake of network partitions","PeriodicalId":333909,"journal":{"name":"ACM/IEEE SC 2006 Conference (SC'06)","volume":"301 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-11-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121285080","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}