{"title":"Towards informatic analysis of syslogs","authors":"Jon Stearley","doi":"10.1109/CLUSTR.2004.1392628","DOIUrl":"https://doi.org/10.1109/CLUSTR.2004.1392628","url":null,"abstract":"The complexity and cost of isolating the root cause of system problems in large parallel computers generally scales with the size of the system. Syslog messages provide a primary source of system feedback, but manual review is tedious and error prone. Informatic analysis can be used to detect subtle anomalies in the syslog message stream, thereby increasing the availability of the overall system. In This work the author describes the use of the bioinformatic-inspired Teiresias algorithm to automatically classify syslog messages, and compare it to an existing log analysis tool (SLCT). He then describes the use of occurrence statistics to group time-correlated messages, and present a simple graphical user interface for viewing analysis results. Finally, example analyses of syslogs from three independent clusters are presented.","PeriodicalId":123512,"journal":{"name":"2004 IEEE International Conference on Cluster Computing (IEEE Cat. No.04EX935)","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-09-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122194137","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Analysis of microbenchmarks for performance tuning of clusters","authors":"M. Sottile, R. Minnich","doi":"10.1109/CLUSTR.2004.1392636","DOIUrl":"https://doi.org/10.1109/CLUSTR.2004.1392636","url":null,"abstract":"Microbenchmarks, i.e. very small computational kernels, have become commonly used for quantitative measures of node performance in clusters. For example, a commonly used benchmark measures the amount of time required to perform a fixed quantum of work. Unfortunately, this benchmark is one of many that violate well known rules from sampling theory, leading to erroneous, contradictory or misleading results. At a minimum, these types of benchmarks can not be used to identify time-based activities that may interfere with and hence limit application performance. Our original and primary goal remains to identify noise in the system due to periodic activities that are not part of user application code. We discuss why the 'fixed quantum of work' benchmark provides data that is of limited use for analysis; and we show code for, discuss, and analyze results from a microbenchmark which follows good rules of sampling hygiene, and hence provides useful data for analysis.","PeriodicalId":123512,"journal":{"name":"2004 IEEE International Conference on Cluster Computing (IEEE Cat. No.04EX935)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-09-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131831716","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"NIC-based offload of dynamic user-defined modules for Myrinet clusters","authors":"A. Wagner, Hyun-Wook Jin, D. Panda, R. Riesen","doi":"10.1109/CLUSTR.2004.1392618","DOIUrl":"https://doi.org/10.1109/CLUSTR.2004.1392618","url":null,"abstract":"Many of the modern networks used to interconnect nodes in cluster-based computing systems provide network-interface cards (NICs) that offer programmable processors. Substantial research has been done with the focus of offloading processing from the host to the NIC processor. However, the research has primarily focused on the static offload of specific features to the NIC, mainly to support the optimization of common collective and synchronization-based communications. We describe the design and implementation of a framework based on MP1CH-GM to support the dynamic NIC-based offload of user-defined modules for Myrinet clusters. We evaluate our implementation on a 16-node cluster using a NIC-based version of the common broadcast operation and we find a maximum factor of improvement of 1.2 with respect to total latency as well as a maximum factor of improvement of 2.2 with respect to average CPU utilization under conditions of process skew. In addition, we see that these improvements increase with system size, indicating that our NIC-based framework offers enhanced scalability when compared to a purely host-based approach.","PeriodicalId":123512,"journal":{"name":"2004 IEEE International Conference on Cluster Computing (IEEE Cat. No.04EX935)","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-09-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131202082","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Hierarchical Bloom filter arrays (HBA): a novel, scalable metadata management system for large cluster-based storage","authors":"Yifeng Zhu, Hong Jiang, Jun Wang","doi":"10.1109/CLUSTR.2004.1392614","DOIUrl":"https://doi.org/10.1109/CLUSTR.2004.1392614","url":null,"abstract":"An efficient and distributed scheme for file mapping or file lookup scheme is critical in decentralizing metadata management within a group of metadata servers. This work presents a technique called HBA (hierarchical Bloom filter arrays) to map file names to the servers holding their metadata. Two levels of probabilistic arrays, i.e., Bloom filter arrays, with different accuracies are used on each metadata server. One array, with lower accuracy and representing the distribution of the entire metadata, trades accuracy for significantly reduced memory overhead, while the other array, with higher accuracy, caches partial distribution information and exploits the temporal locality of file access patterns. Extensive trace-driven simulations have shown our HBA design to be highly effective and efficient in improving performance and scalability of file systems in clusters with 1,000 to 10,000 nodes (or superclusters).","PeriodicalId":123512,"journal":{"name":"2004 IEEE International Conference on Cluster Computing (IEEE Cat. No.04EX935)","volume":"20 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-09-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125620142","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"QMP-MVIA: a message passing system for Linux clusters with gigabit Ethernet mesh connections","authors":"Jie Chen, R. Edwards, W. Mao","doi":"10.1109/CLUSTR.2004.1392651","DOIUrl":"https://doi.org/10.1109/CLUSTR.2004.1392651","url":null,"abstract":"Recent progress in performance coupled with a decline in price for copper-based gigabit Ethernet (GigE) interconnects makes them an attractive alternative to expensive high speed network interconnects (NIC) when constructing Linux clusters. However traditional message passing systems based on TCP for GigE interconnects cannot fully utilize the raw performance of today's GigE interconnects due to the overhead of kernel involvement and multiple memory copies during sending and receiving messages. The overhead is more evident in the case of mesh connected Linux clusters using multiple GigE interconnects in a single host. We present a general message passing system called QMP-MVIA (QCD Message Passing over M-VIA) for Linux clusters with mesh connections using GigE interconnects. In particular, we evaluate and compare the performance characteristics of TCP and M-VIA (an implementation of the VIA specification) software for a mesh communication architecture to demonstrate the feasibility of using M-VIA as a point-to-point communication software, on which QMP-MVIA is based. Furthermore, we illustrate the design and implementation of QMP-MVIA for mesh connected Linux clusters with emphasis on both point-to-point and collective communications, and demonstrate that QMP-MVIA message passing system using GigE interconnects achieves bandwidth and latency that are not only better than systems based on TCP but also compare favorably to systems using some of the specialized high speed interconnects in a switched architecture at much lower cost.","PeriodicalId":123512,"journal":{"name":"2004 IEEE International Conference on Cluster Computing (IEEE Cat. No.04EX935)","volume":"30 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-09-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126250489","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
H. Abbasi, M. Wolf, K. Schwan, G. Eisenhauer, Andrew D. Hilton
{"title":"XChange: coupling parallel applications in a dynamic environment","authors":"H. Abbasi, M. Wolf, K. Schwan, G. Eisenhauer, Andrew D. Hilton","doi":"10.1109/CLUSTR.2004.1392646","DOIUrl":"https://doi.org/10.1109/CLUSTR.2004.1392646","url":null,"abstract":"Modern computational science applications are becoming increasingly multidisciplinary, involving widely distributed research teams and their underlying computational platforms. A common problem for the grid applications used in these environments is the necessity to couple multiple, parallel subsystems, with examples ranging from data exchanges between cooperating, linked parallel programs, to concurrent data streaming to distributed storage engines. This work presents the XChange/sub mxn/ middleware infrastructure for coupling componentized distributed applications. XChange/sub mxn/ implements the basic functionality of well-known services like the CCA Forum's MxN project, by providing efficient data redistribution across parallel application components. Beyond such basic functionality, however, XChange/sub mxn/ also addresses two of the problems faced by wide area scientific collaborations, which are (1) the need to deal with dynamic application/component behaviors, such as dynamic arrivals and departures due to the availability of additional resources, and (2) the need to 'match' data formats across disparate application components and research teams. In response to these needs, XChange/sub mxn/ uses an anonymous publish/subscribe model for linking interacting components, and the data being exchanged is dynamically specialized and transformed to match end point requirements. The pub/sub paradigm makes it easy to deal with dynamic component arrivals and departures. Dynamic data transformation enables the 'inflight' correction of data or needs mismatches for cooperating components. This work describes the design and implementation of XChange/sub mxn/, and it evaluates its implementation compared to those of less flexible transports like MPI. It also highlights the utility ofXChange/sub mxn/'s 'inflight' data specialization, by applying it to the SmartPointer parallel data visualization environment developed at our institution. Interestingly, using XChange/sub mxn/ did not significantly affect performance but led to a reduction in the size of the code base.","PeriodicalId":123512,"journal":{"name":"2004 IEEE International Conference on Cluster Computing (IEEE Cat. No.04EX935)","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-09-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129859953","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}