{"title":"A fine-grained parallel pipelined Karhunen-Loeve transform","authors":"M. Fleury, Bob Self, A. Downton","doi":"10.1109/IPDPS.2003.1213476","DOIUrl":"https://doi.org/10.1109/IPDPS.2003.1213476","url":null,"abstract":"A high-performance Karhunen-Loeve transform for multi-spectral imagery suitable for remote-sensing applications has been prototyped on a platform FPGA, by means of a PC-based development board. Performance estimates suggest that the design will already outperform implementation on a high-end microprocessor, given due attention to I/O (input/output). General conclusions are reached for the utility of this architecture for fine-grained parallel processing, when the design is extended to massively parallel processing.","PeriodicalId":177848,"journal":{"name":"Proceedings International Parallel and Distributed Processing Symposium","volume":"44 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2003-04-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133648803","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A programmable and highly pipelined PPP architecture for Gigabit IP over SDH/SONET","authors":"C. Toal, S. Sezer","doi":"10.1109/IPDPS.2003.1213331","DOIUrl":"https://doi.org/10.1109/IPDPS.2003.1213331","url":null,"abstract":"This paper details the implementation of a highly pipelined 2.5 Gbit/s point-to-point-protocol packet processor (P/sup 5/) aimed at the latest system-on-a-programmable-chip (SoPC) technology. Throughput rates beyond 2.5 Gbit/s based on FPGA technology could be achieved by designing a new highly pipelined and parallel processing architecture for frames and datagrams. A novel pipelined data sorting mechanism with an extremely low resynchronization buffer and backpressure scheme are introduced to keep the data memory requirements as low as possible for embedded on-chip applications.","PeriodicalId":177848,"journal":{"name":"Proceedings International Parallel and Distributed Processing Symposium","volume":"106 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2003-04-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131913111","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
G. Almási, Leonardo R. Bachega, Ralph Bellofatto, J. Brunheroto, Calin Cascaval, J. Castaños, P. Crumley, C. Erway, J. Gagliano, D. Lieber, Pedro Mindlin, J. Moreira, R. Sahoo, A. Sanomiya, E. Schenfeld, R. Swetz, M. Bae, G. Laib, K. Ranganathan, Y. Aridor, T. Domany, Ya'akov Gal, O. Goldshmidt, Edi Shmueli
{"title":"System management in the BlueGene/L supercomputer","authors":"G. Almási, Leonardo R. Bachega, Ralph Bellofatto, J. Brunheroto, Calin Cascaval, J. Castaños, P. Crumley, C. Erway, J. Gagliano, D. Lieber, Pedro Mindlin, J. Moreira, R. Sahoo, A. Sanomiya, E. Schenfeld, R. Swetz, M. Bae, G. Laib, K. Ranganathan, Y. Aridor, T. Domany, Ya'akov Gal, O. Goldshmidt, Edi Shmueli","doi":"10.1109/IPDPS.2003.1213483","DOIUrl":"https://doi.org/10.1109/IPDPS.2003.1213483","url":null,"abstract":"The BlueGene/L supercomputer will use system-on-a-chip integration and a highly scalable cellular architecture to deliver 360 teraflops of peak computing power. With 65536 compute nodes, BlueGene/L represents a new level of scalability for parallel systems. As such, it is natural for many scalability challenges to arise. In this paper, we discuss system management and control, including machine booting, software installation, user account management, system monitoring, and job execution. We address the issue of scalability by organizing the system hierarchically. The 65536 compute nodes are organized in 1024 clusters of 64 compute nodes each, called processing sets. Each processing set is under control of a 65th node, called an I/O node. The 1024 processing sets can then be managed to a great extent as a regular Linux cluster, of which there are several successful examples. Regular cluster management is complemented by BlueGene/L specific services, performed by a service node over a separate control network. Our software development and experiments have been conducted so far using an architecturally accurate simulator of BlueGene/L, and we are gearing up to test real prototypes in 2003.","PeriodicalId":177848,"journal":{"name":"Proceedings International Parallel and Distributed Processing Symposium","volume":"100 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2003-04-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134332041","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A performance analysis of 4X InfiniBand data transfer operations","authors":"Ariel Cohen","doi":"10.1109/IPDPS.2003.1213372","DOIUrl":"https://doi.org/10.1109/IPDPS.2003.1213372","url":null,"abstract":"The performance of 4X InfiniBand send/receive and RDMA operations is studied by running tests to measure latency, data rate, number of operations per second, and CPU load. The measurements performed are for application-to-application data transfers using user-level InfiniBand (IB) verbs. It is shown that IB is capable of low latencies (10 /spl mu/s for small messages) and very high data rates at low CPU loads (over 6 Gbs with 64 KB messages at under 20% CPU load). A very large number of operations per second (over 400,000) is obtained for small messages. Some comparisons are made with the performance of TCP/IP on Gigabit Ethernet. In addition, the paper studies the impact of varying the number of outstanding requests on the obtained throughput, and shows when the peak throughput can be obtained for messages of varying sizes. Finally, an approach for handling completions in user space without a busy wait and without the use of signals is introduced and CPU load results based on this approach are presented.","PeriodicalId":177848,"journal":{"name":"Proceedings International Parallel and Distributed Processing Symposium","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2003-04-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134074367","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"CREC: a novel reconfigurable computing design methodology","authors":"O. Creţ, K. Pusztai, C. Vancea, Balint Szente","doi":"10.1109/IPDPS.2003.1213323","DOIUrl":"https://doi.org/10.1109/IPDPS.2003.1213323","url":null,"abstract":"The main research done in the field of reconfigurable computing was oriented towards applications involving low granularity operations and high intrinsic parallelism. CREC is an original, low-cost general-purpose reconfigurable computer whose architecture is generated through a hardware/software codesign process. The main idea of the CREC system is to generate the best-suited hardware architecture for the execution of each software application. The CREC parallel compiler parses the source code and generates the hardware architecture, based on multiple execution units. The hardware architecture is described in VHDL code, generated by a program. Finally, CREC is implemented in an FPGA device. The great flexibility offered by the general-purpose CREC system makes it interesting for a wide class of applications that mainly involve high intrinsic parallelism, but also any other kinds of computations.","PeriodicalId":177848,"journal":{"name":"Proceedings International Parallel and Distributed Processing Symposium","volume":"52 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2003-04-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115102685","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
M. D. Santo, Franco Frattolillo, N. Ranaldo, W. Russo, E. Zimeo
{"title":"Programming metasystems with active objects","authors":"M. D. Santo, Franco Frattolillo, N. Ranaldo, W. Russo, E. Zimeo","doi":"10.1109/IPDPS.2003.1213257","DOIUrl":"https://doi.org/10.1109/IPDPS.2003.1213257","url":null,"abstract":"The widespread diffusion of metasystems and grid environments makes it necessary to employ programming models able to well exploit a high, variable number of distributed heterogeneous resources. Many software frameworks designed for Grid computing do not address this problem. They only allow the use of existing programming libraries based on explicit message-passing communication models, often not suitable to manage the variability of a Grid. In this paper we present the customization of a component-based middleware for metacomputing, HiMM (Hierarchical Metacomputer Middleware), in order to support distributed programming based on the Active Object model provided by ProActive. This way a meta-system can be efficiently and transparently programmed by unifying the asynchronousremote method invocation model and the reflection provided by meta-objects.","PeriodicalId":177848,"journal":{"name":"Proceedings International Parallel and Distributed Processing Symposium","volume":"154 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2003-04-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115152987","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Sourav Chatterji, Manikandan Narayanan, J. Duell, L. Oliker
{"title":"Performance evaluation of two emerging media processors: VIRAM and Imagine","authors":"Sourav Chatterji, Manikandan Narayanan, J. Duell, L. Oliker","doi":"10.1109/IPDPS.2003.1213417","DOIUrl":"https://doi.org/10.1109/IPDPS.2003.1213417","url":null,"abstract":"This work presets two emerging media microprocessors, VIRAM and Imagine, and compares the implementation strategies and performance results of these unique architectures. VIRAM is a complete system on a chip which uses PIM technology to combine vector processing with embedded DRAM. Imagine is a programmable streaming architecture with a specialized memory hierarchy designed for computationally intensive data-parallel codes. First, we preset a simple and effective approach for understanding and optimizing vector/stream applications. Performance results are then presented from a number of multimedia benchmarks and a computationally intensive scientific kernel. We explore the complex interactions between programming paradigms, the architectural support at the ISA level and the underlying microarchitecture of these two systems. Our long term goal is to evaluate leading media microprocessors as possible building blocks for future high performance systems.","PeriodicalId":177848,"journal":{"name":"Proceedings International Parallel and Distributed Processing Symposium","volume":"281 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2003-04-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114385553","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ahmad A. Al-Yamani, S. M. Sait, H. Barada, H. Youssef
{"title":"Parallel tabu search in a heterogeneous environment","authors":"Ahmad A. Al-Yamani, S. M. Sait, H. Barada, H. Youssef","doi":"10.1109/IPDPS.2003.1213149","DOIUrl":"https://doi.org/10.1109/IPDPS.2003.1213149","url":null,"abstract":"We discuss a parallel tabu search algorithm with implementation in a heterogeneous environment. Two parallelization strategies are integrated: functional decomposition and multi-search threads. In addition, domain decomposition strategy is implemented probabilistically. The performance of each strategy is observed and analyzed in terms of speeding up the search and finding better quality solutions. Experiments were conducted for the VLSI cell placement. The objective was to achieve the best possible solution in terms of interconnection length, timing performance, circuit speed, and area. The multiobjective nature of this problem is addressed using a fuzzy goal-based cost computation.","PeriodicalId":177848,"journal":{"name":"Proceedings International Parallel and Distributed Processing Symposium","volume":"70 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2003-04-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114514108","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"An object-oriented programming framework for parallel finite element analysis with application: liquid composite molding","authors":"B. Henz, D. Shires","doi":"10.1109/IPDPS.2003.1213459","DOIUrl":"https://doi.org/10.1109/IPDPS.2003.1213459","url":null,"abstract":"The use of object-oriented programming techniques in development of parallel, finite element analysis software enhances software reuse and makes application development more efficient. In this paper, an object-oriented programming framework for developing parallel finite element software is described. All required steps, from data file parsing and equation solving to post processing and graphical user interfaces, are discussed. After development of the framework, a sample parallel finite element code, namely COMPOSE, is taken from its original functional programming paradigm and implemented in the new framework. Besides ease of development, the use of generic visualization and interface tools for software utilizing the framework speeds delivery of research codes to end users.","PeriodicalId":177848,"journal":{"name":"Proceedings International Parallel and Distributed Processing Symposium","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2003-04-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114432035","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
F. Campi, A. Cappelli, R. Guerrieri, Andrea Lodi, M. Toma, A. L. Rosa, L. Lavagno, C. Passerone, R. Canegallo
{"title":"A reconfigurable processor architecture and software development environment for embedded systems","authors":"F. Campi, A. Cappelli, R. Guerrieri, Andrea Lodi, M. Toma, A. L. Rosa, L. Lavagno, C. Passerone, R. Canegallo","doi":"10.1109/IPDPS.2003.1213314","DOIUrl":"https://doi.org/10.1109/IPDPS.2003.1213314","url":null,"abstract":"Flexibility, high computing power and low energy consumption are strong guidelines when designing new generation embedded processors. Traditional architectures are no longer suitable to provide a good compromise among these contradictory implementation requirements. In this paper we present a new reconfigurable processor that tightly couples a VLIW architecture with a configurable unit implementing an additional configurable pipeline. A software development environment is also introduced providing a user-friendly tool for application development and performance simulation. Finally, we show that the HW/SW reconfigurable platform proposed achieves dramatic improvement in both speed and energy consumption on signal processing computation kernels.","PeriodicalId":177848,"journal":{"name":"Proceedings International Parallel and Distributed Processing Symposium","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2003-04-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114500157","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}