{"title":"An efficient deadlock-free tree-based routing algorithm for irregular wormhole-routed networks based on the turn model","authors":"Yau-Ming Sun, Chih-Hsueh Yang, Yeh-Ching Chung, Tai-Yi Huang","doi":"10.1109/ICPP.2004.1327941","DOIUrl":"https://doi.org/10.1109/ICPP.2004.1327941","url":null,"abstract":"We proposed an efficient deadlock-free tree-based routing algorithm, the DOWN/UP routing, for irregular wormhole-routed networks based on the turn model. In a tree-based routing algorithm, hot spots around the root of a spanning tree and the uneven traffic distribution are the two main facts degrade the performance of the routing algorithm. To solve the hot spot and the uneven traffic distribution problems, in the DOWN/UP routing, it tries to push the traffic downward to the leaves of a spanning tree as much as possible and remove prohibited turn pairs with opposite directions in each node, respectively. To evaluate the performance of DOWN/UP routing, the simulation is conducted. We have implemented the DOWN/UP routing along with the L-turn routing on the IRFlexSim0.5 simulator. Irregular networks that contain 128 switches with 4-port and 8-port configurations are simulated. The simulation results show that the proposed routing algorithm outperforms the L-turn routing for all test samples in terms of the degree of hot spots, the traffic load distribution, and throughput.","PeriodicalId":106240,"journal":{"name":"International Conference on Parallel Processing, 2004. ICPP 2004.","volume":"34 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-08-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123690433","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A novel FDTD application featuring OpenMP-MPI hybrid parallelization","authors":"M. Su, I. El-Kady, David A. Bader, Shawn-Yu Lin","doi":"10.1109/ICPP.2004.1327945","DOIUrl":"https://doi.org/10.1109/ICPP.2004.1327945","url":null,"abstract":"We have developed a high performance hybridized parallel finite difference time domain (FDTD) algorithm featuring both OpenMP shared memory programming and MPl message passing. Our goal is to effectively model the optical characteristics of a novel light source created by utilizing a new class of materials known as photonic band-gap crystals. Our method is based on the solution of the second order discretized Maxwell's equations in space and time. This novel hybrid parallelization scheme allows us to take advantage of the new generation parallel machines possessing connected SMP nodes. By using parallel computations, we are able to complete a calculation on 24 processors in less than a day, where a serial version would have taken over three weeks. We present a detailed study of this hybrid scheme on an SGI origin 2000 distributed shared memory ccNUMA system along with a complete investigation of the advantages versus drawbacks of this method.","PeriodicalId":106240,"journal":{"name":"International Conference on Parallel Processing, 2004. ICPP 2004.","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-08-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117096723","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"The k-valent graph: a new family of Cayley graphs for interconnection networks","authors":"S. Hsieh, T. Hsiao","doi":"10.1109/ICPP.2004.1327923","DOIUrl":"https://doi.org/10.1109/ICPP.2004.1327923","url":null,"abstract":"This work introduces a new family of Cayley graphs, named the k-valent graphs, for building interconnection networks. It includes the trivalent Cayley graphs (Vadapalli and Srimani, 1995) as a subclass. These new graphs are shown to be regular with the node-degree k, to have logarithmic diameter subject to the number of nodes, and to be k-connected as well as maximally fault tolerant. We also propose a shortest path routing algorithm and investigate some algebraic properties like cycles or cliques embedding.","PeriodicalId":106240,"journal":{"name":"International Conference on Parallel Processing, 2004. ICPP 2004.","volume":"96 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-08-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115985379","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"RMAC: a reliable multicast MAC protocol for wireless ad hoc networks","authors":"Weisheng Si, Chengzhi Li","doi":"10.1109/ICPP.2004.1327959","DOIUrl":"https://doi.org/10.1109/ICPP.2004.1327959","url":null,"abstract":"This work presents a new MAC protocol called RMAC that supports reliable multicast for wireless ad hoc networks. By utilizing the busy tone mechanism to realize multicast reliability, RMAC has the following three novelties: (1) it uses a variable-length control frame to stipulate an order for the receivers to respond, such that the problem of feedback collision is solved; (2) it extends the traditional usage of busy tone for preventing data frame collisions into the multicast scenario; and (3) it introduces a new usage of busy tone for acknowledging data frames. In addition, we also generalize RMAC into a comprehensive MAC protocol that provides both reliable and unreliable services for all the three modes of communications: unicast, multicast, and broadcast. Our evaluation shows that RMAC achieves high reliability with very limited overhead. We also compare RMAC with other reliable multicast MAC protocols, showing that RMAC not only provides higher reliability but also involves lower cost.","PeriodicalId":106240,"journal":{"name":"International Conference on Parallel Processing, 2004. ICPP 2004.","volume":"47 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-08-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115151849","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Applying array contraction to a sequence of DOALL loops","authors":"Yonghong Song, Zhiyuan Li","doi":"10.1109/ICPP.2004.1327903","DOIUrl":"https://doi.org/10.1109/ICPP.2004.1327903","url":null,"abstract":"Efficient program execution on multiprocessor computers requires both sufficient parallelism and good data locality. Recent research found that, using a combination of loop shifting, loop fusion, and array contraction, one can reduce the memory required to execute a sequence of serial loops, thereby to improve the cache locality. This paper studies how to extend such a memory-reduction scheme to a sequence of DOALL loops, which are executed in parallel on multiprocessors. Two methods are proposed to overcome difficulties caused by loop-carried dependences. Data copy-in is performed to remove anti-dependences between different parallel threads, and computation duplication is performed to remove flow dependences. Experiments performed on a number of benchmark programs show that the proposed technique improves both cache locality and parallel execution speed for the DOALL loops. The scheme achieves an average speedup of 1.41 for 17 programs on a 4-processor SUN machine.","PeriodicalId":106240,"journal":{"name":"International Conference on Parallel Processing, 2004. ICPP 2004.","volume":"90 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-08-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126065477","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Improving load/store queues usage in scientific computing","authors":"C. Lemuet, W. Jalby, S. Touati","doi":"10.1109/ICPP.2004.1327902","DOIUrl":"https://doi.org/10.1109/ICPP.2004.1327902","url":null,"abstract":"Memory disambiguation mechanisms, coupled with load/store queues in out-of-order processors, are crucial to increase instruction level parallelism (ILP), especially for memory-bound scientific codes. Designing ideal memory disambiguation mechanisms is too complex because it would require precise address bits comparators; thus, modern microprocessors implement simplified and imprecise ones that perform only partial address comparisons. In this paper, we study the impact of such simplifications on the sustained performance of some real processors such that Alpha 21264, Power 4 and Itanium 2. Despite all the advanced features of these processors, we demonstrate in this article that memory address disambiguation mechanisms can cause significant performance loss. We demonstrate that, even if data are located in low cache levels and enough ILP exist, the performance degradation can be up to 21 times slower if no care is taken on the order of accessing independent memory addresses. Instead of proposing a hardware solution to improve load/store queues, as done in [G. Chrysos et al., (1998), S. Sethumadhavan et al., (2003), I. Park et al., (2003), A. Yoaz et al., (1999), S. Onder (2002)], we show that a software (compilation) technique is possible. Such solution is based on the classical (and robust) Id/st vectorization. Our experiments highlight the effectiveness of such method on BLAS 1 codes that are representative of vector scientific loops.","PeriodicalId":106240,"journal":{"name":"International Conference on Parallel Processing, 2004. ICPP 2004.","volume":"109 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-08-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124068269","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Using tiling to scale parallel data cube construction","authors":"R. Jin, K. Vaidyanathan, Ge Yang, G. Agrawal","doi":"10.1109/ICPP.2004.1327944","DOIUrl":"https://doi.org/10.1109/ICPP.2004.1327944","url":null,"abstract":"Data cube construction is a commonly used operation in data warehouses. Because of the volume of data that is stored and analyzed in a data warehouse and the amount of computation involved in data cube construction, it is natural to consider parallel machines for this operation. Also, for both sequential and parallel data cube construction, effectively using the main memory is an important challenge. In our prior work, we have developed parallel algorithms for this problem. We show how sequential and parallel data cube construction algorithms can be further scaled to handle larger problems, when the memory requirements could be a constraint. This is done by tiling the input and output arrays on each node. We address the challenges in using tiling while still maintaining the other desired properties of a data cube construction algorithm, which are, using minimal parents, and achieving maximal cache and memory reuse. We present a parallel algorithm that combines tiling with interprocessor communication. Our experimental results show the following. First, tiling helps in scaling data cube construction in both sequential and parallel environments. Second, choosing tiling parameters as per our theoretical results does result in better performance.","PeriodicalId":106240,"journal":{"name":"International Conference on Parallel Processing, 2004. ICPP 2004.","volume":"31 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-08-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134282174","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yiran Wang, Li Chen, Xiaobing Feng, Zhaoqing Zhang
{"title":"Global partial replicate computation partitioning","authors":"Yiran Wang, Li Chen, Xiaobing Feng, Zhaoqing Zhang","doi":"10.1109/ICPP.2004.1327910","DOIUrl":"https://doi.org/10.1109/ICPP.2004.1327910","url":null,"abstract":"Early parallelizing compilers use the owner-computes rule to partition computation. Partial replication is then introduced to eliminate near-neighbor communication at the cost of some replicated computation, hence improves the performance and scalability. Current exploration of partial replicate computation partitioning is limited within a single loop nest. We present a formal description of the global partial replicate computation partitioning problem, a simplified cost model and a heuristic solution. Experimental results show that the solution is superior to local approaches.","PeriodicalId":106240,"journal":{"name":"International Conference on Parallel Processing, 2004. ICPP 2004.","volume":"44 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-08-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125133220","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Architecture and implementation of chip multiprocessors: custom logic components and software for rapid prototyping","authors":"N. Manjikian, Huang Jin, J. Reed, N. Cordeiro","doi":"10.1109/ICPP.2004.1327958","DOIUrl":"https://doi.org/10.1109/ICPP.2004.1327958","url":null,"abstract":"This work describes components and software tools in support of rapid prototyping in programmable logic for research on chip multiprocessors. Contemporary programmable logic chips offer considerable on-chip logic and memory resources. Prototyping of systems in programmable logic chips is faster and less costly than full-custom chip design. The first contribution that is described in this paper is a collection of original research-oriented logic components that provides processor, memory, and interconnect functionality for rapid prototyping. Because these are original components, and not proprietary vendor-supplied components, they may be arbitrarily extended and modified to suit research needs. The second contribution is a set of enhanced software tools for generating executable code. The third contribution is user-configurable software for testing and evaluating prototype chip multiprocessor implementations in hardware. In addition to describing these contributions, this paper provides results from implementing and testing prototype components and complete chip multiprocessors, including simulation waveforms, logic chip resource utilization, and observations of hardware operation.","PeriodicalId":106240,"journal":{"name":"International Conference on Parallel Processing, 2004. ICPP 2004.","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-08-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134440601","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"OSCAR - an opportunistic call admission protocol for LEO satellite networks","authors":"S. Olariu, Rajendra Shirhatti, Albert Y. Zomaya","doi":"10.1109/ICPP.2004.1327965","DOIUrl":"https://doi.org/10.1109/ICPP.2004.1327965","url":null,"abstract":"The main contribution of this work is to propose OSCAR - an opportunistic call admission protocol that provides a simple and robust solution to call admission and handoff management in LEO satellite networks. One of the features that sets OSCAR apart from existing protocols is that it avoids the overhead of reserving resources for users in a series of spotbeams along predicted user trajectories. Instead, OSCAR relies on a novel opportunistic bandwidth allocation mechanism that is very simple and efficient and does not involve maintaining complicated data structures or making expensive reservations. Extensive simulation results have shown that OSCAR achieves results comparable to those of Q-Win: it features very low call dropping probability, thus providing for reliable handoff of on-going calls, low call blocking probability for new call requests, and high bandwidth utilization.","PeriodicalId":106240,"journal":{"name":"International Conference on Parallel Processing, 2004. ICPP 2004.","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-08-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131292162","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}