{"title":"Communication in parallel applications: characterization and sensitivity analysis","authors":"Dale Seed, A. Sivasubramaniam, C. Das","doi":"10.1109/ICPP.1997.622679","DOIUrl":"https://doi.org/10.1109/ICPP.1997.622679","url":null,"abstract":"Communication characterization of parallel applications is essential to understand the interplay between architectures and applications in determining the maximum achievable performance. Although a significant amount of research has been conducted on execution-based architectural evaluations, very little effort has gone into capturing the communication behavior of an application mathematically. In this paper, we attempt to characterize the communication behavior of applications by temporal, spatial and volume attributes. We also study the impact of variation in application and architectural parameters on the communication behavior in terms of the three attributes. Our results show that for the chosen suite of applications, the message arrival and spatial distributions can be closely approximated by known statistical distributions and that the temporal as well as spatial distributions of all applications remain unchanged with respect to four parameters considered in this study. These results lead us closer to the belief that it is possible to abstract the communication properties of parallel applications in convenient mathematical forms that have wide applicability.","PeriodicalId":221761,"journal":{"name":"Proceedings of the 1997 International Conference on Parallel Processing (Cat. No.97TB100162)","volume":"48 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1997-08-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121457470","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
R. Dietz, T. Casavant, T. Scheetz, T. Braun, M. Andersland
{"title":"Modeling the impact of run-time uncertainty on optimal computation scheduling using feedback","authors":"R. Dietz, T. Casavant, T. Scheetz, T. Braun, M. Andersland","doi":"10.1109/ICPP.1997.622683","DOIUrl":"https://doi.org/10.1109/ICPP.1997.622683","url":null,"abstract":"Increasingly, feedback of measured run-time information is being used in the optimization of computation execution. This paper introduces a model relating the static view of a computation to its run-time variance that is useful in this context. A notion of uncertainty is then used to provide bounds on key scheduling parameters of the run-time computation. To illustrate the relationship between fidelity in measured information and minimum schedulable, grain size, we apply the bounds to three existing parallel architectures for the case of run-time variance caused by monitoring intrusion. We also outline a hybrid static-dynamic scheduling paradigm-SEDIA-that uses the model of uncertainty to optimize computation for execution in the presence of run-time variance from sources other than monitoring intrusion.","PeriodicalId":221761,"journal":{"name":"Proceedings of the 1997 International Conference on Parallel Processing (Cat. No.97TB100162)","volume":"3 Suppl N 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1997-08-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116895618","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"The affinity entry consistency protocol","authors":"C. Bentes, R. Bianchini, C. Amorim","doi":"10.1109/ICPP.1997.622646","DOIUrl":"https://doi.org/10.1109/ICPP.1997.622646","url":null,"abstract":"In this paper we propose a novel software-only distributed shared memory system (SW-DSM), the Affinity Entry Consistency (AEC) protocol. The protocol is based on Entry Consistency but, unlike previous approaches, does not require the explicit association of shared data to synchronization variables, uses the page as its coherence unit, and generates the set of modifications (in the form of diffs) made to shared pages eagerly. The AEC protocol hides the overhead of generating and applying diffs behind synchronization delays, and uses a novel technique, Lock Acquirer Prediction (LAP), to tolerate the overhead of transferring diffs through the network. LAP attempts to predict the next acquirer of a lock at the time of the release, so that the acquirer can be updated even before requesting ownership of the lack. Using execution-driven simulation of real applications, we show that LAP performs very well under AEC; LAP predictions are within the 80-97% range of accuracy. Our results also show that LAP improves performance by 7-28% for our applications. In addition we find that most of the diff creation overhead in the AEC protocol can usually be overlapped with synchronization latencies. A comparison against simulated TreadMarks shows that AEC outperforms TreadMarks by as much as 47%. We conclude that LAP is a useful technique for improving the performance of update-based SW-DSMs, while AEC is an efficient implementation of the Entry Consistency model.","PeriodicalId":221761,"journal":{"name":"Proceedings of the 1997 International Conference on Parallel Processing (Cat. No.97TB100162)","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1997-08-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115489049","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Decisive path scheduling: a new list scheduling method","authors":"G. Park, B. Shirazi, J. Marquis, Hyunseung Choo","doi":"10.1109/ICPP.1997.622682","DOIUrl":"https://doi.org/10.1109/ICPP.1997.622682","url":null,"abstract":"Scheduling parallel tasks represented as a Directed Acyclic Graph (DAG), on a multiprocessor system has been an important research area in the past decades. One of the critical aspects of a class of scheduling algorithms, called \"List Scheduling\", is how to decide which task is to be scheduled next. This is achieved by assigning priorities to the nodes or the edges of the input DAG, and thus the task with the highest priority will be scheduled next. This paper proposes a low complexity scheduling algorithm to improve the priority node selection criteria in list scheduling algorithms. The worst case performance of the proposed algorithm is analyzed for general input DAGs. Also, the worst case performance and the optimality conditions are obtained for free structured input DAGs. The performance comparison study shows that the proposed algorithm outperforms existing scheduling algorithms especially for input DAGs with high communication overheads. The performance improvement over existing algorithms becomes larger as the input DAG becomes more dense and the level of parallelism in the DAG is increased.","PeriodicalId":221761,"journal":{"name":"Proceedings of the 1997 International Conference on Parallel Processing (Cat. No.97TB100162)","volume":"37 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1997-08-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129970218","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Hardware versus software implementation of COMA","authors":"Adrian Moga, M. Dubois, A. Gefflaut","doi":"10.1109/ICPP.1997.622652","DOIUrl":"https://doi.org/10.1109/ICPP.1997.622652","url":null,"abstract":"Traditionally, cache coherence in multiprocessors has been maintained in hardware. However, the cost-effectiveness of hardwired protocols is questionable. Virtual Shared Memory systems have highlighted the many advantages of software-implemented protocols, albeit at a performance price. The performance gap is narrowed by hybrid systems with the addition of hardware support for fine-grain sharing. We have developed a software protocol for a COMA (Cache-Only Memory Architecture). We call the system SC-COMA for Software-Controlled COMA, to emphasize that the protocol engine is emulated by software executed on the main processor. Contrary to user-level protocols, the software handling coherence events in SC-COMA runs in sub-kernel mode, transparently providing the same services to applications as a hardware counterpart. The software emulation layer has been written and we compare SC-COMA to an idealized hardware COMA through detailed simulations. Our results show that SC-COMA is competitive. On systems with 32 processors, it achieves a slowdown of 11-56% with respect to its hardware counterpart, across a range of applications and memory pressures. SC-COMA scales well, up to 32 nodes. A study on the impact of faster processors on SC-COMA's relative performance indicates a consistent improvement, but with a limitation due to the loosely-integrated design. We conclude that SC-COMA is a viable solution to easily transform networks of workstations into powerful multiprocessors.","PeriodicalId":221761,"journal":{"name":"Proceedings of the 1997 International Conference on Parallel Processing (Cat. No.97TB100162)","volume":"246 5","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1997-08-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"120892870","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
ProgramsJenq Kuen Lee, Daniel Ho, Yue-Chee ChuangDepartment
{"title":"Data distribution analysis and optimization for Pointer-based distributed programs","authors":"ProgramsJenq Kuen Lee, Daniel Ho, Yue-Chee ChuangDepartment","doi":"10.1109/ICPP.1997.622556","DOIUrl":"https://doi.org/10.1109/ICPP.1997.622556","url":null,"abstract":"A critical question remains open if the compiler can understand the distribution pattern of pointer-based distributed objects built by application programmers, and perform optimization as effectively as the HPF compiler does with distributed arrays. In this paper, we address this challenging issue. In our work, we first present a parallel progamming model which allows application programmers to build pointer-based distributed objects at application levels. Next we propose a distribution analysis algorithm which can automatically summarize the distribution pattern of pointer-based distributed objects built by application programmers. Our work, to our best knowledge, is the first work to attempt to address this open issue. Our distribution analysis framework employs Feautrier's parametric integer programming as the basic solver, and can always obtain precise distribution information from the class of programs written in our parallel programming model with static control. Experimental results done on a 16-node IBM SP-2 machine show that the compiler with the help of distribution analysis algorithm can significantly improve the performance of pointer-based distributed programs.","PeriodicalId":221761,"journal":{"name":"Proceedings of the 1997 International Conference on Parallel Processing (Cat. No.97TB100162)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1997-08-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122782055","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Exploiting task and data parallelism in parallel Hough and Radon transforms","authors":"D. Krishnaswamy, P. Banerjee","doi":"10.1109/ICPP.1997.622678","DOIUrl":"https://doi.org/10.1109/ICPP.1997.622678","url":null,"abstract":"Edge detection and shape detection in digital images are very computationally intensive problems. Parallel algorithms can potentially provide significant speedups while preserving the quality of the result obtained. Hough and Radon Transforms are projection-based transforms which are commonly used for edge detection and shape detection respectively. We propose in this paper various new parallel algorithms which exploit both task and data parallelism available in Hough and Radon transforms algorithms. A memory scalable aggressive task parallel algorithm is shown to be the most optimal algorithm in terms of memory scalability and performance on an IBM SP2.","PeriodicalId":221761,"journal":{"name":"Proceedings of the 1997 International Conference on Parallel Processing (Cat. No.97TB100162)","volume":"116 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1997-08-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116456652","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Local search for DAG scheduling and task assignment","authors":"Minyou Wu, W. Shu, J. Gu","doi":"10.1109/ICPP.1997.622584","DOIUrl":"https://doi.org/10.1109/ICPP.1997.622584","url":null,"abstract":"Scheduling DAGs to multiprocessors is one of the key issues in high-performance computing. Local search can be used to effectively improve the quality of a scheduling algorithm. In this paper, based on topological ordering, we present a fast local search algorithm which can improve the quality of DAG scheduling algorithms. This low complexity algorithm can effectively reduce the length of a given schedule.","PeriodicalId":221761,"journal":{"name":"Proceedings of the 1997 International Conference on Parallel Processing (Cat. No.97TB100162)","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1997-08-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127309968","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Design of scalable and multicast capable cut-through switches for high-speed LANs","authors":"Mingyao Yang, L. Ni","doi":"10.1109/ICPP.1997.622662","DOIUrl":"https://doi.org/10.1109/ICPP.1997.622662","url":null,"abstract":"High-speed switches play an important role in building switched LANs. Among different techniques used in switch design, cut-through switching promises short latency delivery and thus is well suited to distributed/parallel applications. The back pressure flow control of cut-through switching also prevents packet loss due to buffer overflow. This paper presents an incremental switch design based on modular building blocks using cut-through switching technique. The switch can be either nonblocking with full configuration and deterministic routing, or blocking but having more flexibility in configuration and fault tolerance. A kind of switch configuration that fits the client/server computing paradigm is presented. Simulation results are given for various switch configurations and traffic loads. The switch also has built-in hardware multicast capability. Issues of physical layout and integration into practical LANs are also discussed.","PeriodicalId":221761,"journal":{"name":"Proceedings of the 1997 International Conference on Parallel Processing (Cat. No.97TB100162)","volume":"138 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1997-08-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131981145","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Throttle and preempt: a new flow control for real-time communications in wormhole networks","authors":"Hyojeong Song, Boseob Kwon, H. Yoon","doi":"10.1109/ICPP.1997.622589","DOIUrl":"https://doi.org/10.1109/ICPP.1997.622589","url":null,"abstract":"We study wormhole routed networks and their suitability for real-time traffic in a priority-driven paradigm. A traditional blocking flow control in wormhole routing may lead to a priority inversion in the sense that high priority packets are blocked by low priority packets for unlimited time. This uncontrolled priority inversion causes the frequent deadline missing. This paper therefore proposes a new flow control called throttle and preempt flow control, where high priority packets can preempt network resources held by low priority packets, if necessary. As a result, this flow control does not cause priority inversion. Our simulations show that the throttle and preempt flow control dramatically reduces deadline miss ratio without extra virtual channels. It is also observed that the throttle and preempt flow control offers shorter delay for non-real-time traffic than existing real-time flow control does.","PeriodicalId":221761,"journal":{"name":"Proceedings of the 1997 International Conference on Parallel Processing (Cat. No.97TB100162)","volume":"30 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1997-08-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115090426","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}