ACM/IEEE SC 2006 Conference (SC'06)最新文献_第6页

Exploiting the Performance of 32 bit Floating Point Arithmetic in Obtaining 64 bit Accuracy (Revisiting Iterative Refinement for Linear Systems) 利用32位浮点运算的性能获得64位精度(再论线性系统的迭代细化)

ACM/IEEE SC 2006 Conference (SC'06) Pub Date : 2006-11-11 DOI: 10.1145/1188455.1188573

J. Langou, J. Langou, P. Luszczek, J. Kurzak, A. Buttari, J. Dongarra

{"title":"Exploiting the Performance of 32 bit Floating Point Arithmetic in Obtaining 64 bit Accuracy (Revisiting Iterative Refinement for Linear Systems)","authors":"J. Langou, J. Langou, P. Luszczek, J. Kurzak, A. Buttari, J. Dongarra","doi":"10.1145/1188455.1188573","DOIUrl":"https://doi.org/10.1145/1188455.1188573","url":null,"abstract":"Recent versions of microprocessors exhibit performance characteristics for 32 bit floating point arithmetic (single precision) that is substantially higher than 64 bit floating point arithmetic (double precision). Examples include the Intel's Pentium IV and M processors, AMD's Opteron architectures and the IBM's Cell Broad Engine processor. When working in single precision, floating point operations can be performed up to two times faster on the Pentium and up to ten times faster on the Cell over double precision. The performance enhancements in these architectures are derived by accessing extensions to the basic architecture, such as SSE2 in the case of the Pentium and the vector functions on the IBM Cell. The motivation for this paper is to exploit single precision operations whenever possible and resort to double precision at critical stages while attempting to provide the full double precision results. The results described here are fairly general and can be applied to various problems in linear algebra such as solving large sparse systems, using direct or iterative methods and some eigenvalue problems. There are limitations to the success of this process, such as when the conditioning of the problem exceeds the reciprocal of the accuracy of the single precision computations. In that case the double precision algorithm should be used","PeriodicalId":333909,"journal":{"name":"ACM/IEEE SC 2006 Conference (SC'06)","volume":"08 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-11-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128857957","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 159

Locality and Parallelism Optimization for Dynamic Programming Algorithm in Bioinformatics 生物信息学中动态规划算法的局部性与并行性优化

ACM/IEEE SC 2006 Conference (SC'06) Pub Date : 2006-11-01 DOI: 10.1145/1188455.1188538

Guangming Tan, S. Feng, Ninghui Sun

{"title":"Locality and Parallelism Optimization for Dynamic Programming Algorithm in Bioinformatics","authors":"Guangming Tan, S. Feng, Ninghui Sun","doi":"10.1145/1188455.1188538","DOIUrl":"https://doi.org/10.1145/1188455.1188538","url":null,"abstract":"Dynamic programming has been one of the most efficient approaches to sequence analysis and structure prediction in biology. However, their performance is limited due to the drastic increase in both the number of biological data and variety of the computer architectures. With regard to such predicament, this paper creates excellent algorithms aimed at addressing the challenges of improving memory efficiency and network latency tolerance for nonserial polyadic dynamic programming where the dependences are nonuniform. By relaxing the nonuniform dependences, we proposed a new cache oblivious scheme to enhance its performance on memory hierarchy architectures. Moreover we develop and extend a tiling technique to parallelize this nonserial polyadic dynamic programming using an alternate block-cyclic mapping strategy for balancing the computational and memory load, where an analytical parameterized model is formulated to determine the tile volume size that minimizes the total execution time and an algorithmic transformation is used to schedule the tile to overlap communication with computation to further minimize communication overhead on parallel architectures. The numerical experiments were carried out on several high performance computer systems. The new cache-oblivious dynamic programming algorithm achieve 2-10 speedup and the parallel tiling algorithm with communication-computation overlapping shows a desired potential for fine-grained parallel computing on massively parallel computer systems","PeriodicalId":333909,"journal":{"name":"ACM/IEEE SC 2006 Conference (SC'06)","volume":"94 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127328526","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 39

FFT Program Generation for Shared Memory: SMP and Multicore 共享内存的FFT程序生成:SMP和多核

ACM/IEEE SC 2006 Conference (SC'06) Pub Date : 2006-11-01 DOI: 10.1145/1188455.1188575

F. Franchetti, Y. Voronenko, Markus Püschel

引用次数: 93

Detecting Distributed Scans Using High-Performance Query-Driven Visualization 使用高性能查询驱动可视化检测分布式扫描

ACM/IEEE SC 2006 Conference (SC'06) Pub Date : 2006-09-01 DOI: 10.1145/1188455.1188542

Kurt Stockinger, E. W. Bethel, S. Campbell, E. Dart, K. Wu

引用次数: 49

Data Intensive Computing Panel Discussion 数据密集计算小组讨论

ACM/IEEE SC 2006 Conference (SC'06) Pub Date : 1900-01-01 DOI: 10.1109/sc.2006.21

引用次数: 0

MPI Performance Analysis Tools on Blue Gene/L 蓝色基因/L MPI性能分析工具

ACM/IEEE SC 2006 Conference (SC'06) Pub Date : 1900-01-01 DOI: 10.1109/SC.2006.43

I. Chung, R. Walkup, H. Wen, Hao Yu

引用次数: 34