2014 20th IEEE International Conference on Parallel and Distributed Systems (ICPADS)最新文献

筛选
英文 中文
Using surrogate-based modeling to predict optimal I/O parameters of applications at the extreme scale 使用基于代理的建模来预测极端规模下应用程序的最佳I/O参数
2014 20th IEEE International Conference on Parallel and Distributed Systems (ICPADS) Pub Date : 2014-12-01 DOI: 10.1109/PADSW.2014.7097855
Michael Matheny, Stephen Herbein, N. Podhorszki, S. Klasky, M. Taufer
{"title":"Using surrogate-based modeling to predict optimal I/O parameters of applications at the extreme scale","authors":"Michael Matheny, Stephen Herbein, N. Podhorszki, S. Klasky, M. Taufer","doi":"10.1109/PADSW.2014.7097855","DOIUrl":"https://doi.org/10.1109/PADSW.2014.7097855","url":null,"abstract":"On petascale systems, the selection of optimal values for I/O parameters without taking into account the I/O size and pattern can cause the I/O time to dominate the simulation time, compromising the application's scalability. In this paper, we adopt and adapt an engineering method called surrogate-based modeling to efficiently search for the optimal I/O parameter values and accurately predict the associated I/O times at the extreme scale. Our approach allows us to address both the search and prediction in a short time, even when the application's I/O is large and exhibits irregular patterns.","PeriodicalId":421740,"journal":{"name":"2014 20th IEEE International Conference on Parallel and Distributed Systems (ICPADS)","volume":"90 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122187154","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 8
Achieving cost effective cloud video services via fine grained multicore scheduling 通过细粒度多核调度实现低成本的云视频服务
2014 20th IEEE International Conference on Parallel and Distributed Systems (ICPADS) Pub Date : 2014-12-01 DOI: 10.1109/PADSW.2014.7097843
Hao-Che Kao, Hao-Ping Kang, Che-Rung Lee, Kun-Hsien Lu, Shu-Hsin Chang
{"title":"Achieving cost effective cloud video services via fine grained multicore scheduling","authors":"Hao-Che Kao, Hao-Ping Kang, Che-Rung Lee, Kun-Hsien Lu, Shu-Hsin Chang","doi":"10.1109/PADSW.2014.7097843","DOIUrl":"https://doi.org/10.1109/PADSW.2014.7097843","url":null,"abstract":"Cloud computing that possesses highly accessible and elastic computing resources perfectly matches the demands of video services, which employ massive storage and intensive computational power to store, transmit, compress, enhance, and analyze the videos, uploaded from commodity devices and surveillance cameras. However, most existing video processing programs are neither designed to run on parallel environments nor able to efficiently utilize the computational power of cloud platforms, which not only wastes the computing resources but also increases the cost of using cloud platforms. In this paper, we present three strategies to enhance the multicore utilization for video processing, namely producer-consumer model, intra-process overlapping, and inter-process overlapping. We experimented our strategies on a video enhancement program, which performs decoding, dehazing, and encoding, and the results showed the CPU utilization can be improved up to 31% for an 8 core instance, which can significantly reduce the cost in a long run.","PeriodicalId":421740,"journal":{"name":"2014 20th IEEE International Conference on Parallel and Distributed Systems (ICPADS)","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129486029","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Combine thread with memory scheduling for maximizing performance in multi-core systems 将线程与内存调度相结合,以在多核系统中最大化性能
2014 20th IEEE International Conference on Parallel and Distributed Systems (ICPADS) Pub Date : 2014-12-01 DOI: 10.1109/PADSW.2014.7097821
Gangyong Jia, Guangjie Han, Liang Shi, Jian Wan, Dong Dai
{"title":"Combine thread with memory scheduling for maximizing performance in multi-core systems","authors":"Gangyong Jia, Guangjie Han, Liang Shi, Jian Wan, Dong Dai","doi":"10.1109/PADSW.2014.7097821","DOIUrl":"https://doi.org/10.1109/PADSW.2014.7097821","url":null,"abstract":"The growing gap between microprocessor speed and DRAM speed is a major problem that computer designers are facing. In order to narrow the gap, it is necessary to improve DRAM's speed and throughput. Moreover, on multi-core platforms, DRAM memory shared by all cores usually suffers from the memory contention and interference problem, which can cause serious performance degradation and unfairness among parallel running threads. To address these problems, this paper proposes techniques to take both advantages of partitioning cores, threads and memory banks into groups to reduce interference among different groups and grouping the memory accesses of the same row together to reduce cache miss rate. A memory optimization framework combined thread scheduling with memory scheduling (CTMS) is proposed in this paper, which simultaneously minimizes memory access schedule length, memory access time and reduce interference to maximize performance for multi-core systems. Experimental results show CTMS is 12.6% shorter in memory access time, while improving 11.8% throughput on average. Moreover, CTMS also saves 5.8% of the energy consumption.","PeriodicalId":421740,"journal":{"name":"2014 20th IEEE International Conference on Parallel and Distributed Systems (ICPADS)","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128050976","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Building a large-scale direct network with low-radix routers 使用低基数路由器构建大规模直连网络
2014 20th IEEE International Conference on Parallel and Distributed Systems (ICPADS) Pub Date : 2014-12-01 DOI: 10.1109/PADSW.2014.7097830
Yong Su, Zheng Cao, Zhiguo Fan, Zhan Wang, Xiaoli Liu, Xiaobing Liu, Li Qiang, Xuejun An, Ninghui Sun
{"title":"Building a large-scale direct network with low-radix routers","authors":"Yong Su, Zheng Cao, Zhiguo Fan, Zhan Wang, Xiaoli Liu, Xiaobing Liu, Li Qiang, Xuejun An, Ninghui Sun","doi":"10.1109/PADSW.2014.7097830","DOIUrl":"https://doi.org/10.1109/PADSW.2014.7097830","url":null,"abstract":"Communication locality is an important characteristic of parallel applications. A great deal of research shows that utilizing the characteristic will favor most applications. Aiming at communication locality, we present a hierarchical direct network topology to accelerate neighbor communication. Combining mesh topology and complete graph topology, it can be used to optimize local communication and build large-scale network with low radix routers. Analyzing the characteristic of hierarchical topology, we find the presented topology has high cost performance and excellent expandability. We also design two minimum path routing algorithms and compare them with Mesh, Dragonfly and PERCS topologies. The results show the saturated throughput of hierarchical topology is nearly 40% with uniform random trace and 70% with local communication model of 4K nodes. That indicates high scalability for applications with local communication and cost efficiency for uniform random trace.","PeriodicalId":421740,"journal":{"name":"2014 20th IEEE International Conference on Parallel and Distributed Systems (ICPADS)","volume":"33 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132357679","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Heterogeneous CPU-GPU computing for the finite volume method on 3D unstructured meshes 三维非结构化网格有限体积法的异构CPU-GPU计算
2014 20th IEEE International Conference on Parallel and Distributed Systems (ICPADS) Pub Date : 2014-12-01 DOI: 10.1109/PADSW.2014.7097808
J. Langguth, Xing Cai
{"title":"Heterogeneous CPU-GPU computing for the finite volume method on 3D unstructured meshes","authors":"J. Langguth, Xing Cai","doi":"10.1109/PADSW.2014.7097808","DOIUrl":"https://doi.org/10.1109/PADSW.2014.7097808","url":null,"abstract":"A recent trend in modern high-performance computing environments is the introduction of accelerators such as GPU and Xeon Phi, i.e. specialized computing devices that are optimized for highly parallel applications and coexist with CPUs. In regular compute-intensive applications with predictable data access patterns, these devices often outperform traditional CPUs by far and thus relegate them to pure control functions instead of computations. For irregular applications however, the gap in relative performance can be much smaller, and sometimes even reversed. Thus, maximizing overall performance in such systems requires that full use of all available computational resources is made. In this paper we study the attainable performance of the cell-centered finite volume method on 3D unstructured tetrahedral meshes using heterogeneous systems consisting of CPUs and multiple GPUs. Finite volume methods are widely used numerical strategies for solving partial differential equations. The advantages of using finite volumes include built-in support for conservation laws and suitability for unstructured meshes. Our focus lies in demonstrating how a workload distribution that maximizes overall performance can be derived from the actual performance attained by the different computing devices in the heterogeneous environment. We also highlight the dual role of partitioning software in reordering and partitioning the input mesh, thus giving rise to a new combined approach to partitioning.","PeriodicalId":421740,"journal":{"name":"2014 20th IEEE International Conference on Parallel and Distributed Systems (ICPADS)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124347832","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 8
Performance analysis of HPC applications with irregular tree data structures 不规则树状数据结构的高性能计算应用性能分析
2014 20th IEEE International Conference on Parallel and Distributed Systems (ICPADS) Pub Date : 2014-12-01 DOI: 10.1109/PADSW.2014.7097837
A. Khawaja, Jiajun Wang, A. Gerstlauer, L. John, D. Malhotra, G. Biros
{"title":"Performance analysis of HPC applications with irregular tree data structures","authors":"A. Khawaja, Jiajun Wang, A. Gerstlauer, L. John, D. Malhotra, G. Biros","doi":"10.1109/PADSW.2014.7097837","DOIUrl":"https://doi.org/10.1109/PADSW.2014.7097837","url":null,"abstract":"Adaptive mesh refinement (AMR) numerical methods utilizing octree data structures are an important class of HPC applications, in particular the solution of partial differential equations. Much effort goes into the implementation of efficient versions of these types of programs, where the emphasis is often on increasing multi-node performance when utilizing GPUs and coprocessors. By contrast, our analysis aims to characterize these workloads on traditional CPUs, as we believe that single-threaded intra-node performance of critical kernels is still a key factor for achieving performance at scale. Especially irregular workloads such as AMR methods, however, exhibit severe underutilization on general purpose processors. In this paper, we analyze the single core performance of two state-of-the-art, highly scalable adaptive mesh refinement codes, one based on the Fast Multipole Method (FMM) and one based on the Finite Element Method (FEM), when running on a x86 CPU. We examined both scalar and vectorized implementations to identify performance bottlenecks. We demonstrate that vectorization can provide a significant benefit in achieving high performance. The greatest bottleneck to peak performance is the high fraction of non-floating point instructions in the kernels.","PeriodicalId":421740,"journal":{"name":"2014 20th IEEE International Conference on Parallel and Distributed Systems (ICPADS)","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114345779","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
ArPat: Accurate RFID reader positioning with mere boundary tags ArPat:仅用边界标签就能精确定位RFID阅读器
2014 20th IEEE International Conference on Parallel and Distributed Systems (ICPADS) Pub Date : 2014-12-01 DOI: 10.1109/PADSW.2014.7097901
Guanglian Liu, Shigeng Zhang, Jianxin Wang, Xuan Liu
{"title":"ArPat: Accurate RFID reader positioning with mere boundary tags","authors":"Guanglian Liu, Shigeng Zhang, Jianxin Wang, Xuan Liu","doi":"10.1109/PADSW.2014.7097901","DOIUrl":"https://doi.org/10.1109/PADSW.2014.7097901","url":null,"abstract":"The Radio Frequency IDentification (RFID) technology provides a promising solution to location discovery in indoor environments. Existing RFID reader positioning algorithms usually use all the collected reference tags to determine the position of the target reader, and thus are time-consuming as well as susceptible to the communication irregularity between the reader and reference tags. Especially, they usually perform poorly when the target reader is near the wall or at the corner. In this paper, we propose ArPat, an Accurate RFID reader Positioning algorithm that uses mere boundary reference Tags to calculate the position of the reader. ArPat uses only boundary tags to determine the position of the target reader, which effectively mitigates the negative impact of communication irregularity on the localization accuracy. The localization accuracy of ArPat is higher than 0.2 ft when the space between references tags is 1 ft. Compared with state-of-the-art solutions for RFID reader positioning, ArPat improves localization accuracy by up to 42 percent and 36 percent on average. Furthermore, it uses a geometric approach rather than iterative optimization approaches employed by previous solutions, making it superior in time efficiency. Compared with previous solutions, the computational time of ArPat is nearly two orders of magnitude less. This is critical for a localization system to provide real time location discovery and tracking services.","PeriodicalId":421740,"journal":{"name":"2014 20th IEEE International Conference on Parallel and Distributed Systems (ICPADS)","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114629306","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Atomic reduction based sparse matrix-transpose vector multiplication on GPUs gpu上基于稀疏矩阵转置向量乘法的原子约简
2014 20th IEEE International Conference on Parallel and Distributed Systems (ICPADS) Pub Date : 2014-12-01 DOI: 10.1109/PADSW.2014.7097920
Yuan Tao, Yangdong Deng, Shuai Mu, Mingfa Zhu, Limin Xiao, Li Ruan, Zhibin Huang
{"title":"Atomic reduction based sparse matrix-transpose vector multiplication on GPUs","authors":"Yuan Tao, Yangdong Deng, Shuai Mu, Mingfa Zhu, Limin Xiao, Li Ruan, Zhibin Huang","doi":"10.1109/PADSW.2014.7097920","DOIUrl":"https://doi.org/10.1109/PADSW.2014.7097920","url":null,"abstract":"Sparse Matrix-Transpose Vector Product (SMTVP) is a frequently used computation pattern in High Performance Computing applications. It is typically solved by transposition followed by a Sparse Matrix-Vector Product (SMVP) in current linear algebra packages. However, the transposition process can be a serious bottleneck on modern parallel computing platforms. A previous work proposed a relatively complex data structure for efficiently computing SMTVP with multi-core CPUs, but it proved to be inefficient on GPUs. In this work, we show that the Compressed Sparse Row (CSR) based SMVP algorithm can also be efficient for SMTVP computation on modern GPUs. The proposed method exploits atomic operations to perform the reduce operation in the computation of each inner product of a row in the transposed matrix and the vector. Experimental results show that the simple technique can outperform the SMTVP flow of transposition plus SMVP released in the CUSPARSE package by up to 405-fold.","PeriodicalId":421740,"journal":{"name":"2014 20th IEEE International Conference on Parallel and Distributed Systems (ICPADS)","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123578674","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
Design and analysis of software defined Vehicular Cyber Physical Systems 软件定义车辆网络物理系统的设计与分析
2014 20th IEEE International Conference on Parallel and Distributed Systems (ICPADS) Pub Date : 2014-12-01 DOI: 10.1109/PADSW.2014.7097836
P. Duan, Chao Peng, Qin Zhu, Jingmin Shi, Haibin Cai
{"title":"Design and analysis of software defined Vehicular Cyber Physical Systems","authors":"P. Duan, Chao Peng, Qin Zhu, Jingmin Shi, Haibin Cai","doi":"10.1109/PADSW.2014.7097836","DOIUrl":"https://doi.org/10.1109/PADSW.2014.7097836","url":null,"abstract":"VCPS (Vehicular Cyber Physical Systems) is a special kind of networked cyber physical system in which each vehicle is regarded as a communication unit. Vehicle's movement is restricted by road and environment in VCPS, while traditional random mobility model and waypoint mobility model cannot reflect the realistic vehicle traces. In VCPS, with the high speed of vehicles, the network topology undergoing tremendous changes all the time, which greatly undermines the stability of communication between vehicles. The diversity and complexity of traffic scenarios in VCPS have also increased the difficulty of designing an efficient and stable routing protocol. In this paper, we creatively combine SDN (Software Defined Networking) and VCPS together and propose a new VCPS communication architecture, which enable VCPS to be manageable by remote controller. SD-VCPS can flexibly change routing policies depending on different traffic scenes or traffic periods, adjusting the topology of VCPS to adapt to different network requirements. We further present a new location-based routing protocol for SD-VCPS, and corroborate the efficiency of our proposed framework by experiments using network simulator NS3.","PeriodicalId":421740,"journal":{"name":"2014 20th IEEE International Conference on Parallel and Distributed Systems (ICPADS)","volume":"33 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121954679","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 7
Providing hybrid block storage for virtual machines using object-based storage 使用基于对象的存储为虚拟机提供混合块存储
2014 20th IEEE International Conference on Parallel and Distributed Systems (ICPADS) Pub Date : 2014-12-01 DOI: 10.1109/PADSW.2014.7097803
Sixiang Ma, Hao-peng Chen, Yuxi Shen, Heng Lu, Bin Wei, P. He
{"title":"Providing hybrid block storage for virtual machines using object-based storage","authors":"Sixiang Ma, Hao-peng Chen, Yuxi Shen, Heng Lu, Bin Wei, P. He","doi":"10.1109/PADSW.2014.7097803","DOIUrl":"https://doi.org/10.1109/PADSW.2014.7097803","url":null,"abstract":"This paper presents the design, implementation, and evaluation of a multi-tiered storage system called MOBBS, which provides hybrid block storage for Virtual Machines (VMs) on top of object-based storage infrastructure. MOBBS is mainly motivated by the gap between the lack of studies on hybrid block storage for VMs and the increasing prevalence of hybrid storage systems. By stripping disk images into partitions and intelligently storing them on different storage tiers according to real-time workload patterns, MOBBS achieves efficient use of multiple storage devices and relieves the burden of data placement. Leveraging the benefits of object-based storage, MOBBS is able to dynamically perform non-disruptive and fine-grained data migration between storage tiers and distribute the complexity of data migration across entire storage nodes. Such designs enable our system to deliver storage for VMs with high scalability and availability under an efficient use of SSDs. We evaluated a Ceph implementation of MOBBS using both block and file system workloads. The results comprehensively demonstrate MOBBS's effectiveness in performance improvement as well as efficient utilization of different storage devices.","PeriodicalId":421740,"journal":{"name":"2014 20th IEEE International Conference on Parallel and Distributed Systems (ICPADS)","volume":"103 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116845534","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 8
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信