2015 44th International Conference on Parallel Processing最新文献

筛选
英文 中文
Slowing Little Quickens More: Improving DCTCP for Massive Concurrent Flows 慢一点加速更多:改进大规模并发流的DCTCP
2015 44th International Conference on Parallel Processing Pub Date : 2015-09-01 DOI: 10.1109/ICPP.2015.78
Mao Miao, Peng Cheng, Fengyuan Ren, Ran Shu
{"title":"Slowing Little Quickens More: Improving DCTCP for Massive Concurrent Flows","authors":"Mao Miao, Peng Cheng, Fengyuan Ren, Ran Shu","doi":"10.1109/ICPP.2015.78","DOIUrl":"https://doi.org/10.1109/ICPP.2015.78","url":null,"abstract":"DCTCP is a potential TCP replacement to satisfy the requirements of data center network. It receives wide concerns in both academic and industrial circles. However, DCTCP could only support tens of concurrent flows well and suffers timeouts and throughput collapse facing numerous concurrent flows. This is far from the requirement of data center network. Data centers employing partition/aggregation pattern usually involve hundreds of concurrent flows. In this paper, after tracing DCTCP's dynamic behavior through experiments, we explored two roots for DCTCP's failure under the high fan-in traffic pattern: (1) The regulation mechanism of sending window is ineffective when cwnd is decreased to the minimum size, (2) The bursts induced by synchronized flows with small cwnd cause fatal packet loss leading to severe timeouts. We enhance DCTCP to support massive concurrent flows by regulating the sending time interval and desynchronizing the sending time in particular conditions. The new protocol called DCTCP+ outperforms DCTCP when the number of concurrent flows increases to several hundreds. DCTCP+ can normally work to effectively support the short concurrent query responses in the benchmark from real production clusters, and keep the same good performance with the mixture of background traffic.","PeriodicalId":423007,"journal":{"name":"2015 44th International Conference on Parallel Processing","volume":"34 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123813600","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
On Maximizing Reliability of Lifetime Constrained Data Aggregation Tree in Wireless Sensor Networks 无线传感器网络中寿命约束数据聚合树可靠性最大化研究
2015 44th International Conference on Parallel Processing Pub Date : 2015-09-01 DOI: 10.1109/ICPP.2015.17
M. Shan, Guihai Chen, Fan Wu, Xiaobing Wu, Xiaofeng Gao, Pan Wu, Haipeng Dai
{"title":"On Maximizing Reliability of Lifetime Constrained Data Aggregation Tree in Wireless Sensor Networks","authors":"M. Shan, Guihai Chen, Fan Wu, Xiaobing Wu, Xiaofeng Gao, Pan Wu, Haipeng Dai","doi":"10.1109/ICPP.2015.17","DOIUrl":"https://doi.org/10.1109/ICPP.2015.17","url":null,"abstract":"Tree-based routing structures are widely used to gather data in wireless sensor networks. Along with tree structures, in-network aggregation is adopted to reduce transmissions, to save energy and to prolong the network lifetime. Most existing works that focus on the lifetime optimization for data aggregation do not take the link quality into consideration. In this paper, we study the problem of Maximizing Reliability of Lifetime Constrained data aggregation trees (MRLC) in WSNs. Considering the NP-completeness of the MRLC problem, we propose an algorithm, namely Iterative Relaxation Algorithm (IRA), to iteratively relax the optimization program and to find the aggregation tree subject to the lifetime bound with a sub-optimal cost. To adapt to the distributed nature of the WSNs in practice, we further propose a Prufer code based distributed updating protocol. Through extensive simulations, we demonstrate that IRA outperforms the best known related work in term of reliability.","PeriodicalId":423007,"journal":{"name":"2015 44th International Conference on Parallel Processing","volume":"75 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126181890","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Accelerating Spectral Calculation through Hybrid GPU-Based Computing 基于混合gpu的计算加速频谱计算
2015 44th International Conference on Parallel Processing Pub Date : 2015-09-01 DOI: 10.1109/ICPP.2015.13
Jian Xiao, Xingyu Xu, Ce Yu, Jiawan Zhang, Shuinai Zhang, Li Ji, Ji-zhou Sun
{"title":"Accelerating Spectral Calculation through Hybrid GPU-Based Computing","authors":"Jian Xiao, Xingyu Xu, Ce Yu, Jiawan Zhang, Shuinai Zhang, Li Ji, Ji-zhou Sun","doi":"10.1109/ICPP.2015.13","DOIUrl":"https://doi.org/10.1109/ICPP.2015.13","url":null,"abstract":"Spectral calculation and analysis have very important practical applications in astrophysics. The main portion of spectral calculation is to solve a large number of one-dimensional numerical integrations at each point of a large three-dimensional parameter space. However, existing widely used solutions still remain in process-level parallelism, which is not competent to tackle numerous compute-intensive small integral tasks. This paper presented a GPU-optimized approach to accelerate the numerical integration in massive spectral calculation. We also proposed a load balance strategy on hybrid multiple CPUs and GPUs architecture via share memory to maximize performance. The approach was prototyped and tested on the Astrophysical Plasma Emission Code (APEC), a commonly used spectral toolset. Comparing with the original serial version and the 24 CPU cores (2.5GHz) parallel version, our implementation on 3 Tesla C2075 GPUs achieves a speed-up of up to 300 and 22 respectively.","PeriodicalId":423007,"journal":{"name":"2015 44th International Conference on Parallel Processing","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127463502","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Optimizing Image Sharpening Algorithm on GPU 基于GPU的图像锐化算法优化
2015 44th International Conference on Parallel Processing Pub Date : 2015-09-01 DOI: 10.1109/ICPP.2015.32
Mengran Fan, Haipeng Jia, Yunquan Zhang, Xiaojing An, Ting Cao
{"title":"Optimizing Image Sharpening Algorithm on GPU","authors":"Mengran Fan, Haipeng Jia, Yunquan Zhang, Xiaojing An, Ting Cao","doi":"10.1109/ICPP.2015.32","DOIUrl":"https://doi.org/10.1109/ICPP.2015.32","url":null,"abstract":"Sharpness is an algorithm used to sharpen images. As the increase of image size, resolution, and the requirements for real-time processing, the performance of sharpness needs to get improved greatly. The independent pixel calculation of sharpness makes a good opportunity to use GPU to largely accelerate the performance. However, to transplant it to GPU, one challenge is that sharpness involves several stages to execute. Each stage has its own characteristics, either with or without data dependency to other stages. Based on those characteristics, this paper proposes a complete solution to implement and optimize sharpness on GPU. Our solution includes five major and effective techniques: Data Transfer Optimization, Kernel Fusion, Vectorization for Data Locality, Border and Reduction Optimization. Experiments show that, compared to a well-optimized CPU version, our GPU solution can reach 10.7~ 69.3 times speedup for different image sizes on an AMD Fire Pro W8000 GPU.","PeriodicalId":423007,"journal":{"name":"2015 44th International Conference on Parallel Processing","volume":"32 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121459937","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
PDTL: Parallel and Distributed Triangle Listing for Massive Graphs 海量图的并行和分布式三角形列表
2015 44th International Conference on Parallel Processing Pub Date : 2015-09-01 DOI: 10.1109/ICPP.2015.46
Ilias Giechaskiel, G. Panagopoulos, Eiko Yoneki
{"title":"PDTL: Parallel and Distributed Triangle Listing for Massive Graphs","authors":"Ilias Giechaskiel, G. Panagopoulos, Eiko Yoneki","doi":"10.1109/ICPP.2015.46","DOIUrl":"https://doi.org/10.1109/ICPP.2015.46","url":null,"abstract":"This paper presents the first distributed triangle listing algorithm with provable CPU, I/O, Memory, and Network bounds. Finding all triangles (3-cliques) in a graph has numerous applications for density and connectivity metrics, but the majority of existing algorithms for massive graphs are sequential, while distributed versions of algorithms do not guarantee their CPU, I/O, Memory, or Network requirements. Our Parallel and Distributed Triangle Listing (PDTL) framework focuses on efficient external-memory access in distributed environments instead of fitting sub graphs into memory. It works by performing efficient orientation and load-balancing steps, and replicating graphs across machines by using an extended version of Hu et al.'s Massive Graph Triangulation algorithm. PDTL suits a variety of computational environments, from single-core machines to high-end clusters, and computes the exact triangle count on graphs of over 6B edges and 1B vertices (e.g. Yahoo graphs), outperforming and using fewer resources than the state-of-the-art systems Power Graph, OPT, and PATRIC by 2x to 4x. Our approach thus highlights the importance of I/O in a distributed environment.","PeriodicalId":423007,"journal":{"name":"2015 44th International Conference on Parallel Processing","volume":"126 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131918010","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 30
SZTS: A Novel Big Data Transportation System Benchmark Suite SZTS:一个新的大数据交通系统基准套件
2015 44th International Conference on Parallel Processing Pub Date : 2015-09-01 DOI: 10.1109/ICPP.2015.91
Wen Xiong, Zhibin Yu, L. Eeckhout, Zhengdong Bei, Fan Zhang, Chengzhong Xu
{"title":"SZTS: A Novel Big Data Transportation System Benchmark Suite","authors":"Wen Xiong, Zhibin Yu, L. Eeckhout, Zhengdong Bei, Fan Zhang, Chengzhong Xu","doi":"10.1109/ICPP.2015.91","DOIUrl":"https://doi.org/10.1109/ICPP.2015.91","url":null,"abstract":"Data analytics is at the core of the supply chain for both products and services in modern economies and societies. Big data workloads however, are placing unprecedented demands on computing technologies, calling for a deep understanding and characterization of these emerging workloads. In this paper, we propose Shen Zhen Transportation System (SZTS), a novel big data Hadoop benchmark suite comprised of real-life transportation analysis applications with real-life input data sets from Shenzhen in China. SZTS uniquely focuses on a specific and real-life application domain whereas other existing Hadoop benchmark suites, such as Hi Bench and Cloud Rank-D, consist of generic algorithms with synthetic inputs. We perform a cross-layer workload characterization at both the job and micro architecture level, revealing unique characteristics of SZTS compared to existing Hadoop benchmarks as well as general-purpose multi-core PARSEC benchmarks. We also study the sensitivity of workload behavior with respect to input data size, and propose a methodology for identifying representative input data sets.","PeriodicalId":423007,"journal":{"name":"2015 44th International Conference on Parallel Processing","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130751256","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 10
Parallel (Probable) Lock-Free Hash Sieve: A Practical Sieving Algorithm for the SVP 并行(可能)无锁散列筛:一种实用的SVP筛分算法
2015 44th International Conference on Parallel Processing Pub Date : 2015-09-01 DOI: 10.1109/ICPP.2015.68
Artur Mariano, C. Bischof, Thijs Laarhoven
{"title":"Parallel (Probable) Lock-Free Hash Sieve: A Practical Sieving Algorithm for the SVP","authors":"Artur Mariano, C. Bischof, Thijs Laarhoven","doi":"10.1109/ICPP.2015.68","DOIUrl":"https://doi.org/10.1109/ICPP.2015.68","url":null,"abstract":"In this paper, we assess the practicability of Hash Sieve, a recently proposed sieving algorithm for the Shortest Vector Problem (SVP) on lattices, on multi-core shared memory systems. To this end, we devised a parallel implementation that scales well, and is based on a probable lock-free system to handle concurrency. The probable lock-free system, implemented with spin-locks, in turn implemented with CAS operations, becomes likely a lock-free mechanism, since threads block only when strictly required and chances are that they are not required to block. With our implementation, we were able to solve the SVP on an arbitrary lattice in dimension 96, in less than 17.5 hours, using 16 physical cores. The least squares fit of the execution times of our implementation, in seconds, lies between 2(0.32n -- 15) or 2(0.33n -- 16). These results are of paramount importance for the selection of parameters in lattice-based cryptography, as they indicate that sieving algorithms are way more practical for solving the SVP than previously believed.","PeriodicalId":423007,"journal":{"name":"2015 44th International Conference on Parallel Processing","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134045702","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 36
Nested Parallelism on GPU: Exploring Parallelization Templates for Irregular Loops and Recursive Computations GPU上的嵌套并行:探索不规则循环和递归计算的并行化模板
2015 44th International Conference on Parallel Processing Pub Date : 2015-09-01 DOI: 10.1109/ICPP.2015.107
Da Li, Hancheng Wu, M. Becchi
{"title":"Nested Parallelism on GPU: Exploring Parallelization Templates for Irregular Loops and Recursive Computations","authors":"Da Li, Hancheng Wu, M. Becchi","doi":"10.1109/ICPP.2015.107","DOIUrl":"https://doi.org/10.1109/ICPP.2015.107","url":null,"abstract":"The effective deployment of applications exhibiting irregular nested parallelism on GPUs is still an open problem. A naïve mapping of irregular code onto the GPU hardware often leads to resource underutilization and, thereby, limited performance. In this work, we focus on two computational patterns exhibiting nested parallelism: irregular nested loops and parallel recursive computations. In particular, we focus on recursive algorithms operating on trees and graphs. We propose different parallelization templates aimed to increase the GPU utilization of these codes. Specifically, we investigate mechanisms to effectively distribute irregular work to streaming multiprocessors and GPU cores. Some of our parallelization templates rely on dynamic parallelism, a feature recently introduced by Nvidia in their Kepler GPUs and announced as part of the Open CL 2.0 standard. We propose mechanisms to maximize the work performed by nested kernels and minimize the overhead due to their invocation. Our results show that the use of our parallelization templates on applications with irregular nested loops can lead to a 2-6x speedup over baseline GPU codes that do not include load balancing mechanisms. The use of nested parallelism-based parallelization templates on recursive tree traversal algorithms can lead to substantial speedups (up to 15-24x) over optimized CPU implementations. However, the benefits of nested parallelism are still unclear in the presence of recursive applications operating on graphs, especially when recursive code variants require expensive synchronization. In these cases, a flat parallelization of iterative versions of the considered algorithms may be preferable.","PeriodicalId":423007,"journal":{"name":"2015 44th International Conference on Parallel Processing","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132337818","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 16
Optimization of Resource Allocation and Energy Efficiency in Heterogeneous Cloud Data Centers 异构云数据中心的资源配置与能效优化
2015 44th International Conference on Parallel Processing Pub Date : 2015-09-01 DOI: 10.1109/ICPP.2015.9
Amer Qouneh, Ming Liu, Tao Li
{"title":"Optimization of Resource Allocation and Energy Efficiency in Heterogeneous Cloud Data Centers","authors":"Amer Qouneh, Ming Liu, Tao Li","doi":"10.1109/ICPP.2015.9","DOIUrl":"https://doi.org/10.1109/ICPP.2015.9","url":null,"abstract":"Performance and energy efficiency are major concerns in cloud computing data centers. More often, they carry conflicting requirements making optimization a challenge. Further complications arise when heterogeneous hardware and data center management technologies are combined. For example, heterogeneous hardware such as General Purpose Graphics Processing Units (GPGPUs) improve performance at the cost of greater power consumption while virtualization technologies improve resource management and utilization at the cost of degraded performance. In this paper, we focus on exploiting heterogeneity introduced by GPUs to reduce power budget requirements for servers while maintaining performance. To maintain or improve overall server performance at reduced power budget, we propose two enhancements: (a) We borrow power from co-located multithreaded virtual machines (VMs) and reallocate it to GPU VMs. (b) To compensate multi-threaded VMs and re-boost their performance, we propose to borrow virtual computing resources from GPU VMs and reallocate them to CPU VMs. Combining the two techniques minimizes server power budget while maintaining overall server performance. Our results show that server power budget can be reduced by almost 18% at the average cost of 13% performance degradation per virtual machine. In addition, reallocating virtual resources improves the performance of multi-threaded applications by 30% without affecting GPU applications. Combining both techniques reduces server energy consumption by 47 % with minimum performance degradation.","PeriodicalId":423007,"journal":{"name":"2015 44th International Conference on Parallel Processing","volume":"172 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132662166","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
Joint Media Streaming Optimization of Energy and Rebuffering Time in Cellular Networks 蜂窝网络中能量和再缓冲时间的联合流媒体优化
2015 44th International Conference on Parallel Processing Pub Date : 2015-09-01 DOI: 10.1109/ICPP.2015.49
Zeqi Lai, Yong Cui, Yayun Bao, Jiangchuan Liu, Yingchao Zhao, Xiao Ma
{"title":"Joint Media Streaming Optimization of Energy and Rebuffering Time in Cellular Networks","authors":"Zeqi Lai, Yong Cui, Yayun Bao, Jiangchuan Liu, Yingchao Zhao, Xiao Ma","doi":"10.1109/ICPP.2015.49","DOIUrl":"https://doi.org/10.1109/ICPP.2015.49","url":null,"abstract":"Streaming services are gaining popularity and have contributed a tremendous fraction of today's cellular network traffic. Both playback fluency and battery endurance are significant performance metrics for mobile streaming services. However, because of the unpredictable network condition and the loose coupling between upper layer streaming protocols and underlying network configurations, jointly optimizing rebuffering time and energy consumption for mobile streaming services remains a significant challenge. In this paper, we propose a novel framework that effectively addresses the above limitations and optimizes video transmission in cellular networks. We design two complementary algorithms, Rebuffering Time Minimization Algorithm (RTMA) and Energy Minimization Algorithm (EMA) in this framework, to achieve smoothed playback and energy-efficiency on demand over multi-user scenarios. Our algorithms integrate cross-layer parameters to schedule video delivery. Specifically, RTMA aims at achieving the minimum rebuffering time with limited energy and EMA tries to obtain the minimum energy consumption while meeting the rebuffering time constraint. Extensive simulation demonstrates that RTMA is able to reduce at least 68% rebuffering time and EMA can achieve more than 27% energy reduction compared with other state-of-the-art solutions.","PeriodicalId":423007,"journal":{"name":"2015 44th International Conference on Parallel Processing","volume":"39 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114832394","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信