Ricardo Nobre, A. Ilic, Sergio Santander-Jiménez, L. Sousa
{"title":"Exploring the Binary Precision Capabilities of Tensor Cores for Epistasis Detection","authors":"Ricardo Nobre, A. Ilic, Sergio Santander-Jiménez, L. Sousa","doi":"10.1109/IPDPS47924.2020.00043","DOIUrl":"https://doi.org/10.1109/IPDPS47924.2020.00043","url":null,"abstract":"Genome-wide association studies are performed to correlate a number of diseases and other physical or even psychological conditions (phenotype) with substitutions of nucleotides at specific positions in the human genome, mainly single-nucleotide polymorphisms (SNPs). Some conditions, possibly because of the complexity of the mechanisms that give rise to them, have been identified to be more statistically correlated with genotype when multiple SNPs are jointly taken into account. However, the discovery of new associations between genotype and phenotype is exponentially slowed down by the increase of computational power required when epistasis, i.e., interactions between SNPs, is considered. This paper proposes a novel graphics processing unit (GPU)-based approach for epistasis detection that combines the use of modern tensor cores with native support for processing binarized inputs with algorithmic and target-focused optimizations. Using only a single mid-range Turing-based GPU, the proposed approach is able to evaluate 64.8×1012 and 25.4×1012 sets of SNPs per second, normalized to the number of patients, when considering 2-way and 3-way epistasis detection, respectively. This proposal is able to surpass the state-of-the-art approach by 6× and 8.2× in terms of the number of pairs and triplets of SNP allelic patient data evaluated per unit of time per GPU.","PeriodicalId":6805,"journal":{"name":"2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"1 1","pages":"338-347"},"PeriodicalIF":0.0,"publicationDate":"2020-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90905125","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Rui Xia, Haipeng Dai, Jiaqi Zheng, Hong Xu, M. Li, Guihai Chen
{"title":"Packet-in Request Redirection for Minimizing Control Plane Response Time","authors":"Rui Xia, Haipeng Dai, Jiaqi Zheng, Hong Xu, M. Li, Guihai Chen","doi":"10.1109/IPDPS47924.2020.00099","DOIUrl":"https://doi.org/10.1109/IPDPS47924.2020.00099","url":null,"abstract":"A distributed control plane is more scalable and robust in software defined networking. This paper focuses on controller load balancing using packet-in request redirection, that is, given the instantaneous state of the system, determining whether to redirect packet-in requests for each switch, such that the overall control plane response time (CPRT) is minimized. To address the above problem, we propose a framework based on Lyapunov optimization. First, we use the drift-plus-penalty algorithm to combine CPRT minimization problem with controller capacity constraints, and further derive a non-linear program, whose optimal solution is obtained with brute force using standard linearization techniques. Second, we present a greedy strategy to efficiently obtain a solution with a bounded approximation ratio. Third, we reformulate the program as a problem of maximizing a non-monotone submodular function subject to matroid constraints. We implement a controller proto-type for packet-in request redirection, and conduct trace-driven simulations to validate our theoretical results. The results show that our algorithms can reduce the average CPRT by 81.6% compared to static controller-switch assignment, and achieve a 3× improvement in maximum controller capacity violation ratio.","PeriodicalId":6805,"journal":{"name":"2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"87 1","pages":"926-935"},"PeriodicalIF":0.0,"publicationDate":"2020-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"80830675","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
P. Zardoshti, Michael F. Spear, A. Vosoughi, G. Swart
{"title":"Understanding and Improving Persistent Transactions on Optane™ DC Memory","authors":"P. Zardoshti, Michael F. Spear, A. Vosoughi, G. Swart","doi":"10.1109/IPDPS47924.2020.00044","DOIUrl":"https://doi.org/10.1109/IPDPS47924.2020.00044","url":null,"abstract":"Storing data structures in high-capacity byte-addressable persistent memory instead of DRAM or a storage device offers the opportunity to (1) reduce cost and power consumption compared with DRAM, (2) decrease the latency and CPU resources needed for an I/O operation compared with storage, and (3) allow for fast recovery as the data structure remains in memory after a machine failure. The first commercial offering in this space is Intel® Optane™ Direct Connect (Optane™ DC) Persistent Memory. Optane™ DC promises access time within a constant factor of DRAM, with larger capacity, lower energy consumption, and persistence. We present an experimental evaluation of persistent transactional memory performance, and explore how Optane™ DC durability domains affect the overall results. Given that neither of the two available durability domains can deliver performance competitive with DRAM, we introduce and emulate a new durability domain, called PDRAM, in which the memory controller tracks enough information (and has enough reserve power) to make DRAM behave like a persistent cache of Optane™ DC memory.In this paper we compare the performance of these durability domains on several configurations of five persistent transactional memory applications. We find a large throughput difference, which emphasizes the importance of choosing the best durability domain for each application and system. At the same time, our results confirm that recently published persistent transactional memory algorithms are able to scale, and that recent optimizations for these algorithms lead to strong performance, with speedups as high as 6× at 16 threads.","PeriodicalId":6805,"journal":{"name":"2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"29 1","pages":"348-357"},"PeriodicalIF":0.0,"publicationDate":"2020-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"91166476","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Zhao Zhang, Lei Huang, J. G. Pauloski, Ian T Foster
{"title":"Efficient I/O for Neural Network Training with Compressed Data","authors":"Zhao Zhang, Lei Huang, J. G. Pauloski, Ian T Foster","doi":"10.1109/IPDPS47924.2020.00050","DOIUrl":"https://doi.org/10.1109/IPDPS47924.2020.00050","url":null,"abstract":"FanStore is a shared object store that enables efficient and scalable neural network training on supercomputers. By providing a global cache layer on node-local burst buffers using a compressed representation, it significantly enhances the processing capability of deep learning (DL) applications on existing hardware. In addition, FanStore allows POSIX-compliant file access to the compressed data in user space. We investigate the tradeoff between runtime overhead and data compression ratio using real-world datasets and applications, and propose a compressor selection algorithm to maximize storage capacity given performance constraints. We consider both asynchronous (i.e., with prefetching) and synchronous I/O strategies, and propose mechanisms for selecting compressors for both approaches. Using FanStore, the same storage hardware can host 2–13× more data for example applications without significant runtime overhead. Empirically, our experiments show that FanStore scales to 512 compute nodes with near linear performance scalability.","PeriodicalId":6805,"journal":{"name":"2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"87 1","pages":"409-418"},"PeriodicalIF":0.0,"publicationDate":"2020-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85998311","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pinchao Liu, Hailu Xu, D. D. Silva, Qingyang Wang, Sarker Tanzir Ahmed, Liting Hu
{"title":"FP4S: Fragment-based Parallel State Recovery for Stateful Stream Applications","authors":"Pinchao Liu, Hailu Xu, D. D. Silva, Qingyang Wang, Sarker Tanzir Ahmed, Liting Hu","doi":"10.1109/IPDPS47924.2020.00116","DOIUrl":"https://doi.org/10.1109/IPDPS47924.2020.00116","url":null,"abstract":"Streaming computations are by nature long-running. They run in highly dynamic distributed environments where many stream operators may leave or fail at the same time. Most of them are stateful, in which stream operators need to store and maintain large-sized state in memory, resulting in expensive time and space costs to recover them. The state-of-the-art stream processing systems offer failure recovery mainly through three approaches: replication recovery, checkpointing recovery, and DStream-based lineage recovery, which are either slow, resource-expensive or fail to handle many simultaneous failures.We present FP4S, a novel fragment-based parallel state recovery mechanism that can handle many simultaneous failures for a large number of concurrently running stream applications. The novelty of FP4S is that we organize all the application’s operators into a distributed hash table (DHT) based consistent ring to associate each operator with a unique set of neighbors. Then we divide each operator’s in-memory state into many fragments and periodically save them in each node’s neighbors, ensuring that different sets of available fragments can reconstruct lost state in parallel. This approach makes this failure recovery mechanism extremely scalable, and allows it to tolerate many simultaneous operator failures. We apply FP4S on Apache Storm and evaluate it using large-scale real-world experiments, which demonstrate its scalability, efficiency, and fast failure recovery features. When compared to the state-of-the-art solutions (Apache Storm), FP4S reduces 37.8% latency of state recovery and saves more than half of the hardware costs. It can scale to many simultaneous failures and successfully recover the states when up to 66.6% of states fail or get lost.","PeriodicalId":6805,"journal":{"name":"2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"51 1","pages":"1102-1111"},"PeriodicalIF":0.0,"publicationDate":"2020-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"87290235","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Xiaodong Yu, Fengguo Wei, Xinming Ou, M. Becchi, Tekin Bicer, D. Yao
{"title":"GPU-Based Static Data-Flow Analysis for Fast and Scalable Android App Vetting","authors":"Xiaodong Yu, Fengguo Wei, Xinming Ou, M. Becchi, Tekin Bicer, D. Yao","doi":"10.1109/IPDPS47924.2020.00037","DOIUrl":"https://doi.org/10.1109/IPDPS47924.2020.00037","url":null,"abstract":"Many popular vetting tools for Android applications use static code analysis techniques. In particular, Interprocedural Data-Flow Graph (IDFG) construction is the computation at the core of Android static data-flow analysis and consumes most of the analysis time. Many analysis tools use a worklist algorithm, an iterative fixed-point approach, to construct the IDFG. In this paper, we observe that a straightforward GPU parallelization of the worklist algorithm leads to significant underutilization of the GPU resources. We identify four performance bottlenecks, namely, frequent dynamic memory allocations, high branch divergence, workload imbalance, and irregular memory access patterns. Accordingly, we propose GDroid, a GPU-based worklist algorithm implementation with multiple fine-grained optimizations tailored to common characteristics of Android applications. The optimizations considered are: matrix-based data structure, memory access-based node grouping, and worklist merging. Our experimental evaluation, performed on 1000 Android applications, shows that the proposed optimizations are beneficial to performance, and GDroid can achieve up to 128X speedups against a plain GPU implementation.","PeriodicalId":6805,"journal":{"name":"2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"131 1","pages":"274-284"},"PeriodicalIF":0.0,"publicationDate":"2020-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"79629826","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jiepeng Zhang, Jingwei Sun, Wenju Zhou, Guangzhong Sun
{"title":"An Active Learning Method for Empirical Modeling in Performance Tuning","authors":"Jiepeng Zhang, Jingwei Sun, Wenju Zhou, Guangzhong Sun","doi":"10.1109/IPDPS47924.2020.00034","DOIUrl":"https://doi.org/10.1109/IPDPS47924.2020.00034","url":null,"abstract":"Tuning performance of scientific applications is a challenging problem since performance can be a complicated nonlinear function with respect to application parameters. Empirical performance modeling is a useful approach to approximate the function and enable efficient heuristic methods to find sub-optimal parameter configurations. However, empirical performance modeling requires a large number of samples from the parameter space, which is resource and time-consuming. To address this issue, existing work based on active learning techniques proposed PBU Sampling method considering performance before uncertainty, which iteratively performs performance biased sampling to model the high-performance subspace instead of the entire space before evaluating the most uncertain samples to reduce redundancy. Compared with uniformly random sampling, this approach can reduce the number of samples, but it still involves redundant sampling that potentially can be improved.We propose a novel active learning based method to exploit the information of evaluated samples and explore possible high-performance parameter configurations. Specifically, we adopt a Performance Weighted Uncertainty (PWU) sampling strategy to identify the configurations with either high performance or high uncertainty and determine which ones are selected for evaluation. To evaluate the effectiveness of our proposed method, we construct random forest to predict the execution time of kernels from SPAPT suite and two typical scientific parallel applications kripke, hypre. Experimental results show that compared with existing methods, our proposed method can reduce the cost of modeling by up to 21x and 3x on average meanwhile hold the same prediction accuracy.","PeriodicalId":6805,"journal":{"name":"2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"5 1","pages":"244-253"},"PeriodicalIF":0.0,"publicationDate":"2020-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78079407","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}