2019 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)最新文献

LBE: A Computational Load Balancing Algorithm for Speeding up Parallel Peptide Search in Mass-Spectrometry Based Proteomics LBE:一种基于质谱的蛋白质组学并行肽搜索的计算负载平衡算法

2019 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW) Pub Date : 2019-05-20 DOI: 10.1109/IPDPSW.2019.00040

Muhammad Haseeb, Fatima Afzali, F. Saeed

{"title":"LBE: A Computational Load Balancing Algorithm for Speeding up Parallel Peptide Search in Mass-Spectrometry Based Proteomics","authors":"Muhammad Haseeb, Fatima Afzali, F. Saeed","doi":"10.1109/IPDPSW.2019.00040","DOIUrl":"https://doi.org/10.1109/IPDPSW.2019.00040","url":null,"abstract":"The most commonly employed method for peptide identification in mass-spectrometry based proteomics involves comparing experimentally obtained tandem MS/MS spectra against a set of theoretical MS/MS spectra. The theoretical MS/MS spectra data are predicted using protein sequence database. Most state-of-the-art peptide search algorithms index theoretical spectra data to quickly filter-in the relevant (similar) indexed spectra when searching an experimental MS/MS spectrum. Data filtration substantially reduces the required number of computationally expensive spectrum-to-spectrum comparison operations. However, the number of predicted (and indexed) theoretical spectra grows exponentially with increase in post-translational modifications creating a memory and I/O bottleneck. In this paper, we present a parallel algorithm, called LBE, for efficient partitioning of theoretical spectra data on a distributed-memory architecture. Our proposed algorithm first groups the similar theoretical spectra. The groups are then finely split across the system allowing machines to perform almost equal amount of work when querying a MS/MS spectrum. Our results show that the compute load imbalance using LBE based data distribution is ≤ 20% allowing speedups of order of magnitudes over existing methods. The proposed algorithm has been implemented on a compute cluster using MPI library. Experimental results for increasing index sizes are reported in terms of execution time, speedups and memory footprint. To the best of our knowledge, LBE is the first load-balancing technique for MS/MS proteomics data on memory-distributed clusters that incorporates proteomics domain knowledge for efficient load-balancing. Source code is made available at: https://github.com/pcdslab/lbdslim/tree/mpi","PeriodicalId":292054,"journal":{"name":"2019 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125051305","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 6

Software-Defined Events through PAPI 通过PAPI的软件定义事件

2019 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW) Pub Date : 2019-05-20 DOI: 10.1109/IPDPSW.2019.00069

Anthony Danalis, Heike Jagode, T. Hérault, P. Luszczek, J. Dongarra

{"title":"Software-Defined Events through PAPI","authors":"Anthony Danalis, Heike Jagode, T. Hérault, P. Luszczek, J. Dongarra","doi":"10.1109/IPDPSW.2019.00069","DOIUrl":"https://doi.org/10.1109/IPDPSW.2019.00069","url":null,"abstract":"PAPI has been used for almost two decades as an abstraction and standardization layer for profiling hardware-specific performance metrics. However, application developers—and profiling software packages—are quite often interested in information beyond hardware counters, such as the behavior of libraries used by the software that is being profiled. So far, accessing this information has required interfacing directly with the libraries on a case-by-case basis, or low-level binary instrumentation. In this paper, we introduce the new Software-Defined Event (SDE) component of PAPI which aims to enable PAPI to serve as an abstraction and standardization layer for events that originate in software layers as well. Extending PAPI to include SDEs enables monitoring of both types of performance events—hardware-and software-related events—in a uniform way, through the same consistent PAPI interface. Furthermore, implementing SDE as a PAPI component means that the new API is aimed only at the library developers who wish to export events from within their libraries. The API for reading PAPI events—both hardware and software—remains the same, so all legacy codes and tools that use PAPI will not only continue to work, but they will automatically be able to read SDEs wherever those are available. The goal of this paper is threefold. First, we outline our design decisions regarding the functionality we offer through the new SDE interface, and offer simple examples of usage. Second, we illustrate how those events can be utilized by different software packages, specifically, by showcasing their use in the task-based runtime PaRSEC, and the HPCG supercomputing benchmark. Third, we provide a thorough performance analysis of the overhead that results from monitoring different types of SDEs, and showcase the negligible overhead of using PAPI SDE even in cases of extremely heavy use.","PeriodicalId":292054,"journal":{"name":"2019 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"124 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123803735","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

OpenMP to FPGA Offloading Prototype Using OpenCL SDK 使用OpenCL SDK的OpenMP到FPGA的卸载原型

2019 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW) Pub Date : 2019-05-20 DOI: 10.1109/IPDPSW.2019.00072

Marius Knaust, Florian Mayer, T. Steinke

引用次数: 14

It Can Understand the Logs, Literally 它真的能理解日志

2019 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW) Pub Date : 2019-05-20 DOI: 10.1109/IPDPSW.2019.00084

Aidi Pi, Wei Chen, W. Zeller, Xiaobo Zhou

引用次数: 4

FPGA-Based Embedded System Implementation of Audio Signal Alignment 基于fpga的嵌入式系统音频信号对准实现

2019 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW) Pub Date : 2019-05-20 DOI: 10.1109/IPDPSW.2019.00031

Luca Stornaiuolo, Massimo Perini, M. Santambrogio, D. Sciuto

{"title":"FPGA-Based Embedded System Implementation of Audio Signal Alignment","authors":"Luca Stornaiuolo, Massimo Perini, M. Santambrogio, D. Sciuto","doi":"10.1109/IPDPSW.2019.00031","DOIUrl":"https://doi.org/10.1109/IPDPSW.2019.00031","url":null,"abstract":"FPGAs are considered a valuable solution for embedded system applications thanks to their performance, energy efficiency and capability to face system failures. However, the number of available applications is limited due to the learning curve needed to customize FPGA-based accelerators. As proof of this, Xilinx recently released PYNQ, a platform for Zynq SoC that relies on Python and overlays to ease the integration of functionalities of the programmable logic into applications. In this work, we build upon this framework to implement an optimized embedded design for audio alignment and we integrated it in the Python applications workflow. In particular, we provide a custom accelerator designed for PYNQ and the software interface to transparently exploit the programmable logic from the Python code runs on the embedded CPU. We then compare the executions on two different devices: the PYNQ-Z1 and the Raspberry Pi 3. Our FPGA accelerated implementation is able to reach a speedup of 12.4x with respect to the PYNQ-Z1, when only the CPU is used, and a speedup of 5.5x with respect to the Raspberry Pi 3 version.","PeriodicalId":292054,"journal":{"name":"2019 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128922288","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

Load Imbalance Mitigation Optimizations for GPU-Accelerated Similarity Joins gpu加速相似连接的负载不平衡缓解优化

2019 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW) Pub Date : 2019-05-20 DOI: 10.1109/IPDPSW.2019.00078

Benoît Gallet, M. Gowanlock

{"title":"Load Imbalance Mitigation Optimizations for GPU-Accelerated Similarity Joins","authors":"Benoît Gallet, M. Gowanlock","doi":"10.1109/IPDPSW.2019.00078","DOIUrl":"https://doi.org/10.1109/IPDPSW.2019.00078","url":null,"abstract":"The distance similarity self-join is widely used in database applications and is defined as joining a table on itself using a distance predicate. The similarity self-join is often used in spatial applications and is a building block of other algorithms, such as those used for data analysis. In this paper, we propose several new optimizations mitigating load imbalance of a GPU-accelerated self-join algorithm. The data-dependent nature of the self-join makes the algorithm potentially unsuitable for the GPU's architecture, due to variance in workloads assigned to threads. Consequently, we propose a method that reduces load imbalance and subsequent thread divergence between threads executing in a warp by considering the total workload assigned to each thread, and forcing the GPU's hardware scheduler to group threads with similar workloads within the same warp. Also, by leveraging a grid-based index, we propose a new balanced computational pattern for both reducing the number of distance calculations and the workload variance between threads. Moreover, we exploit additional parallelism by increasing the workload granularity to further improve computational throughput and workload balance within warps. Our solution achieves a speedup of up to 9.7x and 1.6x on average against another GPU algorithm, and up to 10.7x with an average of 2.5x against a CPU state-of-the-art parallel algorithm.","PeriodicalId":292054,"journal":{"name":"2019 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125385389","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 6

SmarTmem: Intelligent Management of Transcendent Memory in a Virtualized Server SmarTmem:虚拟化服务器中超越内存的智能管理

2019 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW) Pub Date : 2019-05-20 DOI: 10.1109/IPDPSW.2019.00151

Luis A. Garrido, Rajiv Nishtala, P. Carpenter

{"title":"SmarTmem: Intelligent Management of Transcendent Memory in a Virtualized Server","authors":"Luis A. Garrido, Rajiv Nishtala, P. Carpenter","doi":"10.1109/IPDPSW.2019.00151","DOIUrl":"https://doi.org/10.1109/IPDPSW.2019.00151","url":null,"abstract":"Managing memory capacity in virtualized environments is still a challenging problem. Many solutions have been proposed and implemented, including memory ballooning and memory hotplug. But these mechanisms are slow to respond to changes in virtual machine (VM) memory demands. Transcendent Memory (tmem) was introduced to improve responsiveness in memory provisioning, by pooling idle and fallow memory in the hypervisor, and making these physical pages available as additional memory for the VMs through a key-value store. However, tmem presents some limitations of its own. State-of-the-art hypervisors do not implement any efficient way to manage tmem capacity, letting VMs compete for it in a greedy way by default, regardless of their actual memory demand. In this paper, we demonstrate the need for intelligent memory capacity management for tmem, and we present the design and implementation of SmarTmem, a mechanism that integrates coarse-grained user-space memory management with fine-grain allocation and enforcement at the virtualization layer. Our results show that our solution can improve the running time of applications from the Cloudsuite benchmarks by up to 35% compared to the default tmem allocation mechanism.","PeriodicalId":292054,"journal":{"name":"2019 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"158 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121410547","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Random Walk Gradient Descent for Decentralized Learning on Graphs 图上分散学习的随机行走梯度下降

2019 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW) Pub Date : 2019-05-20 DOI: 10.1109/IPDPSW.2019.00157

Ghadir Ayache, S. Rouayheb

{"title":"Random Walk Gradient Descent for Decentralized Learning on Graphs","authors":"Ghadir Ayache, S. Rouayheb","doi":"10.1109/IPDPSW.2019.00157","DOIUrl":"https://doi.org/10.1109/IPDPSW.2019.00157","url":null,"abstract":"We design a new variant of the stochastic gradient descent algorithm applied for learning a global model based on the data distributed over the nodes of a network. Motivated by settings such as in decentralized learning, we suppose that one special node in the network, which we call node 1, is interested in learning the global model. We seek a decentralized and distributed algorithm for many reasons including privacy and fault-tolerance. A natural candidate here is Gossip-style SGD. However, it suffers from slow convergence and high communication cost mainly because at the end all nodes, and not only the special node, will learn the model. We propose a distributed SGD algorithm using a weighted random walk to sample the nodes. The Markov chain is designed to have stationary probability distribution that is proportional to the smoothness bound L_i of the local loss function at node i. We study the convergence rate of this algorithm and prove that it depends on the smoothness average L. This outperforms the case of uniform sampling algorithm obtained by a Metropolis-Hasting random walk (MHRW) which depends on the supremum of all L_i s noted L. We present numerical simulations that substantiate our theoretical findings and show that our algorithm outperforms random walk and gossip-style algorithms.","PeriodicalId":292054,"journal":{"name":"2019 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"2 4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132108894","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4

EdgeL^3: Compressing L^3-Net for Mote Scale Urban Noise Monitoring EdgeL^3:压缩L^3- net用于城市噪声监测

2019 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW) Pub Date : 2019-05-20 DOI: 10.1109/IPDPSW.2019.00145

Sangeeta Kumari, Dhrubojyoti Roy, M. Cartwright, J. Bello, A. Arora

{"title":"EdgeL^3: Compressing L^3-Net for Mote Scale Urban Noise Monitoring","authors":"Sangeeta Kumari, Dhrubojyoti Roy, M. Cartwright, J. Bello, A. Arora","doi":"10.1109/IPDPSW.2019.00145","DOIUrl":"https://doi.org/10.1109/IPDPSW.2019.00145","url":null,"abstract":"Urban noise sensing in deeply embedded devices at the edge of the Internet of Things (IoT) is challenging not only because of the lack of sufficiently labeled training data but also because device resources are quite limited. Look, Listen, and Learn (L3), a recently proposed state-of-the-art transfer learning technique, mitigates the first challenge by training self-supervised deep audio embeddings through binary Audio-Visual Correspondence (AVC), and the resulting embeddings can be used to train a variety of downstream audio classification tasks. However, with close to 4.7 million parameters, the multi-layer L3-Net CNN is still prohibitively expensive to be run on small edge devices, such as \"motes\" that use a single microcontroller and limited memory to achieve long-lived self-powered operation. In this paper, we comprehensively explore the feasibility of compressing the L3-Net for mote-scale inference. We use pruning, ablation, and knowledge distillation techniques to show that the originally proposed L3-Net architecture is substantially overparameterized, not only for AVC but for the target task of sound classification as evaluated on two popular downstream datasets. Our findings demonstrate the value of fine-tuning and knowledge distillation in regaining the performance lost through aggressive compression strategies. Finally, we present EdgeL3, the first L3-Net reference model compressed by 1-2 orders of magnitude for real-time urban noise monitoring on resource-constrained edge devices, that can fit in just 0.4 MB of memory through half-precision floating point representation.","PeriodicalId":292054,"journal":{"name":"2019 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"33 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132247315","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 15

Automatic Tool-Flow for Mapping Applications to an Application-Specific CGRA Architecture 将应用程序映射到特定于应用程序的CGRA体系结构的自动工具流

2019 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW) Pub Date : 2019-05-20 DOI: 10.1109/IPDPSW.2019.00033

Florian Fricke, André Werner, Keyvan Shahin, F. Werner, M. Hübner

{"title":"Automatic Tool-Flow for Mapping Applications to an Application-Specific CGRA Architecture","authors":"Florian Fricke, André Werner, Keyvan Shahin, F. Werner, M. Hübner","doi":"10.1109/IPDPSW.2019.00033","DOIUrl":"https://doi.org/10.1109/IPDPSW.2019.00033","url":null,"abstract":"The research work presented in this paper is about a holistic tool-chain for generating, configuring and evaluating application-specific Coarse-Grained Reconfigurable Array (CRGA) architectures. This development was part of a large EU funded project with the name EXTRA. The reduced complexity of the architecture in comparison to fine-grained architectures like FPGAs is exploited to evaluate the Just-in-Time generation of VCGRA configurations. The manuscript presents the tool-chain that is responsible for the implementation of applications on the coarse-grained architecture. In particular, the tools for partitioning the applications, mapping the partitions and controlling the execution of the entire application on the target architecture will be examined. In addition, both the user interface and the interfaces between the components of the tool-chain are described. Subsequently, the presented tools are evaluated using a practical example and various metrics. We show, that the creation of configurations for the presented architectures can be carried out rapidly and therefore the generation of new configurations at run-time is feasible.","PeriodicalId":292054,"journal":{"name":"2019 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133876783","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1