{"title":"An Approach for Evaluating and Mitigating Intra-Application I/O Performance Variability Over Parallel File Systems","authors":"E. C. Inacio, M. Dantas","doi":"10.5753/wscad_estendido.2019.8709","DOIUrl":"https://doi.org/10.5753/wscad_estendido.2019.8709","url":null,"abstract":"To meet ever increasing capacity and performance requirements of emerging data-intensive applications, highly distributed and multilayered back-end storage systems have been employed in large-scale high performance computing (HPC) environments. A main component of these storage infrastructures is the parallel file system (PFS), a especially designed file system for absorbing bulk data transfers from applications with thousands of concurrent processes. Load distribution on PFS data servers compose a major source of intra-application input/output (I/O) performance variability. Albeit mitigating variability is desirable, as it is known to harm application-perceived performance, understanding and dealing with I/O performance variability in such complex environments remains a challenging task. In this research, a differentiated approach for evaluating and mitigating intra-application I/O performance variability over PFSs is proposed. More specifically, from the evaluation perspective, a comprehensive approach combining complementary methods is proposed. An analytical model proposal, named DTSMaxLoad, provides estimates for the maximum load in a PFS data server. To complement DTSMaxLoad, modeling conditions and mechanisms hard to represent analytically, the Parallel I/O and Storage System (PIOSS) simulation model was proposed. Finally, for experimental evaluation over real environments, a flexible and distributed I/O performance evaluation tool, coined as IOR-Extended (IORE), was proposed. Furthermore, a high-level file distribution approach for PFSs, called N-N Round-Robin (N2R2), was proposed focusing on mitigating I/O performance variability for distributed applications where each process accesses an individual and independent file. An extensive experimental effort, including measurements on real environments, was conducted in this research work for evaluating each of the proposed approaches. In summary, this evaluation indicated both DTSMaxLoad and PIOSS modeling proposals can represent load distribution behavior on PFSs with significant fidelity. Moreover, results demonstrated N2R2 successfully reduced intra-application I/O performance variability for 270 distinct experimental scenarios, which, ultimately, translated into overall application I/O performance Improvements.","PeriodicalId":280012,"journal":{"name":"Anais Estendidos do Simpósio em Sistemas Computacionais de Alto Desempenho (WSCAD)","volume":"57 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-11-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126754310","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Avaliação de Máquinas Preemptáveis nos Provedores de Nuvem Pública Amazon e Google","authors":"J. Soares, Aleteia Araujo","doi":"10.5753/wscad_estendido.2019.8695","DOIUrl":"https://doi.org/10.5753/wscad_estendido.2019.8695","url":null,"abstract":"Com a diversidade de serviços do mercado de nuvem computacional, a escolha do serviço e do provedor mais adequados consiste em um desafio não trivial para os usuários. Nesse contexto, este trabalho propõe uma análise comparativa sobre máquinas preemptı́veis oferecidas por provedores de nuvem pública, as quais podem ter a execução finalizada em situações nas quais seus recursos computacionais são necessários em outras tarefas do provedor do serviço. Para isso, são executados testes experimentais em instâncias oferecidas pelos provedores, utilizando benchmarks da literatura. O trabalho conclui, a partir dos resultados de custo e de performance obtidos, quais instâncias, provedores e regiões são mais indicados para cargas de trabalho similares aos benchmarks executados.","PeriodicalId":280012,"journal":{"name":"Anais Estendidos do Simpósio em Sistemas Computacionais de Alto Desempenho (WSCAD)","volume":"18 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-11-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127585560","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Soluções Paralelas para o Problema de Roteamento Usando o Algoritmo de Lee","authors":"William Tavares, Nahri Moreano","doi":"10.5753/wscad_estendido.2019.8692","DOIUrl":"https://doi.org/10.5753/wscad_estendido.2019.8692","url":null,"abstract":"O algoritmo de Lee é uma técnica popular para realizar o roteamento de trilhas em uma placa de circuito. No âmbito de VLSI, essa tarefa se torna computacionalmente intensa e exige grande quantidade de memória. Este artigo avalia otimizações descritas na literatura que reduzem o consumo de tempo e memória do algoritmo, e propõe, de maneira construtiva, técnicas para a paralelização do mesmo. O resultado final apresentou speedup de 2, 25 com 2 threads e 3, 70 com 4 threads.","PeriodicalId":280012,"journal":{"name":"Anais Estendidos do Simpósio em Sistemas Computacionais de Alto Desempenho (WSCAD)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-11-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131210870","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Secure and efficient software implementation of QC-MDPC code-based cryptography","authors":"A. Guimarães, Diego F. Aranha, E. Borin","doi":"10.5753/wscad_estendido.2019.8710","DOIUrl":"https://doi.org/10.5753/wscad_estendido.2019.8710","url":null,"abstract":"The emergence of quantum computers is pushing an unprecedented transition in the public key cryptography field. Conventional algorithms, mostly represented by elliptic curves and RSA, are vulnerable to attacks using quantum computers and need, therefore, to be replaced. Cryptosystems based on error-correcting codes are considered some of the most promising candidates to replace them for encryption schemes. Among the code families, QC-MDPC codes achieve the smallest key sizes while maintaining the desired security properties. Their performance, however, still needs to be greatly improved to reach a competitive level. In this work, we focus on optimizing the performance of QC-MDPC code-based cryptosystems through improvements concerning both their implementations and algorithms. We first present a new enhanced version of QcBits' key encapsulation mechanism, which is a constant time implementation of the Niederreiter cryptosystem using QC-MDPC codes. In this version, we updated the implementation parameters to meet the 128-bit quantum security level, replaced some of the core algorithms avoiding slower instructions, vectorized the entire code using the AVX 512 instruction set extension and introduced some other minor improvements. Comparing with the current state-of-the-art implementation for QC-MDPC codes, the BIKE implementation, our code performs 1.9 times faster when decrypting messages. We then optimize the performance of QC-MDPC code-based cryptosystems through the insertion of a configurable failure rate in their arithmetic procedures. We present constant time algorithms with a configurable failure rate for multiplication and inversion over binary polynomials, the two most expensive subroutines used in QC-MDPC implementations. Using a failure rate negligible compared to the security level (2^{-128}), our multiplication is 2 times faster than the one used in the NTL library on sparse polynomials and 1.6 times faster than a naive constant-time sparse polynomial multiplication. Our inversion algorithm, based on the inversion algorithm of Wu et al., is 2 times faster than the original and 12 times faster than the inversion algorithm of Itoh and Tsujii using the same modulus polynomial (x^{32749} - 1). By inserting these algorithms in our enhanced version of QcBits, we were able to achieve a speedup of 1.9 on the key generation and up to 1.4 on the decryption time. Comparing with BIKE, our final version of QcBits performs the uniform decryption 2.7 times faster. Moreover, the techniques presented in this work can also be applied to BIKE, opening new possibilities for further Improvements.","PeriodicalId":280012,"journal":{"name":"Anais Estendidos do Simpósio em Sistemas Computacionais de Alto Desempenho (WSCAD)","volume":"804 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-11-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133138563","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Paralelização do Algoritmo de Indexação de Dados Multimídia Baseado em Quantização","authors":"André Fernandes, G. Teodoro","doi":"10.5753/wscad_estendido.2019.8699","DOIUrl":"https://doi.org/10.5753/wscad_estendido.2019.8699","url":null,"abstract":"Nesse artigo é apresentada uma paralelização eficiente do algoritmo de busca por similaridade Product Quantization Approximate Nearest Neighbor Search (PQANNS). Esse método pode responder consultas com uma demanda reduzida de memória e, juntamente com a paralelização proposta, pode lidar de forma eficiente com grandes bases de dados. A execução utilizando 128 nós/3584 núcleos de CPU foi capaz de atingir uma eficiência do paralelismo de 0.97 em uma base de dados contendo 256 bilhões de descritores SIFT.","PeriodicalId":280012,"journal":{"name":"Anais Estendidos do Simpósio em Sistemas Computacionais de Alto Desempenho (WSCAD)","volume":"38 14","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-11-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114006722","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Weber Ribeiro, Thiago Teixeira, F. Cabral, M. R. Borges, C. Osthoff
{"title":"Otimização para Ambientes Intel(R) de um Metodo Numérico para o Escoamento Bifásico de Fluidos em Meios Porosos Através da Eliminação de Barreiras OpenMP","authors":"Weber Ribeiro, Thiago Teixeira, F. Cabral, M. R. Borges, C. Osthoff","doi":"10.5753/wscad_estendido.2019.8700","DOIUrl":"https://doi.org/10.5753/wscad_estendido.2019.8700","url":null,"abstract":"Este artigo apresenta otimizações de um método numérico para o escoamento bifásico de fluidos em meios porosos, voltado à execução paralela em ambientes Intel R . As ferramentas do suı́te Intel R Parallel Studios XE, foram utilizadas no estudo de possı́veis implementações. A implementação EWS-SYNC consiste em substituir as barreiras do OpenMP por um mecanismo explı́cito de sincronismo entre threads, o MPI é implementado para comunicação entre diversos processadores distribuı́dos e tornar o código utilizável em ambiente Cluster. Foram comparados os resultados para o aumento de número de processos no novo código MPI com o aumento do número de threads no código EWSSYNC. A implementação EWS-SYNC obteve Speedup de 27x, comparado-se a execução serial, utilizando-se o hardware Intel R Xeon Phi (KNL) @ 1.40GHz com 68 cores fı́sicos 4 threads/core em uma máquina que contém Intel Xeon CPU E5-2698 v3 @ 2.30GHz com 32 cores fı́sicos em [Teixeira et al. 2018]. Comparando-se o Speedup do código EWS-SYNC em relação ao código serial em arquitetura Intel Xeon R CPU E5-2698 v3 @ 2.30GHz 16 cores fı́sicos o Speedup foi de 10x e nesta mesma arquitetura o ainda em fase inicial de implementação código MPI em relação ao EWS-SYNC obteve Speedup de 23x.","PeriodicalId":280012,"journal":{"name":"Anais Estendidos do Simpósio em Sistemas Computacionais de Alto Desempenho (WSCAD)","volume":"65 3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-11-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122202494","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Uma Abordagem em Ambiente Domiciliar Assistido Baseada no Paradigma de Segurança Orientada a Contexto","authors":"Franco Umilio, E. C. Inacio, M. Dantas","doi":"10.5753/WSCAD_ESTENDIDO.2019.8693","DOIUrl":"https://doi.org/10.5753/WSCAD_ESTENDIDO.2019.8693","url":null,"abstract":"A crescente expectativa de vida da população mundial traz alguns desafios que deverão ser enfrentados nas próximas décadas. Na área da saúde surgiu o AAL (Ambient Assisted Living). As tecnologias de rede, avançando sempre mais, e computadores com maior poder de processamento, facilitam a criação de sistemas AAL. Desta forma, neste trabalho, o paradigma de orientação a contexto para segurança é abordado, propondo uma possível solução com um sistema de SBDD (Sistema de Banco de Dados Distribuído),o qual como teste usa dois cenários com três usuários, tendo diferentes níveisde permissões de acesso ao banco de dados.","PeriodicalId":280012,"journal":{"name":"Anais Estendidos do Simpósio em Sistemas Computacionais de Alto Desempenho (WSCAD)","volume":"83 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-11-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115723904","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Vitor David, D. Araujo, Marcelo Zamith, Ubiratam de Paula
{"title":"Utilizando a biblioteca PAPI para avaliar diferentes abordagens de construção de curvas b-spline","authors":"Vitor David, D. Araujo, Marcelo Zamith, Ubiratam de Paula","doi":"10.5753/wscad_estendido.2019.8698","DOIUrl":"https://doi.org/10.5753/wscad_estendido.2019.8698","url":null,"abstract":"Tendo em vista a importância da paralelização de trechos do algoritmo, para a obtenção de desempenho, este artigo propõe a análise de desempenho para diferentes nı́veis de abordagens paralelas permitindo concluir a melhor abordagem para curvas B-spline. O desempenho foi avaliado utilizando diferentes nı́veis de paralelização, por meio de instruções, com a vetorização, e por meio de núcleos, através das threads. Nossos resultados demonstraram que uma abordagem através de núcleos obteve um aproveitamento muito bom enquanto que a vetorização não obteve ganhos.","PeriodicalId":280012,"journal":{"name":"Anais Estendidos do Simpósio em Sistemas Computacionais de Alto Desempenho (WSCAD)","volume":"22 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-11-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126036163","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Structural testing criteria for concurrent programs considering loop executions","authors":"Sílvia M. D. Diaz, P. S. Souza","doi":"10.5753/wscad_estendido.2019.8711","DOIUrl":"https://doi.org/10.5753/wscad_estendido.2019.8711","url":null,"abstract":"Parallel programs are imperative for improving performance and problem solving, having an increasing demand on implementing efficient parallel programming techniques. This entails new challenges on software testing to ensure their quality and reliability. Structural testing is a technique that allows the identification of concurrency defects by analyzing the internal structure of the program. However, the non-determinism of concurrent programs has implications in the testing activity, requiring the use of structured methods to reveal defects. Testing criteria support the selection of test cases in a systematic form by statically analysing elements of concurrent programs. We found that there are currently gaps in the definition of testing criteria contemplating scenarios with elements that are dynamically evaluated, such as the execution of communication primitives inside loops. The objective of this project is to define structural testing criteria to guide the selection of test cases, improving the reliability of concurrent programs by revealing non-determinism related errors present in repetition structures. We developed a Concurrent Defects Taxonomy, identifying and classifying concurrency types of defects found in related literature. The analysis of such defects, paths inside loops, number of loop iterations, and nested loops allow us to model the proposed structural testing criteria. We define new sets and associations related to communication and synchronization flows for message-passing programs, establishing a model for testing criteria. We implemented the proposed test model in ValiMPI, a testing tool prototype, considering the new concepts defined in our test model, generating required elements and evaluating coverage after constructing loop paths. For the application evaluation of criteria we perform an empirical study with statistical validation, indicating the results for cost, effectiveness and strength. Our experimental evaluation demonstrated that the proposed testing criteria generates required elements that support the identification of concurrency defects occurring in different loop iterations, when having communicational events with non-deterministic behavior.","PeriodicalId":280012,"journal":{"name":"Anais Estendidos do Simpósio em Sistemas Computacionais de Alto Desempenho (WSCAD)","volume":"74 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-11-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115426014","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}