使用多标准分区复制优化欧洲核子研究中心大型强子对撞机分布式文件存储和处理引擎

Proceedings of the 19th International Conference on Distributed Computing and Networking Pub Date : 2018-01-04 DOI:10.1145/3154273.3154320

S. Boychenko, M. Zerlauth, J. Garnier, M. Z. Rela

{"title":"使用多标准分区复制优化欧洲核子研究中心大型强子对撞机分布式文件存储和处理引擎","authors":"S. Boychenko, M. Zerlauth, J. Garnier, M. Z. Rela","doi":"10.1145/3154273.3154320","DOIUrl":null,"url":null,"abstract":"Throughout the last decades, distributed file systems and processing engines have been the primary choice for applications requiring access to large amounts of data. Since the introduction of the MapReduce paradigm, relational databases are being increasingly replaced by more efficient and scalable architectures, in particular in environments where a query is expected to process TBytes or even PBytes of data in a single execution. That is the situation at CERN, where data storage systems that are critical for the safe operation, exploitation and optimization of the particle accelerator complex, are based on traditional databases or file system solutions, which are already working well beyond their initially provisioned capacity. Despite the efficiency of modern distributed data storage and processing engines in handling large amounts of data, they are not optimized for heterogeneous workloads such as they arise in the dynamic environment of one of the world's largest scientific facilities. This contribution presents a Mixed Partitioning Scheme Replication (MPSR) solution that outperforms the conventional distributed processing environment configurations at CERN for virtually the entire parameter space of the accelerator monitoring systems' workload variations. Our main strategy was to replicate the data using different partitioning schemes for each replica, whereas the individual partitioning criteria is dynamically derived from the observed workload. To assess the efficiency of this approach in a wide range of scenarios, a behavioral simulator has been developed to compare and analyze the performance of the MPSR with the current solution. Furthermore we present the first actual results of the Hadoop-based prototype running on a relatively small cluster that not only validates the simulation predictions but also confirms the higher efficiency of the proposed technique.","PeriodicalId":276042,"journal":{"name":"Proceedings of the 19th International Conference on Distributed Computing and Networking","volume":"24 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-01-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"Optimizing distributed file storage and processing engines for CERN's Large Hadron Collider using multi criteria partitioned replication\",\"authors\":\"S. Boychenko, M. Zerlauth, J. Garnier, M. Z. Rela\",\"doi\":\"10.1145/3154273.3154320\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Throughout the last decades, distributed file systems and processing engines have been the primary choice for applications requiring access to large amounts of data. Since the introduction of the MapReduce paradigm, relational databases are being increasingly replaced by more efficient and scalable architectures, in particular in environments where a query is expected to process TBytes or even PBytes of data in a single execution. That is the situation at CERN, where data storage systems that are critical for the safe operation, exploitation and optimization of the particle accelerator complex, are based on traditional databases or file system solutions, which are already working well beyond their initially provisioned capacity. Despite the efficiency of modern distributed data storage and processing engines in handling large amounts of data, they are not optimized for heterogeneous workloads such as they arise in the dynamic environment of one of the world's largest scientific facilities. This contribution presents a Mixed Partitioning Scheme Replication (MPSR) solution that outperforms the conventional distributed processing environment configurations at CERN for virtually the entire parameter space of the accelerator monitoring systems' workload variations. Our main strategy was to replicate the data using different partitioning schemes for each replica, whereas the individual partitioning criteria is dynamically derived from the observed workload. To assess the efficiency of this approach in a wide range of scenarios, a behavioral simulator has been developed to compare and analyze the performance of the MPSR with the current solution. Furthermore we present the first actual results of the Hadoop-based prototype running on a relatively small cluster that not only validates the simulation predictions but also confirms the higher efficiency of the proposed technique.\",\"PeriodicalId\":276042,\"journal\":{\"name\":\"Proceedings of the 19th International Conference on Distributed Computing and Networking\",\"volume\":\"24 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2018-01-04\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 19th International Conference on Distributed Computing and Networking\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3154273.3154320\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 19th International Conference on Distributed Computing and Networking","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3154273.3154320","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

摘要

在过去几十年中，分布式文件系统和处理引擎一直是需要访问大量数据的应用程序的主要选择。自从引入MapReduce范式以来，关系数据库正逐渐被更高效和可扩展的架构所取代，特别是在查询预计在一次执行中处理TBytes甚至PBytes数据的环境中。这就是欧洲核子研究中心的情况，数据存储系统对粒子加速器的安全运行、开发和优化至关重要，它基于传统的数据库或文件系统解决方案，这些解决方案已经远远超出了最初提供的容量。尽管现代分布式数据存储和处理引擎在处理大量数据方面效率很高，但它们并没有针对异构工作负载进行优化，比如它们出现在世界上最大的科学设施之一的动态环境中。该贡献提出了一个混合分区方案复制(MPSR)解决方案，该解决方案在加速器监控系统工作负载变化的几乎整个参数空间中优于CERN传统的分布式处理环境配置。我们的主要策略是为每个副本使用不同的分区方案来复制数据，而各个分区标准是根据观察到的工作负载动态派生的。为了评估该方法在各种情况下的效率，我们开发了一个行为模拟器来比较和分析MPSR与当前解决方案的性能。此外，我们还展示了在一个相对较小的集群上运行的基于hadoop的原型的第一个实际结果，该结果不仅验证了模拟预测，而且证实了所提出技术的更高效率。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Optimizing distributed file storage and processing engines for CERN's Large Hadron Collider using multi criteria partitioned replication

Throughout the last decades, distributed file systems and processing engines have been the primary choice for applications requiring access to large amounts of data. Since the introduction of the MapReduce paradigm, relational databases are being increasingly replaced by more efficient and scalable architectures, in particular in environments where a query is expected to process TBytes or even PBytes of data in a single execution. That is the situation at CERN, where data storage systems that are critical for the safe operation, exploitation and optimization of the particle accelerator complex, are based on traditional databases or file system solutions, which are already working well beyond their initially provisioned capacity. Despite the efficiency of modern distributed data storage and processing engines in handling large amounts of data, they are not optimized for heterogeneous workloads such as they arise in the dynamic environment of one of the world's largest scientific facilities. This contribution presents a Mixed Partitioning Scheme Replication (MPSR) solution that outperforms the conventional distributed processing environment configurations at CERN for virtually the entire parameter space of the accelerator monitoring systems' workload variations. Our main strategy was to replicate the data using different partitioning schemes for each replica, whereas the individual partitioning criteria is dynamically derived from the observed workload. To assess the efficiency of this approach in a wide range of scenarios, a behavioral simulator has been developed to compare and analyze the performance of the MPSR with the current solution. Furthermore we present the first actual results of the Hadoop-based prototype running on a relatively small cluster that not only validates the simulation predictions but also confirms the higher efficiency of the proposed technique.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Proceedings of the 19th International Conference on Distributed Computing and Networking

自引率

0.00%

发文量