ATLAS EventIndex项目的大数据分析与Apache Spark

IF 1.2 Q3 MATHEMATICS, APPLIED

Computational and Mathematical Methods Pub Date : 2023-09-27 DOI:10.1155/2023/6900908

Álvaro Fernández Casaní, Carlos García Montoro, Santiago González de la Hoz, José Salt, Javier Sánchez, Miguel Villaplana Pérez

{"title":"ATLAS EventIndex项目的大数据分析与Apache Spark","authors":"Álvaro Fernández Casaní, Carlos García Montoro, Santiago González de la Hoz, José Salt, Javier Sánchez, Miguel Villaplana Pérez","doi":"10.1155/2023/6900908","DOIUrl":null,"url":null,"abstract":"<div>\n <p>The ATLAS EventIndex was designed to provide a global event catalogue and limited event-level metadata for ATLAS experiment of the Large Hadron Collider (LHC) and their analysis groups and users during Run 2 (2015-2018) and has been running in production since. The LHC Run 3, started in 2022, has seen increased data-taking and simulation production rates, with which the current infrastructure would still cope but may be stretched to its limits by the end of Run 3. A new core storage service is being developed in HBase/Phoenix, and there is work in progress to provide at least the same functionality as the current one for increased data ingestion and search rates and with increasing volumes of stored data. In addition, new tools are being developed for solving the needed access cases within the new storage. This paper describes a new tool using Spark and implemented in Scala for accessing the big data quantities of the EventIndex project stored in HBase/Phoenix. With this tool, we can offer data discovery capabilities at different granularities, providing Spark Dataframes that can be used or refined within the same framework. Data analytic cases of the EventIndex project are implemented, like the search for duplicates of events from the same or different datasets. An algorithm and implementation for the calculation of overlap matrices of events across different datasets are presented. Our approach can be used by other higher-level tools and users, to ease access to the data in a performant and standard way using Spark abstractions. The provided tools decouple data access from the actual data schema, which makes it convenient to hide complexity and possible changes on the backed storage.</p>\n </div>","PeriodicalId":100308,"journal":{"name":"Computational and Mathematical Methods","volume":"2023 1","pages":""},"PeriodicalIF":1.2000,"publicationDate":"2023-09-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1155/2023/6900908","citationCount":"0","resultStr":"{\"title\":\"Big Data Analytics for the ATLAS EventIndex Project with Apache Spark\",\"authors\":\"Álvaro Fernández Casaní, Carlos García Montoro, Santiago González de la Hoz, José Salt, Javier Sánchez, Miguel Villaplana Pérez\",\"doi\":\"10.1155/2023/6900908\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div>\\n <p>The ATLAS EventIndex was designed to provide a global event catalogue and limited event-level metadata for ATLAS experiment of the Large Hadron Collider (LHC) and their analysis groups and users during Run 2 (2015-2018) and has been running in production since. The LHC Run 3, started in 2022, has seen increased data-taking and simulation production rates, with which the current infrastructure would still cope but may be stretched to its limits by the end of Run 3. A new core storage service is being developed in HBase/Phoenix, and there is work in progress to provide at least the same functionality as the current one for increased data ingestion and search rates and with increasing volumes of stored data. In addition, new tools are being developed for solving the needed access cases within the new storage. This paper describes a new tool using Spark and implemented in Scala for accessing the big data quantities of the EventIndex project stored in HBase/Phoenix. With this tool, we can offer data discovery capabilities at different granularities, providing Spark Dataframes that can be used or refined within the same framework. Data analytic cases of the EventIndex project are implemented, like the search for duplicates of events from the same or different datasets. An algorithm and implementation for the calculation of overlap matrices of events across different datasets are presented. Our approach can be used by other higher-level tools and users, to ease access to the data in a performant and standard way using Spark abstractions. The provided tools decouple data access from the actual data schema, which makes it convenient to hide complexity and possible changes on the backed storage.</p>\\n </div>\",\"PeriodicalId\":100308,\"journal\":{\"name\":\"Computational and Mathematical Methods\",\"volume\":\"2023 1\",\"pages\":\"\"},\"PeriodicalIF\":1.2000,\"publicationDate\":\"2023-09-27\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://onlinelibrary.wiley.com/doi/epdf/10.1155/2023/6900908\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Computational and Mathematical Methods\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://onlinelibrary.wiley.com/doi/10.1155/2023/6900908\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q3\",\"JCRName\":\"MATHEMATICS, APPLIED\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computational and Mathematical Methods","FirstCategoryId":"1085","ListUrlMain":"https://onlinelibrary.wiley.com/doi/10.1155/2023/6900908","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"MATHEMATICS, APPLIED","Score":null,"Total":0}

引用次数: 0

摘要

ATLAS EventIndex旨在为大型强子对撞机（Large Hadron Collider， LHC）的ATLAS实验及其分析小组和用户在第二阶段（2015-2018）提供全球事件目录和有限事件级元数据，并已投入生产运行。LHC第3次运行于2022年开始，数据采集和模拟生产速度有所增加，目前的基础设施仍然可以应付，但在第3次运行结束时可能会达到极限。一个新的核心存储服务正在HBase/Phoenix中开发，并且正在进行工作，以提供至少与当前服务相同的功能，以提高数据摄取和搜索率，并增加存储数据量。此外，正在开发用于解决新存储中所需访问情况的新工具。本文描述了一个使用Spark并在Scala中实现的新工具，用于访问存储在HBase/Phoenix中的EventIndex项目的大数据量。有了这个工具，我们可以提供不同粒度的数据发现功能，提供可以在同一框架内使用或改进的Spark dataframe。实现了EventIndex项目的数据分析案例，例如从相同或不同的数据集中搜索重复的事件。提出了一种计算不同数据集事件重叠矩阵的算法和实现。我们的方法可以被其他高级工具和用户使用，使用Spark抽象以一种高性能和标准的方式简化对数据的访问。所提供的工具将数据访问与实际的数据模式分离，这使得隐藏后台存储的复杂性和可能的更改变得方便。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

Big Data Analytics for the ATLAS EventIndex Project with Apache Spark

查看原文本刊更多论文

Big Data Analytics for the ATLAS EventIndex Project with Apache Spark

The ATLAS EventIndex was designed to provide a global event catalogue and limited event-level metadata for ATLAS experiment of the Large Hadron Collider (LHC) and their analysis groups and users during Run 2 (2015-2018) and has been running in production since. The LHC Run 3, started in 2022, has seen increased data-taking and simulation production rates, with which the current infrastructure would still cope but may be stretched to its limits by the end of Run 3. A new core storage service is being developed in HBase/Phoenix, and there is work in progress to provide at least the same functionality as the current one for increased data ingestion and search rates and with increasing volumes of stored data. In addition, new tools are being developed for solving the needed access cases within the new storage. This paper describes a new tool using Spark and implemented in Scala for accessing the big data quantities of the EventIndex project stored in HBase/Phoenix. With this tool, we can offer data discovery capabilities at different granularities, providing Spark Dataframes that can be used or refined within the same framework. Data analytic cases of the EventIndex project are implemented, like the search for duplicates of events from the same or different datasets. An algorithm and implementation for the calculation of overlap matrices of events across different datasets are presented. Our approach can be used by other higher-level tools and users, to ease access to the data in a performant and standard way using Spark abstractions. The provided tools decouple data access from the actual data schema, which makes it convenient to hide complexity and possible changes on the backed storage.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Computational and Mathematical Methods

CiteScore

2.20

自引率

0.00%

发文量