数据和工作流管理中的软件代理

高能物理与核物理 Pub Date : 2004-11-01 DOI:10.5170/CERN-2005-002.838

T. Barrass, Y. Wu, I. Semeniouk, D. Bonacorsi, D. Newbold, L. Tuura, T. Wildish, C. Charlot, Nicola De Filippis, S. Metson, I. Fisk, J. Hernández, C. Grandi, A. Afaq, J. Rehn, O. Maroney, K. Rabbertz, W. Jank, P. Garcia-Abia, M. Ernst, A. Fanfani

{"title":"数据和工作流管理中的软件代理","authors":"T. Barrass, Y. Wu, I. Semeniouk, D. Bonacorsi, D. Newbold, L. Tuura, T. Wildish, C. Charlot, Nicola De Filippis, S. Metson, I. Fisk, J. Hernández, C. Grandi, A. Afaq, J. Rehn, O. Maroney, K. Rabbertz, W. Jank, P. Garcia-Abia, M. Ernst, A. Fanfani","doi":"10.5170/CERN-2005-002.838","DOIUrl":null,"url":null,"abstract":"CMS currently uses a number of tools to transfer data which, taken together, form the basis of a heterogeneous datagrid. The range of tools used, and the directed, rather than optimized nature of CMS recent large scale data challenge required the creation of a simple infrastructure that allowed a range of tools to operate in a complementary way. The system created comprises a hierarchy of simple processes (named ‘agents’) that propagate files through a number of transfer states. File locations and some application metadata were stored in POOL file catalogues, with LCG LRC or MySQL back-ends. Agents were assigned limited responsibilities, and were restricted to communicating state in a well-defined, indirect fashion through a central transfer management database. In this way, the task of distributing data was easily divided between different groups for implementation. The prototype system was developed rapidly, and achieved the required sustained transfer rate of ~10 MBps, with O(10) files distributed to 6 sites from CERN. Experience with the system during the data challenge raised issues with underlying technology (MSS write/read, stability of the LRC, maintenance of file catalogues, synchronization of filespaces), all of which have been successfully identified and handled. The development of this prototype infrastructure allows us to plan the evolution of backbone CMS data distribution from a simple hierarchy to a more autonomous, scalable model drawing on emerging agent and grid technology. DATA DISTRIBUTION FOR CMS The Compact Muon Solenoid (CMS) experiment at the LHC will produce Petabytes of data a year [1]. This data is then to be distributed to multiple sites which form a hierarchical structure based on available resources: the detector is associated with a Tier 0 site; Tier 1 sites are typically large national computing centres; and Tier 2 sites are Institutes with a more restricted availability of resources and/or services. A core set of Tier 1 sites with large tape, disk and network resources will receive raw and reconstructed data to safeguard against data loss at CERN. Smaller sites, associated with certain analysis groups or Universities, will also subscribe to certain parts of the data. Sites at all levels will be involved in producing Monte Carlo data for comparison with detector data. At the Tier 0 the raw experiment data undergoes a process called reconstruction in which it is restructured to represent physics objects. This data will be grouped hierarchically by stream and dataset based on physics content, then further subdivided by finer granularity metadata. There are therefore three main use cases for distribution in CMS. The first can be described as a push with high priority, in which raw data is replicated to tape at Tier 1s. The second is a subscription pull, where a site subscribes to all data in a given set and data is transferred as it is produced. This use case corresponds to a site registering an interest in the data produced by an ongoing Monte Carlo simulation. The third is a random pull, where a site or individual physicist just wishes to replicate an extant dataset in a one-off transfer. Although these use cases are here discussed in terms of push and pull these can be slightly misleading descriptions. The key point is the effective handover of responsibility for replication between distribution components; for example, it is necessary to determine whether a replica has been created safely in a Tier 1 tape store before being able to delete it from a buffer at the source. This handover is enabled with well-defined handshakes or exchanges of state messages between distribution components. The conceptual basis of data distribution for CMS is then distribution through a hierarchy of sites, with smaller sites associating themselves to larger by subscribing to some subset of the data stored at the larger site. The management of this data poses two overall problems. The first problem is that sustained transfers at the 100+ MBps estimated for CMS alone are currently only approached by existing experiments. The second problem is one of managing the logisitics of subscription transfer based on metadata at granularities between high","PeriodicalId":66916,"journal":{"name":"高能物理与核物理","volume":"86 1","pages":"838-841"},"PeriodicalIF":0.0000,"publicationDate":"2004-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"32","resultStr":"{\"title\":\"Software Agents in Data and Workflow Management\",\"authors\":\"T. Barrass, Y. Wu, I. Semeniouk, D. Bonacorsi, D. Newbold, L. Tuura, T. Wildish, C. Charlot, Nicola De Filippis, S. Metson, I. Fisk, J. Hernández, C. Grandi, A. Afaq, J. Rehn, O. Maroney, K. Rabbertz, W. Jank, P. Garcia-Abia, M. Ernst, A. Fanfani\",\"doi\":\"10.5170/CERN-2005-002.838\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"CMS currently uses a number of tools to transfer data which, taken together, form the basis of a heterogeneous datagrid. The range of tools used, and the directed, rather than optimized nature of CMS recent large scale data challenge required the creation of a simple infrastructure that allowed a range of tools to operate in a complementary way. The system created comprises a hierarchy of simple processes (named ‘agents’) that propagate files through a number of transfer states. File locations and some application metadata were stored in POOL file catalogues, with LCG LRC or MySQL back-ends. Agents were assigned limited responsibilities, and were restricted to communicating state in a well-defined, indirect fashion through a central transfer management database. In this way, the task of distributing data was easily divided between different groups for implementation. The prototype system was developed rapidly, and achieved the required sustained transfer rate of ~10 MBps, with O(10) files distributed to 6 sites from CERN. Experience with the system during the data challenge raised issues with underlying technology (MSS write/read, stability of the LRC, maintenance of file catalogues, synchronization of filespaces), all of which have been successfully identified and handled. The development of this prototype infrastructure allows us to plan the evolution of backbone CMS data distribution from a simple hierarchy to a more autonomous, scalable model drawing on emerging agent and grid technology. DATA DISTRIBUTION FOR CMS The Compact Muon Solenoid (CMS) experiment at the LHC will produce Petabytes of data a year [1]. This data is then to be distributed to multiple sites which form a hierarchical structure based on available resources: the detector is associated with a Tier 0 site; Tier 1 sites are typically large national computing centres; and Tier 2 sites are Institutes with a more restricted availability of resources and/or services. A core set of Tier 1 sites with large tape, disk and network resources will receive raw and reconstructed data to safeguard against data loss at CERN. Smaller sites, associated with certain analysis groups or Universities, will also subscribe to certain parts of the data. Sites at all levels will be involved in producing Monte Carlo data for comparison with detector data. At the Tier 0 the raw experiment data undergoes a process called reconstruction in which it is restructured to represent physics objects. This data will be grouped hierarchically by stream and dataset based on physics content, then further subdivided by finer granularity metadata. There are therefore three main use cases for distribution in CMS. The first can be described as a push with high priority, in which raw data is replicated to tape at Tier 1s. The second is a subscription pull, where a site subscribes to all data in a given set and data is transferred as it is produced. This use case corresponds to a site registering an interest in the data produced by an ongoing Monte Carlo simulation. The third is a random pull, where a site or individual physicist just wishes to replicate an extant dataset in a one-off transfer. Although these use cases are here discussed in terms of push and pull these can be slightly misleading descriptions. The key point is the effective handover of responsibility for replication between distribution components; for example, it is necessary to determine whether a replica has been created safely in a Tier 1 tape store before being able to delete it from a buffer at the source. This handover is enabled with well-defined handshakes or exchanges of state messages between distribution components. The conceptual basis of data distribution for CMS is then distribution through a hierarchy of sites, with smaller sites associating themselves to larger by subscribing to some subset of the data stored at the larger site. The management of this data poses two overall problems. The first problem is that sustained transfers at the 100+ MBps estimated for CMS alone are currently only approached by existing experiments. The second problem is one of managing the logisitics of subscription transfer based on metadata at granularities between high\",\"PeriodicalId\":66916,\"journal\":{\"name\":\"高能物理与核物理\",\"volume\":\"86 1\",\"pages\":\"838-841\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2004-11-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"32\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"高能物理与核物理\",\"FirstCategoryId\":\"1087\",\"ListUrlMain\":\"https://doi.org/10.5170/CERN-2005-002.838\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"高能物理与核物理","FirstCategoryId":"1087","ListUrlMain":"https://doi.org/10.5170/CERN-2005-002.838","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 32

摘要

CMS目前使用许多工具来传输数据，这些工具加在一起构成了异构数据网格的基础。使用的工具的范围，以及CMS最近大规模数据挑战的直接而不是优化的性质，要求创建一个简单的基础设施，允许一系列工具以互补的方式运行。所创建的系统包括一个简单进程的层次结构(称为“代理”)，它通过许多传输状态传播文件。文件位置和一些应用程序元数据存储在POOL文件目录中，使用LCG LRC或MySQL后端。代理被分配了有限的责任，并且被限制通过中央传输管理数据库以定义良好的间接方式进行状态通信。这样，分发数据的任务很容易在不同的组之间进行实现。原型系统发展迅速，实现了所需的~10 MBps的持续传输速率，从CERN向6个站点分发了O(10)个文件。在数据挑战期间，系统的经验提出了一些底层技术问题(MSS写/读、LRC的稳定性、文件目录的维护、文件空间的同步)，所有这些问题都已被成功识别和处理。该原型基础设施的开发使我们能够规划骨干CMS数据分布的演变，从简单的层次结构到基于新兴代理和网格技术的更自主、可扩展的模型。大型强子对撞机的紧凑型介子螺线管(CMS)实验每年将产生pb级的数据[1]。然后将这些数据分发到多个站点，这些站点根据可用资源形成分层结构:探测器与第0层站点相关联;一级站点通常是大型的国家计算中心;二级站点是资源和/或服务的可用性更有限的机构。拥有大量磁带、磁盘和网络资源的Tier 1站点核心集将接收原始和重构数据，以防止CERN的数据丢失。与某些分析小组或大学有关联的小型网站也将订阅某些部分的数据。各级场址将参与制作蒙特卡罗数据，以便与探测器数据进行比较。在第0层，原始实验数据经历了一个称为重建的过程，在这个过程中，它被重新构造以表示物理对象。这些数据将根据物理内容按流和数据集分层分组，然后通过更细粒度的元数据进一步细分。因此，CMS中的分发有三种主要用例。第一种可以被描述为高优先级的推送，其中原始数据被复制到Tier 1的磁带上。第二种是订阅拉取，即站点订阅给定集合中的所有数据，并在数据产生时传输数据。此用例对应于注册对正在进行的蒙特卡罗模拟产生的数据感兴趣的站点。第三种是随机抽取，一个网站或个人物理学家只是希望在一次传输中复制现有的数据集。尽管这些用例在这里讨论的是推和拉，但这些描述可能有点误导。关键是在分布组件之间有效地移交复制责任;例如，在能够从源端的缓冲区中删除副本之前，有必要确定副本是否已安全地在Tier 1磁带存储中创建。此切换是通过在分布组件之间定义良好的握手或状态消息交换来实现的。CMS数据分发的概念基础是通过站点层次结构进行分发，较小的站点通过订阅存储在较大站点的数据子集将自己与较大站点关联起来。这些数据的管理带来了两个总体问题。第一个问题是，仅CMS估计的100+ MBps的持续传输目前只能通过现有的实验来实现。第二个问题是如何管理基于元数据的订阅传输的物流

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Software Agents in Data and Workflow Management

CMS currently uses a number of tools to transfer data which, taken together, form the basis of a heterogeneous datagrid. The range of tools used, and the directed, rather than optimized nature of CMS recent large scale data challenge required the creation of a simple infrastructure that allowed a range of tools to operate in a complementary way. The system created comprises a hierarchy of simple processes (named ‘agents’) that propagate files through a number of transfer states. File locations and some application metadata were stored in POOL file catalogues, with LCG LRC or MySQL back-ends. Agents were assigned limited responsibilities, and were restricted to communicating state in a well-defined, indirect fashion through a central transfer management database. In this way, the task of distributing data was easily divided between different groups for implementation. The prototype system was developed rapidly, and achieved the required sustained transfer rate of ~10 MBps, with O(10) files distributed to 6 sites from CERN. Experience with the system during the data challenge raised issues with underlying technology (MSS write/read, stability of the LRC, maintenance of file catalogues, synchronization of filespaces), all of which have been successfully identified and handled. The development of this prototype infrastructure allows us to plan the evolution of backbone CMS data distribution from a simple hierarchy to a more autonomous, scalable model drawing on emerging agent and grid technology. DATA DISTRIBUTION FOR CMS The Compact Muon Solenoid (CMS) experiment at the LHC will produce Petabytes of data a year [1]. This data is then to be distributed to multiple sites which form a hierarchical structure based on available resources: the detector is associated with a Tier 0 site; Tier 1 sites are typically large national computing centres; and Tier 2 sites are Institutes with a more restricted availability of resources and/or services. A core set of Tier 1 sites with large tape, disk and network resources will receive raw and reconstructed data to safeguard against data loss at CERN. Smaller sites, associated with certain analysis groups or Universities, will also subscribe to certain parts of the data. Sites at all levels will be involved in producing Monte Carlo data for comparison with detector data. At the Tier 0 the raw experiment data undergoes a process called reconstruction in which it is restructured to represent physics objects. This data will be grouped hierarchically by stream and dataset based on physics content, then further subdivided by finer granularity metadata. There are therefore three main use cases for distribution in CMS. The first can be described as a push with high priority, in which raw data is replicated to tape at Tier 1s. The second is a subscription pull, where a site subscribes to all data in a given set and data is transferred as it is produced. This use case corresponds to a site registering an interest in the data produced by an ongoing Monte Carlo simulation. The third is a random pull, where a site or individual physicist just wishes to replicate an extant dataset in a one-off transfer. Although these use cases are here discussed in terms of push and pull these can be slightly misleading descriptions. The key point is the effective handover of responsibility for replication between distribution components; for example, it is necessary to determine whether a replica has been created safely in a Tier 1 tape store before being able to delete it from a buffer at the source. This handover is enabled with well-defined handshakes or exchanges of state messages between distribution components. The conceptual basis of data distribution for CMS is then distribution through a hierarchy of sites, with smaller sites associating themselves to larger by subscribing to some subset of the data stored at the larger site. The management of this data poses two overall problems. The first problem is that sustained transfers at the 100+ MBps estimated for CMS alone are currently only approached by existing experiments. The second problem is one of managing the logisitics of subscription transfer based on metadata at granularities between high

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

高能物理与核物理

自引率

0.00%

发文量

2062