Modeling a Large Data-Acquisition Network in a Simulation Framework

2015 IEEE International Conference on Cluster Computing Pub Date : 2015-09-08 DOI:10.1109/CLUSTER.2015.137

T. Colombo, H. Fröning, P. García, W. Vandelli

{"title":"Modeling a Large Data-Acquisition Network in a Simulation Framework","authors":"T. Colombo, H. Fröning, P. García, W. Vandelli","doi":"10.1109/CLUSTER.2015.137","DOIUrl":null,"url":null,"abstract":"The ATLAS detector at CERN records particle collision \"events\" delivered by the Large Hadron Collider. Its data-acquisition system identifies, selects, and stores interesting events in near real-time, with an aggregate throughput of several 10 GB/s. It is a distributed software system executed on a farm of roughly 2000 commodity worker nodes communicating via TCP/IP on an Ethernet network. Event data fragments are received from the many detector readout channels and are buffered, collected together, analyzed and either stored permanently or discarded. This system, and data-acquisition systems in general, are sensitive to the latency of the data transfer from the readout buffers to the worker nodes. Challenges affecting this transfer include the many-to-one communication pattern and the inherently bursty nature of the traffic. In this paper we introduce the main performance issues brought about by this workload, focusing in particular on the so-called TCP incast pathology. Since performing systematic studies of these issues is often impeded by operational constraints related to the mission-critical nature of these systems, we focus instead on the development of a simulation model of the ATLAS data-acquisition system, used as a case study. The simulation is based on the well-established OMNeT++ framework. Its results are compared with existing measurements of the system's behavior. The successful reproduction of the measurements by the simulations validates the modeling approach. We share some of the preliminary findings obtained from the simulation, as an example of the additional possibilities it enables, and outline the planned future investigations.","PeriodicalId":187042,"journal":{"name":"2015 IEEE International Conference on Cluster Computing","volume":"173 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2015-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2015 IEEE International Conference on Cluster Computing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CLUSTER.2015.137","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 4

Abstract

The ATLAS detector at CERN records particle collision "events" delivered by the Large Hadron Collider. Its data-acquisition system identifies, selects, and stores interesting events in near real-time, with an aggregate throughput of several 10 GB/s. It is a distributed software system executed on a farm of roughly 2000 commodity worker nodes communicating via TCP/IP on an Ethernet network. Event data fragments are received from the many detector readout channels and are buffered, collected together, analyzed and either stored permanently or discarded. This system, and data-acquisition systems in general, are sensitive to the latency of the data transfer from the readout buffers to the worker nodes. Challenges affecting this transfer include the many-to-one communication pattern and the inherently bursty nature of the traffic. In this paper we introduce the main performance issues brought about by this workload, focusing in particular on the so-called TCP incast pathology. Since performing systematic studies of these issues is often impeded by operational constraints related to the mission-critical nature of these systems, we focus instead on the development of a simulation model of the ATLAS data-acquisition system, used as a case study. The simulation is based on the well-established OMNeT++ framework. Its results are compared with existing measurements of the system's behavior. The successful reproduction of the measurements by the simulations validates the modeling approach. We share some of the preliminary findings obtained from the simulation, as an example of the additional possibilities it enables, and outline the planned future investigations.

查看原文本刊更多论文

基于仿真框架的大型数据采集网络建模

欧洲核子研究中心的ATLAS探测器记录了大型强子对撞机产生的粒子碰撞“事件”。它的数据采集系统近乎实时地识别、选择和存储感兴趣的事件，总吞吐量达到10gb /s。它是一个分布式软件系统，在大约2000个通过TCP/IP在以太网网络上通信的商品工作节点上执行。事件数据片段从许多检测器读出通道接收，并被缓冲、收集、分析并永久存储或丢弃。该系统和一般的数据采集系统对数据从读出缓冲区传输到工作节点的延迟很敏感。影响这种传输的挑战包括多对一通信模式和流量固有的突发特性。在本文中，我们介绍了这种工作负载带来的主要性能问题，特别关注所谓的TCP连铸病理。由于对这些问题进行系统研究经常受到与这些系统的关键任务性质相关的操作限制的阻碍，因此我们将重点放在ATLAS数据采集系统的模拟模型的开发上，用作案例研究。仿真基于完善的omnet++框架。将其结果与现有的系统行为测量结果进行比较。通过模拟成功再现了测量结果，验证了建模方法的有效性。我们分享了从模拟中获得的一些初步发现，作为它能够实现的其他可能性的一个例子，并概述了计划中的未来调查。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2015 IEEE International Conference on Cluster Computing

自引率

0.00%

发文量