Manasvi Goyal, Andrea Zonca, Amy Roberts, Jim Pivarski, Ianna Osborne
{"title":"描述数据,获取科学数据就绪工具:作为开泰结构 YAML 目标的尴尬","authors":"Manasvi Goyal, Andrea Zonca, Amy Roberts, Jim Pivarski, Ianna Osborne","doi":"arxiv-2407.14461","DOIUrl":null,"url":null,"abstract":"In some fields, scientific data formats differ across experiments due to\nspecialized hardware and data acquisition systems. Researchers need to develop,\ndocument, and maintain experiment-specific analysis software to interact with\nthese data formats. These software are often tightly coupled with a particular\ndata format. This proliferation of custom data formats has been a prominent\nchallenge for small to mid-scale experiments. The widespread adoption of ROOT\nhas largely mitigated this problem for the Large Hadron Collider experiments.\nHowever, many smaller experiments continue to use custom data formats to meet\nspecific research needs. Therefore, simplifying the process of accessing a\nunique data format for analysis holds immense value for scientific communities\nwithin HEP. We have added Awkward Arrays as a target language for Kaitai Struct\nfor this purpose. Researchers can describe their custom data format in the\nKaitai Struct YAML (KSY) language. The Kaitai Struct Compiler generates C++\ncode to fill the LayoutBuilder buffers using the KSY format. In a few steps,\nthe Kaitai Struct Awkward Runtime API can convert the generated C++ code into a\ncompiled Python module. Finally, the raw data can be passed to the module to\nproduce Awkward Arrays. This paper introduces the Awkward Target for the Kaitai\nStruct Compiler and the Kaitai Struct Awkward Runtime API. It also demonstrates\nthe conversion of a given KSY for a specific custom file format to Awkward\nArrays.","PeriodicalId":501197,"journal":{"name":"arXiv - CS - Programming Languages","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2024-07-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Describe Data to get Science-Data-Ready Tooling: Awkward as a Target for Kaitai Struct YAML\",\"authors\":\"Manasvi Goyal, Andrea Zonca, Amy Roberts, Jim Pivarski, Ianna Osborne\",\"doi\":\"arxiv-2407.14461\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"In some fields, scientific data formats differ across experiments due to\\nspecialized hardware and data acquisition systems. Researchers need to develop,\\ndocument, and maintain experiment-specific analysis software to interact with\\nthese data formats. These software are often tightly coupled with a particular\\ndata format. This proliferation of custom data formats has been a prominent\\nchallenge for small to mid-scale experiments. The widespread adoption of ROOT\\nhas largely mitigated this problem for the Large Hadron Collider experiments.\\nHowever, many smaller experiments continue to use custom data formats to meet\\nspecific research needs. Therefore, simplifying the process of accessing a\\nunique data format for analysis holds immense value for scientific communities\\nwithin HEP. We have added Awkward Arrays as a target language for Kaitai Struct\\nfor this purpose. Researchers can describe their custom data format in the\\nKaitai Struct YAML (KSY) language. The Kaitai Struct Compiler generates C++\\ncode to fill the LayoutBuilder buffers using the KSY format. In a few steps,\\nthe Kaitai Struct Awkward Runtime API can convert the generated C++ code into a\\ncompiled Python module. Finally, the raw data can be passed to the module to\\nproduce Awkward Arrays. This paper introduces the Awkward Target for the Kaitai\\nStruct Compiler and the Kaitai Struct Awkward Runtime API. It also demonstrates\\nthe conversion of a given KSY for a specific custom file format to Awkward\\nArrays.\",\"PeriodicalId\":501197,\"journal\":{\"name\":\"arXiv - CS - Programming Languages\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-07-19\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - CS - Programming Languages\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2407.14461\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Programming Languages","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2407.14461","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Describe Data to get Science-Data-Ready Tooling: Awkward as a Target for Kaitai Struct YAML
In some fields, scientific data formats differ across experiments due to
specialized hardware and data acquisition systems. Researchers need to develop,
document, and maintain experiment-specific analysis software to interact with
these data formats. These software are often tightly coupled with a particular
data format. This proliferation of custom data formats has been a prominent
challenge for small to mid-scale experiments. The widespread adoption of ROOT
has largely mitigated this problem for the Large Hadron Collider experiments.
However, many smaller experiments continue to use custom data formats to meet
specific research needs. Therefore, simplifying the process of accessing a
unique data format for analysis holds immense value for scientific communities
within HEP. We have added Awkward Arrays as a target language for Kaitai Struct
for this purpose. Researchers can describe their custom data format in the
Kaitai Struct YAML (KSY) language. The Kaitai Struct Compiler generates C++
code to fill the LayoutBuilder buffers using the KSY format. In a few steps,
the Kaitai Struct Awkward Runtime API can convert the generated C++ code into a
compiled Python module. Finally, the raw data can be passed to the module to
produce Awkward Arrays. This paper introduces the Awkward Target for the Kaitai
Struct Compiler and the Kaitai Struct Awkward Runtime API. It also demonstrates
the conversion of a given KSY for a specific custom file format to Awkward
Arrays.