可审计、可重复使用的横道图，可快速、按比例整合分散的表格数据

arXiv - CS - Databases Pub Date : 2024-09-03 DOI:arxiv-2409.01517

Gavin Chait

{"title":"可审计、可重复使用的横道图，可快速、按比例整合分散的表格数据","authors":"Gavin Chait","doi":"arxiv-2409.01517","DOIUrl":null,"url":null,"abstract":"This paper presents an open-source curatorial toolkit intended to produce\nwell-structured and interoperable data. Curation is divided into discrete\ncomponents, with a schema-centric focus for auditable restructuring of complex\nand scattered tabular data to conform to a destination schema. Task separation\nallows development of software and analysis without source data being present.\nTransformations are captured as high-level sequential scripts describing\nschema-to-schema mappings, reducing complexity and resource requirements.\nUltimately, data are transformed, but the objective is that any data meeting a\nschema definition can be restructured using a crosswalk. The toolkit is\navailable both as a Python package, and as a 'no-code' visual web application.\nA visual example is presented, derived from a longitudinal study where\nscattered source data from hundreds of local councils are integrated into a\nsingle database.","PeriodicalId":501123,"journal":{"name":"arXiv - CS - Databases","volume":"1 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Auditable and reusable crosswalks for fast, scaled integration of scattered tabular data\",\"authors\":\"Gavin Chait\",\"doi\":\"arxiv-2409.01517\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"This paper presents an open-source curatorial toolkit intended to produce\\nwell-structured and interoperable data. Curation is divided into discrete\\ncomponents, with a schema-centric focus for auditable restructuring of complex\\nand scattered tabular data to conform to a destination schema. Task separation\\nallows development of software and analysis without source data being present.\\nTransformations are captured as high-level sequential scripts describing\\nschema-to-schema mappings, reducing complexity and resource requirements.\\nUltimately, data are transformed, but the objective is that any data meeting a\\nschema definition can be restructured using a crosswalk. The toolkit is\\navailable both as a Python package, and as a 'no-code' visual web application.\\nA visual example is presented, derived from a longitudinal study where\\nscattered source data from hundreds of local councils are integrated into a\\nsingle database.\",\"PeriodicalId\":501123,\"journal\":{\"name\":\"arXiv - CS - Databases\",\"volume\":\"1 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-09-03\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - CS - Databases\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2409.01517\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Databases","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.01517","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

本文介绍了一个旨在生成结构良好、可互操作的数据的开放源代码编辑工具包。数据整理被划分为不同的组成部分，以模式为中心，对复杂而分散的表格数据进行可审计的重组，使其符合目标模式。任务分离允许在不存在源数据的情况下开发软件和进行分析。转换以描述模式到模式映射的高级顺序脚本的形式进行，从而降低了复杂性和资源需求。最终，数据将被转换，但目标是任何符合模式定义的数据都可以使用横道图进行重组。该工具包既可以作为 Python 软件包提供，也可以作为 "无代码 "可视化网络应用程序提供。本文介绍了一个可视化示例，该示例来自一项纵向研究，研究将数百个地方议会的零散源数据整合到同一个数据库中。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Auditable and reusable crosswalks for fast, scaled integration of scattered tabular data

This paper presents an open-source curatorial toolkit intended to produce well-structured and interoperable data. Curation is divided into discrete components, with a schema-centric focus for auditable restructuring of complex and scattered tabular data to conform to a destination schema. Task separation allows development of software and analysis without source data being present. Transformations are captured as high-level sequential scripts describing schema-to-schema mappings, reducing complexity and resource requirements. Ultimately, data are transformed, but the objective is that any data meeting a schema definition can be restructured using a crosswalk. The toolkit is available both as a Python package, and as a 'no-code' visual web application. A visual example is presented, derived from a longitudinal study where scattered source data from hundreds of local councils are integrated into a single database.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

arXiv - CS - Databases

自引率

0.00%

发文量