Auditable and reusable crosswalks for fast, scaled integration of scattered tabular data

arXiv - CS - Databases Pub Date : 2024-09-03 DOI:arxiv-2409.01517

Gavin Chait

引用次数: 0

Abstract

This paper presents an open-source curatorial toolkit intended to produce well-structured and interoperable data. Curation is divided into discrete components, with a schema-centric focus for auditable restructuring of complex and scattered tabular data to conform to a destination schema. Task separation allows development of software and analysis without source data being present. Transformations are captured as high-level sequential scripts describing schema-to-schema mappings, reducing complexity and resource requirements. Ultimately, data are transformed, but the objective is that any data meeting a schema definition can be restructured using a crosswalk. The toolkit is available both as a Python package, and as a 'no-code' visual web application. A visual example is presented, derived from a longitudinal study where scattered source data from hundreds of local councils are integrated into a single database.

查看原文本刊更多论文

可审计、可重复使用的横道图，可快速、按比例整合分散的表格数据

本文介绍了一个旨在生成结构良好、可互操作的数据的开放源代码编辑工具包。数据整理被划分为不同的组成部分，以模式为中心，对复杂而分散的表格数据进行可审计的重组，使其符合目标模式。任务分离允许在不存在源数据的情况下开发软件和进行分析。转换以描述模式到模式映射的高级顺序脚本的形式进行，从而降低了复杂性和资源需求。最终，数据将被转换，但目标是任何符合模式定义的数据都可以使用横道图进行重组。该工具包既可以作为 Python 软件包提供，也可以作为 "无代码 "可视化网络应用程序提供。本文介绍了一个可视化示例，该示例来自一项纵向研究，研究将数百个地方议会的零散源数据整合到同一个数据库中。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

arXiv - CS - Databases

自引率

0.00%

发文量