{"title":"可审计、可重复使用的横道图,可快速、按比例整合分散的表格数据","authors":"Gavin Chait","doi":"arxiv-2409.01517","DOIUrl":null,"url":null,"abstract":"This paper presents an open-source curatorial toolkit intended to produce\nwell-structured and interoperable data. Curation is divided into discrete\ncomponents, with a schema-centric focus for auditable restructuring of complex\nand scattered tabular data to conform to a destination schema. Task separation\nallows development of software and analysis without source data being present.\nTransformations are captured as high-level sequential scripts describing\nschema-to-schema mappings, reducing complexity and resource requirements.\nUltimately, data are transformed, but the objective is that any data meeting a\nschema definition can be restructured using a crosswalk. The toolkit is\navailable both as a Python package, and as a 'no-code' visual web application.\nA visual example is presented, derived from a longitudinal study where\nscattered source data from hundreds of local councils are integrated into a\nsingle database.","PeriodicalId":501123,"journal":{"name":"arXiv - CS - Databases","volume":"1 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Auditable and reusable crosswalks for fast, scaled integration of scattered tabular data\",\"authors\":\"Gavin Chait\",\"doi\":\"arxiv-2409.01517\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"This paper presents an open-source curatorial toolkit intended to produce\\nwell-structured and interoperable data. Curation is divided into discrete\\ncomponents, with a schema-centric focus for auditable restructuring of complex\\nand scattered tabular data to conform to a destination schema. Task separation\\nallows development of software and analysis without source data being present.\\nTransformations are captured as high-level sequential scripts describing\\nschema-to-schema mappings, reducing complexity and resource requirements.\\nUltimately, data are transformed, but the objective is that any data meeting a\\nschema definition can be restructured using a crosswalk. The toolkit is\\navailable both as a Python package, and as a 'no-code' visual web application.\\nA visual example is presented, derived from a longitudinal study where\\nscattered source data from hundreds of local councils are integrated into a\\nsingle database.\",\"PeriodicalId\":501123,\"journal\":{\"name\":\"arXiv - CS - Databases\",\"volume\":\"1 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-09-03\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - CS - Databases\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2409.01517\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Databases","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.01517","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Auditable and reusable crosswalks for fast, scaled integration of scattered tabular data
This paper presents an open-source curatorial toolkit intended to produce
well-structured and interoperable data. Curation is divided into discrete
components, with a schema-centric focus for auditable restructuring of complex
and scattered tabular data to conform to a destination schema. Task separation
allows development of software and analysis without source data being present.
Transformations are captured as high-level sequential scripts describing
schema-to-schema mappings, reducing complexity and resource requirements.
Ultimately, data are transformed, but the objective is that any data meeting a
schema definition can be restructured using a crosswalk. The toolkit is
available both as a Python package, and as a 'no-code' visual web application.
A visual example is presented, derived from a longitudinal study where
scattered source data from hundreds of local councils are integrated into a
single database.