{"title":"利用出处和大文件管理增强临床数据仓库:临床奥米克斯数据的 gitOmmix 方法","authors":"Maxime WackCRC, HeKA, HEGP, CHNO, Adrien CouletCRC, HeKA, Anita BurgunHEGP, Imagine, Bastien RanceUPCité, HEGP, CRC, HeKA","doi":"arxiv-2409.03288","DOIUrl":null,"url":null,"abstract":"Background. Clinical data warehouses (CDWs) are essential in the reuse of\nhospital data in observational studies or predictive modeling. However, state\nof-the-art CDW systems present two drawbacks. First, they do not support the\nmanagement of large data files, what is critical in medical genomics,\nradiology, digital pathology, and other domains where such files are generated.\nSecond, they do not provide provenance management or means to represent\nlongitudinal relationships between patient events. Indeed, a disease diagnosis\nand its follow-up rely on multiple analyses. In these cases no relationship\nbetween the data (e.g., a large file) and its associated analysis and decision\ncan be documented.Method. We introduce gitOmmix, an approach that overcomes\nthese limitations, and illustrate its usefulness in the management of medical\nomics data. gitOmmix relies on (i) a file versioning system: git, (ii) an\nextension that handles large files: git-annex, (iii) a provenance knowledge\ngraph: PROV-O, and (iv) an alignment between the git versioning information and\nthe provenance knowledge graph.Results. Capabilities inherited from git and\ngit-annex enable retracing the history of a clinical interpretation back to the\npatient sample, through supporting data and analyses. In addition, the\nprovenance knowledge graph, aligned with the git versioning information,\nenables querying and browsing provenance relationships between these\nelements.Conclusion. gitOmmix adds a provenance layer to CDWs, while scaling to\nlarge files and being agnostic of the CDW system. For these reasons, we think\nthat it is a viable and generalizable solution for omics clinical studies.","PeriodicalId":501266,"journal":{"name":"arXiv - QuanBio - Quantitative Methods","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2024-09-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Enhancing Clinical Data Warehouses with Provenance and Large File Management: The gitOmmix Approach for Clinical Omics Data\",\"authors\":\"Maxime WackCRC, HeKA, HEGP, CHNO, Adrien CouletCRC, HeKA, Anita BurgunHEGP, Imagine, Bastien RanceUPCité, HEGP, CRC, HeKA\",\"doi\":\"arxiv-2409.03288\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Background. Clinical data warehouses (CDWs) are essential in the reuse of\\nhospital data in observational studies or predictive modeling. However, state\\nof-the-art CDW systems present two drawbacks. First, they do not support the\\nmanagement of large data files, what is critical in medical genomics,\\nradiology, digital pathology, and other domains where such files are generated.\\nSecond, they do not provide provenance management or means to represent\\nlongitudinal relationships between patient events. Indeed, a disease diagnosis\\nand its follow-up rely on multiple analyses. In these cases no relationship\\nbetween the data (e.g., a large file) and its associated analysis and decision\\ncan be documented.Method. We introduce gitOmmix, an approach that overcomes\\nthese limitations, and illustrate its usefulness in the management of medical\\nomics data. gitOmmix relies on (i) a file versioning system: git, (ii) an\\nextension that handles large files: git-annex, (iii) a provenance knowledge\\ngraph: PROV-O, and (iv) an alignment between the git versioning information and\\nthe provenance knowledge graph.Results. Capabilities inherited from git and\\ngit-annex enable retracing the history of a clinical interpretation back to the\\npatient sample, through supporting data and analyses. In addition, the\\nprovenance knowledge graph, aligned with the git versioning information,\\nenables querying and browsing provenance relationships between these\\nelements.Conclusion. gitOmmix adds a provenance layer to CDWs, while scaling to\\nlarge files and being agnostic of the CDW system. For these reasons, we think\\nthat it is a viable and generalizable solution for omics clinical studies.\",\"PeriodicalId\":501266,\"journal\":{\"name\":\"arXiv - QuanBio - Quantitative Methods\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-09-05\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - QuanBio - Quantitative Methods\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2409.03288\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - QuanBio - Quantitative Methods","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.03288","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Enhancing Clinical Data Warehouses with Provenance and Large File Management: The gitOmmix Approach for Clinical Omics Data
Background. Clinical data warehouses (CDWs) are essential in the reuse of
hospital data in observational studies or predictive modeling. However, state
of-the-art CDW systems present two drawbacks. First, they do not support the
management of large data files, what is critical in medical genomics,
radiology, digital pathology, and other domains where such files are generated.
Second, they do not provide provenance management or means to represent
longitudinal relationships between patient events. Indeed, a disease diagnosis
and its follow-up rely on multiple analyses. In these cases no relationship
between the data (e.g., a large file) and its associated analysis and decision
can be documented.Method. We introduce gitOmmix, an approach that overcomes
these limitations, and illustrate its usefulness in the management of medical
omics data. gitOmmix relies on (i) a file versioning system: git, (ii) an
extension that handles large files: git-annex, (iii) a provenance knowledge
graph: PROV-O, and (iv) an alignment between the git versioning information and
the provenance knowledge graph.Results. Capabilities inherited from git and
git-annex enable retracing the history of a clinical interpretation back to the
patient sample, through supporting data and analyses. In addition, the
provenance knowledge graph, aligned with the git versioning information,
enables querying and browsing provenance relationships between these
elements.Conclusion. gitOmmix adds a provenance layer to CDWs, while scaling to
large files and being agnostic of the CDW system. For these reasons, we think
that it is a viable and generalizable solution for omics clinical studies.