Project Victoria: A pragmatic data model to automate RWE generation from the national French claims database.

IF 2.3 3区医学 Q2 HEALTH CARE SCIENCES & SERVICES

Health Informatics Journal Pub Date : 2025-01-01 DOI:10.1177/14604582251318250

Kevin Ouazzani, Xavier Ansolabehere, Florence Journeau, Alexandre Vidal, Nicolas Jaubourg, Maxime Doublet, Raphael Thollot, Arnaud Fabre, Nicolas Glatt

{"title":"Project Victoria: A pragmatic data model to automate RWE generation from the national French claims database.","authors":"Kevin Ouazzani, Xavier Ansolabehere, Florence Journeau, Alexandre Vidal, Nicolas Jaubourg, Maxime Doublet, Raphael Thollot, Arnaud Fabre, Nicolas Glatt","doi":"10.1177/14604582251318250","DOIUrl":null,"url":null,"abstract":"Objective: This paper describes Victoria, an empirically built data pipeline for SNDS to: - Build an automated, scalable pipeline supporting changes to the data model inherent to the use of large databases, - Deliver a documented pipeline with clear processes, enabling scientific, epidemiological researches, - Ease access to SNDS data in compliance with regulatory requirements. Methods: This paper describes the 2-steps process of the Victoria pipeline and its final output. The initial cleaning step consists in formatting, deleting empty, error or duplicate records and renaming variables without changing their values, accordingly with the official SNDS documentation. The second step consists in creating 2 linearised data models: every line of each table is an event, and each table is indexed with a unique patient identifier, without the need for a central patient or identifier table. These 2 models are: - the epidemiological model, used for answering most of the research questions requiring population phenotyping (demography, diagnosis, procedures characteristics). - the medico-economic model is used for costs and healthcare consumption analyses. It contains more complex information about reimbursements rates and the data quality assessment is focused on costs rather than medico-administrative information. Results: The pipeline was executed on 2 different datasets representing ∼85 000 and ∼870 000 beneficiaries with the following configuration: one master with 4 cores and 16Go of RAM and respectively 4 and 6 workers. The total execution time for the smaller dataset was 25 h and 96 h for the larger one. The longest part of those times is represented by the format conversion to parquet. The cleaning step took only 4 h in both cases. The epidemiological model took 344 min for the smaller dataset and 1934 min for the larger one. The medico-economic model took the longest time with 704 min and 2145 min, respectively. Conclusion: Victoria pipeline is a successfully implemented SNDS pipeline. Compared to previous pipelines, reviewability is part of its design as unit tests and quality assessments can natively be developed to ensure data and analysis quality. The pipeline has been used for 2 published studies. The recent work toward OMOP conversion will be integrated in upcoming versions and, as Victoria is set to run on a CD platform, the potential evolution if SNDS format can be considered.","PeriodicalId":55069,"journal":{"name":"Health Informatics Journal","volume":"31 1","pages":"14604582251318250"},"PeriodicalIF":2.3000,"publicationDate":"2025-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Health Informatics Journal","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1177/14604582251318250","RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"HEALTH CARE SCIENCES & SERVICES","Score":null,"Total":0}

引用次数: 0

Abstract

Objective: This paper describes Victoria, an empirically built data pipeline for SNDS to: - Build an automated, scalable pipeline supporting changes to the data model inherent to the use of large databases, - Deliver a documented pipeline with clear processes, enabling scientific, epidemiological researches, - Ease access to SNDS data in compliance with regulatory requirements. Methods: This paper describes the 2-steps process of the Victoria pipeline and its final output. The initial cleaning step consists in formatting, deleting empty, error or duplicate records and renaming variables without changing their values, accordingly with the official SNDS documentation. The second step consists in creating 2 linearised data models: every line of each table is an event, and each table is indexed with a unique patient identifier, without the need for a central patient or identifier table. These 2 models are: - the epidemiological model, used for answering most of the research questions requiring population phenotyping (demography, diagnosis, procedures characteristics). - the medico-economic model is used for costs and healthcare consumption analyses. It contains more complex information about reimbursements rates and the data quality assessment is focused on costs rather than medico-administrative information. Results: The pipeline was executed on 2 different datasets representing ∼85 000 and ∼870 000 beneficiaries with the following configuration: one master with 4 cores and 16Go of RAM and respectively 4 and 6 workers. The total execution time for the smaller dataset was 25 h and 96 h for the larger one. The longest part of those times is represented by the format conversion to parquet. The cleaning step took only 4 h in both cases. The epidemiological model took 344 min for the smaller dataset and 1934 min for the larger one. The medico-economic model took the longest time with 704 min and 2145 min, respectively. Conclusion: Victoria pipeline is a successfully implemented SNDS pipeline. Compared to previous pipelines, reviewability is part of its design as unit tests and quality assessments can natively be developed to ensure data and analysis quality. The pipeline has been used for 2 published studies. The recent work toward OMOP conversion will be integrated in upcoming versions and, as Victoria is set to run on a CD platform, the potential evolution if SNDS format can be considered.

查看原文本刊更多论文

维多利亚项目：一个实用的数据模型，用于从法国国家索赔数据库自动生成RWE。

目的：本文描述了维多利亚，一个经验构建的SNDS数据管道：-建立一个自动化的，可扩展的管道，支持使用大型数据库固有的数据模型的变化，-提供具有明确流程的文档化管道，支持科学，流行病学研究，-易于访问符合监管要求的SNDS数据。方法：本文描述了维多利亚管道的两步流程及其最终输出。初始清理步骤包括格式化、删除空的、错误的或重复的记录，并根据官方SNDS文档对变量进行重命名，但不改变它们的值。第二步包括创建2个线性化的数据模型：每个表的每一行都是一个事件，每个表都使用唯一的患者标识符进行索引，而不需要中央患者或标识符表。这两个模型是：-流行病学模型，用于回答大多数需要群体表型的研究问题（人口学，诊断，程序特征）。-医疗经济模型用于成本和医疗保健消费分析。它包含关于赔偿率的更复杂的信息，数据质量评估侧重于成本，而不是医疗管理信息。结果：该管道在2个不同的数据集上执行，分别代表约85000和约870000受益人，配置如下：一个具有4核和16Go RAM的主机，分别有4和6个工人。较小数据集的总执行时间为25小时，较大数据集的总执行时间为96小时。这些时间中最长的部分由格式转换为拼花表示。在这两种情况下，清洗步骤只花了4小时。小数据集的流行病学模型耗时344分钟，大数据集的模型耗时1934分钟。医学经济模型用时最长，分别为704 min和2145 min。结论：Victoria管道是成功实施的SNDS管道。与以前的管道相比，可评审性是其设计的一部分，因为单元测试和质量评估可以原生开发，以确保数据和分析的质量。该管道已用于两项已发表的研究。最近针对OMOP转换的工作将集成到即将发布的版本中，并且由于Victoria将在CD平台上运行，因此可以考虑SNDS格式的潜在演变。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Health Informatics Journal HEALTH CARE SCIENCES & SERVICES-MEDICAL INFORMATICS

CiteScore

7.80

自引率

6.70%

发文量

审稿时长

6 months

期刊介绍： Health Informatics Journal is an international peer-reviewed journal. All papers submitted to Health Informatics Journal are subject to peer review by members of a carefully appointed editorial board. The journal operates a conventional single-blind reviewing policy in which the reviewer’s name is always concealed from the submitting author.