{"title":"使用监督机器学习将临床试验遗留数据半自动转换为CDISC SDTM标准格式。","authors":"Takuma Oda, Shih-Wei Chiu, Takuhiro Yamaguchi","doi":"10.1055/s-0041-1731388","DOIUrl":null,"url":null,"abstract":"<p><strong>Objective: </strong> This study aimed to develop a semi-automated process to convert legacy data into clinical data interchange standards consortium (CDISC) study data tabulation model (SDTM) format by combining human verification and three methods: data normalization; feature extraction by distributed representation of dataset names, variable names, and variable labels; and supervised machine learning.</p><p><strong>Materials and methods: </strong> Variable labels, dataset names, variable names, and values of legacy data were used as machine learning features. Because most of these data are string data, they had been converted to a distributed representation to make them usable as machine learning features. For this purpose, we utilized the following methods for distributed representation: Gestalt pattern matching, cosine similarity after vectorization by Doc2vec, and vectorization by Doc2vec. In this study, we examined five algorithms-namely decision tree, random forest, gradient boosting, neural network, and an ensemble that combines the four algorithms-to identify the one that could generate the best prediction model.</p><p><strong>Results: </strong> The accuracy rate was highest for the neural network, and the distribution of prediction probabilities also showed a split between the correct and incorrect distributions. By combining human verification and the three methods, we were able to semi-automatically convert legacy data into the CDISC SDTM format.</p><p><strong>Conclusion: </strong> By combining human verification and the three methods, we have successfully developed a semi-automated process to convert legacy data into the CDISC SDTM format; this process is more efficient than the conventional fully manual process.</p>","PeriodicalId":49822,"journal":{"name":"Methods of Information in Medicine","volume":"60 1-02","pages":"49-61"},"PeriodicalIF":1.3000,"publicationDate":"2021-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"Semi-automated Conversion of Clinical Trial Legacy Data into CDISC SDTM Standards Format Using Supervised Machine Learning.\",\"authors\":\"Takuma Oda, Shih-Wei Chiu, Takuhiro Yamaguchi\",\"doi\":\"10.1055/s-0041-1731388\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><strong>Objective: </strong> This study aimed to develop a semi-automated process to convert legacy data into clinical data interchange standards consortium (CDISC) study data tabulation model (SDTM) format by combining human verification and three methods: data normalization; feature extraction by distributed representation of dataset names, variable names, and variable labels; and supervised machine learning.</p><p><strong>Materials and methods: </strong> Variable labels, dataset names, variable names, and values of legacy data were used as machine learning features. Because most of these data are string data, they had been converted to a distributed representation to make them usable as machine learning features. For this purpose, we utilized the following methods for distributed representation: Gestalt pattern matching, cosine similarity after vectorization by Doc2vec, and vectorization by Doc2vec. In this study, we examined five algorithms-namely decision tree, random forest, gradient boosting, neural network, and an ensemble that combines the four algorithms-to identify the one that could generate the best prediction model.</p><p><strong>Results: </strong> The accuracy rate was highest for the neural network, and the distribution of prediction probabilities also showed a split between the correct and incorrect distributions. By combining human verification and the three methods, we were able to semi-automatically convert legacy data into the CDISC SDTM format.</p><p><strong>Conclusion: </strong> By combining human verification and the three methods, we have successfully developed a semi-automated process to convert legacy data into the CDISC SDTM format; this process is more efficient than the conventional fully manual process.</p>\",\"PeriodicalId\":49822,\"journal\":{\"name\":\"Methods of Information in Medicine\",\"volume\":\"60 1-02\",\"pages\":\"49-61\"},\"PeriodicalIF\":1.3000,\"publicationDate\":\"2021-05-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Methods of Information in Medicine\",\"FirstCategoryId\":\"3\",\"ListUrlMain\":\"https://doi.org/10.1055/s-0041-1731388\",\"RegionNum\":4,\"RegionCategory\":\"医学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"2021/7/8 0:00:00\",\"PubModel\":\"Epub\",\"JCR\":\"Q3\",\"JCRName\":\"COMPUTER SCIENCE, INFORMATION SYSTEMS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Methods of Information in Medicine","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1055/s-0041-1731388","RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2021/7/8 0:00:00","PubModel":"Epub","JCR":"Q3","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}
Semi-automated Conversion of Clinical Trial Legacy Data into CDISC SDTM Standards Format Using Supervised Machine Learning.
Objective: This study aimed to develop a semi-automated process to convert legacy data into clinical data interchange standards consortium (CDISC) study data tabulation model (SDTM) format by combining human verification and three methods: data normalization; feature extraction by distributed representation of dataset names, variable names, and variable labels; and supervised machine learning.
Materials and methods: Variable labels, dataset names, variable names, and values of legacy data were used as machine learning features. Because most of these data are string data, they had been converted to a distributed representation to make them usable as machine learning features. For this purpose, we utilized the following methods for distributed representation: Gestalt pattern matching, cosine similarity after vectorization by Doc2vec, and vectorization by Doc2vec. In this study, we examined five algorithms-namely decision tree, random forest, gradient boosting, neural network, and an ensemble that combines the four algorithms-to identify the one that could generate the best prediction model.
Results: The accuracy rate was highest for the neural network, and the distribution of prediction probabilities also showed a split between the correct and incorrect distributions. By combining human verification and the three methods, we were able to semi-automatically convert legacy data into the CDISC SDTM format.
Conclusion: By combining human verification and the three methods, we have successfully developed a semi-automated process to convert legacy data into the CDISC SDTM format; this process is more efficient than the conventional fully manual process.
期刊介绍:
Good medicine and good healthcare demand good information. Since the journal''s founding in 1962, Methods of Information in Medicine has stressed the methodology and scientific fundamentals of organizing, representing and analyzing data, information and knowledge in biomedicine and health care. Covering publications in the fields of biomedical and health informatics, medical biometry, and epidemiology, the journal publishes original papers, reviews, reports, opinion papers, editorials, and letters to the editor. From time to time, the journal publishes articles on particular focus themes as part of a journal''s issue.