{"title":"Developing an Analytical Pipeline to Classify Patient Safety Event Reports Using Optimized Predictive Algorithms.","authors":"Asa Adadey, Robert Giannini, Lorraine B Possanza","doi":"10.1055/s-0041-1735620","DOIUrl":null,"url":null,"abstract":"<p><strong>Background: </strong>Patient safety event reports provide valuable insight into systemic safety issues but deriving insights from these reports requires computational tools to efficiently parse through large volumes of qualitative data. Natural language processing (NLP) combined with predictive learning provides an automated approach to evaluating these data and supporting the work of patient safety analysts.</p><p><strong>Objectives: </strong>The objective of this study was to use NLP and machine learning techniques to develop a generalizable, scalable, and reliable approach to classifying event reports for the purpose of driving improvements in the safety and quality of patient care.</p><p><strong>Methods: </strong>Datasets for 14 different labels (themes) were vectorized using a bag-of-words, <i>tf-idf</i>, or document embeddings approach and then applied to a series of classification algorithms via a hyperparameter grid search to derive an optimized model. Reports were also analyzed for terms strongly associated with each theme using an adjusted F-score calculation.</p><p><strong>Results: </strong>F<sub>1</sub> score for each optimized model ranged from 0.951 (\"Fall\") to 0.544 (\"Environment\"). The bag-of-words approach proved optimal for 12 of 14 labels, and the naïve Bayes algorithm performed best for nine labels. Linear support vector machine was demonstrated as optimal for three labels and XGBoost for four of the 14 labels. Labels with more distinctly associated terms performed better than less distinct themes, as shown by a Pearson's correlation coefficient of 0.634.</p><p><strong>Conclusions: </strong>We were able to demonstrate an analytical pipeline that broadly applies NLP and predictive modeling to categorize patient safety reports from multiple facilities. This pipeline allows analysts to more rapidly identify and structure information contained in patient safety data, which can enhance the evaluation and the use of this information over time.</p>","PeriodicalId":49822,"journal":{"name":"Methods of Information in Medicine","volume":"60 5-06","pages":"147-161"},"PeriodicalIF":1.3000,"publicationDate":"2021-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Methods of Information in Medicine","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1055/s-0041-1735620","RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2021/10/31 0:00:00","PubModel":"Epub","JCR":"Q3","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}
引用次数: 0
Abstract
Background: Patient safety event reports provide valuable insight into systemic safety issues but deriving insights from these reports requires computational tools to efficiently parse through large volumes of qualitative data. Natural language processing (NLP) combined with predictive learning provides an automated approach to evaluating these data and supporting the work of patient safety analysts.
Objectives: The objective of this study was to use NLP and machine learning techniques to develop a generalizable, scalable, and reliable approach to classifying event reports for the purpose of driving improvements in the safety and quality of patient care.
Methods: Datasets for 14 different labels (themes) were vectorized using a bag-of-words, tf-idf, or document embeddings approach and then applied to a series of classification algorithms via a hyperparameter grid search to derive an optimized model. Reports were also analyzed for terms strongly associated with each theme using an adjusted F-score calculation.
Results: F1 score for each optimized model ranged from 0.951 ("Fall") to 0.544 ("Environment"). The bag-of-words approach proved optimal for 12 of 14 labels, and the naïve Bayes algorithm performed best for nine labels. Linear support vector machine was demonstrated as optimal for three labels and XGBoost for four of the 14 labels. Labels with more distinctly associated terms performed better than less distinct themes, as shown by a Pearson's correlation coefficient of 0.634.
Conclusions: We were able to demonstrate an analytical pipeline that broadly applies NLP and predictive modeling to categorize patient safety reports from multiple facilities. This pipeline allows analysts to more rapidly identify and structure information contained in patient safety data, which can enhance the evaluation and the use of this information over time.
期刊介绍:
Good medicine and good healthcare demand good information. Since the journal''s founding in 1962, Methods of Information in Medicine has stressed the methodology and scientific fundamentals of organizing, representing and analyzing data, information and knowledge in biomedicine and health care. Covering publications in the fields of biomedical and health informatics, medical biometry, and epidemiology, the journal publishes original papers, reviews, reports, opinion papers, editorials, and letters to the editor. From time to time, the journal publishes articles on particular focus themes as part of a journal''s issue.