Hamed Jafarpour , Guosong Wu , Cheligeer (Ken) Cheligeer , Jun Yan , Yuan Xu , Danielle A. Southern , Cathy A. Eastwood , Yong Zeng , Hude Quan
{"title":"Preprocessing narrative texts in electronic medical records to identify hospital adverse events: A scoping review","authors":"Hamed Jafarpour , Guosong Wu , Cheligeer (Ken) Cheligeer , Jun Yan , Yuan Xu , Danielle A. Southern , Cathy A. Eastwood , Yong Zeng , Hude Quan","doi":"10.1016/j.artmed.2025.103281","DOIUrl":null,"url":null,"abstract":"<div><h3>Background:</h3><div>Narrative electronic medical records (EMR), which include textual notes created by clinicians within healthcare environments, represent a significant resource for documenting various facets of patient care. This form of text exhibits distinctive characteristics, such as the occurrence of grammatically incorrect sentences, abbreviations, frequent acronyms, specialized characters with particular meanings, negation expressions, and sporadic misspellings. As a result, a primary goal in processing these textual notes is to implement effective preprocessing techniques that enhance data quality and ensure consistency across all entries. Recent advancements in algorithms and methodologies within the fields of natural language processing (NLP), machine learning (ML), and large language models (LLM) have prompted researchers to leverage narrative EMR for the detection of hospital adverse events (HAE).</div></div><div><h3>Methods:</h3><div>The scoping review adhered to the PRISMA-ScR (Preferred Reporting Items for Systematic Reviews and Meta-Analyses Extension for Scoping Reviews) guidelines. A scoping review protocol was developed and utilized to guide the research process, clearly outlining the eligibility criteria, information sources, search strategies, data management, selection process, data collection procedures, data items, outcomes and prioritization, data synthesis, and meta-bias considerations. The search strategy was implemented across nine engineering and medical electronic databases.</div></div><div><h3>Results:</h3><div>The results have indicated that from a total of 3,264 studies retrieved, 48 unique studies were included in the review. Responses to the research questions were systematically extracted from these studies. The review has identified challenges associated with the preprocessing of narrative texts in EMR for HAE identification. Additionally, three research gaps have been identified: (1) the imperative need for a pipeline to preprocess narrative EMR for the identification of HAE, (2) the necessity for a robust system capable of managing the extensive volume of narrative EMR data, and (3) the requirement for temporal event system, which are essential for effective HAE detection. The study also has underscored the essential role of preprocessing tasks in enhancing the performance of HAE detection. The study has emphasized the importance of extracting N-grams from clinical text, normalizing these N-grams through lemmatization and/or stemming, and establishing semantic feature extraction in preprocessing tasks that significantly affect HAE detection performance. While LLM-based systems naturally incorporate tokenization and normalization processes within their frameworks, it remains crucial to address features that hold semantic relevance to the specific type of HAE during preprocessing.</div></div><div><h3>Conclusion:</h3><div>This scoping review has provided valuable insights for researchers focused on HAE detection utilizing narrative EMR data. It has elucidated how preprocessing tasks can elevate the performance of HAE detection and draws attention to neglected research gaps within the field. Addressing these gaps will necessitate further investigation in subsequent research endeavors.</div></div>","PeriodicalId":55458,"journal":{"name":"Artificial Intelligence in Medicine","volume":"170 ","pages":"Article 103281"},"PeriodicalIF":6.2000,"publicationDate":"2025-10-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Artificial Intelligence in Medicine","FirstCategoryId":"5","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0933365725002167","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0
Abstract
Background:
Narrative electronic medical records (EMR), which include textual notes created by clinicians within healthcare environments, represent a significant resource for documenting various facets of patient care. This form of text exhibits distinctive characteristics, such as the occurrence of grammatically incorrect sentences, abbreviations, frequent acronyms, specialized characters with particular meanings, negation expressions, and sporadic misspellings. As a result, a primary goal in processing these textual notes is to implement effective preprocessing techniques that enhance data quality and ensure consistency across all entries. Recent advancements in algorithms and methodologies within the fields of natural language processing (NLP), machine learning (ML), and large language models (LLM) have prompted researchers to leverage narrative EMR for the detection of hospital adverse events (HAE).
Methods:
The scoping review adhered to the PRISMA-ScR (Preferred Reporting Items for Systematic Reviews and Meta-Analyses Extension for Scoping Reviews) guidelines. A scoping review protocol was developed and utilized to guide the research process, clearly outlining the eligibility criteria, information sources, search strategies, data management, selection process, data collection procedures, data items, outcomes and prioritization, data synthesis, and meta-bias considerations. The search strategy was implemented across nine engineering and medical electronic databases.
Results:
The results have indicated that from a total of 3,264 studies retrieved, 48 unique studies were included in the review. Responses to the research questions were systematically extracted from these studies. The review has identified challenges associated with the preprocessing of narrative texts in EMR for HAE identification. Additionally, three research gaps have been identified: (1) the imperative need for a pipeline to preprocess narrative EMR for the identification of HAE, (2) the necessity for a robust system capable of managing the extensive volume of narrative EMR data, and (3) the requirement for temporal event system, which are essential for effective HAE detection. The study also has underscored the essential role of preprocessing tasks in enhancing the performance of HAE detection. The study has emphasized the importance of extracting N-grams from clinical text, normalizing these N-grams through lemmatization and/or stemming, and establishing semantic feature extraction in preprocessing tasks that significantly affect HAE detection performance. While LLM-based systems naturally incorporate tokenization and normalization processes within their frameworks, it remains crucial to address features that hold semantic relevance to the specific type of HAE during preprocessing.
Conclusion:
This scoping review has provided valuable insights for researchers focused on HAE detection utilizing narrative EMR data. It has elucidated how preprocessing tasks can elevate the performance of HAE detection and draws attention to neglected research gaps within the field. Addressing these gaps will necessitate further investigation in subsequent research endeavors.
期刊介绍:
Artificial Intelligence in Medicine publishes original articles from a wide variety of interdisciplinary perspectives concerning the theory and practice of artificial intelligence (AI) in medicine, medically-oriented human biology, and health care.
Artificial intelligence in medicine may be characterized as the scientific discipline pertaining to research studies, projects, and applications that aim at supporting decision-based medical tasks through knowledge- and/or data-intensive computer-based solutions that ultimately support and improve the performance of a human care provider.