Bushra Hossain, Sarah M Preum, Md Fazle Rabbi, Rifat Ara, Mohammed Eunus Ali
{"title":"从在线话语中提取复杂情况的症状(Subreddit to Symptomatology):基于词典的方法。","authors":"Bushra Hossain, Sarah M Preum, Md Fazle Rabbi, Rifat Ara, Mohammed Eunus Ali","doi":"10.2196/70940","DOIUrl":null,"url":null,"abstract":"<p><strong>Background: </strong>Millions of people affected with complex medical conditions with diverse symptoms often turn to online discourse to share their experiences. While some studies have explored natural language processing methods and medical information extraction tools, these typically focus on generic symptoms in clinical notes and struggle to identify patient-reported, disease-specific, subtle symptoms from online health discourse.</p><p><strong>Objective: </strong>We aimed to extract patient-reported, disease-specific symptoms shared on social media reflecting the lived experiences of thousands of affected individuals and explore the characteristics, prevalence, and occurrence patterns of the symptoms.</p><p><strong>Methods: </strong>We propose a lexicon-based symptom extraction (LSE) method to identify a diverse list of disease-specific, patient-reported symptoms. We initially used a large language model to accelerate the extraction of symptom-related key phrases that formed the lexicon. We evaluated the effectiveness of lexicon extraction against human annotation using a Jaccard index score. We then leveraged BERT-Base, BioBERT, and Phrase-BERT-based embeddings to learn representations of these symptom-related key phrases and cluster similar symptoms using k-means and hierarchical density-based spatial clustering of applications with noise (HDBSCAN). Among the different options explored in our experiments, BioBERT-based k-means clustering was found to be the most effective. Finally, we applied symptom normalization to eliminate duplicate and redundant entries in the comprehensive symptom list.</p><p><strong>Results: </strong>In a real-world polycystic ovary syndrome (PCOS) subreddit dataset, we found that LSE significantly outperformed state-of-the-art baselines, achieving at least 41% and 20% higher F<sub>1</sub>-scores (mean 86.10) than automatic medical extraction tools and large language models, respectively. Notably, the comprehensive list of 64 PCOS symptoms generated via LSE ensured extensive coverage of symptoms reported in 7 reputable eHealth forums. Analyzing PCOS symptomatology revealed 28 potentially emerging symptoms and 8 self-reported comorbidities co-occurring with PCOS.</p><p><strong>Conclusions: </strong>The comprehensive patient-reported, disease-specific symptom list can help patients and health practitioners resolve uncertainties surrounding the disease, eliminating the variability of PCOS symptoms prevailing in the community. Analyzing PCOS symptomatology across varied dimensions provides valuable insights for public health research.</p>","PeriodicalId":56334,"journal":{"name":"JMIR Medical Informatics","volume":"13 ","pages":"e70940"},"PeriodicalIF":3.8000,"publicationDate":"2025-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12475878/pdf/","citationCount":"0","resultStr":"{\"title\":\"Extracting Symptoms of Complex Conditions From Online Discourse (Subreddit to Symptomatology): Lexicon-Based Approach.\",\"authors\":\"Bushra Hossain, Sarah M Preum, Md Fazle Rabbi, Rifat Ara, Mohammed Eunus Ali\",\"doi\":\"10.2196/70940\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><strong>Background: </strong>Millions of people affected with complex medical conditions with diverse symptoms often turn to online discourse to share their experiences. While some studies have explored natural language processing methods and medical information extraction tools, these typically focus on generic symptoms in clinical notes and struggle to identify patient-reported, disease-specific, subtle symptoms from online health discourse.</p><p><strong>Objective: </strong>We aimed to extract patient-reported, disease-specific symptoms shared on social media reflecting the lived experiences of thousands of affected individuals and explore the characteristics, prevalence, and occurrence patterns of the symptoms.</p><p><strong>Methods: </strong>We propose a lexicon-based symptom extraction (LSE) method to identify a diverse list of disease-specific, patient-reported symptoms. We initially used a large language model to accelerate the extraction of symptom-related key phrases that formed the lexicon. We evaluated the effectiveness of lexicon extraction against human annotation using a Jaccard index score. We then leveraged BERT-Base, BioBERT, and Phrase-BERT-based embeddings to learn representations of these symptom-related key phrases and cluster similar symptoms using k-means and hierarchical density-based spatial clustering of applications with noise (HDBSCAN). Among the different options explored in our experiments, BioBERT-based k-means clustering was found to be the most effective. Finally, we applied symptom normalization to eliminate duplicate and redundant entries in the comprehensive symptom list.</p><p><strong>Results: </strong>In a real-world polycystic ovary syndrome (PCOS) subreddit dataset, we found that LSE significantly outperformed state-of-the-art baselines, achieving at least 41% and 20% higher F<sub>1</sub>-scores (mean 86.10) than automatic medical extraction tools and large language models, respectively. Notably, the comprehensive list of 64 PCOS symptoms generated via LSE ensured extensive coverage of symptoms reported in 7 reputable eHealth forums. Analyzing PCOS symptomatology revealed 28 potentially emerging symptoms and 8 self-reported comorbidities co-occurring with PCOS.</p><p><strong>Conclusions: </strong>The comprehensive patient-reported, disease-specific symptom list can help patients and health practitioners resolve uncertainties surrounding the disease, eliminating the variability of PCOS symptoms prevailing in the community. Analyzing PCOS symptomatology across varied dimensions provides valuable insights for public health research.</p>\",\"PeriodicalId\":56334,\"journal\":{\"name\":\"JMIR Medical Informatics\",\"volume\":\"13 \",\"pages\":\"e70940\"},\"PeriodicalIF\":3.8000,\"publicationDate\":\"2025-09-12\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12475878/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"JMIR Medical Informatics\",\"FirstCategoryId\":\"3\",\"ListUrlMain\":\"https://doi.org/10.2196/70940\",\"RegionNum\":3,\"RegionCategory\":\"医学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"MEDICAL INFORMATICS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"JMIR Medical Informatics","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.2196/70940","RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"MEDICAL INFORMATICS","Score":null,"Total":0}
Extracting Symptoms of Complex Conditions From Online Discourse (Subreddit to Symptomatology): Lexicon-Based Approach.
Background: Millions of people affected with complex medical conditions with diverse symptoms often turn to online discourse to share their experiences. While some studies have explored natural language processing methods and medical information extraction tools, these typically focus on generic symptoms in clinical notes and struggle to identify patient-reported, disease-specific, subtle symptoms from online health discourse.
Objective: We aimed to extract patient-reported, disease-specific symptoms shared on social media reflecting the lived experiences of thousands of affected individuals and explore the characteristics, prevalence, and occurrence patterns of the symptoms.
Methods: We propose a lexicon-based symptom extraction (LSE) method to identify a diverse list of disease-specific, patient-reported symptoms. We initially used a large language model to accelerate the extraction of symptom-related key phrases that formed the lexicon. We evaluated the effectiveness of lexicon extraction against human annotation using a Jaccard index score. We then leveraged BERT-Base, BioBERT, and Phrase-BERT-based embeddings to learn representations of these symptom-related key phrases and cluster similar symptoms using k-means and hierarchical density-based spatial clustering of applications with noise (HDBSCAN). Among the different options explored in our experiments, BioBERT-based k-means clustering was found to be the most effective. Finally, we applied symptom normalization to eliminate duplicate and redundant entries in the comprehensive symptom list.
Results: In a real-world polycystic ovary syndrome (PCOS) subreddit dataset, we found that LSE significantly outperformed state-of-the-art baselines, achieving at least 41% and 20% higher F1-scores (mean 86.10) than automatic medical extraction tools and large language models, respectively. Notably, the comprehensive list of 64 PCOS symptoms generated via LSE ensured extensive coverage of symptoms reported in 7 reputable eHealth forums. Analyzing PCOS symptomatology revealed 28 potentially emerging symptoms and 8 self-reported comorbidities co-occurring with PCOS.
Conclusions: The comprehensive patient-reported, disease-specific symptom list can help patients and health practitioners resolve uncertainties surrounding the disease, eliminating the variability of PCOS symptoms prevailing in the community. Analyzing PCOS symptomatology across varied dimensions provides valuable insights for public health research.
期刊介绍:
JMIR Medical Informatics (JMI, ISSN 2291-9694) is a top-rated, tier A journal which focuses on clinical informatics, big data in health and health care, decision support for health professionals, electronic health records, ehealth infrastructures and implementation. It has a focus on applied, translational research, with a broad readership including clinicians, CIOs, engineers, industry and health informatics professionals.
Published by JMIR Publications, publisher of the Journal of Medical Internet Research (JMIR), the leading eHealth/mHealth journal (Impact Factor 2016: 5.175), JMIR Med Inform has a slightly different scope (emphasizing more on applications for clinicians and health professionals rather than consumers/citizens, which is the focus of JMIR), publishes even faster, and also allows papers which are more technical or more formative than what would be published in the Journal of Medical Internet Research.