Zasim Azhar Siddiqui, Maryam Pathan, Sabina Nduaguba, Traci LeMasters, Virginia G Scott, Usha Sambamoorthi, Jay S Patel
{"title":"Leveraging social media data to study disease and treatment characteristics of Hodgkin's lymphoma Using Natural Language Processing methods.","authors":"Zasim Azhar Siddiqui, Maryam Pathan, Sabina Nduaguba, Traci LeMasters, Virginia G Scott, Usha Sambamoorthi, Jay S Patel","doi":"10.1371/journal.pdig.0000765","DOIUrl":null,"url":null,"abstract":"<p><strong>Background: </strong>The use of social media platforms in health research is increasing, yet their application in studying rare diseases is limited. Hodgkin's lymphoma (HL) is a rare malignancy with a high incidence in young adults. This study evaluates the feasibility of using social media data to study the disease and treatment characteristics of HL.</p><p><strong>Methods: </strong>We utilized the X (formerly Twitter) API v2 developer portal to download posts (formerly tweets) from January 2010 to October 2022. Annotation guidelines were developed from literature and a manual review of limited posts was performed to identify the class and attributes (characteristics) of HL discussed on X, and create a gold standard dataset. This dataset was subsequently employed to train, test, and validate a Named Entity Recognition (NER) Natural Language Processing (NLP) application.</p><p><strong>Results: </strong>After data preparation, 80,811 posts were collected: 500 for annotation guideline development, 2,000 for NLP application development, and the remaining 78,311 for deploying the application. We identified nine classes related to HL, such as HL classification, etiopathology, stages and progression, and treatment. The treatment class and HL stages and progression were the most frequently discussed, with 20,013 (25.56%) posts mentioning HL's treatments and 17,177 (21.93%) mentioning HL stages and progression. The model exhibited robust performance, achieving 86% accuracy and an 87% F1 score. The etiopathology class demonstrated excellent performance, with 93% accuracy and a 95% F1 score.</p><p><strong>Discussion: </strong>The NLP application displayed high efficacy in extracting and characterizing HL-related information from social media posts, as evidenced by the high F1 score. Nonetheless, the data presented limitations in distinguishing between patients, providers, and caregivers and in establishing the temporal relationships between classes and attributes. Further research is necessary to bridge these gaps.</p><p><strong>Conclusion: </strong>Our study demonstrated potential of using social media as a valuable preliminary research source for understanding the characteristics of rare diseases such as Hodgkin's Lymphoma.</p>","PeriodicalId":74465,"journal":{"name":"PLOS digital health","volume":"4 3","pages":"e0000765"},"PeriodicalIF":0.0000,"publicationDate":"2025-03-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11922232/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"PLOS digital health","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1371/journal.pdig.0000765","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/3/1 0:00:00","PubModel":"eCollection","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Background: The use of social media platforms in health research is increasing, yet their application in studying rare diseases is limited. Hodgkin's lymphoma (HL) is a rare malignancy with a high incidence in young adults. This study evaluates the feasibility of using social media data to study the disease and treatment characteristics of HL.
Methods: We utilized the X (formerly Twitter) API v2 developer portal to download posts (formerly tweets) from January 2010 to October 2022. Annotation guidelines were developed from literature and a manual review of limited posts was performed to identify the class and attributes (characteristics) of HL discussed on X, and create a gold standard dataset. This dataset was subsequently employed to train, test, and validate a Named Entity Recognition (NER) Natural Language Processing (NLP) application.
Results: After data preparation, 80,811 posts were collected: 500 for annotation guideline development, 2,000 for NLP application development, and the remaining 78,311 for deploying the application. We identified nine classes related to HL, such as HL classification, etiopathology, stages and progression, and treatment. The treatment class and HL stages and progression were the most frequently discussed, with 20,013 (25.56%) posts mentioning HL's treatments and 17,177 (21.93%) mentioning HL stages and progression. The model exhibited robust performance, achieving 86% accuracy and an 87% F1 score. The etiopathology class demonstrated excellent performance, with 93% accuracy and a 95% F1 score.
Discussion: The NLP application displayed high efficacy in extracting and characterizing HL-related information from social media posts, as evidenced by the high F1 score. Nonetheless, the data presented limitations in distinguishing between patients, providers, and caregivers and in establishing the temporal relationships between classes and attributes. Further research is necessary to bridge these gaps.
Conclusion: Our study demonstrated potential of using social media as a valuable preliminary research source for understanding the characteristics of rare diseases such as Hodgkin's Lymphoma.
背景:社交媒体平台在健康研究中的应用越来越多,但在罕见病研究中的应用有限。霍奇金淋巴瘤(HL)是一种罕见的恶性肿瘤,发病率高的年轻人。本研究评估利用社交媒体数据研究HL疾病及治疗特点的可行性。方法:我们利用X(以前的Twitter) API v2开发人员门户网站下载2010年1月至2022年10月的帖子(以前的tweets)。根据文献制定注释指南,并对有限的帖子进行人工审查,以识别X上讨论的HL的类别和属性(特征),并创建一个金标准数据集。该数据集随后被用于训练、测试和验证命名实体识别(NER)自然语言处理(NLP)应用程序。结果:经过数据准备,共收集到80,811篇帖子,其中500篇用于标注指南开发,2,000篇用于NLP应用开发,其余78,311篇用于应用部署。我们确定了与HL相关的9个类别,如HL的分类、病因、分期和进展以及治疗。讨论频率最高的是治疗类别和HL分期及进展,有20,013篇(25.56%)帖子提到了HL的治疗方法,17,177篇(21.93%)帖子提到了HL分期和进展。该模型表现出稳健的性能,达到86%的准确率和87%的F1得分。病因病理学分类表现优异,准确率为93%,F1评分为95%。讨论:NLP应用在从社交媒体帖子中提取和表征hl相关信息方面表现出很高的效率,F1得分很高。尽管如此,这些数据在区分患者、提供者和护理人员以及建立类别和属性之间的时间关系方面存在局限性。需要进一步的研究来弥合这些差距。结论:我们的研究显示了利用社交媒体作为了解罕见疾病(如霍奇金淋巴瘤)特征的有价值的初步研究来源的潜力。