Understanding Cancer Survivorship Care Needs Using Amazon Reviews: Content Analysis, Algorithm Development, and Validation Study.

IF 2.7 Q2 ONCOLOGY

JMIR Cancer Pub Date : 2025-09-23 DOI:10.2196/71102

Liwei Wang, Qiuhao Lu, Rui Li, Taylor B Harrison, Heling Jia, Ming Huang, Heidi Dowst, Rui Zhang, Hoda Badr, Jungwei W Fan, Hongfang Liu

{"title":"Understanding Cancer Survivorship Care Needs Using Amazon Reviews: Content Analysis, Algorithm Development, and Validation Study.","authors":"Liwei Wang, Qiuhao Lu, Rui Li, Taylor B Harrison, Heling Jia, Ming Huang, Heidi Dowst, Rui Zhang, Hoda Badr, Jungwei W Fan, Hongfang Liu","doi":"10.2196/71102","DOIUrl":null,"url":null,"abstract":"Background: Complementary therapies are being increasingly used by cancer survivors. As a channel for customers to share their feelings, outcomes, and perceived knowledge about the products purchased from e-commerce platforms, Amazon consumer reviews are a valuable real-world data source for understanding cancer survivorship care needs.Objective: In this study, we aimed to highlight the potential of using Amazon consumer reviews as a novel source for identifying cancer survivorship care needs, particularly related to symptom self-management. Specifically, we present a publicly available, manually annotated corpus derived from Amazon reviews of health-related products and develop baseline natural language processing models using deep learning and large language model (LLM) to demonstrate the usability of this dataset.Methods: We preprocessed the Amazon review dataset to identify sentences with cancer mentions through a rule-based method and conducted content analysis including text feature analysis, sentiment analysis, topic modeling, cancer type, and symptom association analysis. We then designed an annotation guideline, targeting survivorship-relevant constructs. A total of 159 reviews were annotated, and baseline models were developed based on deep learning and large language model (LLM) for named entity recognition and text classification tasks.Results: A total of 4703 sentences containing positive cancer mentions were identified, drawn from 3349 reviews associated with 2589 distinct products. The identified topics through topic modeling revealed meaningful insights into cancer symptom management and survivorship experiences. Examples included discussions of green tea use during chemotherapy, cancer prevention strategies, and product recommendations for breast cancer. Top 15 symptoms in reviews were also identified, with pain being the most frequent symptom, followed by inflammation, fatigue, etc. The annotation labels were designed to capture cancer types, indicated symptoms, and symptom management outcomes. The resulting annotation corpus contains 2067 labels from 159 Amazon reviews. It is publicly accessible, together with the annotation guideline through the Open Health Natural Language Processing (OHNLP) GitHub. Our baseline model, Bert-base-cased, achieved the highest weighted average F1-score, that is, 66.92%, for named entity recognition, and LLM gpt4-1106-preview-chat achieved the highest F1-score for text classification tasks, that is, 66.67% for \"Harmful outcome,\" 88.46% for \"Favorable outcome\" and 73.33% for \"Ambiguous outcome.\"Conclusions: Our results demonstrate the potential of Amazon consumer reviews as a novel data source for identifying persistent symptoms, concerns, and self-management strategies among cancer survivors. This corpus, along with the baseline natural language processing models developed for named entity recognition and text classification, lays the groundwork for future methodological advancements in cancer survivorship research. Importantly, insights from this study could be evaluated against established clinical guidelines for symptom management in cancer survivorship care. By revealing the feasibility of using consumer-generated data for mining survivorship-related experiences, this study offers a promising foundation for future research and argumentation analysis aimed at improving long-term outcomes and support for cancer survivors.","PeriodicalId":45538,"journal":{"name":"JMIR Cancer","volume":"11 ","pages":"e71102"},"PeriodicalIF":2.7000,"publicationDate":"2025-09-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12456872/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"JMIR Cancer","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.2196/71102","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"ONCOLOGY","Score":null,"Total":0}

引用次数: 0

Abstract

Background: Complementary therapies are being increasingly used by cancer survivors. As a channel for customers to share their feelings, outcomes, and perceived knowledge about the products purchased from e-commerce platforms, Amazon consumer reviews are a valuable real-world data source for understanding cancer survivorship care needs.

Objective: In this study, we aimed to highlight the potential of using Amazon consumer reviews as a novel source for identifying cancer survivorship care needs, particularly related to symptom self-management. Specifically, we present a publicly available, manually annotated corpus derived from Amazon reviews of health-related products and develop baseline natural language processing models using deep learning and large language model (LLM) to demonstrate the usability of this dataset.

Methods: We preprocessed the Amazon review dataset to identify sentences with cancer mentions through a rule-based method and conducted content analysis including text feature analysis, sentiment analysis, topic modeling, cancer type, and symptom association analysis. We then designed an annotation guideline, targeting survivorship-relevant constructs. A total of 159 reviews were annotated, and baseline models were developed based on deep learning and large language model (LLM) for named entity recognition and text classification tasks.

Results: A total of 4703 sentences containing positive cancer mentions were identified, drawn from 3349 reviews associated with 2589 distinct products. The identified topics through topic modeling revealed meaningful insights into cancer symptom management and survivorship experiences. Examples included discussions of green tea use during chemotherapy, cancer prevention strategies, and product recommendations for breast cancer. Top 15 symptoms in reviews were also identified, with pain being the most frequent symptom, followed by inflammation, fatigue, etc. The annotation labels were designed to capture cancer types, indicated symptoms, and symptom management outcomes. The resulting annotation corpus contains 2067 labels from 159 Amazon reviews. It is publicly accessible, together with the annotation guideline through the Open Health Natural Language Processing (OHNLP) GitHub. Our baseline model, Bert-base-cased, achieved the highest weighted average F1-score, that is, 66.92%, for named entity recognition, and LLM gpt4-1106-preview-chat achieved the highest F1-score for text classification tasks, that is, 66.67% for "Harmful outcome," 88.46% for "Favorable outcome" and 73.33% for "Ambiguous outcome."

Conclusions: Our results demonstrate the potential of Amazon consumer reviews as a novel data source for identifying persistent symptoms, concerns, and self-management strategies among cancer survivors. This corpus, along with the baseline natural language processing models developed for named entity recognition and text classification, lays the groundwork for future methodological advancements in cancer survivorship research. Importantly, insights from this study could be evaluated against established clinical guidelines for symptom management in cancer survivorship care. By revealing the feasibility of using consumer-generated data for mining survivorship-related experiences, this study offers a promising foundation for future research and argumentation analysis aimed at improving long-term outcomes and support for cancer survivors.

查看原文本刊更多论文

使用亚马逊评论了解癌症幸存者护理需求：内容分析、算法开发和验证研究。

背景：癌症幸存者越来越多地使用辅助疗法。作为客户分享他们对从电子商务平台购买的产品的感受、结果和感知知识的渠道，亚马逊消费者评论是了解癌症幸存者护理需求的宝贵现实数据来源。目的：在本研究中，我们旨在强调使用亚马逊消费者评论作为识别癌症生存护理需求的新来源的潜力，特别是与症状自我管理相关的需求。具体来说，我们提出了一个公开可用的、人工注释的语料库，该语料库来源于亚马逊对健康相关产品的评论，并使用深度学习和大型语言模型（LLM）开发了基线自然语言处理模型，以证明该数据集的可用性。方法：采用基于规则的方法对亚马逊评论数据集进行预处理，识别涉及癌症的句子，并进行内容分析，包括文本特征分析、情感分析、主题建模、癌症类型和症状关联分析。然后，我们设计了一个注释指南，针对与生存相关的构造。总共对159篇综述进行了注释，并基于深度学习和大型语言模型（LLM）开发了用于命名实体识别和文本分类任务的基线模型。结果：从与2589种不同产品相关的3349篇评论中，共鉴定出4703个包含阳性癌症提及的句子。通过主题建模确定的主题揭示了对癌症症状管理和生存经验的有意义的见解。例子包括讨论化疗期间绿茶的使用，癌症预防策略，以及乳腺癌的产品建议。在回顾中还确定了前15个症状，其中疼痛是最常见的症状，其次是炎症、疲劳等。注释标签的设计是为了捕获癌症类型、指示症状和症状管理结果。得到的标注语料库包含来自159条亚马逊评论的2067个标签。它与注释指南一起可通过开放健康自然语言处理（OHNLP） GitHub公开访问。我们的基线模型bert -base-case在命名实体识别方面取得了最高的加权平均f1分数，即66.92%，LLM gpt4-1106-pre -chat在文本分类任务方面取得了最高的f1分数，即“有害结果”为66.67%，“有利结果”为88.46%，“模糊结果”为73.33%。结论：我们的研究结果证明了亚马逊消费者评论作为识别癌症幸存者持续症状、关注点和自我管理策略的新数据源的潜力。该语料库，以及为命名实体识别和文本分类开发的基线自然语言处理模型，为癌症生存研究的未来方法进步奠定了基础。重要的是，这项研究的见解可以根据癌症生存护理中症状管理的既定临床指南进行评估。通过揭示使用消费者生成的数据挖掘幸存者相关经验的可行性，本研究为未来的研究和论证分析提供了有希望的基础，旨在改善癌症幸存者的长期结果和支持。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊