改进社交媒体帖子中的自杀意念检测：主题建模和合成数据增强方法。

IF 2 Q3 HEALTH CARE SCIENCES & SERVICES

JMIR Formative Research Pub Date : 2025-06-11 DOI:10.2196/63272

Hamideh Ghanadian, Isar Nejadgholi, Hussein Al Osman

{"title":"改进社交媒体帖子中的自杀意念检测：主题建模和合成数据增强方法。","authors":"Hamideh Ghanadian, Isar Nejadgholi, Hussein Al Osman","doi":"10.2196/63272","DOIUrl":null,"url":null,"abstract":"Background: In an era dominated by social media conversations, it is pivotal to comprehend how suicide, a critical public health issue, is discussed online. Discussions around suicide often highlight a range of topics, such as mental health challenges, relationship conflicts, and financial distress. However, certain sensitive issues, like those affecting marginalized communities, may be underrepresented in these discussions. This underrepresentation is a critical issue to investigate because it is mainly associated with underserved demographics (eg, racial and sexual minorities), and models trained on such data will underperform on such topics.Objective: The objective of this study was to bridge the gap between established psychology literature on suicidal ideation and social media data by analyzing the topics discussed online. Additionally, by generating synthetic data, we aimed to ensure that datasets used for training classifiers have high coverage of critical risk factors to address and adequately represent underrepresented or misrepresented topics. This approach enhances both the quality and diversity of the data used for detecting suicidal ideation in social media conversations.Methods: We first performed unsupervised topic modeling to analyze suicide-related data from social media and identify the most frequently discussed topics within the dataset. Next, we conducted a scoping review of established psychology literature to identify core risk factors associated with suicide. Using these identified risk factors, we then performed guided topic modeling on the social media dataset to evaluate the presence and coverage of these factors. After identifying topic biases and gaps in the dataset, we explored the use of generative large language models to create topic-diverse synthetic data for augmentation. Finally, the synthetic dataset was evaluated for readability, complexity, topic diversity, and utility in training machine learning classifiers compared to real-world datasets.Results: Our study found that several critical suicide-related topics, particularly those concerning marginalized communities and racism, were significantly underrepresented in the real-world social media data. The introduction of synthetic data, generated using GPT-3.5 Turbo, and the augmented dataset improved topic diversity. The synthetic dataset showed levels of readability and complexity comparable to those of real data. Furthermore, the incorporation of the augmented dataset in fine-tuning classifiers enhanced their ability to detect suicidal ideation, with the F1-score improving from 0.87 to 0.91 on the University of Maryland Reddit Suicidality Dataset test subset and from 0.70 to 0.90 on the synthetic test subset, demonstrating its utility in improving model accuracy for suicidal narrative detection.Conclusions: Our results demonstrate that synthetic datasets can be useful to obtain an enriched understanding of online suicide discussions as well as build more accurate machine learning models for suicidal narrative detection on social media.","PeriodicalId":14841,"journal":{"name":"JMIR Formative Research","volume":"9 ","pages":"e63272"},"PeriodicalIF":2.0000,"publicationDate":"2025-06-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12198699/pdf/","citationCount":"0","resultStr":"{\"title\":\"Improving Suicidal Ideation Detection in Social Media Posts: Topic Modeling and Synthetic Data Augmentation Approach.\",\"authors\":\"Hamideh Ghanadian, Isar Nejadgholi, Hussein Al Osman\",\"doi\":\"10.2196/63272\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Background: In an era dominated by social media conversations, it is pivotal to comprehend how suicide, a critical public health issue, is discussed online. Discussions around suicide often highlight a range of topics, such as mental health challenges, relationship conflicts, and financial distress. However, certain sensitive issues, like those affecting marginalized communities, may be underrepresented in these discussions. This underrepresentation is a critical issue to investigate because it is mainly associated with underserved demographics (eg, racial and sexual minorities), and models trained on such data will underperform on such topics.Objective: The objective of this study was to bridge the gap between established psychology literature on suicidal ideation and social media data by analyzing the topics discussed online. Additionally, by generating synthetic data, we aimed to ensure that datasets used for training classifiers have high coverage of critical risk factors to address and adequately represent underrepresented or misrepresented topics. This approach enhances both the quality and diversity of the data used for detecting suicidal ideation in social media conversations.Methods: We first performed unsupervised topic modeling to analyze suicide-related data from social media and identify the most frequently discussed topics within the dataset. Next, we conducted a scoping review of established psychology literature to identify core risk factors associated with suicide. Using these identified risk factors, we then performed guided topic modeling on the social media dataset to evaluate the presence and coverage of these factors. After identifying topic biases and gaps in the dataset, we explored the use of generative large language models to create topic-diverse synthetic data for augmentation. Finally, the synthetic dataset was evaluated for readability, complexity, topic diversity, and utility in training machine learning classifiers compared to real-world datasets.Results: Our study found that several critical suicide-related topics, particularly those concerning marginalized communities and racism, were significantly underrepresented in the real-world social media data. The introduction of synthetic data, generated using GPT-3.5 Turbo, and the augmented dataset improved topic diversity. The synthetic dataset showed levels of readability and complexity comparable to those of real data. Furthermore, the incorporation of the augmented dataset in fine-tuning classifiers enhanced their ability to detect suicidal ideation, with the F1-score improving from 0.87 to 0.91 on the University of Maryland Reddit Suicidality Dataset test subset and from 0.70 to 0.90 on the synthetic test subset, demonstrating its utility in improving model accuracy for suicidal narrative detection.Conclusions: Our results demonstrate that synthetic datasets can be useful to obtain an enriched understanding of online suicide discussions as well as build more accurate machine learning models for suicidal narrative detection on social media.\",\"PeriodicalId\":14841,\"journal\":{\"name\":\"JMIR Formative Research\",\"volume\":\"9 \",\"pages\":\"e63272\"},\"PeriodicalIF\":2.0000,\"publicationDate\":\"2025-06-11\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12198699/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"JMIR Formative Research\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.2196/63272\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q3\",\"JCRName\":\"HEALTH CARE SCIENCES & SERVICES\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"JMIR Formative Research","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.2196/63272","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"HEALTH CARE SCIENCES & SERVICES","Score":null,"Total":0}

引用次数: 0

摘要

背景：在一个由社交媒体对话主导的时代，理解自杀这个重要的公共卫生问题是如何在网上讨论的至关重要。围绕自杀的讨论通常会突出一系列话题，比如心理健康挑战、关系冲突和经济困境。然而，某些敏感问题，如影响边缘化社区的问题，在这些讨论中可能代表性不足。这种代表性不足是一个需要调查的关键问题，因为它主要与服务不足的人口统计（例如，种族和性少数群体）有关，并且在这些数据上训练的模型在这些主题上表现不佳。目的：本研究的目的是通过分析网上讨论的话题，弥合关于自杀意念的既有心理学文献与社交媒体数据之间的差距。此外，通过生成合成数据，我们旨在确保用于训练分类器的数据集具有高覆盖率的关键风险因素，以解决并充分代表未充分代表或歪曲的主题。这种方法提高了用于检测社交媒体对话中自杀意念的数据的质量和多样性。方法：我们首先进行无监督主题建模，分析来自社交媒体的自杀相关数据，并确定数据集中最常讨论的主题。接下来，我们对已建立的心理学文献进行了范围审查，以确定与自杀相关的核心风险因素。利用这些确定的风险因素，我们对社交媒体数据集进行了引导主题建模，以评估这些因素的存在和覆盖范围。在确定了数据集中的主题偏差和差距之后，我们探索了使用生成式大型语言模型来创建主题多样化的合成数据以进行增强。最后，与真实世界的数据集相比，评估了合成数据集的可读性、复杂性、主题多样性和训练机器学习分类器的实用性。结果：我们的研究发现，在现实世界的社交媒体数据中，几个与自杀相关的关键话题，特别是那些与边缘化社区和种族主义有关的话题，明显缺乏代表性。使用GPT-3.5 Turbo生成的合成数据和增强数据集的引入提高了主题的多样性。合成数据集显示出与真实数据相当的可读性和复杂性。此外，在微调分类器中加入增强数据集增强了他们检测自杀意念的能力，在马里兰大学Reddit自杀数据集测试子集上的f1得分从0.87提高到0.91，在合成测试子集上的f1得分从0.70提高到0.90，证明了它在提高自杀叙事检测模型准确性方面的效用。结论：我们的研究结果表明，合成数据集有助于丰富对在线自杀讨论的理解，并为社交媒体上的自杀叙事检测建立更准确的机器学习模型。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Improving Suicidal Ideation Detection in Social Media Posts: Topic Modeling and Synthetic Data Augmentation Approach.

Background: In an era dominated by social media conversations, it is pivotal to comprehend how suicide, a critical public health issue, is discussed online. Discussions around suicide often highlight a range of topics, such as mental health challenges, relationship conflicts, and financial distress. However, certain sensitive issues, like those affecting marginalized communities, may be underrepresented in these discussions. This underrepresentation is a critical issue to investigate because it is mainly associated with underserved demographics (eg, racial and sexual minorities), and models trained on such data will underperform on such topics.

Objective: The objective of this study was to bridge the gap between established psychology literature on suicidal ideation and social media data by analyzing the topics discussed online. Additionally, by generating synthetic data, we aimed to ensure that datasets used for training classifiers have high coverage of critical risk factors to address and adequately represent underrepresented or misrepresented topics. This approach enhances both the quality and diversity of the data used for detecting suicidal ideation in social media conversations.

Methods: We first performed unsupervised topic modeling to analyze suicide-related data from social media and identify the most frequently discussed topics within the dataset. Next, we conducted a scoping review of established psychology literature to identify core risk factors associated with suicide. Using these identified risk factors, we then performed guided topic modeling on the social media dataset to evaluate the presence and coverage of these factors. After identifying topic biases and gaps in the dataset, we explored the use of generative large language models to create topic-diverse synthetic data for augmentation. Finally, the synthetic dataset was evaluated for readability, complexity, topic diversity, and utility in training machine learning classifiers compared to real-world datasets.

Results: Our study found that several critical suicide-related topics, particularly those concerning marginalized communities and racism, were significantly underrepresented in the real-world social media data. The introduction of synthetic data, generated using GPT-3.5 Turbo, and the augmented dataset improved topic diversity. The synthetic dataset showed levels of readability and complexity comparable to those of real data. Furthermore, the incorporation of the augmented dataset in fine-tuning classifiers enhanced their ability to detect suicidal ideation, with the F₁-score improving from 0.87 to 0.91 on the University of Maryland Reddit Suicidality Dataset test subset and from 0.70 to 0.90 on the synthetic test subset, demonstrating its utility in improving model accuracy for suicidal narrative detection.

Conclusions: Our results demonstrate that synthetic datasets can be useful to obtain an enriched understanding of online suicide discussions as well as build more accurate machine learning models for suicidal narrative detection on social media.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊