Extracting medical entities from social media

S. Šćepanović, E. Martin-Lopez, D. Quercia, Khan Baykaner
{"title":"Extracting medical entities from social media","authors":"S. Šćepanović, E. Martin-Lopez, D. Quercia, Khan Baykaner","doi":"10.1145/3368555.3384467","DOIUrl":null,"url":null,"abstract":"Accurately extracting medical entities from social media is challenging because people use informal language with different expressions for the same concept, and they also make spelling mistakes. Previous work either focused on specific diseases (e.g., depression) or drugs (e.g., opioids) or, if working with a wide-set of medical entities, only tackled individual and small-scale benchmark datasets (e.g., AskaPatient). In this work, we first demonstrated how to accurately extract a wide variety of medical entities such as symptoms, diseases, and drug names on three benchmark datasets from varied social media sources, and then also validated this approach on a large-scale Reddit dataset. We first implemented a deep-learning method using contextual embeddings that upon two existing benchmark datasets, one containing annotated AskaPatient posts (CADEC) and the other containing annotated tweets (Micromed), outperformed existing state-of-the-art methods. Second, we created an additional benchmark dataset by annotating medical entities in 2K Reddit posts (made publicly available under the name of MedRed) and showed that our method also performs well on this new dataset. Finally, to demonstrate that our method accurately extracts a wide variety of medical entities on a large scale, we applied the model pre-trained on MedRed to half a million Reddit posts. The posts came from disease-specific subreddits so we could categorise them into 18 diseases based on the subreddit. We then trained a machine-learning classifier to predict the post's category solely from the extracted medical entities. The average F1 score across categories was .87. These results open up new cost-effective opportunities for modeling, tracking and even predicting health behavior at scale.","PeriodicalId":87342,"journal":{"name":"Proceedings of the ACM Conference on Health, Inference, and Learning","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2020-04-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"26","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the ACM Conference on Health, Inference, and Learning","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3368555.3384467","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 26

Abstract

Accurately extracting medical entities from social media is challenging because people use informal language with different expressions for the same concept, and they also make spelling mistakes. Previous work either focused on specific diseases (e.g., depression) or drugs (e.g., opioids) or, if working with a wide-set of medical entities, only tackled individual and small-scale benchmark datasets (e.g., AskaPatient). In this work, we first demonstrated how to accurately extract a wide variety of medical entities such as symptoms, diseases, and drug names on three benchmark datasets from varied social media sources, and then also validated this approach on a large-scale Reddit dataset. We first implemented a deep-learning method using contextual embeddings that upon two existing benchmark datasets, one containing annotated AskaPatient posts (CADEC) and the other containing annotated tweets (Micromed), outperformed existing state-of-the-art methods. Second, we created an additional benchmark dataset by annotating medical entities in 2K Reddit posts (made publicly available under the name of MedRed) and showed that our method also performs well on this new dataset. Finally, to demonstrate that our method accurately extracts a wide variety of medical entities on a large scale, we applied the model pre-trained on MedRed to half a million Reddit posts. The posts came from disease-specific subreddits so we could categorise them into 18 diseases based on the subreddit. We then trained a machine-learning classifier to predict the post's category solely from the extracted medical entities. The average F1 score across categories was .87. These results open up new cost-effective opportunities for modeling, tracking and even predicting health behavior at scale.
从社交媒体中提取医疗实体
准确地从社交媒体中提取医疗实体是一项挑战,因为人们对同一个概念使用不同表达的非正式语言,而且他们也会犯拼写错误。以前的工作要么集中在特定疾病(如抑郁症)或药物(如阿片类药物)上,要么在与广泛的医疗实体合作时,只处理个人和小规模的基准数据集(如askappatient)。在这项工作中,我们首先演示了如何在来自不同社交媒体来源的三个基准数据集上准确地提取各种各样的医疗实体,如症状、疾病和药物名称,然后还在大规模的Reddit数据集上验证了这种方法。我们首先使用上下文嵌入实现了一种深度学习方法,该方法基于两个现有的基准数据集,一个包含带注释的AskaPatient帖子(CADEC),另一个包含带注释的tweet (Micromed),优于现有的最先进的方法。其次,我们通过在2K个Reddit帖子(以MedRed的名义公开提供)中注释医疗实体创建了一个额外的基准数据集,并表明我们的方法在这个新数据集上也表现良好。最后,为了证明我们的方法可以大规模地准确提取各种各样的医疗实体,我们将MedRed上预训练的模型应用于50万个Reddit帖子。这些帖子来自特定疾病的子reddit,所以我们可以根据子reddit将它们分类为18种疾病。然后,我们训练了一个机器学习分类器,仅从提取的医疗实体中预测帖子的类别。各类别的F1平均得分为0.87。这些结果为大规模建模、跟踪甚至预测健康行为开辟了新的具有成本效益的机会。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信