面向生物医学的自动注释Twitter COVID-19数据集。

Q2 Agricultural and Biological Sciences
Genomics and Informatics Pub Date : 2021-09-01 Epub Date: 2021-09-30 DOI:10.5808/gi.21011
Luis Alberto Robles Hernandez, Tiffany J Callahan, Juan M Banda
{"title":"面向生物医学的自动注释Twitter COVID-19数据集。","authors":"Luis Alberto Robles Hernandez,&nbsp;Tiffany J Callahan,&nbsp;Juan M Banda","doi":"10.5808/gi.21011","DOIUrl":null,"url":null,"abstract":"<p><p>The use of social media data, like Twitter, for biomedical research has been gradually increasing over the years. With the coronavirus disease 2019 (COVID-19) pandemic, researchers have turned to more non-traditional sources of clinical data to characterize the disease in near-real time, study the societal implications of interventions, as well as the sequelae that recovered COVID-19 cases present. However, manually curated social media datasets are difficult to come by due to the expensive costs of manual annotation and the efforts needed to identify the correct texts. When datasets are available, they are usually very small and their annotations don't generalize well over time or to larger sets of documents. As part of the 2021 Biomedical Linked Annotation Hackathon, we release our dataset of over 120 million automatically annotated tweets for biomedical research purposes. Incorporating best-practices, we identify tweets with potentially high clinical relevance. We evaluated our work by comparing several SpaCy-based annotation frameworks against a manually annotated gold-standard dataset. Selecting the best method to use for automatic annotation, we then annotated 120 million tweets and released them publicly for future downstream usage within the biomedical domain.</p>","PeriodicalId":36591,"journal":{"name":"Genomics and Informatics","volume":"19 3","pages":"e21"},"PeriodicalIF":0.0000,"publicationDate":"2021-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8510871/pdf/","citationCount":"1","resultStr":"{\"title\":\"A biomedically oriented automatically annotated Twitter COVID-19 dataset.\",\"authors\":\"Luis Alberto Robles Hernandez,&nbsp;Tiffany J Callahan,&nbsp;Juan M Banda\",\"doi\":\"10.5808/gi.21011\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><p>The use of social media data, like Twitter, for biomedical research has been gradually increasing over the years. With the coronavirus disease 2019 (COVID-19) pandemic, researchers have turned to more non-traditional sources of clinical data to characterize the disease in near-real time, study the societal implications of interventions, as well as the sequelae that recovered COVID-19 cases present. However, manually curated social media datasets are difficult to come by due to the expensive costs of manual annotation and the efforts needed to identify the correct texts. When datasets are available, they are usually very small and their annotations don't generalize well over time or to larger sets of documents. As part of the 2021 Biomedical Linked Annotation Hackathon, we release our dataset of over 120 million automatically annotated tweets for biomedical research purposes. Incorporating best-practices, we identify tweets with potentially high clinical relevance. We evaluated our work by comparing several SpaCy-based annotation frameworks against a manually annotated gold-standard dataset. Selecting the best method to use for automatic annotation, we then annotated 120 million tweets and released them publicly for future downstream usage within the biomedical domain.</p>\",\"PeriodicalId\":36591,\"journal\":{\"name\":\"Genomics and Informatics\",\"volume\":\"19 3\",\"pages\":\"e21\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2021-09-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8510871/pdf/\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Genomics and Informatics\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.5808/gi.21011\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"2021/9/30 0:00:00\",\"PubModel\":\"Epub\",\"JCR\":\"Q2\",\"JCRName\":\"Agricultural and Biological Sciences\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Genomics and Informatics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.5808/gi.21011","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2021/9/30 0:00:00","PubModel":"Epub","JCR":"Q2","JCRName":"Agricultural and Biological Sciences","Score":null,"Total":0}
引用次数: 1

摘要

多年来,像推特这样的社交媒体数据在生物医学研究中的使用逐渐增加。随着2019冠状病毒病(新冠肺炎)的大流行,研究人员转向了更多非传统的临床数据来源,以在近实时描述该疾病,研究干预措施的社会影响,以及新冠肺炎康复病例的后遗症。然而,由于手动注释的昂贵成本和识别正确文本所需的努力,手动策划的社交媒体数据集很难获得。当数据集可用时,它们通常非常小,并且它们的注释不会随着时间的推移很好地推广到更大的文档集。作为2021生物医学链接注释黑客马拉松的一部分,我们发布了超过1.2亿条自动注释推文的数据集,用于生物医学研究。结合最佳实践,我们确定了具有潜在高度临床相关性的推文。我们通过将几个基于SpaCy的注释框架与手动注释的黄金标准数据集进行比较来评估我们的工作。选择用于自动注释的最佳方法,我们对1.2亿条推文进行了注释,并公开发布,以供未来在生物医学领域的下游使用。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
A biomedically oriented automatically annotated Twitter COVID-19 dataset.

The use of social media data, like Twitter, for biomedical research has been gradually increasing over the years. With the coronavirus disease 2019 (COVID-19) pandemic, researchers have turned to more non-traditional sources of clinical data to characterize the disease in near-real time, study the societal implications of interventions, as well as the sequelae that recovered COVID-19 cases present. However, manually curated social media datasets are difficult to come by due to the expensive costs of manual annotation and the efforts needed to identify the correct texts. When datasets are available, they are usually very small and their annotations don't generalize well over time or to larger sets of documents. As part of the 2021 Biomedical Linked Annotation Hackathon, we release our dataset of over 120 million automatically annotated tweets for biomedical research purposes. Incorporating best-practices, we identify tweets with potentially high clinical relevance. We evaluated our work by comparing several SpaCy-based annotation frameworks against a manually annotated gold-standard dataset. Selecting the best method to use for automatic annotation, we then annotated 120 million tweets and released them publicly for future downstream usage within the biomedical domain.

求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
Genomics and Informatics
Genomics and Informatics Agricultural and Biological Sciences-Ecology, Evolution, Behavior and Systematics
CiteScore
1.90
自引率
0.00%
发文量
0
审稿时长
12 weeks
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信