Twitter中的命名实体识别:短期时间变化的数据集和分析

Q3 Environmental Science
Asahi Ushio, Leonardo Neves, V'itor Silva, Francesco Barbieri, José Camacho-Collados
{"title":"Twitter中的命名实体识别:短期时间变化的数据集和分析","authors":"Asahi Ushio, Leonardo Neves, V'itor Silva, Francesco Barbieri, José Camacho-Collados","doi":"10.48550/arXiv.2210.03797","DOIUrl":null,"url":null,"abstract":"Recent progress in language model pre-training has led to important improvements in Named Entity Recognition (NER). Nonetheless, this progress has been mainly tested in well-formatted documents such as news, Wikipedia, or scientific articles. In social media the landscape is different, in which it adds another layer of complexity due to its noisy and dynamic nature. In this paper, we focus on NER in Twitter, one of the largest social media platforms, and construct a new NER dataset, TweetNER7, which contains seven entity types annotated over 11,382 tweets from September 2019 to August 2021. The dataset was constructed by carefully distributing the tweets over time and taking representative trends as a basis. Along with the dataset, we provide a set of language model baselines and perform an analysis on the language model performance on the task, especially analyzing the impact of different time periods. In particular, we focus on three important temporal aspects in our analysis: short-term degradation of NER models over time, strategies to fine-tune a language model over different periods, and self-labeling as an alternative to lack of recently-labeled data. TweetNER7 is released publicly (https://huggingface.co/datasets/tner/tweetner7) along with the models fine-tuned on it (NER models have been integrated into TweetNLP and can be found at https://github.com/asahi417/tner/tree/master/examples/tweetner7_paper).","PeriodicalId":39298,"journal":{"name":"AACL Bioflux","volume":"2 1","pages":"309-319"},"PeriodicalIF":0.0000,"publicationDate":"2022-10-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"7","resultStr":"{\"title\":\"Named Entity Recognition in Twitter: A Dataset and Analysis on Short-Term Temporal Shifts\",\"authors\":\"Asahi Ushio, Leonardo Neves, V'itor Silva, Francesco Barbieri, José Camacho-Collados\",\"doi\":\"10.48550/arXiv.2210.03797\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Recent progress in language model pre-training has led to important improvements in Named Entity Recognition (NER). Nonetheless, this progress has been mainly tested in well-formatted documents such as news, Wikipedia, or scientific articles. In social media the landscape is different, in which it adds another layer of complexity due to its noisy and dynamic nature. In this paper, we focus on NER in Twitter, one of the largest social media platforms, and construct a new NER dataset, TweetNER7, which contains seven entity types annotated over 11,382 tweets from September 2019 to August 2021. The dataset was constructed by carefully distributing the tweets over time and taking representative trends as a basis. Along with the dataset, we provide a set of language model baselines and perform an analysis on the language model performance on the task, especially analyzing the impact of different time periods. In particular, we focus on three important temporal aspects in our analysis: short-term degradation of NER models over time, strategies to fine-tune a language model over different periods, and self-labeling as an alternative to lack of recently-labeled data. TweetNER7 is released publicly (https://huggingface.co/datasets/tner/tweetner7) along with the models fine-tuned on it (NER models have been integrated into TweetNLP and can be found at https://github.com/asahi417/tner/tree/master/examples/tweetner7_paper).\",\"PeriodicalId\":39298,\"journal\":{\"name\":\"AACL Bioflux\",\"volume\":\"2 1\",\"pages\":\"309-319\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-10-07\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"7\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"AACL Bioflux\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.48550/arXiv.2210.03797\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q3\",\"JCRName\":\"Environmental Science\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"AACL Bioflux","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.48550/arXiv.2210.03797","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"Environmental Science","Score":null,"Total":0}
引用次数: 7

摘要

语言模型预训练的最新进展导致了命名实体识别(NER)的重要改进。尽管如此,这种进步主要是在格式良好的文档中进行测试的,比如新闻、维基百科或科学文章。在社交媒体中,情况是不同的,由于其嘈杂和动态的性质,它增加了另一层复杂性。在本文中,我们关注最大的社交媒体平台之一Twitter中的NER,并构建了一个新的NER数据集TweetNER7,该数据集包含7种实体类型,标注了2019年9月至2021年8月的11,382条推文。该数据集是通过仔细分布推文并以代表性趋势为基础构建的。与数据集一起,我们提供了一组语言模型基线,并对语言模型在任务上的性能进行了分析,特别是分析了不同时间段的影响。在我们的分析中,我们特别关注了三个重要的时间方面:随着时间的推移,NER模型的短期退化,在不同时期微调语言模型的策略,以及作为缺乏最近标记数据的替代方法的自标记。TweetNER7是公开发布的(https://huggingface.co/datasets/tner/tweetner7),同时还发布了经过微调的模型(NER模型已经集成到TweetNLP中,可以在https://github.com/asahi417/tner/tree/master/examples/tweetner7_paper上找到)。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Named Entity Recognition in Twitter: A Dataset and Analysis on Short-Term Temporal Shifts
Recent progress in language model pre-training has led to important improvements in Named Entity Recognition (NER). Nonetheless, this progress has been mainly tested in well-formatted documents such as news, Wikipedia, or scientific articles. In social media the landscape is different, in which it adds another layer of complexity due to its noisy and dynamic nature. In this paper, we focus on NER in Twitter, one of the largest social media platforms, and construct a new NER dataset, TweetNER7, which contains seven entity types annotated over 11,382 tweets from September 2019 to August 2021. The dataset was constructed by carefully distributing the tweets over time and taking representative trends as a basis. Along with the dataset, we provide a set of language model baselines and perform an analysis on the language model performance on the task, especially analyzing the impact of different time periods. In particular, we focus on three important temporal aspects in our analysis: short-term degradation of NER models over time, strategies to fine-tune a language model over different periods, and self-labeling as an alternative to lack of recently-labeled data. TweetNER7 is released publicly (https://huggingface.co/datasets/tner/tweetner7) along with the models fine-tuned on it (NER models have been integrated into TweetNLP and can be found at https://github.com/asahi417/tner/tree/master/examples/tweetner7_paper).
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
AACL Bioflux
AACL Bioflux Environmental Science-Management, Monitoring, Policy and Law
CiteScore
1.40
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信