Twitter中的命名实体识别:短期时间变化的数据集和分析

Q3 Environmental Science

AACL Bioflux Pub Date : 2022-10-07 DOI:10.48550/arXiv.2210.03797

Asahi Ushio, Leonardo Neves, V'itor Silva, Francesco Barbieri, José Camacho-Collados

{"title":"Twitter中的命名实体识别:短期时间变化的数据集和分析","authors":"Asahi Ushio, Leonardo Neves, V'itor Silva, Francesco Barbieri, José Camacho-Collados","doi":"10.48550/arXiv.2210.03797","DOIUrl":null,"url":null,"abstract":"Recent progress in language model pre-training has led to important improvements in Named Entity Recognition (NER). Nonetheless, this progress has been mainly tested in well-formatted documents such as news, Wikipedia, or scientific articles. In social media the landscape is different, in which it adds another layer of complexity due to its noisy and dynamic nature. In this paper, we focus on NER in Twitter, one of the largest social media platforms, and construct a new NER dataset, TweetNER7, which contains seven entity types annotated over 11,382 tweets from September 2019 to August 2021. The dataset was constructed by carefully distributing the tweets over time and taking representative trends as a basis. Along with the dataset, we provide a set of language model baselines and perform an analysis on the language model performance on the task, especially analyzing the impact of different time periods. In particular, we focus on three important temporal aspects in our analysis: short-term degradation of NER models over time, strategies to fine-tune a language model over different periods, and self-labeling as an alternative to lack of recently-labeled data. TweetNER7 is released publicly (https://huggingface.co/datasets/tner/tweetner7) along with the models fine-tuned on it (NER models have been integrated into TweetNLP and can be found at https://github.com/asahi417/tner/tree/master/examples/tweetner7_paper).","PeriodicalId":39298,"journal":{"name":"AACL Bioflux","volume":"2 1","pages":"309-319"},"PeriodicalIF":0.0000,"publicationDate":"2022-10-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"7","resultStr":"{\"title\":\"Named Entity Recognition in Twitter: A Dataset and Analysis on Short-Term Temporal Shifts\",\"authors\":\"Asahi Ushio, Leonardo Neves, V'itor Silva, Francesco Barbieri, José Camacho-Collados\",\"doi\":\"10.48550/arXiv.2210.03797\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Recent progress in language model pre-training has led to important improvements in Named Entity Recognition (NER). Nonetheless, this progress has been mainly tested in well-formatted documents such as news, Wikipedia, or scientific articles. In social media the landscape is different, in which it adds another layer of complexity due to its noisy and dynamic nature. In this paper, we focus on NER in Twitter, one of the largest social media platforms, and construct a new NER dataset, TweetNER7, which contains seven entity types annotated over 11,382 tweets from September 2019 to August 2021. The dataset was constructed by carefully distributing the tweets over time and taking representative trends as a basis. Along with the dataset, we provide a set of language model baselines and perform an analysis on the language model performance on the task, especially analyzing the impact of different time periods. In particular, we focus on three important temporal aspects in our analysis: short-term degradation of NER models over time, strategies to fine-tune a language model over different periods, and self-labeling as an alternative to lack of recently-labeled data. TweetNER7 is released publicly (https://huggingface.co/datasets/tner/tweetner7) along with the models fine-tuned on it (NER models have been integrated into TweetNLP and can be found at https://github.com/asahi417/tner/tree/master/examples/tweetner7_paper).\",\"PeriodicalId\":39298,\"journal\":{\"name\":\"AACL Bioflux\",\"volume\":\"2 1\",\"pages\":\"309-319\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-10-07\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"7\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"AACL Bioflux\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.48550/arXiv.2210.03797\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q3\",\"JCRName\":\"Environmental Science\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"AACL Bioflux","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.48550/arXiv.2210.03797","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"Environmental Science","Score":null,"Total":0}

引用次数: 7

摘要

语言模型预训练的最新进展导致了命名实体识别(NER)的重要改进。尽管如此，这种进步主要是在格式良好的文档中进行测试的，比如新闻、维基百科或科学文章。在社交媒体中，情况是不同的，由于其嘈杂和动态的性质，它增加了另一层复杂性。在本文中，我们关注最大的社交媒体平台之一Twitter中的NER，并构建了一个新的NER数据集TweetNER7，该数据集包含7种实体类型，标注了2019年9月至2021年8月的11,382条推文。该数据集是通过仔细分布推文并以代表性趋势为基础构建的。与数据集一起，我们提供了一组语言模型基线，并对语言模型在任务上的性能进行了分析，特别是分析了不同时间段的影响。在我们的分析中，我们特别关注了三个重要的时间方面:随着时间的推移，NER模型的短期退化，在不同时期微调语言模型的策略，以及作为缺乏最近标记数据的替代方法的自标记。TweetNER7是公开发布的(https://huggingface.co/datasets/tner/tweetner7)，同时还发布了经过微调的模型(NER模型已经集成到TweetNLP中，可以在https://github.com/asahi417/tner/tree/master/examples/tweetner7_paper上找到)。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Named Entity Recognition in Twitter: A Dataset and Analysis on Short-Term Temporal Shifts

Recent progress in language model pre-training has led to important improvements in Named Entity Recognition (NER). Nonetheless, this progress has been mainly tested in well-formatted documents such as news, Wikipedia, or scientific articles. In social media the landscape is different, in which it adds another layer of complexity due to its noisy and dynamic nature. In this paper, we focus on NER in Twitter, one of the largest social media platforms, and construct a new NER dataset, TweetNER7, which contains seven entity types annotated over 11,382 tweets from September 2019 to August 2021. The dataset was constructed by carefully distributing the tweets over time and taking representative trends as a basis. Along with the dataset, we provide a set of language model baselines and perform an analysis on the language model performance on the task, especially analyzing the impact of different time periods. In particular, we focus on three important temporal aspects in our analysis: short-term degradation of NER models over time, strategies to fine-tune a language model over different periods, and self-labeling as an alternative to lack of recently-labeled data. TweetNER7 is released publicly (https://huggingface.co/datasets/tner/tweetner7) along with the models fine-tuned on it (NER models have been integrated into TweetNLP and can be found at https://github.com/asahi417/tner/tree/master/examples/tweetner7_paper).

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

AACL Bioflux Environmental Science-Management, Monitoring, Policy and Law

CiteScore

1.40

自引率

0.00%

发文量