Subhabrata Dutta, Rudra Dhar, Prantik Guha, Arpan Murmu, Dipankar Das
{"title":"印度推特中事实声明识别的多语言数据集","authors":"Subhabrata Dutta, Rudra Dhar, Prantik Guha, Arpan Murmu, Dipankar Das","doi":"10.1145/3574318.3574348","DOIUrl":null,"url":null,"abstract":"The need for automated fact-checking is getting prominent with every passing day as the spread of misinformation is swelling over the ever-increasing stream of online content. We focus on fine-grained labelling of factual information in tweets to facilitate better fact-checking systems capable of providing improved justifications. In this paper, we present a token-level annotation of factual claims in tweets from Indian Twitter. To deal with the multilingual variety of the Indian diaspora, we deal with tweets in English, Bengali, Hindi, and their codemixed variants. To the best of our knowledge, this dataset is first of kind, both in terms of labelling scheme as well as data sources.","PeriodicalId":270700,"journal":{"name":"Proceedings of the 14th Annual Meeting of the Forum for Information Retrieval Evaluation","volume":"35 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-12-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"A Multilingual Dataset for Identification of Factual Claims in Indian Twitter\",\"authors\":\"Subhabrata Dutta, Rudra Dhar, Prantik Guha, Arpan Murmu, Dipankar Das\",\"doi\":\"10.1145/3574318.3574348\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The need for automated fact-checking is getting prominent with every passing day as the spread of misinformation is swelling over the ever-increasing stream of online content. We focus on fine-grained labelling of factual information in tweets to facilitate better fact-checking systems capable of providing improved justifications. In this paper, we present a token-level annotation of factual claims in tweets from Indian Twitter. To deal with the multilingual variety of the Indian diaspora, we deal with tweets in English, Bengali, Hindi, and their codemixed variants. To the best of our knowledge, this dataset is first of kind, both in terms of labelling scheme as well as data sources.\",\"PeriodicalId\":270700,\"journal\":{\"name\":\"Proceedings of the 14th Annual Meeting of the Forum for Information Retrieval Evaluation\",\"volume\":\"35 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-12-09\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 14th Annual Meeting of the Forum for Information Retrieval Evaluation\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3574318.3574348\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 14th Annual Meeting of the Forum for Information Retrieval Evaluation","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3574318.3574348","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
A Multilingual Dataset for Identification of Factual Claims in Indian Twitter
The need for automated fact-checking is getting prominent with every passing day as the spread of misinformation is swelling over the ever-increasing stream of online content. We focus on fine-grained labelling of factual information in tweets to facilitate better fact-checking systems capable of providing improved justifications. In this paper, we present a token-level annotation of factual claims in tweets from Indian Twitter. To deal with the multilingual variety of the Indian diaspora, we deal with tweets in English, Bengali, Hindi, and their codemixed variants. To the best of our knowledge, this dataset is first of kind, both in terms of labelling scheme as well as data sources.