DICE:意大利犯罪事件新闻的数据集

Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval Pub Date : 2023-07-18 DOI:10.1145/3539618.3591904

Giovanni Bonisoli, Maria Pia di Buono, Laura Po, Federica Rollo

{"title":"DICE:意大利犯罪事件新闻的数据集","authors":"Giovanni Bonisoli, Maria Pia di Buono, Laura Po, Federica Rollo","doi":"10.1145/3539618.3591904","DOIUrl":null,"url":null,"abstract":"Extracting events from news stories as the aim of several Natural Language Processing (NLP) applications (e.g., question answering, news recommendation, news summarization) is not a trivial task, due to the complexity of natural language and the fact that news reporting is characterized by journalistic style and norms. Those aspects entail scattering an event description over several sentences within one document (or more documents), applying a mechanism of gradual specification of event-related information. This implies a widespread use of co-reference relations among the textual elements, conveying non-linear temporal information. In addition to this, despite the achievement of state-of-the-art results in several tasks, high-quality training datasets for non-English languages are rarely available. This paper presents our preliminary study to develop an annotated Dataset for Italian Crime Event news (DICE). The contribution of the paper are: (1) the creation of a corpus of 10,395 crime news; (2) the annotation schema; (3) a dataset of 10,395 news with automatic annotations; (4) a preliminary manual annotation using the proposed schema of 1000 documents. The first tests on DICE have compared the performance of a manual annotator with that of single-span and multi-span question answering models and shown there is still a gap in the models, especially when dealing with more complex annotation tasks and limited training data. This underscores the importance of investing in the creation of high-quality annotated datasets like DICE, which can provide a solid foundation for training and testing a wide range of NLP models.","PeriodicalId":425056,"journal":{"name":"Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval","volume":"22 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-07-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"DICE: a Dataset of Italian Crime Event news\",\"authors\":\"Giovanni Bonisoli, Maria Pia di Buono, Laura Po, Federica Rollo\",\"doi\":\"10.1145/3539618.3591904\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Extracting events from news stories as the aim of several Natural Language Processing (NLP) applications (e.g., question answering, news recommendation, news summarization) is not a trivial task, due to the complexity of natural language and the fact that news reporting is characterized by journalistic style and norms. Those aspects entail scattering an event description over several sentences within one document (or more documents), applying a mechanism of gradual specification of event-related information. This implies a widespread use of co-reference relations among the textual elements, conveying non-linear temporal information. In addition to this, despite the achievement of state-of-the-art results in several tasks, high-quality training datasets for non-English languages are rarely available. This paper presents our preliminary study to develop an annotated Dataset for Italian Crime Event news (DICE). The contribution of the paper are: (1) the creation of a corpus of 10,395 crime news; (2) the annotation schema; (3) a dataset of 10,395 news with automatic annotations; (4) a preliminary manual annotation using the proposed schema of 1000 documents. The first tests on DICE have compared the performance of a manual annotator with that of single-span and multi-span question answering models and shown there is still a gap in the models, especially when dealing with more complex annotation tasks and limited training data. This underscores the importance of investing in the creation of high-quality annotated datasets like DICE, which can provide a solid foundation for training and testing a wide range of NLP models.\",\"PeriodicalId\":425056,\"journal\":{\"name\":\"Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval\",\"volume\":\"22 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2023-07-18\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3539618.3591904\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3539618.3591904","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

由于自然语言的复杂性以及新闻报道具有新闻风格和规范的事实，从新闻故事中提取事件作为几个自然语言处理(NLP)应用程序(例如，问答，新闻推荐，新闻摘要)的目标并不是一项微不足道的任务。这些方面需要将事件描述分散到一个文档(或多个文档)中的几个句子中，应用逐步规范事件相关信息的机制。这意味着在文本元素之间广泛使用共指关系，传递非线性的时间信息。除此之外，尽管在一些任务中取得了最先进的结果，但非英语语言的高质量训练数据集很少可用。本文介绍了我们为意大利犯罪事件新闻(DICE)开发注释数据集的初步研究。本文的贡献有:(1)建立了一个包含10395条犯罪新闻的语料库;(2)标注模式;(3)带有自动注释的10,395条新闻数据集;(4)使用1000个文档的拟议模式的初步手工注释。DICE上的第一次测试将手动注释器的性能与单跨和多跨问答模型的性能进行了比较，结果表明模型仍然存在差距，特别是在处理更复杂的注释任务和有限的训练数据时。这强调了投资创建高质量注释数据集(如DICE)的重要性，它可以为训练和测试广泛的NLP模型提供坚实的基础。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

DICE: a Dataset of Italian Crime Event news

Extracting events from news stories as the aim of several Natural Language Processing (NLP) applications (e.g., question answering, news recommendation, news summarization) is not a trivial task, due to the complexity of natural language and the fact that news reporting is characterized by journalistic style and norms. Those aspects entail scattering an event description over several sentences within one document (or more documents), applying a mechanism of gradual specification of event-related information. This implies a widespread use of co-reference relations among the textual elements, conveying non-linear temporal information. In addition to this, despite the achievement of state-of-the-art results in several tasks, high-quality training datasets for non-English languages are rarely available. This paper presents our preliminary study to develop an annotated Dataset for Italian Crime Event news (DICE). The contribution of the paper are: (1) the creation of a corpus of 10,395 crime news; (2) the annotation schema; (3) a dataset of 10,395 news with automatic annotations; (4) a preliminary manual annotation using the proposed schema of 1000 documents. The first tests on DICE have compared the performance of a manual annotator with that of single-span and multi-span question answering models and shown there is still a gap in the models, especially when dealing with more complex annotation tasks and limited training data. This underscores the importance of investing in the creation of high-quality annotated datasets like DICE, which can provide a solid foundation for training and testing a wide range of NLP models.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval

自引率

0.00%

发文量