从随机数据选择到知情数据选择：优化人工注释和少量学习的多样性方法

International Conference on Computational Processing of the Portuguese Language Pub Date : 2024-01-24 DOI:10.48550/arXiv.2401.13229

Alexandre Alcoforado, Thomas Palmeira Ferraz, Lucas Hideki Okamura, Israel Campos Fama, Arnold Moya Lavado, B'arbara Dias Bueno, Bruno Veloso, Anna Helena Reali Costa

{"title":"从随机数据选择到知情数据选择：优化人工注释和少量学习的多样性方法","authors":"Alexandre Alcoforado, Thomas Palmeira Ferraz, Lucas Hideki Okamura, Israel Campos Fama, Arnold Moya Lavado, B'arbara Dias Bueno, Bruno Veloso, Anna Helena Reali Costa","doi":"10.48550/arXiv.2401.13229","DOIUrl":null,"url":null,"abstract":"A major challenge in Natural Language Processing is obtaining annotated data for supervised learning. An option is the use of crowdsourcing platforms for data annotation. However, crowdsourcing introduces issues related to the annotator's experience, consistency, and biases. An alternative is to use zero-shot methods, which in turn have limitations compared to their few-shot or fully supervised counterparts. Recent advancements driven by large language models show potential, but struggle to adapt to specialized domains with severely limited data. The most common approaches therefore involve the human itself randomly annotating a set of datapoints to build initial datasets. But randomly sampling data to be annotated is often inefficient as it ignores the characteristics of the data and the specific needs of the model. The situation worsens when working with imbalanced datasets, as random sampling tends to heavily bias towards the majority classes, leading to excessive annotated data. To address these issues, this paper contributes an automatic and informed data selection architecture to build a small dataset for few-shot learning. Our proposal minimizes the quantity and maximizes diversity of data selected for human annotation, while improving model performance.","PeriodicalId":291757,"journal":{"name":"International Conference on Computational Processing of the Portuguese Language","volume":"20 1","pages":"492-502"},"PeriodicalIF":0.0000,"publicationDate":"2024-01-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"From Random to Informed Data Selection: A Diversity-Based Approach to Optimize Human Annotation and Few-Shot Learning\",\"authors\":\"Alexandre Alcoforado, Thomas Palmeira Ferraz, Lucas Hideki Okamura, Israel Campos Fama, Arnold Moya Lavado, B'arbara Dias Bueno, Bruno Veloso, Anna Helena Reali Costa\",\"doi\":\"10.48550/arXiv.2401.13229\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"A major challenge in Natural Language Processing is obtaining annotated data for supervised learning. An option is the use of crowdsourcing platforms for data annotation. However, crowdsourcing introduces issues related to the annotator's experience, consistency, and biases. An alternative is to use zero-shot methods, which in turn have limitations compared to their few-shot or fully supervised counterparts. Recent advancements driven by large language models show potential, but struggle to adapt to specialized domains with severely limited data. The most common approaches therefore involve the human itself randomly annotating a set of datapoints to build initial datasets. But randomly sampling data to be annotated is often inefficient as it ignores the characteristics of the data and the specific needs of the model. The situation worsens when working with imbalanced datasets, as random sampling tends to heavily bias towards the majority classes, leading to excessive annotated data. To address these issues, this paper contributes an automatic and informed data selection architecture to build a small dataset for few-shot learning. Our proposal minimizes the quantity and maximizes diversity of data selected for human annotation, while improving model performance.\",\"PeriodicalId\":291757,\"journal\":{\"name\":\"International Conference on Computational Processing of the Portuguese Language\",\"volume\":\"20 1\",\"pages\":\"492-502\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-01-24\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"International Conference on Computational Processing of the Portuguese Language\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.48550/arXiv.2401.13229\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Conference on Computational Processing of the Portuguese Language","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.48550/arXiv.2401.13229","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

自然语言处理的一个主要挑战是为监督学习获取注释数据。一种方法是使用众包平台进行数据注释。然而，众包带来了与注释者的经验、一致性和偏见有关的问题。另一种方法是使用零镜头方法，与少镜头方法或完全监督方法相比，零镜头方法也有其局限性。由大型语言模型推动的最新进展显示出了潜力，但在适应数据极其有限的专业领域时却举步维艰。因此，最常见的方法是由人类自己随机标注一组数据点来建立初始数据集。但随机抽样标注数据往往效率低下，因为它忽略了数据的特点和模型的特定需求。在处理不平衡数据集时，情况会变得更糟，因为随机取样往往会严重偏向大多数类别，导致注释数据过多。为了解决这些问题，本文提出了一种自动、明智的数据选择架构，以构建用于少量学习的小型数据集。我们的建议最大限度地减少了人工标注数据的数量，最大限度地增加了数据的多样性，同时提高了模型的性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

From Random to Informed Data Selection: A Diversity-Based Approach to Optimize Human Annotation and Few-Shot Learning

A major challenge in Natural Language Processing is obtaining annotated data for supervised learning. An option is the use of crowdsourcing platforms for data annotation. However, crowdsourcing introduces issues related to the annotator's experience, consistency, and biases. An alternative is to use zero-shot methods, which in turn have limitations compared to their few-shot or fully supervised counterparts. Recent advancements driven by large language models show potential, but struggle to adapt to specialized domains with severely limited data. The most common approaches therefore involve the human itself randomly annotating a set of datapoints to build initial datasets. But randomly sampling data to be annotated is often inefficient as it ignores the characteristics of the data and the specific needs of the model. The situation worsens when working with imbalanced datasets, as random sampling tends to heavily bias towards the majority classes, leading to excessive annotated data. To address these issues, this paper contributes an automatic and informed data selection architecture to build a small dataset for few-shot learning. Our proposal minimizes the quantity and maximizes diversity of data selected for human annotation, while improving model performance.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

International Conference on Computational Processing of the Portuguese Language

自引率

0.00%

发文量