International Conference on Computational Processing of the Portuguese Language最新文献

GlórIA: A Generative and Open Large Language Model for Portuguese GlórIA：葡萄牙语的生成和开放式大型语言模型

International Conference on Computational Processing of the Portuguese Language Pub Date : 2024-02-20 DOI: 10.48550/arXiv.2402.12969

Ricardo Lopes, João Magalhães, David Semedo

{"title":"GlórIA: A Generative and Open Large Language Model for Portuguese","authors":"Ricardo Lopes, João Magalhães, David Semedo","doi":"10.48550/arXiv.2402.12969","DOIUrl":"https://doi.org/10.48550/arXiv.2402.12969","url":null,"abstract":"Significant strides have been made in natural language tasks, largely attributed to the emergence of powerful large language models (LLMs). These models, pre-trained on extensive and diverse corpora, have become increasingly capable of comprehending the intricacies of language. Despite the abundance of LLMs for many high-resource languages, the availability of such models remains limited for European Portuguese. We introduce Gl'orIA, a robust European Portuguese decoder LLM. To pre-train Gl'orIA, we assembled a comprehensive PT-PT text corpus comprising 35 billion tokens from various sources. We present our pre-training methodology, followed by an assessment of the model's effectiveness on multiple downstream tasks. Additionally, to evaluate our models' language modeling capabilities, we introduce CALAME-PT (Context-Aware LAnguage Modeling Evaluation for Portuguese), the first Portuguese zero-shot language-modeling benchmark. Evaluation shows that Gl'orIA significantly outperforms existing open PT decoder models in language modeling and that it can generate sound, knowledge-rich, and coherent PT-PT text. The model also exhibits strong potential for various downstream tasks.","PeriodicalId":291757,"journal":{"name":"International Conference on Computational Processing of the Portuguese Language","volume":"85 7","pages":"441-453"},"PeriodicalIF":0.0,"publicationDate":"2024-02-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140448003","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

From Random to Informed Data Selection: A Diversity-Based Approach to Optimize Human Annotation and Few-Shot Learning 从随机数据选择到知情数据选择：优化人工注释和少量学习的多样性方法

International Conference on Computational Processing of the Portuguese Language Pub Date : 2024-01-24 DOI: 10.48550/arXiv.2401.13229

Alexandre Alcoforado, Thomas Palmeira Ferraz, Lucas Hideki Okamura, Israel Campos Fama, Arnold Moya Lavado, B'arbara Dias Bueno, Bruno Veloso, Anna Helena Reali Costa

{"title":"From Random to Informed Data Selection: A Diversity-Based Approach to Optimize Human Annotation and Few-Shot Learning","authors":"Alexandre Alcoforado, Thomas Palmeira Ferraz, Lucas Hideki Okamura, Israel Campos Fama, Arnold Moya Lavado, B'arbara Dias Bueno, Bruno Veloso, Anna Helena Reali Costa","doi":"10.48550/arXiv.2401.13229","DOIUrl":"https://doi.org/10.48550/arXiv.2401.13229","url":null,"abstract":"A major challenge in Natural Language Processing is obtaining annotated data for supervised learning. An option is the use of crowdsourcing platforms for data annotation. However, crowdsourcing introduces issues related to the annotator's experience, consistency, and biases. An alternative is to use zero-shot methods, which in turn have limitations compared to their few-shot or fully supervised counterparts. Recent advancements driven by large language models show potential, but struggle to adapt to specialized domains with severely limited data. The most common approaches therefore involve the human itself randomly annotating a set of datapoints to build initial datasets. But randomly sampling data to be annotated is often inefficient as it ignores the characteristics of the data and the specific needs of the model. The situation worsens when working with imbalanced datasets, as random sampling tends to heavily bias towards the majority classes, leading to excessive annotated data. To address these issues, this paper contributes an automatic and informed data selection architecture to build a small dataset for few-shot learning. Our proposal minimizes the quantity and maximizes diversity of data selected for human annotation, while improving model performance.","PeriodicalId":291757,"journal":{"name":"International Conference on Computational Processing of the Portuguese Language","volume":"20 1","pages":"492-502"},"PeriodicalIF":0.0,"publicationDate":"2024-01-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140497627","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Tracking environmental policy changes in the Brazilian Federal Official Gazette 跟踪巴西联邦官方公报的环境政策变化

International Conference on Computational Processing of the Portuguese Language Pub Date : 2022-02-11 DOI: 10.1007/978-3-030-98305-5_24

Flávio Nakasato Cação, A. H. R. Costa, Natalie Unterstell, Liuca Yonaha, T. Stec, F'abio Ishisaki

引用次数: 1

MINT - Mainstream and Independent News Text Corpus MINT -主流和独立的新闻文本语料库

International Conference on Computational Processing of the Portuguese Language Pub Date : 2021-08-13 DOI: 10.1007/978-3-030-98305-5_3

Danielle Caled, Paula Carvalho, Mário J. Silva

引用次数: 0

Brazilian Portuguese Speech Recognition Using Wav2vec 2.0 使用Wav2vec 2.0的巴西葡萄牙语语音识别

International Conference on Computational Processing of the Portuguese Language Pub Date : 2021-07-23 DOI: 10.1007/978-3-030-98305-5_31

L. Gris, Edresson Casanova, F. S. Oliveira, A. S. Soares, A. Júnior

引用次数: 11

Inferring the Source of Official Texts: Can SVM Beat ULMFiT? 官方文本来源推断:SVM能胜过ULMFiT吗?

International Conference on Computational Processing of the Portuguese Language Pub Date : 2020-03-02 DOI: 10.1007/978-3-030-41505-1_8

Pedro Henrique Luz de Araujo, T. D. Campos, Marcelo Magalhães Silva de Sousa

引用次数: 5

Screening of Email Box in Portuguese with SVM at Banco do Brasil 用SVM筛选巴西银行葡萄牙语邮箱

International Conference on Computational Processing of the Portuguese Language Pub Date : 2020-03-02 DOI: 10.1007/978-3-030-41505-1_15

Rafael Faria de Azevedo, Rafael Pereira de Araujo, R. G. Araújo, Régis Moreira Bittencourt, Rafael Ferreira Alves da Silva, Gabriel de Melo Vaz Nogueira, Thiago Marques Franca, J. Nunes, Klailton Ralff da Silva, Emmanuelle Regiane Cunha de Oliveira

引用次数: 1

A Dataset for the Evaluation of Lexical Simplification in Portuguese for Children 儿童葡萄牙语词汇简化评价数据集

International Conference on Computational Processing of the Portuguese Language Pub Date : 2020-03-02 DOI: 10.1007/978-3-030-41505-1_6

N. Hartmann, Gustavo Paetzold, S. Aluísio

引用次数: 3

Exploiting Siamese Neural Networks on Short Text Similarity Tasks for Multiple Domains and Languages 基于Siamese神经网络的多领域多语言短文本相似度研究

International Conference on Computational Processing of the Portuguese Language Pub Date : 2020-03-02 DOI: 10.1007/978-3-030-41505-1_34

João Vitor Andrioli de Souza, Lucas E. S. Oliveira, Yohan Bonescki Gumiel, D. Carvalho, C. Moro

引用次数: 8

A Portuguese Dataset for Evaluation of Semantic Question Answering 语义问答评价的葡萄牙语数据集

International Conference on Computational Processing of the Portuguese Language Pub Date : 2020-03-02 DOI: 10.1007/978-3-030-41505-1_21

Denis Andrei de Araüjo, S. Rigo, P. Quaresma, João Henrique Muniz

引用次数: 0