{"title":"GlórIA: A Generative and Open Large Language Model for Portuguese","authors":"Ricardo Lopes, João Magalhães, David Semedo","doi":"10.48550/arXiv.2402.12969","DOIUrl":"https://doi.org/10.48550/arXiv.2402.12969","url":null,"abstract":"Significant strides have been made in natural language tasks, largely attributed to the emergence of powerful large language models (LLMs). These models, pre-trained on extensive and diverse corpora, have become increasingly capable of comprehending the intricacies of language. Despite the abundance of LLMs for many high-resource languages, the availability of such models remains limited for European Portuguese. We introduce Gl'orIA, a robust European Portuguese decoder LLM. To pre-train Gl'orIA, we assembled a comprehensive PT-PT text corpus comprising 35 billion tokens from various sources. We present our pre-training methodology, followed by an assessment of the model's effectiveness on multiple downstream tasks. Additionally, to evaluate our models' language modeling capabilities, we introduce CALAME-PT (Context-Aware LAnguage Modeling Evaluation for Portuguese), the first Portuguese zero-shot language-modeling benchmark. Evaluation shows that Gl'orIA significantly outperforms existing open PT decoder models in language modeling and that it can generate sound, knowledge-rich, and coherent PT-PT text. The model also exhibits strong potential for various downstream tasks.","PeriodicalId":291757,"journal":{"name":"International Conference on Computational Processing of the Portuguese Language","volume":"85 7","pages":"441-453"},"PeriodicalIF":0.0,"publicationDate":"2024-02-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140448003","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Alexandre Alcoforado, Thomas Palmeira Ferraz, Lucas Hideki Okamura, Israel Campos Fama, Arnold Moya Lavado, B'arbara Dias Bueno, Bruno Veloso, Anna Helena Reali Costa
{"title":"From Random to Informed Data Selection: A Diversity-Based Approach to Optimize Human Annotation and Few-Shot Learning","authors":"Alexandre Alcoforado, Thomas Palmeira Ferraz, Lucas Hideki Okamura, Israel Campos Fama, Arnold Moya Lavado, B'arbara Dias Bueno, Bruno Veloso, Anna Helena Reali Costa","doi":"10.48550/arXiv.2401.13229","DOIUrl":"https://doi.org/10.48550/arXiv.2401.13229","url":null,"abstract":"A major challenge in Natural Language Processing is obtaining annotated data for supervised learning. An option is the use of crowdsourcing platforms for data annotation. However, crowdsourcing introduces issues related to the annotator's experience, consistency, and biases. An alternative is to use zero-shot methods, which in turn have limitations compared to their few-shot or fully supervised counterparts. Recent advancements driven by large language models show potential, but struggle to adapt to specialized domains with severely limited data. The most common approaches therefore involve the human itself randomly annotating a set of datapoints to build initial datasets. But randomly sampling data to be annotated is often inefficient as it ignores the characteristics of the data and the specific needs of the model. The situation worsens when working with imbalanced datasets, as random sampling tends to heavily bias towards the majority classes, leading to excessive annotated data. To address these issues, this paper contributes an automatic and informed data selection architecture to build a small dataset for few-shot learning. Our proposal minimizes the quantity and maximizes diversity of data selected for human annotation, while improving model performance.","PeriodicalId":291757,"journal":{"name":"International Conference on Computational Processing of the Portuguese Language","volume":"20 1","pages":"492-502"},"PeriodicalIF":0.0,"publicationDate":"2024-01-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140497627","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Flávio Nakasato Cação, A. H. R. Costa, Natalie Unterstell, Liuca Yonaha, T. Stec, F'abio Ishisaki
{"title":"Tracking environmental policy changes in the Brazilian Federal Official Gazette","authors":"Flávio Nakasato Cação, A. H. R. Costa, Natalie Unterstell, Liuca Yonaha, T. Stec, F'abio Ishisaki","doi":"10.1007/978-3-030-98305-5_24","DOIUrl":"https://doi.org/10.1007/978-3-030-98305-5_24","url":null,"abstract":"","PeriodicalId":291757,"journal":{"name":"International Conference on Computational Processing of the Portuguese Language","volume":"115 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-02-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115046863","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"MINT - Mainstream and Independent News Text Corpus","authors":"Danielle Caled, Paula Carvalho, Mário J. Silva","doi":"10.1007/978-3-030-98305-5_3","DOIUrl":"https://doi.org/10.1007/978-3-030-98305-5_3","url":null,"abstract":"","PeriodicalId":291757,"journal":{"name":"International Conference on Computational Processing of the Portuguese Language","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-08-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125911062","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
L. Gris, Edresson Casanova, F. S. Oliveira, A. S. Soares, A. Júnior
{"title":"Brazilian Portuguese Speech Recognition Using Wav2vec 2.0","authors":"L. Gris, Edresson Casanova, F. S. Oliveira, A. S. Soares, A. Júnior","doi":"10.1007/978-3-030-98305-5_31","DOIUrl":"https://doi.org/10.1007/978-3-030-98305-5_31","url":null,"abstract":"","PeriodicalId":291757,"journal":{"name":"International Conference on Computational Processing of the Portuguese Language","volume":"41 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-07-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123559284","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pedro Henrique Luz de Araujo, T. D. Campos, Marcelo Magalhães Silva de Sousa
{"title":"Inferring the Source of Official Texts: Can SVM Beat ULMFiT?","authors":"Pedro Henrique Luz de Araujo, T. D. Campos, Marcelo Magalhães Silva de Sousa","doi":"10.1007/978-3-030-41505-1_8","DOIUrl":"https://doi.org/10.1007/978-3-030-41505-1_8","url":null,"abstract":"","PeriodicalId":291757,"journal":{"name":"International Conference on Computational Processing of the Portuguese Language","volume":"122 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-03-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115669554","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Rafael Faria de Azevedo, Rafael Pereira de Araujo, R. G. Araújo, Régis Moreira Bittencourt, Rafael Ferreira Alves da Silva, Gabriel de Melo Vaz Nogueira, Thiago Marques Franca, J. Nunes, Klailton Ralff da Silva, Emmanuelle Regiane Cunha de Oliveira
{"title":"Screening of Email Box in Portuguese with SVM at Banco do Brasil","authors":"Rafael Faria de Azevedo, Rafael Pereira de Araujo, R. G. Araújo, Régis Moreira Bittencourt, Rafael Ferreira Alves da Silva, Gabriel de Melo Vaz Nogueira, Thiago Marques Franca, J. Nunes, Klailton Ralff da Silva, Emmanuelle Regiane Cunha de Oliveira","doi":"10.1007/978-3-030-41505-1_15","DOIUrl":"https://doi.org/10.1007/978-3-030-41505-1_15","url":null,"abstract":"","PeriodicalId":291757,"journal":{"name":"International Conference on Computational Processing of the Portuguese Language","volume":"80 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-03-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122581294","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A Dataset for the Evaluation of Lexical Simplification in Portuguese for Children","authors":"N. Hartmann, Gustavo Paetzold, S. Aluísio","doi":"10.1007/978-3-030-41505-1_6","DOIUrl":"https://doi.org/10.1007/978-3-030-41505-1_6","url":null,"abstract":"","PeriodicalId":291757,"journal":{"name":"International Conference on Computational Processing of the Portuguese Language","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-03-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127799727","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
João Vitor Andrioli de Souza, Lucas E. S. Oliveira, Yohan Bonescki Gumiel, D. Carvalho, C. Moro
{"title":"Exploiting Siamese Neural Networks on Short Text Similarity Tasks for Multiple Domains and Languages","authors":"João Vitor Andrioli de Souza, Lucas E. S. Oliveira, Yohan Bonescki Gumiel, D. Carvalho, C. Moro","doi":"10.1007/978-3-030-41505-1_34","DOIUrl":"https://doi.org/10.1007/978-3-030-41505-1_34","url":null,"abstract":"","PeriodicalId":291757,"journal":{"name":"International Conference on Computational Processing of the Portuguese Language","volume":"31 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-03-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117175368","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Denis Andrei de Araüjo, S. Rigo, P. Quaresma, João Henrique Muniz
{"title":"A Portuguese Dataset for Evaluation of Semantic Question Answering","authors":"Denis Andrei de Araüjo, S. Rigo, P. Quaresma, João Henrique Muniz","doi":"10.1007/978-3-030-41505-1_21","DOIUrl":"https://doi.org/10.1007/978-3-030-41505-1_21","url":null,"abstract":"","PeriodicalId":291757,"journal":{"name":"International Conference on Computational Processing of the Portuguese Language","volume":"31 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-03-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130595280","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}