International Conference on Computational Processing of the Portuguese Language最新文献

筛选
英文 中文
GlórIA: A Generative and Open Large Language Model for Portuguese GlórIA:葡萄牙语的生成和开放式大型语言模型
International Conference on Computational Processing of the Portuguese Language Pub Date : 2024-02-20 DOI: 10.48550/arXiv.2402.12969
Ricardo Lopes, João Magalhães, David Semedo
{"title":"GlórIA: A Generative and Open Large Language Model for Portuguese","authors":"Ricardo Lopes, João Magalhães, David Semedo","doi":"10.48550/arXiv.2402.12969","DOIUrl":"https://doi.org/10.48550/arXiv.2402.12969","url":null,"abstract":"Significant strides have been made in natural language tasks, largely attributed to the emergence of powerful large language models (LLMs). These models, pre-trained on extensive and diverse corpora, have become increasingly capable of comprehending the intricacies of language. Despite the abundance of LLMs for many high-resource languages, the availability of such models remains limited for European Portuguese. We introduce Gl'orIA, a robust European Portuguese decoder LLM. To pre-train Gl'orIA, we assembled a comprehensive PT-PT text corpus comprising 35 billion tokens from various sources. We present our pre-training methodology, followed by an assessment of the model's effectiveness on multiple downstream tasks. Additionally, to evaluate our models' language modeling capabilities, we introduce CALAME-PT (Context-Aware LAnguage Modeling Evaluation for Portuguese), the first Portuguese zero-shot language-modeling benchmark. Evaluation shows that Gl'orIA significantly outperforms existing open PT decoder models in language modeling and that it can generate sound, knowledge-rich, and coherent PT-PT text. The model also exhibits strong potential for various downstream tasks.","PeriodicalId":291757,"journal":{"name":"International Conference on Computational Processing of the Portuguese Language","volume":"85 7","pages":"441-453"},"PeriodicalIF":0.0,"publicationDate":"2024-02-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140448003","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
From Random to Informed Data Selection: A Diversity-Based Approach to Optimize Human Annotation and Few-Shot Learning 从随机数据选择到知情数据选择:优化人工注释和少量学习的多样性方法
International Conference on Computational Processing of the Portuguese Language Pub Date : 2024-01-24 DOI: 10.48550/arXiv.2401.13229
Alexandre Alcoforado, Thomas Palmeira Ferraz, Lucas Hideki Okamura, Israel Campos Fama, Arnold Moya Lavado, B'arbara Dias Bueno, Bruno Veloso, Anna Helena Reali Costa
{"title":"From Random to Informed Data Selection: A Diversity-Based Approach to Optimize Human Annotation and Few-Shot Learning","authors":"Alexandre Alcoforado, Thomas Palmeira Ferraz, Lucas Hideki Okamura, Israel Campos Fama, Arnold Moya Lavado, B'arbara Dias Bueno, Bruno Veloso, Anna Helena Reali Costa","doi":"10.48550/arXiv.2401.13229","DOIUrl":"https://doi.org/10.48550/arXiv.2401.13229","url":null,"abstract":"A major challenge in Natural Language Processing is obtaining annotated data for supervised learning. An option is the use of crowdsourcing platforms for data annotation. However, crowdsourcing introduces issues related to the annotator's experience, consistency, and biases. An alternative is to use zero-shot methods, which in turn have limitations compared to their few-shot or fully supervised counterparts. Recent advancements driven by large language models show potential, but struggle to adapt to specialized domains with severely limited data. The most common approaches therefore involve the human itself randomly annotating a set of datapoints to build initial datasets. But randomly sampling data to be annotated is often inefficient as it ignores the characteristics of the data and the specific needs of the model. The situation worsens when working with imbalanced datasets, as random sampling tends to heavily bias towards the majority classes, leading to excessive annotated data. To address these issues, this paper contributes an automatic and informed data selection architecture to build a small dataset for few-shot learning. Our proposal minimizes the quantity and maximizes diversity of data selected for human annotation, while improving model performance.","PeriodicalId":291757,"journal":{"name":"International Conference on Computational Processing of the Portuguese Language","volume":"20 1","pages":"492-502"},"PeriodicalIF":0.0,"publicationDate":"2024-01-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140497627","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Tracking environmental policy changes in the Brazilian Federal Official Gazette 跟踪巴西联邦官方公报的环境政策变化
International Conference on Computational Processing of the Portuguese Language Pub Date : 2022-02-11 DOI: 10.1007/978-3-030-98305-5_24
Flávio Nakasato Cação, A. H. R. Costa, Natalie Unterstell, Liuca Yonaha, T. Stec, F'abio Ishisaki
{"title":"Tracking environmental policy changes in the Brazilian Federal Official Gazette","authors":"Flávio Nakasato Cação, A. H. R. Costa, Natalie Unterstell, Liuca Yonaha, T. Stec, F'abio Ishisaki","doi":"10.1007/978-3-030-98305-5_24","DOIUrl":"https://doi.org/10.1007/978-3-030-98305-5_24","url":null,"abstract":"","PeriodicalId":291757,"journal":{"name":"International Conference on Computational Processing of the Portuguese Language","volume":"115 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-02-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115046863","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
MINT - Mainstream and Independent News Text Corpus MINT -主流和独立的新闻文本语料库
International Conference on Computational Processing of the Portuguese Language Pub Date : 2021-08-13 DOI: 10.1007/978-3-030-98305-5_3
Danielle Caled, Paula Carvalho, Mário J. Silva
{"title":"MINT - Mainstream and Independent News Text Corpus","authors":"Danielle Caled, Paula Carvalho, Mário J. Silva","doi":"10.1007/978-3-030-98305-5_3","DOIUrl":"https://doi.org/10.1007/978-3-030-98305-5_3","url":null,"abstract":"","PeriodicalId":291757,"journal":{"name":"International Conference on Computational Processing of the Portuguese Language","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-08-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125911062","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Brazilian Portuguese Speech Recognition Using Wav2vec 2.0 使用Wav2vec 2.0的巴西葡萄牙语语音识别
International Conference on Computational Processing of the Portuguese Language Pub Date : 2021-07-23 DOI: 10.1007/978-3-030-98305-5_31
L. Gris, Edresson Casanova, F. S. Oliveira, A. S. Soares, A. Júnior
{"title":"Brazilian Portuguese Speech Recognition Using Wav2vec 2.0","authors":"L. Gris, Edresson Casanova, F. S. Oliveira, A. S. Soares, A. Júnior","doi":"10.1007/978-3-030-98305-5_31","DOIUrl":"https://doi.org/10.1007/978-3-030-98305-5_31","url":null,"abstract":"","PeriodicalId":291757,"journal":{"name":"International Conference on Computational Processing of the Portuguese Language","volume":"41 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-07-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123559284","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 11
Inferring the Source of Official Texts: Can SVM Beat ULMFiT? 官方文本来源推断:SVM能胜过ULMFiT吗?
International Conference on Computational Processing of the Portuguese Language Pub Date : 2020-03-02 DOI: 10.1007/978-3-030-41505-1_8
Pedro Henrique Luz de Araujo, T. D. Campos, Marcelo Magalhães Silva de Sousa
{"title":"Inferring the Source of Official Texts: Can SVM Beat ULMFiT?","authors":"Pedro Henrique Luz de Araujo, T. D. Campos, Marcelo Magalhães Silva de Sousa","doi":"10.1007/978-3-030-41505-1_8","DOIUrl":"https://doi.org/10.1007/978-3-030-41505-1_8","url":null,"abstract":"","PeriodicalId":291757,"journal":{"name":"International Conference on Computational Processing of the Portuguese Language","volume":"122 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-03-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115669554","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
Screening of Email Box in Portuguese with SVM at Banco do Brasil 用SVM筛选巴西银行葡萄牙语邮箱
International Conference on Computational Processing of the Portuguese Language Pub Date : 2020-03-02 DOI: 10.1007/978-3-030-41505-1_15
Rafael Faria de Azevedo, Rafael Pereira de Araujo, R. G. Araújo, Régis Moreira Bittencourt, Rafael Ferreira Alves da Silva, Gabriel de Melo Vaz Nogueira, Thiago Marques Franca, J. Nunes, Klailton Ralff da Silva, Emmanuelle Regiane Cunha de Oliveira
{"title":"Screening of Email Box in Portuguese with SVM at Banco do Brasil","authors":"Rafael Faria de Azevedo, Rafael Pereira de Araujo, R. G. Araújo, Régis Moreira Bittencourt, Rafael Ferreira Alves da Silva, Gabriel de Melo Vaz Nogueira, Thiago Marques Franca, J. Nunes, Klailton Ralff da Silva, Emmanuelle Regiane Cunha de Oliveira","doi":"10.1007/978-3-030-41505-1_15","DOIUrl":"https://doi.org/10.1007/978-3-030-41505-1_15","url":null,"abstract":"","PeriodicalId":291757,"journal":{"name":"International Conference on Computational Processing of the Portuguese Language","volume":"80 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-03-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122581294","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
A Dataset for the Evaluation of Lexical Simplification in Portuguese for Children 儿童葡萄牙语词汇简化评价数据集
International Conference on Computational Processing of the Portuguese Language Pub Date : 2020-03-02 DOI: 10.1007/978-3-030-41505-1_6
N. Hartmann, Gustavo Paetzold, S. Aluísio
{"title":"A Dataset for the Evaluation of Lexical Simplification in Portuguese for Children","authors":"N. Hartmann, Gustavo Paetzold, S. Aluísio","doi":"10.1007/978-3-030-41505-1_6","DOIUrl":"https://doi.org/10.1007/978-3-030-41505-1_6","url":null,"abstract":"","PeriodicalId":291757,"journal":{"name":"International Conference on Computational Processing of the Portuguese Language","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-03-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127799727","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Exploiting Siamese Neural Networks on Short Text Similarity Tasks for Multiple Domains and Languages 基于Siamese神经网络的多领域多语言短文本相似度研究
International Conference on Computational Processing of the Portuguese Language Pub Date : 2020-03-02 DOI: 10.1007/978-3-030-41505-1_34
João Vitor Andrioli de Souza, Lucas E. S. Oliveira, Yohan Bonescki Gumiel, D. Carvalho, C. Moro
{"title":"Exploiting Siamese Neural Networks on Short Text Similarity Tasks for Multiple Domains and Languages","authors":"João Vitor Andrioli de Souza, Lucas E. S. Oliveira, Yohan Bonescki Gumiel, D. Carvalho, C. Moro","doi":"10.1007/978-3-030-41505-1_34","DOIUrl":"https://doi.org/10.1007/978-3-030-41505-1_34","url":null,"abstract":"","PeriodicalId":291757,"journal":{"name":"International Conference on Computational Processing of the Portuguese Language","volume":"31 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-03-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117175368","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 8
A Portuguese Dataset for Evaluation of Semantic Question Answering 语义问答评价的葡萄牙语数据集
International Conference on Computational Processing of the Portuguese Language Pub Date : 2020-03-02 DOI: 10.1007/978-3-030-41505-1_21
Denis Andrei de Araüjo, S. Rigo, P. Quaresma, João Henrique Muniz
{"title":"A Portuguese Dataset for Evaluation of Semantic Question Answering","authors":"Denis Andrei de Araüjo, S. Rigo, P. Quaresma, João Henrique Muniz","doi":"10.1007/978-3-030-41505-1_21","DOIUrl":"https://doi.org/10.1007/978-3-030-41505-1_21","url":null,"abstract":"","PeriodicalId":291757,"journal":{"name":"International Conference on Computational Processing of the Portuguese Language","volume":"31 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-03-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130595280","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信