{"title":"Multilingual Controllable Transformer-Based Lexical Simplification","authors":"Sheang Cheng Kim, Horacio Saggion","doi":"10.48550/arXiv.2307.02120","DOIUrl":"https://doi.org/10.48550/arXiv.2307.02120","url":null,"abstract":"Text is by far the most ubiquitous source of knowledge and information and should be made easily accessible to as many people as possible; however, texts often contain complex words that hinder reading comprehension and accessibility. Therefore, suggesting simpler alternatives for complex words without compromising meaning would help convey the information to a broader audience. This paper proposes mTLS, a multilingual controllable Transformer-based Lexical Simplification (LS) system fined-tuned with the T5 model. The novelty of this work lies in the use of language-specific prefixes, control tokens, and candidates extracted from pre-trained masked language models to learn simpler alternatives for complex words. The evaluation results on three well-known LS datasets -- LexMTurk, BenchLS, and NNSEval -- show that our model outperforms the previous state-of-the-art models like LSBert and ConLS. Moreover, further evaluation of our approach on the part of the recent TSAR-2022 multilingual LS shared-task dataset shows that our model performs competitively when compared with the participating systems for English LS and even outperforms the GPT-3 model on several metrics. Moreover, our model obtains performance gains also for Spanish and Portuguese.","PeriodicalId":258781,"journal":{"name":"Proces. del Leng. Natural","volume":"197 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-07-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131319012","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
A. Haddad, Damith Premasiri, Tharindu Ranasinghe, R. Mitkov
{"title":"Deep Learning Methods for Extracting Metaphorical Names of Flowers and Plants","authors":"A. Haddad, Damith Premasiri, Tharindu Ranasinghe, R. Mitkov","doi":"10.48550/arXiv.2305.10833","DOIUrl":"https://doi.org/10.48550/arXiv.2305.10833","url":null,"abstract":"The domain of Botany is rich with metaphorical terms. Those terms play an important role in the description and identification of flowers and plants. However, the identification of such terms in discourse is an arduous task. This leads in some cases to committing errors during translation processes and lexicographic tasks. The process is even more challenging when it comes to machine translation, both in the cases of single-word terms and multi-word terms. One of the recent concerns of Natural Language Processing (NLP) applications and Machine Translation (MT) technologies is the automatic identification of metaphor-based words in discourse through Deep Learning (DL). In this study, we seek to fill this gap through the use of thirteen popular transformer based models, as well as ChatGPT, and we show that discriminative models perform better than GPT-3.5 model with our best performer reporting 92.2349% F1 score in metaphoric flower and plant names identification task.","PeriodicalId":258781,"journal":{"name":"Proces. del Leng. Natural","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-05-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132795262","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Oscar R. Navarrete-Parra, Víctor Uc Cetina, Jorge Reyes-Magaña
{"title":"Aligning a medium-size GPT model in English to a small closed domain in Spanish using reinforcement learning","authors":"Oscar R. Navarrete-Parra, Víctor Uc Cetina, Jorge Reyes-Magaña","doi":"10.48550/arXiv.2303.17649","DOIUrl":"https://doi.org/10.48550/arXiv.2303.17649","url":null,"abstract":"In this paper, we propose a methodology to align a medium-sized GPT model, originally trained in English for an open domain, to a small closed domain in Spanish. The application for which the model is finely tuned is the question answering task. To achieve this we also needed to train and implement another neural network (which we called the reward model) that could score and determine whether an answer is appropriate for a given question. This component served to improve the decoding and generation of the answers of the system. Numerical metrics such as BLEU and perplexity were used to evaluate the model, and human judgment was also used to compare the decoding technique with others. Finally, the results favored the proposed method, and it was determined that it is feasible to use a reward model to align the generation of responses.","PeriodicalId":258781,"journal":{"name":"Proces. del Leng. Natural","volume":"7 Suppl 1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-03-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124983406","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Número 61","authors":"VV.AA","doi":"10.18537/auc.61","DOIUrl":"https://doi.org/10.18537/auc.61","url":null,"abstract":"\u0000Ejemplar completo del número 61 \u0000 \u0000 ","PeriodicalId":258781,"journal":{"name":"Proces. del Leng. Natural","volume":"73 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-12-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126464338","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Lessons learned from the evaluation of Spanish Language Models","authors":"Rodrigo Agerri, Eneko Agirre","doi":"10.48550/arXiv.2212.08390","DOIUrl":"https://doi.org/10.48550/arXiv.2212.08390","url":null,"abstract":"Given the impact of language models on the field of Natural Language Processing, a number of Spanish encoder-only masked language models (aka BERTs) have been trained and released. These models were developed either within large projects using very large private corpora or by means of smaller scale academic efforts leveraging freely available data. In this paper we present a comprehensive head-to-head comparison of language models for Spanish with the following results: (i) Previously ignored multilingual models from large companies fare better than monolingual models, substantially changing the evaluation landscape of language models in Spanish; (ii) Results across the monolingual models are not conclusive, with supposedly smaller and inferior models performing competitively. Based on these empirical results, we argue for the need of more research to understand the factors underlying them. In this sense, the effect of corpus size, quality and pre-training techniques need to be further investigated to be able to obtain Spanish monolingual models significantly better than the multilingual ones released by large private companies, specially in the face of rapid ongoing progress in the field. The recent activity in the development of language technology for Spanish is to be welcomed, but our results show that building language models remains an open, resource-heavy problem which requires to marry resources (monetary and/or computational) with the best research expertise and practice.","PeriodicalId":258781,"journal":{"name":"Proces. del Leng. Natural","volume":"58 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-12-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123269857","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Javier de la Rosa, E. G. Ponferrada, Paulo Villegas, Pablo González de Prado Salas, Manu Romero, María Grandury
{"title":"BERTIN: Efficient Pre-Training of a Spanish Language Model using Perplexity Sampling","authors":"Javier de la Rosa, E. G. Ponferrada, Paulo Villegas, Pablo González de Prado Salas, Manu Romero, María Grandury","doi":"10.48550/arXiv.2207.06814","DOIUrl":"https://doi.org/10.48550/arXiv.2207.06814","url":null,"abstract":"The pre-training of large language models usually requires massive amounts of resources, both in terms of computation and data. Frequently used web sources such as Common Crawl might contain enough noise to make this pre-training sub-optimal. In this work, we experiment with different sampling methods from the Spanish version of mC4, and present a novel data-centric technique which we name $textit{perplexity sampling}$ that enables the pre-training of language models in roughly half the amount of steps and using one fifth of the data. The resulting models are comparable to the current state-of-the-art, and even achieve better results for certain tasks. Our work is proof of the versatility of Transformers, and paves the way for small teams to train their models on a limited budget. Our models are available at this $href{https://huggingface.co/bertin-project}{URL}$.","PeriodicalId":258781,"journal":{"name":"Proces. del Leng. Natural","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-07-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126478521","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Número 60","authors":"Vvaa","doi":"10.18537/auc.60","DOIUrl":"https://doi.org/10.18537/auc.60","url":null,"abstract":"Ejemplar completo del número 60","PeriodicalId":258781,"journal":{"name":"Proces. del Leng. Natural","volume":"276 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133947108","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Salvador Lima-López, Eulàlia Farré-Maduell, Antonio Miranda-Escalada, Vicent Briva-Iglesias, Martin Krallinger
{"title":"NLP applied to occupational health: MEDDOPROF shared task at IberLEF 2021 on automatic recognition, classification and normalization of professions and occupations from medical texts","authors":"Salvador Lima-López, Eulàlia Farré-Maduell, Antonio Miranda-Escalada, Vicent Briva-Iglesias, Martin Krallinger","doi":"10.26342/2021-67-21","DOIUrl":"https://doi.org/10.26342/2021-67-21","url":null,"abstract":"Among the socio-demographic patient characteristics, occupations play an important role regarding not only occupational health, work-related accidents and exposure to toxic/pathogenic agents, but also their impact on general physical and mental health. This paper presents the Medical Documents Profession Recogni-tion (MEDDOPROF) shared task (held within IberLEF/SEPLN 2021), focused on the recognition and normalization of occupations in medical documents in Spanish. MEDDOPROF proposes three challenges: NER (recognition of professions, employ-ment statuses and activities in text), CLASS (classifying each occupation mention to its holder, i.e. patient or family member) and NORM (normalizing mentions to their identifier in ESCO or SNOMED CT). From the total of 40 registered teams, 15 submitted a total of 94 runs for the various sub-tracks. Best-performing systems were based on deep-learning technologies (incl. transformers) and achieved 0.818 F-score in occupation detection (NER), 0.793 in classifying occupations to their ref-erent (CLASS) and 0.619 in normalization (NORM). Future initiatives should also address multilingual aspects and application to other domains like social services, human resources, legal or job market data analytics and policy makers.","PeriodicalId":258781,"journal":{"name":"Proces. del Leng. Natural","volume":"107 3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-09-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115497817","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"El sentimiento de las letras de las canciones y su relación con las características musicales","authors":"M. Palomeque, J. Lucio","doi":"10.26342/2021-67-8","DOIUrl":"https://doi.org/10.26342/2021-67-8","url":null,"abstract":"Los autores agradecen la financiacion recibida por la Comunidad de Madrid y la UAH (ref: EPU-INV/2020/006).","PeriodicalId":258781,"journal":{"name":"Proces. del Leng. Natural","volume":"305 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-09-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121260731","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Carlos Badenes-Olmedo, Borja Lozano-Álvarez, Óscar Corcho
{"title":"Impact of Text Length for Information Retrieval Tasks based on Probabilistic Topics","authors":"Carlos Badenes-Olmedo, Borja Lozano-Álvarez, Óscar Corcho","doi":"10.26342/2021-67-2","DOIUrl":"https://doi.org/10.26342/2021-67-2","url":null,"abstract":"This work is supported by the project KnowledgeSpaces with reference PID2020-118274RB-I00, financed by the Spanish Ministry of Science and Innovation.","PeriodicalId":258781,"journal":{"name":"Proces. del Leng. Natural","volume":"20 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-09-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114758610","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}