Paula M. L. Pedroso, F. Lobato, Eveline Sá, A. Jacob
{"title":"Handling out of vocabulary words at the semantical level using recurrent neural networks","authors":"Paula M. L. Pedroso, F. Lobato, Eveline Sá, A. Jacob","doi":"10.1109/WI-IAT55865.2022.00022","DOIUrl":null,"url":null,"abstract":"Text recognition through natural language processing (NLP) faces challenges when it encounters a word that is not categorized. These types of words are called out-of-vocabulary words (OOV). They are often the subject of representation, local slang, or typing mistakes. These types of content have grown exponentially as the Internet has popularized, making people interact more assiduously through texting. Given the importance of this subject, we present three OOV classification models based on deep learning using a corpus with words in Portuguese as a case study. These models are bidirectional simple recurrent neural networks (RNN), short-term long memory (LSTM), and gated recurrent units (GRU). The purpose is to enable the system to recognize the embedding of OOV and place them in a vector space. In addition, the meaning of the words was verified using cosine similarity. The results of LSTM are promising for identifying OOV and generating semantically similar words. The model can be used in pre-processing pipelines for user-generated content analysis, adding more value to social media studies.","PeriodicalId":345445,"journal":{"name":"2022 IEEE/WIC/ACM International Joint Conference on Web Intelligence and Intelligent Agent Technology (WI-IAT)","volume":"166 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 IEEE/WIC/ACM International Joint Conference on Web Intelligence and Intelligent Agent Technology (WI-IAT)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/WI-IAT55865.2022.00022","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1
Abstract
Text recognition through natural language processing (NLP) faces challenges when it encounters a word that is not categorized. These types of words are called out-of-vocabulary words (OOV). They are often the subject of representation, local slang, or typing mistakes. These types of content have grown exponentially as the Internet has popularized, making people interact more assiduously through texting. Given the importance of this subject, we present three OOV classification models based on deep learning using a corpus with words in Portuguese as a case study. These models are bidirectional simple recurrent neural networks (RNN), short-term long memory (LSTM), and gated recurrent units (GRU). The purpose is to enable the system to recognize the embedding of OOV and place them in a vector space. In addition, the meaning of the words was verified using cosine similarity. The results of LSTM are promising for identifying OOV and generating semantically similar words. The model can be used in pre-processing pipelines for user-generated content analysis, adding more value to social media studies.