Handling out of vocabulary words at the semantical level using recurrent neural networks

Paula M. L. Pedroso, F. Lobato, Eveline Sá, A. Jacob
{"title":"Handling out of vocabulary words at the semantical level using recurrent neural networks","authors":"Paula M. L. Pedroso, F. Lobato, Eveline Sá, A. Jacob","doi":"10.1109/WI-IAT55865.2022.00022","DOIUrl":null,"url":null,"abstract":"Text recognition through natural language processing (NLP) faces challenges when it encounters a word that is not categorized. These types of words are called out-of-vocabulary words (OOV). They are often the subject of representation, local slang, or typing mistakes. These types of content have grown exponentially as the Internet has popularized, making people interact more assiduously through texting. Given the importance of this subject, we present three OOV classification models based on deep learning using a corpus with words in Portuguese as a case study. These models are bidirectional simple recurrent neural networks (RNN), short-term long memory (LSTM), and gated recurrent units (GRU). The purpose is to enable the system to recognize the embedding of OOV and place them in a vector space. In addition, the meaning of the words was verified using cosine similarity. The results of LSTM are promising for identifying OOV and generating semantically similar words. The model can be used in pre-processing pipelines for user-generated content analysis, adding more value to social media studies.","PeriodicalId":345445,"journal":{"name":"2022 IEEE/WIC/ACM International Joint Conference on Web Intelligence and Intelligent Agent Technology (WI-IAT)","volume":"166 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 IEEE/WIC/ACM International Joint Conference on Web Intelligence and Intelligent Agent Technology (WI-IAT)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/WI-IAT55865.2022.00022","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1

Abstract

Text recognition through natural language processing (NLP) faces challenges when it encounters a word that is not categorized. These types of words are called out-of-vocabulary words (OOV). They are often the subject of representation, local slang, or typing mistakes. These types of content have grown exponentially as the Internet has popularized, making people interact more assiduously through texting. Given the importance of this subject, we present three OOV classification models based on deep learning using a corpus with words in Portuguese as a case study. These models are bidirectional simple recurrent neural networks (RNN), short-term long memory (LSTM), and gated recurrent units (GRU). The purpose is to enable the system to recognize the embedding of OOV and place them in a vector space. In addition, the meaning of the words was verified using cosine similarity. The results of LSTM are promising for identifying OOV and generating semantically similar words. The model can be used in pre-processing pipelines for user-generated content analysis, adding more value to social media studies.
使用递归神经网络在语义层面处理词汇表外的单词
通过自然语言处理(NLP)进行的文本识别在遇到未分类的单词时面临挑战。这些类型的单词被称为词汇外单词(OOV)。他们经常是代表、当地俚语或打字错误的主题。随着互联网的普及,这些类型的内容呈指数级增长,使人们更努力地通过短信进行互动。鉴于这一主题的重要性,我们提出了三种基于深度学习的面向对象分类模型,并以葡萄牙语单词语料库为例进行了研究。这些模型包括双向简单递归神经网络(RNN)、短期长记忆(LSTM)和门控递归单元(GRU)。目的是使系统能够识别OOV的嵌入,并将其放置在向量空间中。此外,使用余弦相似度对单词的意义进行了验证。LSTM的结果在识别OOV和生成语义相似的词方面很有前景。该模型可用于用户生成内容分析的预处理管道,为社交媒体研究增加更多价值。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信