Daniel Maciel, L. S. Artese, Alexandre Leopoldo Goncalves
{"title":"未知词的词嵌入:向伯特的词汇表中添加新词","authors":"Daniel Maciel, L. S. Artese, Alexandre Leopoldo Goncalves","doi":"10.5748/19contecsi/pse/dsc/7035","DOIUrl":null,"url":null,"abstract":"In natural language processing, dealing with the dynamics of languages, such as the arisen of new words, can be a challenge to models. In deep learning models, when a word is not presented in the training dataset, it is not known by the model and, therefore, considered out of vocabulary (OOV). Although many models manage to get around this barrier, sometimes it is necessary to learn the embedding of a new word. In this sense, a method is presented to obtain a dynamic contextual vector representation of a new word based in the BERT language model. To evaluate the method, we took the case of the arisen of the word 'voip' in scientific publications, obtaining an embedding close to 'telecommunications' and 'signalling', some of the main words with significance in relation to the context of the word of study, demonstrating that the proposed method offers an efficient way to obtain embeddings for new words.","PeriodicalId":284686,"journal":{"name":"19th CONTECSI International Conference on Information Systems and Technology Management","volume":"65 6","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"WORD EMBEDDING FOR UNKNOWN WORDS: ADDING NEW WORDS INTO BERT’S VOCABULARY\",\"authors\":\"Daniel Maciel, L. S. Artese, Alexandre Leopoldo Goncalves\",\"doi\":\"10.5748/19contecsi/pse/dsc/7035\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"In natural language processing, dealing with the dynamics of languages, such as the arisen of new words, can be a challenge to models. In deep learning models, when a word is not presented in the training dataset, it is not known by the model and, therefore, considered out of vocabulary (OOV). Although many models manage to get around this barrier, sometimes it is necessary to learn the embedding of a new word. In this sense, a method is presented to obtain a dynamic contextual vector representation of a new word based in the BERT language model. To evaluate the method, we took the case of the arisen of the word 'voip' in scientific publications, obtaining an embedding close to 'telecommunications' and 'signalling', some of the main words with significance in relation to the context of the word of study, demonstrating that the proposed method offers an efficient way to obtain embeddings for new words.\",\"PeriodicalId\":284686,\"journal\":{\"name\":\"19th CONTECSI International Conference on Information Systems and Technology Management\",\"volume\":\"65 6\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"1900-01-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"19th CONTECSI International Conference on Information Systems and Technology Management\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.5748/19contecsi/pse/dsc/7035\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"19th CONTECSI International Conference on Information Systems and Technology Management","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.5748/19contecsi/pse/dsc/7035","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
WORD EMBEDDING FOR UNKNOWN WORDS: ADDING NEW WORDS INTO BERT’S VOCABULARY
In natural language processing, dealing with the dynamics of languages, such as the arisen of new words, can be a challenge to models. In deep learning models, when a word is not presented in the training dataset, it is not known by the model and, therefore, considered out of vocabulary (OOV). Although many models manage to get around this barrier, sometimes it is necessary to learn the embedding of a new word. In this sense, a method is presented to obtain a dynamic contextual vector representation of a new word based in the BERT language model. To evaluate the method, we took the case of the arisen of the word 'voip' in scientific publications, obtaining an embedding close to 'telecommunications' and 'signalling', some of the main words with significance in relation to the context of the word of study, demonstrating that the proposed method offers an efficient way to obtain embeddings for new words.