Review and Visualization of Facebook's FastText Pretrained Word Vector Model

J. Young, A. Rusli
{"title":"Review and Visualization of Facebook's FastText Pretrained Word Vector Model","authors":"J. Young, A. Rusli","doi":"10.1109/ICESI.2019.8863015","DOIUrl":null,"url":null,"abstract":"One of the most popular machine learning methods for processing natural language is Word2Vec. Like several other machine learning methods, there are some concerns regarding the interpretability of the resulting model. In this paper, our research aims to review and analyze a pretrained word vector model for processing Bahasa Indonesia released by Facebook, FastText. The analysis process is started by comparing words existing in the pretrained model and in the official dictionary of Indonesian language (KBBI), then words in the model are visualized to provide further analysis and review. A combination of Principal Component Analysis (PCA) method and t-SNE algorithm is used as a dimensionality reduction technique to visualize the word set. Based on the analysis and visualization result, this paper proposes several considerations needed when using the FastText pretrained word vector model to process natural language in Indonesian such as whether or not common natural language text preprocessing techniques are needed.","PeriodicalId":249316,"journal":{"name":"2019 International Conference on Engineering, Science, and Industrial Applications (ICESI)","volume":"20 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"10","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 International Conference on Engineering, Science, and Industrial Applications (ICESI)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICESI.2019.8863015","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 10

Abstract

One of the most popular machine learning methods for processing natural language is Word2Vec. Like several other machine learning methods, there are some concerns regarding the interpretability of the resulting model. In this paper, our research aims to review and analyze a pretrained word vector model for processing Bahasa Indonesia released by Facebook, FastText. The analysis process is started by comparing words existing in the pretrained model and in the official dictionary of Indonesian language (KBBI), then words in the model are visualized to provide further analysis and review. A combination of Principal Component Analysis (PCA) method and t-SNE algorithm is used as a dimensionality reduction technique to visualize the word set. Based on the analysis and visualization result, this paper proposes several considerations needed when using the FastText pretrained word vector model to process natural language in Indonesian such as whether or not common natural language text preprocessing techniques are needed.
回顾和可视化Facebook的FastText预训练词向量模型
最流行的处理自然语言的机器学习方法之一是Word2Vec。与其他几种机器学习方法一样,对结果模型的可解释性也存在一些担忧。在本文中,我们的研究旨在回顾和分析由Facebook发布的用于处理印尼语的预训练词向量模型FastText。分析过程首先将预训练模型中存在的单词与印度尼西亚语官方词典(KBBI)中的单词进行比较,然后将模型中的单词可视化,以提供进一步的分析和审查。结合主成分分析(PCA)方法和t-SNE算法作为降维技术,对词集进行可视化。基于分析和可视化结果,本文提出了使用FastText预训练词向量模型处理印尼语自然语言时需要考虑的几个问题,如是否需要使用常见的自然语言文本预处理技术。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信