回顾和可视化Facebook的FastText预训练词向量模型

2019 International Conference on Engineering, Science, and Industrial Applications (ICESI) Pub Date : 2019-08-01 DOI:10.1109/ICESI.2019.8863015

J. Young, A. Rusli

{"title":"回顾和可视化Facebook的FastText预训练词向量模型","authors":"J. Young, A. Rusli","doi":"10.1109/ICESI.2019.8863015","DOIUrl":null,"url":null,"abstract":"One of the most popular machine learning methods for processing natural language is Word2Vec. Like several other machine learning methods, there are some concerns regarding the interpretability of the resulting model. In this paper, our research aims to review and analyze a pretrained word vector model for processing Bahasa Indonesia released by Facebook, FastText. The analysis process is started by comparing words existing in the pretrained model and in the official dictionary of Indonesian language (KBBI), then words in the model are visualized to provide further analysis and review. A combination of Principal Component Analysis (PCA) method and t-SNE algorithm is used as a dimensionality reduction technique to visualize the word set. Based on the analysis and visualization result, this paper proposes several considerations needed when using the FastText pretrained word vector model to process natural language in Indonesian such as whether or not common natural language text preprocessing techniques are needed.","PeriodicalId":249316,"journal":{"name":"2019 International Conference on Engineering, Science, and Industrial Applications (ICESI)","volume":"20 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"10","resultStr":"{\"title\":\"Review and Visualization of Facebook's FastText Pretrained Word Vector Model\",\"authors\":\"J. Young, A. Rusli\",\"doi\":\"10.1109/ICESI.2019.8863015\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"One of the most popular machine learning methods for processing natural language is Word2Vec. Like several other machine learning methods, there are some concerns regarding the interpretability of the resulting model. In this paper, our research aims to review and analyze a pretrained word vector model for processing Bahasa Indonesia released by Facebook, FastText. The analysis process is started by comparing words existing in the pretrained model and in the official dictionary of Indonesian language (KBBI), then words in the model are visualized to provide further analysis and review. A combination of Principal Component Analysis (PCA) method and t-SNE algorithm is used as a dimensionality reduction technique to visualize the word set. Based on the analysis and visualization result, this paper proposes several considerations needed when using the FastText pretrained word vector model to process natural language in Indonesian such as whether or not common natural language text preprocessing techniques are needed.\",\"PeriodicalId\":249316,\"journal\":{\"name\":\"2019 International Conference on Engineering, Science, and Industrial Applications (ICESI)\",\"volume\":\"20 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2019-08-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"10\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2019 International Conference on Engineering, Science, and Industrial Applications (ICESI)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICESI.2019.8863015\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 International Conference on Engineering, Science, and Industrial Applications (ICESI)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICESI.2019.8863015","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 10

摘要

最流行的处理自然语言的机器学习方法之一是Word2Vec。与其他几种机器学习方法一样，对结果模型的可解释性也存在一些担忧。在本文中，我们的研究旨在回顾和分析由Facebook发布的用于处理印尼语的预训练词向量模型FastText。分析过程首先将预训练模型中存在的单词与印度尼西亚语官方词典(KBBI)中的单词进行比较，然后将模型中的单词可视化，以提供进一步的分析和审查。结合主成分分析(PCA)方法和t-SNE算法作为降维技术，对词集进行可视化。基于分析和可视化结果，本文提出了使用FastText预训练词向量模型处理印尼语自然语言时需要考虑的几个问题，如是否需要使用常见的自然语言文本预处理技术。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Review and Visualization of Facebook's FastText Pretrained Word Vector Model

One of the most popular machine learning methods for processing natural language is Word2Vec. Like several other machine learning methods, there are some concerns regarding the interpretability of the resulting model. In this paper, our research aims to review and analyze a pretrained word vector model for processing Bahasa Indonesia released by Facebook, FastText. The analysis process is started by comparing words existing in the pretrained model and in the official dictionary of Indonesian language (KBBI), then words in the model are visualized to provide further analysis and review. A combination of Principal Component Analysis (PCA) method and t-SNE algorithm is used as a dimensionality reduction technique to visualize the word set. Based on the analysis and visualization result, this paper proposes several considerations needed when using the FastText pretrained word vector model to process natural language in Indonesian such as whether or not common natural language text preprocessing techniques are needed.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2019 International Conference on Engineering, Science, and Industrial Applications (ICESI)

自引率

0.00%

发文量