{"title":"回顾和可视化Facebook的FastText预训练词向量模型","authors":"J. Young, A. Rusli","doi":"10.1109/ICESI.2019.8863015","DOIUrl":null,"url":null,"abstract":"One of the most popular machine learning methods for processing natural language is Word2Vec. Like several other machine learning methods, there are some concerns regarding the interpretability of the resulting model. In this paper, our research aims to review and analyze a pretrained word vector model for processing Bahasa Indonesia released by Facebook, FastText. The analysis process is started by comparing words existing in the pretrained model and in the official dictionary of Indonesian language (KBBI), then words in the model are visualized to provide further analysis and review. A combination of Principal Component Analysis (PCA) method and t-SNE algorithm is used as a dimensionality reduction technique to visualize the word set. Based on the analysis and visualization result, this paper proposes several considerations needed when using the FastText pretrained word vector model to process natural language in Indonesian such as whether or not common natural language text preprocessing techniques are needed.","PeriodicalId":249316,"journal":{"name":"2019 International Conference on Engineering, Science, and Industrial Applications (ICESI)","volume":"20 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"10","resultStr":"{\"title\":\"Review and Visualization of Facebook's FastText Pretrained Word Vector Model\",\"authors\":\"J. Young, A. Rusli\",\"doi\":\"10.1109/ICESI.2019.8863015\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"One of the most popular machine learning methods for processing natural language is Word2Vec. Like several other machine learning methods, there are some concerns regarding the interpretability of the resulting model. In this paper, our research aims to review and analyze a pretrained word vector model for processing Bahasa Indonesia released by Facebook, FastText. The analysis process is started by comparing words existing in the pretrained model and in the official dictionary of Indonesian language (KBBI), then words in the model are visualized to provide further analysis and review. A combination of Principal Component Analysis (PCA) method and t-SNE algorithm is used as a dimensionality reduction technique to visualize the word set. Based on the analysis and visualization result, this paper proposes several considerations needed when using the FastText pretrained word vector model to process natural language in Indonesian such as whether or not common natural language text preprocessing techniques are needed.\",\"PeriodicalId\":249316,\"journal\":{\"name\":\"2019 International Conference on Engineering, Science, and Industrial Applications (ICESI)\",\"volume\":\"20 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2019-08-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"10\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2019 International Conference on Engineering, Science, and Industrial Applications (ICESI)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICESI.2019.8863015\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 International Conference on Engineering, Science, and Industrial Applications (ICESI)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICESI.2019.8863015","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Review and Visualization of Facebook's FastText Pretrained Word Vector Model
One of the most popular machine learning methods for processing natural language is Word2Vec. Like several other machine learning methods, there are some concerns regarding the interpretability of the resulting model. In this paper, our research aims to review and analyze a pretrained word vector model for processing Bahasa Indonesia released by Facebook, FastText. The analysis process is started by comparing words existing in the pretrained model and in the official dictionary of Indonesian language (KBBI), then words in the model are visualized to provide further analysis and review. A combination of Principal Component Analysis (PCA) method and t-SNE algorithm is used as a dimensionality reduction technique to visualize the word set. Based on the analysis and visualization result, this paper proposes several considerations needed when using the FastText pretrained word vector model to process natural language in Indonesian such as whether or not common natural language text preprocessing techniques are needed.