基于Word2Vec和DBSCAN的词性本体的实现

Proceedings of the 5th International Conference on Sustainable Information Engineering and Technology Pub Date : 2020-11-16 DOI:10.1145/3427423.3427431

Parmonangan R. Togatorop, Rosa Siagian, Yolanda Nainggolan, Kaleb Simanungkalit

{"title":"基于Word2Vec和DBSCAN的词性本体的实现","authors":"Parmonangan R. Togatorop, Rosa Siagian, Yolanda Nainggolan, Kaleb Simanungkalit","doi":"10.1145/3427423.3427431","DOIUrl":null,"url":null,"abstract":"POS tagging is a process of marking text into an appropriate word-class based on word definitions and word relationships. In general, several POS tagging approaches have been applied in Bahasa Indonesia namely rule-based, stochastic, and neural. Besides, there is another approach to POS tagging which has been applied to English, namely the approach using ontology. This approach has not yet been applied to Bahasa Indonesia so we will implement an ontology to conduct POS tagging in Bahasa Indonesia. In this study, the ontology was constructed using the Word2Vec and the DBSCAN clustering method. The Word2Vec model is implemented to extract each word in vector form based on its context and the DBSCAN clustering method is implemented for the classification process of word classes based on word vectors modeled by Word2Vec. The process of POS tagging with ontology is carried out in several stages, namely: data collection using web scraping techniques from Kompas.com and Detik.com online news articles, text preprocessing, Word2Vec feature building, clustering with DBSCAN, ontology construction and evaluation. The experiments carried out in this study were to choose the optimal parameter values from DBSCAN in forming word clusters for ontology construction. Overall, the implementation of ontology with Word2Vec and DBSCAN can do POS tagging with the highest accuracy value of 0.62, the highest precision value of 0.79, the highest recall value of 0.62, and the highest f1-score of 0.67.","PeriodicalId":120194,"journal":{"name":"Proceedings of the 5th International Conference on Sustainable Information Engineering and Technology","volume":"15 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-11-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":"{\"title\":\"Implementation of ontology-based on Word2Vec and DBSCAN for part-of-speech\",\"authors\":\"Parmonangan R. Togatorop, Rosa Siagian, Yolanda Nainggolan, Kaleb Simanungkalit\",\"doi\":\"10.1145/3427423.3427431\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"POS tagging is a process of marking text into an appropriate word-class based on word definitions and word relationships. In general, several POS tagging approaches have been applied in Bahasa Indonesia namely rule-based, stochastic, and neural. Besides, there is another approach to POS tagging which has been applied to English, namely the approach using ontology. This approach has not yet been applied to Bahasa Indonesia so we will implement an ontology to conduct POS tagging in Bahasa Indonesia. In this study, the ontology was constructed using the Word2Vec and the DBSCAN clustering method. The Word2Vec model is implemented to extract each word in vector form based on its context and the DBSCAN clustering method is implemented for the classification process of word classes based on word vectors modeled by Word2Vec. The process of POS tagging with ontology is carried out in several stages, namely: data collection using web scraping techniques from Kompas.com and Detik.com online news articles, text preprocessing, Word2Vec feature building, clustering with DBSCAN, ontology construction and evaluation. The experiments carried out in this study were to choose the optimal parameter values from DBSCAN in forming word clusters for ontology construction. Overall, the implementation of ontology with Word2Vec and DBSCAN can do POS tagging with the highest accuracy value of 0.62, the highest precision value of 0.79, the highest recall value of 0.62, and the highest f1-score of 0.67.\",\"PeriodicalId\":120194,\"journal\":{\"name\":\"Proceedings of the 5th International Conference on Sustainable Information Engineering and Technology\",\"volume\":\"15 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2020-11-16\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"2\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 5th International Conference on Sustainable Information Engineering and Technology\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3427423.3427431\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 5th International Conference on Sustainable Information Engineering and Technology","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3427423.3427431","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 2

摘要

词性标注是根据词的定义和词的关系将文本标记为适当的词类的过程。一般来说，在印尼语中有几种词性标注方法，即基于规则的、随机的和神经的。此外，还有一种已应用于英语的词性标注方法，即使用本体的方法。这种方法尚未应用于印尼语，因此我们将实现一个本体来在印尼语中进行词性标注。本研究采用Word2Vec和DBSCAN聚类方法构建本体。实现了Word2Vec模型，基于上下文以向量形式提取每个词，并基于Word2Vec建模的词向量实现了DBSCAN聚类方法对词类进行分类。基于本体的词性标注过程分为以下几个阶段:利用网络抓取技术从Kompas.com和Detik.com在线新闻文章中收集数据、文本预处理、Word2Vec特征构建、DBSCAN聚类、本体构建和评价。实验进行的这项研究,选择最优的参数值从DBSCAN形成词本体建设集群。总体而言，使用Word2Vec和DBSCAN实现本体可以进行词性标注，准确率最高为0.62，准确率最高为0.79，查全率最高为0.62,f1得分最高为0.67。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Implementation of ontology-based on Word2Vec and DBSCAN for part-of-speech

POS tagging is a process of marking text into an appropriate word-class based on word definitions and word relationships. In general, several POS tagging approaches have been applied in Bahasa Indonesia namely rule-based, stochastic, and neural. Besides, there is another approach to POS tagging which has been applied to English, namely the approach using ontology. This approach has not yet been applied to Bahasa Indonesia so we will implement an ontology to conduct POS tagging in Bahasa Indonesia. In this study, the ontology was constructed using the Word2Vec and the DBSCAN clustering method. The Word2Vec model is implemented to extract each word in vector form based on its context and the DBSCAN clustering method is implemented for the classification process of word classes based on word vectors modeled by Word2Vec. The process of POS tagging with ontology is carried out in several stages, namely: data collection using web scraping techniques from Kompas.com and Detik.com online news articles, text preprocessing, Word2Vec feature building, clustering with DBSCAN, ontology construction and evaluation. The experiments carried out in this study were to choose the optimal parameter values from DBSCAN in forming word clusters for ontology construction. Overall, the implementation of ontology with Word2Vec and DBSCAN can do POS tagging with the highest accuracy value of 0.62, the highest precision value of 0.79, the highest recall value of 0.62, and the highest f1-score of 0.67.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Proceedings of the 5th International Conference on Sustainable Information Engineering and Technology

自引率

0.00%

发文量