{"title":"基于描述文本的微软研究视频描述语料库数据主题分组,采用fasttext、pca和k-means聚类","authors":"Ahmad Hafidh Ayatullah, Nanik Suciati","doi":"10.33795/jip.v9i2.1271","DOIUrl":null,"url":null,"abstract":"This research groups topics of the Microsoft Research Video Description Corpus (MRVDC) based on text descriptions of Indonesian language dataset. The Microsoft Research Video Description Corpus (MRVDC) is a video dataset developed by Microsoft Research, which contains paraphrased event expressions in English and other languages. The results of grouping these topics show how the patterns of similarity and interrelationships between text descriptions from different video data, which will be useful for the topic-based video retrieval. The topic grouping process is based on text descriptions using fastText as word embedding, PCA as features reduction method and K-means as the clustering method. The experiment on 1959 videos with 43753 text descriptions to vary the number of k and with/without PCA result that the optimal clustering number is 180 with silhouette coefficient of 0.123115.","PeriodicalId":232501,"journal":{"name":"Jurnal Informatika Polinema","volume":"20 6 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-02-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"TOPIC GROUPING BASED ON DESCRIPTION TEXT IN MICROSOFT RESEARCH VIDEO DESCRIPTION CORPUS DATA USING FASTTEXT, PCA AND K-MEANS CLUSTERING\",\"authors\":\"Ahmad Hafidh Ayatullah, Nanik Suciati\",\"doi\":\"10.33795/jip.v9i2.1271\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"This research groups topics of the Microsoft Research Video Description Corpus (MRVDC) based on text descriptions of Indonesian language dataset. The Microsoft Research Video Description Corpus (MRVDC) is a video dataset developed by Microsoft Research, which contains paraphrased event expressions in English and other languages. The results of grouping these topics show how the patterns of similarity and interrelationships between text descriptions from different video data, which will be useful for the topic-based video retrieval. The topic grouping process is based on text descriptions using fastText as word embedding, PCA as features reduction method and K-means as the clustering method. The experiment on 1959 videos with 43753 text descriptions to vary the number of k and with/without PCA result that the optimal clustering number is 180 with silhouette coefficient of 0.123115.\",\"PeriodicalId\":232501,\"journal\":{\"name\":\"Jurnal Informatika Polinema\",\"volume\":\"20 6 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2023-02-27\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Jurnal Informatika Polinema\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.33795/jip.v9i2.1271\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Jurnal Informatika Polinema","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.33795/jip.v9i2.1271","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
TOPIC GROUPING BASED ON DESCRIPTION TEXT IN MICROSOFT RESEARCH VIDEO DESCRIPTION CORPUS DATA USING FASTTEXT, PCA AND K-MEANS CLUSTERING
This research groups topics of the Microsoft Research Video Description Corpus (MRVDC) based on text descriptions of Indonesian language dataset. The Microsoft Research Video Description Corpus (MRVDC) is a video dataset developed by Microsoft Research, which contains paraphrased event expressions in English and other languages. The results of grouping these topics show how the patterns of similarity and interrelationships between text descriptions from different video data, which will be useful for the topic-based video retrieval. The topic grouping process is based on text descriptions using fastText as word embedding, PCA as features reduction method and K-means as the clustering method. The experiment on 1959 videos with 43753 text descriptions to vary the number of k and with/without PCA result that the optimal clustering number is 180 with silhouette coefficient of 0.123115.