N. Bastas, George Kalpakis, T. Tsikrika, S. Vrochidis, Y. Kompatsiaris
{"title":"基于词嵌入的聚类方法的比较研究","authors":"N. Bastas, George Kalpakis, T. Tsikrika, S. Vrochidis, Y. Kompatsiaris","doi":"10.1109/EISIC49498.2019.9108898","DOIUrl":null,"url":null,"abstract":"Grouping large amounts of data is critical for various tasks, including the identification of content on a specific topic of interest (such as terrorism-related content) within a collection of material gathered from online sources. Various existing approaches typically extract relevant features using topic distributions and/or embedding methods, and subsequently apply clustering techniques in the derived representation space. In this work, we present a comparative study using Latent Dirichlet Allocation (LDA), Paragraph-Vector Distributed Bag-of-Words (PV-DBOW), and Paragraph-Vector Distributed Memory (PV-DM) models as representation methods, in conjunction with five traditional clustering algorithms, namely k-means, spherical k-means, possibilistic fuzzy c-means, agglomerative clustering and NMF, on two publicly available and one proprietary datasets. Fifteen combinations are formed which are assessed using external clustering validity measures, such as Adjusted Mutual Information (AMI) and Adjusted Rand Index (ARI) against available ground-truth. Our results indicate that using PV-DBOW leads in general to better clustering performance in all datasets.","PeriodicalId":117256,"journal":{"name":"2019 European Intelligence and Security Informatics Conference (EISIC)","volume":"12 11 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":"{\"title\":\"A comparative study of clustering methods using word embeddings\",\"authors\":\"N. Bastas, George Kalpakis, T. Tsikrika, S. Vrochidis, Y. Kompatsiaris\",\"doi\":\"10.1109/EISIC49498.2019.9108898\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Grouping large amounts of data is critical for various tasks, including the identification of content on a specific topic of interest (such as terrorism-related content) within a collection of material gathered from online sources. Various existing approaches typically extract relevant features using topic distributions and/or embedding methods, and subsequently apply clustering techniques in the derived representation space. In this work, we present a comparative study using Latent Dirichlet Allocation (LDA), Paragraph-Vector Distributed Bag-of-Words (PV-DBOW), and Paragraph-Vector Distributed Memory (PV-DM) models as representation methods, in conjunction with five traditional clustering algorithms, namely k-means, spherical k-means, possibilistic fuzzy c-means, agglomerative clustering and NMF, on two publicly available and one proprietary datasets. Fifteen combinations are formed which are assessed using external clustering validity measures, such as Adjusted Mutual Information (AMI) and Adjusted Rand Index (ARI) against available ground-truth. Our results indicate that using PV-DBOW leads in general to better clustering performance in all datasets.\",\"PeriodicalId\":117256,\"journal\":{\"name\":\"2019 European Intelligence and Security Informatics Conference (EISIC)\",\"volume\":\"12 11 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2019-11-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"2\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2019 European Intelligence and Security Informatics Conference (EISIC)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/EISIC49498.2019.9108898\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 European Intelligence and Security Informatics Conference (EISIC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/EISIC49498.2019.9108898","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
A comparative study of clustering methods using word embeddings
Grouping large amounts of data is critical for various tasks, including the identification of content on a specific topic of interest (such as terrorism-related content) within a collection of material gathered from online sources. Various existing approaches typically extract relevant features using topic distributions and/or embedding methods, and subsequently apply clustering techniques in the derived representation space. In this work, we present a comparative study using Latent Dirichlet Allocation (LDA), Paragraph-Vector Distributed Bag-of-Words (PV-DBOW), and Paragraph-Vector Distributed Memory (PV-DM) models as representation methods, in conjunction with five traditional clustering algorithms, namely k-means, spherical k-means, possibilistic fuzzy c-means, agglomerative clustering and NMF, on two publicly available and one proprietary datasets. Fifteen combinations are formed which are assessed using external clustering validity measures, such as Adjusted Mutual Information (AMI) and Adjusted Rand Index (ARI) against available ground-truth. Our results indicate that using PV-DBOW leads in general to better clustering performance in all datasets.