Mohamad Almgerbi, Andrea De Mauro, Adham Kahlawi, V. Poggioni
{"title":"通过N-gram去除提高主题建模性能","authors":"Mohamad Almgerbi, Andrea De Mauro, Adham Kahlawi, V. Poggioni","doi":"10.1145/3486622.3493952","DOIUrl":null,"url":null,"abstract":"In recent years, topic modeling has been increasingly adopted for finding conceptual patterns in large corpora of digital documents to organize them accordingly. In order to enhance the performance of topic modeling algorithms, such as Latent Dirichlet Allocation (LDA), multiple preprocessing steps have been proposed. In this paper, we introduce N-gram Removal, a novel preprocessing procedure based on the systematic elimination of a dynamic number of repeated words in text documents. We have evaluated the effects of the utilization of N-gram Removal through four different performance metrics: we concluded that its application is effective at improving the performance of LDA and enhances the human interpretation of topics models.","PeriodicalId":89230,"journal":{"name":"Proceedings. IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology","volume":"109 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2021-12-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"Improving Topic Modeling Performance through N-gram Removal\",\"authors\":\"Mohamad Almgerbi, Andrea De Mauro, Adham Kahlawi, V. Poggioni\",\"doi\":\"10.1145/3486622.3493952\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"In recent years, topic modeling has been increasingly adopted for finding conceptual patterns in large corpora of digital documents to organize them accordingly. In order to enhance the performance of topic modeling algorithms, such as Latent Dirichlet Allocation (LDA), multiple preprocessing steps have been proposed. In this paper, we introduce N-gram Removal, a novel preprocessing procedure based on the systematic elimination of a dynamic number of repeated words in text documents. We have evaluated the effects of the utilization of N-gram Removal through four different performance metrics: we concluded that its application is effective at improving the performance of LDA and enhances the human interpretation of topics models.\",\"PeriodicalId\":89230,\"journal\":{\"name\":\"Proceedings. IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology\",\"volume\":\"109 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2021-12-14\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings. IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3486622.3493952\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings. IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3486622.3493952","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Improving Topic Modeling Performance through N-gram Removal
In recent years, topic modeling has been increasingly adopted for finding conceptual patterns in large corpora of digital documents to organize them accordingly. In order to enhance the performance of topic modeling algorithms, such as Latent Dirichlet Allocation (LDA), multiple preprocessing steps have been proposed. In this paper, we introduce N-gram Removal, a novel preprocessing procedure based on the systematic elimination of a dynamic number of repeated words in text documents. We have evaluated the effects of the utilization of N-gram Removal through four different performance metrics: we concluded that its application is effective at improving the performance of LDA and enhances the human interpretation of topics models.