Mubashar Mustafa, Feng Zeng, Usama Manzoor, Lin Meng
{"title":"从乌尔都语文本中发现连贯主题:统计模型、聚类技术和词嵌入的比较研究","authors":"Mubashar Mustafa, Feng Zeng, Usama Manzoor, Lin Meng","doi":"10.1109/ICICT58900.2023.00028","DOIUrl":null,"url":null,"abstract":"The volume of data on the internet is continuously expanding due to the abundance of news sources, journals, blogs, contents, and other online publications. The use of Urdu online has grown significantly, much like other languages. Information retrieval (IR) is getting more challenging as data amount rises. The natural language processing (NLP) technique of topic modelling (TM) is crucial for extracting themes or aspects from text. Although there is a long tradition of TM in both English and other western languages, Urdu falls behind in terms of sophisticated NLP tools and resources for TM. The rich morphology of the Urdu language makes TM a challenging task. In this study, we developed a framework of TM and analysed word embedding, statistical models, and clustering techniques for Urdu documents. The aim of this work is to evaluate and compare three distinct approaches based on the coherence measure of extracted topics. The findings of a thorough experiment and evaluation demonstrate that word embedding fails to extract coherent topics in Urdu language, and that the average coherence measure of topics retrieved by clustering approaches outperforms that discovered through statistical models.","PeriodicalId":425057,"journal":{"name":"2023 6th International Conference on Information and Computer Technologies (ICICT)","volume":"27 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Discovering Coherent Topics from Urdu Text: A Comparative Study of Statistical Models, Clustering Techniques and Word Embedding\",\"authors\":\"Mubashar Mustafa, Feng Zeng, Usama Manzoor, Lin Meng\",\"doi\":\"10.1109/ICICT58900.2023.00028\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The volume of data on the internet is continuously expanding due to the abundance of news sources, journals, blogs, contents, and other online publications. The use of Urdu online has grown significantly, much like other languages. Information retrieval (IR) is getting more challenging as data amount rises. The natural language processing (NLP) technique of topic modelling (TM) is crucial for extracting themes or aspects from text. Although there is a long tradition of TM in both English and other western languages, Urdu falls behind in terms of sophisticated NLP tools and resources for TM. The rich morphology of the Urdu language makes TM a challenging task. In this study, we developed a framework of TM and analysed word embedding, statistical models, and clustering techniques for Urdu documents. The aim of this work is to evaluate and compare three distinct approaches based on the coherence measure of extracted topics. The findings of a thorough experiment and evaluation demonstrate that word embedding fails to extract coherent topics in Urdu language, and that the average coherence measure of topics retrieved by clustering approaches outperforms that discovered through statistical models.\",\"PeriodicalId\":425057,\"journal\":{\"name\":\"2023 6th International Conference on Information and Computer Technologies (ICICT)\",\"volume\":\"27 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2023-03-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2023 6th International Conference on Information and Computer Technologies (ICICT)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICICT58900.2023.00028\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2023 6th International Conference on Information and Computer Technologies (ICICT)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICICT58900.2023.00028","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Discovering Coherent Topics from Urdu Text: A Comparative Study of Statistical Models, Clustering Techniques and Word Embedding
The volume of data on the internet is continuously expanding due to the abundance of news sources, journals, blogs, contents, and other online publications. The use of Urdu online has grown significantly, much like other languages. Information retrieval (IR) is getting more challenging as data amount rises. The natural language processing (NLP) technique of topic modelling (TM) is crucial for extracting themes or aspects from text. Although there is a long tradition of TM in both English and other western languages, Urdu falls behind in terms of sophisticated NLP tools and resources for TM. The rich morphology of the Urdu language makes TM a challenging task. In this study, we developed a framework of TM and analysed word embedding, statistical models, and clustering techniques for Urdu documents. The aim of this work is to evaluate and compare three distinct approaches based on the coherence measure of extracted topics. The findings of a thorough experiment and evaluation demonstrate that word embedding fails to extract coherent topics in Urdu language, and that the average coherence measure of topics retrieved by clustering approaches outperforms that discovered through statistical models.