Uncovering Flat and Hierarchical Topics by Community Discovery on Word Co-occurrence Network.

IF 4.6 2区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

Data Science and Engineering Pub Date : 2024-01-01 Epub Date: 2024-03-13 DOI:10.1007/s41019-023-00239-2

Eric Austin, Shraddha Makwana, Amine Trabelsi, Christine Largeron, Osmar R Zaïane

{"title":"Uncovering Flat and Hierarchical Topics by Community Discovery on Word Co-occurrence Network.","authors":"Eric Austin, Shraddha Makwana, Amine Trabelsi, Christine Largeron, Osmar R Zaïane","doi":"10.1007/s41019-023-00239-2","DOIUrl":null,"url":null,"abstract":"<p><p>Topic modeling aims to discover latent themes in collections of text documents. It has various applications across fields such as sociology, opinion analysis, and media studies. In such areas, it is essential to have easily interpretable, diverse, and coherent topics. An efficient topic modeling technique should accurately identify flat and hierarchical topics, especially useful in disciplines where topics can be logically arranged into a tree format. In this paper, we propose Community Topic, a novel algorithm that exploits word co-occurrence networks to mine communities and produces topics. We also evaluate the proposed approach using several metrics and compare it with usual baselines, confirming its good performances. Community Topic enables quick identification of flat topics and topic hierarchy, facilitating the on-demand exploration of sub- and super-topics. It also obtains good results on datasets in different languages.</p>","PeriodicalId":52220,"journal":{"name":"Data Science and Engineering","volume":"9 1","pages":"41-61"},"PeriodicalIF":4.6000,"publicationDate":"2024-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10980674/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Data Science and Engineering","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1007/s41019-023-00239-2","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2024/3/13 0:00:00","PubModel":"Epub","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 0

Abstract

Topic modeling aims to discover latent themes in collections of text documents. It has various applications across fields such as sociology, opinion analysis, and media studies. In such areas, it is essential to have easily interpretable, diverse, and coherent topics. An efficient topic modeling technique should accurately identify flat and hierarchical topics, especially useful in disciplines where topics can be logically arranged into a tree format. In this paper, we propose Community Topic, a novel algorithm that exploits word co-occurrence networks to mine communities and produces topics. We also evaluate the proposed approach using several metrics and compare it with usual baselines, confirming its good performances. Community Topic enables quick identification of flat topics and topic hierarchy, facilitating the on-demand exploration of sub- and super-topics. It also obtains good results on datasets in different languages.

Abstract Image

查看原文本刊更多论文

通过词语共现网络上的社群发现揭示扁平和分层主题

主题建模旨在发现文本文档集合中的潜在主题。它在社会学、舆论分析和媒体研究等领域有着广泛的应用。在这些领域，拥有易于解释、多样且连贯的主题至关重要。高效的主题建模技术应能准确识别扁平和分层主题，尤其是在主题可按逻辑排列成树形格式的学科中。在本文中，我们提出了 "社区话题"（Community Topic）这一新型算法，该算法利用词语共现网络挖掘社区并生成话题。我们还使用多个指标对所提出的方法进行了评估，并将其与通常的基线进行了比较，证实了其良好的性能。Community Topic 可以快速识别平面主题和主题层次，便于按需探索子主题和超级主题。它在不同语言的数据集上也取得了良好的效果。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Data Science and Engineering Engineering-Computational Mechanics

CiteScore

10.40

自引率

2.40%

发文量

审稿时长

12 weeks

期刊介绍： The journal of Data Science and Engineering (DSE) responds to the remarkable change in the focus of information technology development from CPU-intensive computation to data-intensive computation, where the effective application of data, especially big data, becomes vital. The emerging discipline data science and engineering, an interdisciplinary field integrating theories and methods from computer science, statistics, information science, and other fields, focuses on the foundations and engineering of efficient and effective techniques and systems for data collection and management, for data integration and correlation, for information and knowledge extraction from massive data sets, and for data use in different application domains. Focusing on the theoretical background and advanced engineering approaches, DSE aims to offer a prime forum for researchers, professionals, and industrial practitioners to share their knowledge in this rapidly growing area. It provides in-depth coverage of the latest advances in the closely related fields of data science and data engineering. More specifically, DSE covers four areas: (i) the data itself, i.e., the nature and quality of the data, especially big data; (ii) the principles of information extraction from data, especially big data; (iii) the theory behind data-intensive computing; and (iv) the techniques and systems used to analyze and manage big data. DSE welcomes papers that explore the above subjects. Specific topics include, but are not limited to: (a) the nature and quality of data, (b) the computational complexity of data-intensive computing,(c) new methods for the design and analysis of the algorithms for solving problems with big data input,(d) collection and integration of data collected from internet and sensing devises or sensor networks, (e) representation, modeling, and visualization of big data,(f) storage, transmission, and management of big data,(g) methods and algorithms of data intensive computing, such asmining big data,online analysis processing of big data,big data-based machine learning, big data based decision-making, statistical computation of big data, graph-theoretic computation of big data, linear algebraic computation of big data, and big data-based optimization. (h) hardware systems and software systems for data-intensive computing, (i) data security, privacy, and trust, and(j) novel applications of big data.