Extracting information and inferences from a large text corpus.

International journal of information technology : an official journal of Bharati Vidyapeeth's Institute of Computer Applications and Management Pub Date : 2023-01-01 DOI:10.1007/s41870-022-01123-4

Sandhya Avasthi, Ritu Chauhan, Debi Prasanna Acharjya

{"title":"Extracting information and inferences from a large text corpus.","authors":"Sandhya Avasthi, Ritu Chauhan, Debi Prasanna Acharjya","doi":"10.1007/s41870-022-01123-4","DOIUrl":null,"url":null,"abstract":"<p><p>The usage of various software applications has grown tremendously due to the onset of Industry 4.0, giving rise to the accumulation of all forms of data. The scientific, biological, and social media text collections demand efficient machine learning methods for data interpretability, which organizations need in decision-making of all sorts. The topic models can be applied in text mining of biomedical articles, scientific articles, Twitter data, and blog posts. This paper analyzes and provides a comparison of the performance of Latent Dirichlet Allocation (LDA), Dynamic Topic Model (DTM), and Embedded Topic Model (ETM) techniques. An incremental topic model with word embedding (ITMWE) is proposed that processes large text data in an incremental environment and extracts latent topics that best describe the document collections. Experiments in both offline and online settings on large real-world document collections such as CORD-19, NIPS papers, and Tweet datasets show that, while LDA and DTM is a good model for discovering word-level topics, ITMWE discovers better document-level topic groups more efficiently in a dynamic environment, which is crucial in text mining applications.</p>","PeriodicalId":73455,"journal":{"name":"International journal of information technology : an official journal of Bharati Vidyapeeth's Institute of Computer Applications and Management","volume":"15 1","pages":"435-445"},"PeriodicalIF":0.0000,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9676895/pdf/","citationCount":"3","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"International journal of information technology : an official journal of Bharati Vidyapeeth's Institute of Computer Applications and Management","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1007/s41870-022-01123-4","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 3

Abstract

The usage of various software applications has grown tremendously due to the onset of Industry 4.0, giving rise to the accumulation of all forms of data. The scientific, biological, and social media text collections demand efficient machine learning methods for data interpretability, which organizations need in decision-making of all sorts. The topic models can be applied in text mining of biomedical articles, scientific articles, Twitter data, and blog posts. This paper analyzes and provides a comparison of the performance of Latent Dirichlet Allocation (LDA), Dynamic Topic Model (DTM), and Embedded Topic Model (ETM) techniques. An incremental topic model with word embedding (ITMWE) is proposed that processes large text data in an incremental environment and extracts latent topics that best describe the document collections. Experiments in both offline and online settings on large real-world document collections such as CORD-19, NIPS papers, and Tweet datasets show that, while LDA and DTM is a good model for discovering word-level topics, ITMWE discovers better document-level topic groups more efficiently in a dynamic environment, which is crucial in text mining applications.

Abstract Image

查看原文本刊更多论文

从大型文本语料库中提取信息和推论。

由于工业4.0的出现，各种软件应用程序的使用急剧增长，从而产生了各种形式的数据的积累。科学、生物和社交媒体文本集合需要有效的机器学习方法来实现数据可解释性，这是组织在各种决策中所需要的。主题模型可以应用于生物医学文章、科学文章、Twitter数据和博客文章的文本挖掘。本文分析并比较了潜狄利克雷分配(LDA)、动态主题模型(DTM)和嵌入式主题模型(ETM)技术的性能。提出了一种基于词嵌入的增量主题模型(ITMWE)，该模型在增量环境中处理大型文本数据，并提取最能描述文档集合的潜在主题。在CORD-19、NIPS论文和Tweet数据集等大型现实世界文档集合上进行的离线和在线设置实验表明，LDA和DTM是发现词级主题的好模型，而ITMWE在动态环境中更有效地发现更好的文档级主题组，这在文本挖掘应用中至关重要。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

International journal of information technology : an official journal of Bharati Vidyapeeth's Institute of Computer Applications and Management

自引率

0.00%

发文量