语义增强型主题建模技术：语义-LDA

IF 4.8 3区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

ACM Transactions on Knowledge Discovery from Data Pub Date : 2024-01-02 DOI:10.1145/3639409

Dakshi Kapugama Geeganage, Yue Xu, Yuefeng Li

{"title":"语义增强型主题建模技术：语义-LDA","authors":"Dakshi Kapugama Geeganage, Yue Xu, Yuefeng Li","doi":"10.1145/3639409","DOIUrl":null,"url":null,"abstract":"<p>Topic modelling is a beneficial technique used to discover latent topics in text collections. But to correctly understand the text content and generate a meaningful topic list, semantics are important. By ignoring semantics, that is, not attempting to grasp the meaning of the words, most of the existing topic modelling approaches can generate some meaningless topic words. Even existing semantic-based approaches usually interpret the meanings of words without considering the context and related words. In this paper, we introduce a semantic-based topic model called semantic-LDA which captures the semantics of words in a text collection using concepts from an external ontology. A new method is introduced to identify and quantify the concept–word relationships based on matching words from the input text collection with concepts from an ontology without using pre-calculated values from the ontology that quantify the relationships between the words and concepts. These pre-calculated values may not reflect the actual relationships between words and concepts for the input collection because they are derived from datasets used to build the ontology rather than from the input collection itself. Instead, quantifying the relationship based on the word distribution in the input collection is more realistic and beneficial in the semantic capture process. Furthermore, an ambiguity handling mechanism is introduced to interpret the unmatched words, that is, words for which there are no matching concepts in the ontology. Thus, this paper makes a significant contribution by introducing a semantic-based topic model which calculates the word–concept relationships directly from the input text collection. The proposed semantic-based topic model and an enhanced version with the disambiguation mechanism were evaluated against a set of state-of-the-art systems, and our approaches outperformed the baseline systems in both topic quality and information filtering evaluations.</p>","PeriodicalId":49249,"journal":{"name":"ACM Transactions on Knowledge Discovery from Data","volume":"11 1","pages":""},"PeriodicalIF":4.8000,"publicationDate":"2024-01-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"A Semantics-enhanced Topic Modelling Technique: Semantic-LDA\",\"authors\":\"Dakshi Kapugama Geeganage, Yue Xu, Yuefeng Li\",\"doi\":\"10.1145/3639409\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p>Topic modelling is a beneficial technique used to discover latent topics in text collections. But to correctly understand the text content and generate a meaningful topic list, semantics are important. By ignoring semantics, that is, not attempting to grasp the meaning of the words, most of the existing topic modelling approaches can generate some meaningless topic words. Even existing semantic-based approaches usually interpret the meanings of words without considering the context and related words. In this paper, we introduce a semantic-based topic model called semantic-LDA which captures the semantics of words in a text collection using concepts from an external ontology. A new method is introduced to identify and quantify the concept–word relationships based on matching words from the input text collection with concepts from an ontology without using pre-calculated values from the ontology that quantify the relationships between the words and concepts. These pre-calculated values may not reflect the actual relationships between words and concepts for the input collection because they are derived from datasets used to build the ontology rather than from the input collection itself. Instead, quantifying the relationship based on the word distribution in the input collection is more realistic and beneficial in the semantic capture process. Furthermore, an ambiguity handling mechanism is introduced to interpret the unmatched words, that is, words for which there are no matching concepts in the ontology. Thus, this paper makes a significant contribution by introducing a semantic-based topic model which calculates the word–concept relationships directly from the input text collection. The proposed semantic-based topic model and an enhanced version with the disambiguation mechanism were evaluated against a set of state-of-the-art systems, and our approaches outperformed the baseline systems in both topic quality and information filtering evaluations.</p>\",\"PeriodicalId\":49249,\"journal\":{\"name\":\"ACM Transactions on Knowledge Discovery from Data\",\"volume\":\"11 1\",\"pages\":\"\"},\"PeriodicalIF\":4.8000,\"publicationDate\":\"2024-01-02\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"ACM Transactions on Knowledge Discovery from Data\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://doi.org/10.1145/3639409\",\"RegionNum\":3,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, INFORMATION SYSTEMS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACM Transactions on Knowledge Discovery from Data","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1145/3639409","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 0

摘要

主题建模是一种用于发现文本集合中潜在主题的有效技术。但要正确理解文本内容并生成有意义的主题列表，语义非常重要。如果忽略语义，即不试图把握词语的含义，大多数现有的主题建模方法都会生成一些毫无意义的主题词。即使是现有的基于语义的方法，通常也是在不考虑上下文和相关词语的情况下解释词语的含义。在本文中，我们引入了一种基于语义的主题模型，称为语义-LDA，它使用外部本体中的概念来捕捉文本集合中单词的语义。我们引入了一种新方法，基于将输入文本集中的词与本体中的概念进行匹配来识别和量化概念与词之间的关系，而无需使用本体中预先计算的值来量化词与概念之间的关系。这些预计算值可能无法反映输入文集中单词与概念之间的实际关系，因为这些值是从用于构建本体的数据集而非输入文集本身得出的。相反，在语义捕捉过程中，根据输入集合中的词语分布来量化关系更为现实和有益。此外，本文还引入了一种歧义处理机制来解释未匹配词，即本体中没有匹配概念的词。因此，本文通过引入基于语义的主题模型，直接从输入文本集合中计算词-概念关系，做出了重大贡献。本文提出的基于语义的主题模型和带有消歧义机制的增强版本与一组最先进的系统进行了对比评估，在主题质量和信息过滤评估中，我们的方法都优于基线系统。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

A Semantics-enhanced Topic Modelling Technique: Semantic-LDA

Topic modelling is a beneficial technique used to discover latent topics in text collections. But to correctly understand the text content and generate a meaningful topic list, semantics are important. By ignoring semantics, that is, not attempting to grasp the meaning of the words, most of the existing topic modelling approaches can generate some meaningless topic words. Even existing semantic-based approaches usually interpret the meanings of words without considering the context and related words. In this paper, we introduce a semantic-based topic model called semantic-LDA which captures the semantics of words in a text collection using concepts from an external ontology. A new method is introduced to identify and quantify the concept–word relationships based on matching words from the input text collection with concepts from an ontology without using pre-calculated values from the ontology that quantify the relationships between the words and concepts. These pre-calculated values may not reflect the actual relationships between words and concepts for the input collection because they are derived from datasets used to build the ontology rather than from the input collection itself. Instead, quantifying the relationship based on the word distribution in the input collection is more realistic and beneficial in the semantic capture process. Furthermore, an ambiguity handling mechanism is introduced to interpret the unmatched words, that is, words for which there are no matching concepts in the ontology. Thus, this paper makes a significant contribution by introducing a semantic-based topic model which calculates the word–concept relationships directly from the input text collection. The proposed semantic-based topic model and an enhanced version with the disambiguation mechanism were evaluated against a set of state-of-the-art systems, and our approaches outperformed the baseline systems in both topic quality and information filtering evaluations.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

ACM Transactions on Knowledge Discovery from Data COMPUTER SCIENCE, INFORMATION SYSTEMS-COMPUTER SCIENCE, SOFTWARE ENGINEERING

CiteScore

6.70

自引率

5.60%

发文量

172

审稿时长

3 months

期刊介绍： TKDD welcomes papers on a full range of research in the knowledge discovery and analysis of diverse forms of data. Such subjects include, but are not limited to: scalable and effective algorithms for data mining and big data analysis, mining brain networks, mining data streams, mining multi-media data, mining high-dimensional data, mining text, Web, and semi-structured data, mining spatial and temporal data, data mining for community generation, social network analysis, and graph structured data, security and privacy issues in data mining, visual, interactive and online data mining, pre-processing and post-processing for data mining, robust and scalable statistical methods, data mining languages, foundations of data mining, KDD framework and process, and novel applications and infrastructures exploiting data mining technology including massively parallel processing and cloud computing platforms. TKDD encourages papers that explore the above subjects in the context of large distributed networks of computers, parallel or multiprocessing computers, or new data devices. TKDD also encourages papers that describe emerging data mining applications that cannot be satisfied by the current data mining technology.