使用基于图的聚类构建术语层次结构

Proceedings of the 2022 6th International Conference on Natural Language Processing and Information Retrieval Pub Date : 2022-12-16 DOI:10.1145/3582768.3582807

Mark Hloch, Markus Van Meegen, M. Kubek, H. Unger

{"title":"使用基于图的聚类构建术语层次结构","authors":"Mark Hloch, Markus Van Meegen, M. Kubek, H. Unger","doi":"10.1145/3582768.3582807","DOIUrl":null,"url":null,"abstract":"Classical tasks of a librarian, such as screening and categorizing new documents based on their content, are increasingly replaced by search engines or through the use of cataloging software. A first overview of a corpus topical orientation can be achieved by combining graph-based search engines and clustering methods. Existing classical clustering methods, however, often require an a priori specification of the desired number of clusters to be output and do not consider term relationships in graphs, which is deficient from a practical point of view. Therefore, fully unsupervised graph-based clustering approaches at the term level offer new possibilities that mitigate these shortcomings. Within this work, a set of novel graph-based clustering algorithms have been developed. The hierarchical clustering algorithm (HCA) forms term hierarchies by iteratively isolating nodes of a given co-occurrence graph based on the evaluation of the edge weight between the nodes. Based on the co-occurrence graph inherent relationships of terms, a new graph is built agglomerative forming individual term clusters of related terms. The feasibility of the outlined methods for text analysis is shown.","PeriodicalId":315721,"journal":{"name":"Proceedings of the 2022 6th International Conference on Natural Language Processing and Information Retrieval","volume":"24 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-12-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Building term hierarchies using graph-based clustering\",\"authors\":\"Mark Hloch, Markus Van Meegen, M. Kubek, H. Unger\",\"doi\":\"10.1145/3582768.3582807\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Classical tasks of a librarian, such as screening and categorizing new documents based on their content, are increasingly replaced by search engines or through the use of cataloging software. A first overview of a corpus topical orientation can be achieved by combining graph-based search engines and clustering methods. Existing classical clustering methods, however, often require an a priori specification of the desired number of clusters to be output and do not consider term relationships in graphs, which is deficient from a practical point of view. Therefore, fully unsupervised graph-based clustering approaches at the term level offer new possibilities that mitigate these shortcomings. Within this work, a set of novel graph-based clustering algorithms have been developed. The hierarchical clustering algorithm (HCA) forms term hierarchies by iteratively isolating nodes of a given co-occurrence graph based on the evaluation of the edge weight between the nodes. Based on the co-occurrence graph inherent relationships of terms, a new graph is built agglomerative forming individual term clusters of related terms. The feasibility of the outlined methods for text analysis is shown.\",\"PeriodicalId\":315721,\"journal\":{\"name\":\"Proceedings of the 2022 6th International Conference on Natural Language Processing and Information Retrieval\",\"volume\":\"24 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-12-16\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 2022 6th International Conference on Natural Language Processing and Information Retrieval\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3582768.3582807\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2022 6th International Conference on Natural Language Processing and Information Retrieval","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3582768.3582807","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

图书管理员的经典任务，如根据内容对新文档进行筛选和分类，越来越多地被搜索引擎或编目软件所取代。通过结合基于图的搜索引擎和聚类方法，可以实现语料库主题方向的第一个概述。然而，现有的经典聚类方法通常需要对输出的期望聚类数量进行先验说明，并且不考虑图中的项关系，这从实际的角度来看是不足的。因此，在术语级别上，完全无监督的基于图的聚类方法为减轻这些缺点提供了新的可能性。在这项工作中，开发了一套新颖的基于图的聚类算法。层次聚类算法(HCA)基于节点间边权的评估，通过迭代分离给定共现图的节点，形成词层次。基于词间的共现图内在关系，构建了由相关词组成的聚类图。说明了本文提出的文本分析方法的可行性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Building term hierarchies using graph-based clustering

Classical tasks of a librarian, such as screening and categorizing new documents based on their content, are increasingly replaced by search engines or through the use of cataloging software. A first overview of a corpus topical orientation can be achieved by combining graph-based search engines and clustering methods. Existing classical clustering methods, however, often require an a priori specification of the desired number of clusters to be output and do not consider term relationships in graphs, which is deficient from a practical point of view. Therefore, fully unsupervised graph-based clustering approaches at the term level offer new possibilities that mitigate these shortcomings. Within this work, a set of novel graph-based clustering algorithms have been developed. The hierarchical clustering algorithm (HCA) forms term hierarchies by iteratively isolating nodes of a given co-occurrence graph based on the evaluation of the edge weight between the nodes. Based on the co-occurrence graph inherent relationships of terms, a new graph is built agglomerative forming individual term clusters of related terms. The feasibility of the outlined methods for text analysis is shown.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Proceedings of the 2022 6th International Conference on Natural Language Processing and Information Retrieval

自引率

0.00%

发文量