STIF:使用词嵌入和聚类的半监督分类归纳

Proceedings of the 2021 5th International Conference on Natural Language Processing and Information Retrieval Pub Date : 2021-12-17 DOI:10.1145/3508230.3508247

Maryam Mousavi, Elena Steiner, S. Corman, Scott W. Ruston, Dylan Weber, H. Davulcu

{"title":"STIF:使用词嵌入和聚类的半监督分类归纳","authors":"Maryam Mousavi, Elena Steiner, S. Corman, Scott W. Ruston, Dylan Weber, H. Davulcu","doi":"10.1145/3508230.3508247","DOIUrl":null,"url":null,"abstract":"In this paper, we developed a semi-supervised taxonomy induction framework using term embedding and clustering methods for a blog corpus comprising 145,000 posts from 650 Ukraine-related blog domains dated between 2010-2020. We extracted 32,429 noun phrases (NPs) and proceeded to split these NPs into a pair of categories: General/Ambiguous phrases, which might appear under any topic vs. Topical/Non-Ambiguous phrases, which pertain to a topic’s specifics. We used term representation and clustering methods to partition the topical/non-ambiguous phrases into 90 groups using the Silhouette method. Next, a team of 10 communications scientists analyzed the NP clusters and inducted a two-level taxonomy alongside its codebook. Upon achieving intercoder reliability of 94%, coders proceeded to map all topical/non-ambiguous phrases into a gold-standard taxonomy. We evaluated a range of term representation and clustering methods using extrinsic and intrinsic measures. We determined that GloVe embeddings with K-Means achieved the highest performance (i.e. 74% purity) for this real-world dataset.","PeriodicalId":252146,"journal":{"name":"Proceedings of the 2021 5th International Conference on Natural Language Processing and Information Retrieval","volume":"3 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-12-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"STIF: Semi-Supervised Taxonomy Induction using Term Embeddings and Clustering\",\"authors\":\"Maryam Mousavi, Elena Steiner, S. Corman, Scott W. Ruston, Dylan Weber, H. Davulcu\",\"doi\":\"10.1145/3508230.3508247\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"In this paper, we developed a semi-supervised taxonomy induction framework using term embedding and clustering methods for a blog corpus comprising 145,000 posts from 650 Ukraine-related blog domains dated between 2010-2020. We extracted 32,429 noun phrases (NPs) and proceeded to split these NPs into a pair of categories: General/Ambiguous phrases, which might appear under any topic vs. Topical/Non-Ambiguous phrases, which pertain to a topic’s specifics. We used term representation and clustering methods to partition the topical/non-ambiguous phrases into 90 groups using the Silhouette method. Next, a team of 10 communications scientists analyzed the NP clusters and inducted a two-level taxonomy alongside its codebook. Upon achieving intercoder reliability of 94%, coders proceeded to map all topical/non-ambiguous phrases into a gold-standard taxonomy. We evaluated a range of term representation and clustering methods using extrinsic and intrinsic measures. We determined that GloVe embeddings with K-Means achieved the highest performance (i.e. 74% purity) for this real-world dataset.\",\"PeriodicalId\":252146,\"journal\":{\"name\":\"Proceedings of the 2021 5th International Conference on Natural Language Processing and Information Retrieval\",\"volume\":\"3 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2021-12-17\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 2021 5th International Conference on Natural Language Processing and Information Retrieval\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3508230.3508247\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2021 5th International Conference on Natural Language Processing and Information Retrieval","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3508230.3508247","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

在本文中，我们使用术语嵌入和聚类方法开发了一个半监督分类法归纳框架，该框架包含来自650个乌克兰相关博客域的145,000篇文章，时间为2010-2020年。我们提取了32,429个名词短语(NPs)，并将这些NPs分成两类:一般/模糊短语(可能出现在任何主题下)和局部/非模糊短语(与主题的细节有关)。我们使用术语表示和聚类方法，使用Silhouette方法将主题/非歧义短语划分为90组。接下来，一个由10名通信科学家组成的团队分析了NP集群，并在其密码本旁边引入了一个两级分类法。在实现94%的编码器间可靠性之后，编码器开始将所有主题/非歧义短语映射到金标准分类法中。我们评估了一系列术语表示和聚类方法使用外在和内在的措施。我们确定使用K-Means的GloVe嵌入在这个真实数据集中获得了最高的性能(即74%的纯度)。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

STIF: Semi-Supervised Taxonomy Induction using Term Embeddings and Clustering

In this paper, we developed a semi-supervised taxonomy induction framework using term embedding and clustering methods for a blog corpus comprising 145,000 posts from 650 Ukraine-related blog domains dated between 2010-2020. We extracted 32,429 noun phrases (NPs) and proceeded to split these NPs into a pair of categories: General/Ambiguous phrases, which might appear under any topic vs. Topical/Non-Ambiguous phrases, which pertain to a topic’s specifics. We used term representation and clustering methods to partition the topical/non-ambiguous phrases into 90 groups using the Silhouette method. Next, a team of 10 communications scientists analyzed the NP clusters and inducted a two-level taxonomy alongside its codebook. Upon achieving intercoder reliability of 94%, coders proceeded to map all topical/non-ambiguous phrases into a gold-standard taxonomy. We evaluated a range of term representation and clustering methods using extrinsic and intrinsic measures. We determined that GloVe embeddings with K-Means achieved the highest performance (i.e. 74% purity) for this real-world dataset.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Proceedings of the 2021 5th International Conference on Natural Language Processing and Information Retrieval

自引率

0.00%

发文量