使用词嵌入的大型分类法无监督多标签文档分类

2019 International Conference on Computational Science and Computational Intelligence (CSCI) Pub Date : 2019-12-01 DOI:10.1109/CSCI49370.2019.00241

Stefan Hirschmeier, J. Melsbach, D. Schoder, Sven Stahlmann

{"title":"使用词嵌入的大型分类法无监督多标签文档分类","authors":"Stefan Hirschmeier, J. Melsbach, D. Schoder, Sven Stahlmann","doi":"10.1109/CSCI49370.2019.00241","DOIUrl":null,"url":null,"abstract":"More and more businesses are in need for metadata for their documents. However, automatic generation for metadata is not easy, as for supervised document classification, a significant amount of labelled training data is needed, which is not always present in the desired amount or quality. Often, documents need to be tagged with a predefined set of company specific keywords that are organized in a taxonomy. We present an unsupervised approach to perform multi-label document classification for large taxonomies using word embeddings and evaluate it with a dataset of a public broadcaster. We point out strengths of the approach compared to supervised classification and statistical approaches like tf-idf.","PeriodicalId":103662,"journal":{"name":"2019 International Conference on Computational Science and Computational Intelligence (CSCI)","volume":"26 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"Unsupervised Multi-Label Document Classification for Large Taxonomies Using Word Embeddings\",\"authors\":\"Stefan Hirschmeier, J. Melsbach, D. Schoder, Sven Stahlmann\",\"doi\":\"10.1109/CSCI49370.2019.00241\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"More and more businesses are in need for metadata for their documents. However, automatic generation for metadata is not easy, as for supervised document classification, a significant amount of labelled training data is needed, which is not always present in the desired amount or quality. Often, documents need to be tagged with a predefined set of company specific keywords that are organized in a taxonomy. We present an unsupervised approach to perform multi-label document classification for large taxonomies using word embeddings and evaluate it with a dataset of a public broadcaster. We point out strengths of the approach compared to supervised classification and statistical approaches like tf-idf.\",\"PeriodicalId\":103662,\"journal\":{\"name\":\"2019 International Conference on Computational Science and Computational Intelligence (CSCI)\",\"volume\":\"26 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2019-12-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2019 International Conference on Computational Science and Computational Intelligence (CSCI)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/CSCI49370.2019.00241\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 International Conference on Computational Science and Computational Intelligence (CSCI)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CSCI49370.2019.00241","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

摘要

越来越多的企业需要其文档的元数据。然而，元数据的自动生成并不容易，因为对于监督文档分类，需要大量标记的训练数据，这些数据并不总是以期望的数量或质量存在。通常，文档需要使用一组预定义的公司特定关键字进行标记，这些关键字按照分类法组织。我们提出了一种无监督的方法，使用词嵌入对大型分类法进行多标签文档分类，并使用公共广播公司的数据集对其进行评估。我们指出了该方法与监督分类和统计方法(如tf-idf)相比的优势。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Unsupervised Multi-Label Document Classification for Large Taxonomies Using Word Embeddings

More and more businesses are in need for metadata for their documents. However, automatic generation for metadata is not easy, as for supervised document classification, a significant amount of labelled training data is needed, which is not always present in the desired amount or quality. Often, documents need to be tagged with a predefined set of company specific keywords that are organized in a taxonomy. We present an unsupervised approach to perform multi-label document classification for large taxonomies using word embeddings and evaluate it with a dataset of a public broadcaster. We point out strengths of the approach compared to supervised classification and statistical approaches like tf-idf.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2019 International Conference on Computational Science and Computational Intelligence (CSCI)

自引率

0.00%

发文量