偿还元数据债务:使用主题模型学习概念的表示

Proceedings of the First ACM International Conference on AI in Finance Pub Date : 2020-10-09 DOI:10.1145/3383455.3422537

Jiahao Chen, M. Veloso

{"title":"偿还元数据债务:使用主题模型学习概念的表示","authors":"Jiahao Chen, M. Veloso","doi":"10.1145/3383455.3422537","DOIUrl":null,"url":null,"abstract":"We introduce a data management problem called metadata debt, to identify the mapping between data concepts and their logical representations. We describe how this mapping can be learned using semisupervised topic models based on low-rank matrix factorizations that account for missing and noisy labels, coupled with sparsity penalties to improve localization and interpretability. We introduce a gauge transformation approach that allows us to construct explicit associations between topics and concept labels, and thus assign meaning to topics. We also show how to use this topic model for semisupervised learning tasks like extrapolating from known labels, evaluating possible errors in existing labels, and predicting missing features. We show results from this topic model in predicting subject tags on over 25,000 datasets from Kaggle.com, demonstrating the ability to learn semantically meaningful features.","PeriodicalId":447950,"journal":{"name":"Proceedings of the First ACM International Conference on AI in Finance","volume":"37 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-10-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"Paying down metadata debt: learning the representation of concepts using topic models\",\"authors\":\"Jiahao Chen, M. Veloso\",\"doi\":\"10.1145/3383455.3422537\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"We introduce a data management problem called metadata debt, to identify the mapping between data concepts and their logical representations. We describe how this mapping can be learned using semisupervised topic models based on low-rank matrix factorizations that account for missing and noisy labels, coupled with sparsity penalties to improve localization and interpretability. We introduce a gauge transformation approach that allows us to construct explicit associations between topics and concept labels, and thus assign meaning to topics. We also show how to use this topic model for semisupervised learning tasks like extrapolating from known labels, evaluating possible errors in existing labels, and predicting missing features. We show results from this topic model in predicting subject tags on over 25,000 datasets from Kaggle.com, demonstrating the ability to learn semantically meaningful features.\",\"PeriodicalId\":447950,\"journal\":{\"name\":\"Proceedings of the First ACM International Conference on AI in Finance\",\"volume\":\"37 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2020-10-09\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the First ACM International Conference on AI in Finance\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3383455.3422537\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the First ACM International Conference on AI in Finance","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3383455.3422537","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

摘要

我们引入了一个称为元数据债务的数据管理问题，以确定数据概念及其逻辑表示之间的映射。我们描述了如何使用基于低秩矩阵分解的半监督主题模型来学习这种映射，该模型考虑了缺失和噪声标签，并结合稀疏性惩罚来提高定位和可解释性。我们引入了一种规范转换方法，它允许我们在主题和概念标签之间构建明确的关联，从而为主题分配意义。我们还展示了如何将该主题模型用于半监督学习任务，如从已知标签推断，评估现有标签中的可能错误以及预测缺失特征。我们展示了这个主题模型在Kaggle.com上超过25000个数据集上预测主题标签的结果，展示了学习语义上有意义的特征的能力。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Paying down metadata debt: learning the representation of concepts using topic models

We introduce a data management problem called metadata debt, to identify the mapping between data concepts and their logical representations. We describe how this mapping can be learned using semisupervised topic models based on low-rank matrix factorizations that account for missing and noisy labels, coupled with sparsity penalties to improve localization and interpretability. We introduce a gauge transformation approach that allows us to construct explicit associations between topics and concept labels, and thus assign meaning to topics. We also show how to use this topic model for semisupervised learning tasks like extrapolating from known labels, evaluating possible errors in existing labels, and predicting missing features. We show results from this topic model in predicting subject tags on over 25,000 datasets from Kaggle.com, demonstrating the ability to learn semantically meaningful features.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Proceedings of the First ACM International Conference on AI in Finance

自引率

0.00%

发文量