基于相似度的文本分类元特征生成合成文档表示

Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval Pub Date : 2019-07-18 DOI:10.1145/3331184.3331239

Sérgio D. Canuto, Thiago Salles, Thierson Couto, Marcos André Gonçalves

{"title":"基于相似度的文本分类元特征生成合成文档表示","authors":"Sérgio D. Canuto, Thiago Salles, Thierson Couto, Marcos André Gonçalves","doi":"10.1145/3331184.3331239","DOIUrl":null,"url":null,"abstract":"We propose new solutions that enhance and extend the already very successful application of meta-features to text classification. Our newly proposed meta-features are capable of: (1) improving the correlation of small pieces of evidence shared by neighbors with labeled categories by means of synthetic document representations and (local and global) hyperplane distances; and (2) estimating the level of error introduced by these newly proposed and the existing meta-features in the literature, specially for hard-to-classify regions of the feature space. Our experiments with large and representative number of datasets show that our new solutions produce the best results in all tested scenarios, achieving gains of up to 12% over the strongest meta-feature proposal of the literature.","PeriodicalId":20700,"journal":{"name":"Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval","volume":"1 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2019-07-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"15","resultStr":"{\"title\":\"Similarity-Based Synthetic Document Representations for Meta-Feature Generation in Text Classification\",\"authors\":\"Sérgio D. Canuto, Thiago Salles, Thierson Couto, Marcos André Gonçalves\",\"doi\":\"10.1145/3331184.3331239\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"We propose new solutions that enhance and extend the already very successful application of meta-features to text classification. Our newly proposed meta-features are capable of: (1) improving the correlation of small pieces of evidence shared by neighbors with labeled categories by means of synthetic document representations and (local and global) hyperplane distances; and (2) estimating the level of error introduced by these newly proposed and the existing meta-features in the literature, specially for hard-to-classify regions of the feature space. Our experiments with large and representative number of datasets show that our new solutions produce the best results in all tested scenarios, achieving gains of up to 12% over the strongest meta-feature proposal of the literature.\",\"PeriodicalId\":20700,\"journal\":{\"name\":\"Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval\",\"volume\":\"1 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2019-07-18\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"15\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3331184.3331239\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3331184.3331239","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 15

摘要

我们提出了新的解决方案，以增强和扩展已经非常成功的元特征在文本分类中的应用。我们新提出的元特征能够:(1)通过合成文档表示和(局部和全局)超平面距离，改善带有标记类别的邻居共享的小块证据的相关性;(2)估计这些新提出的元特征和文献中已有的元特征引入的误差水平，特别是对特征空间中难以分类的区域。我们对大量具有代表性的数据集进行的实验表明，我们的新解决方案在所有测试场景中都产生了最好的结果，比文献中最强的元特征提案获得了高达12%的收益。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Similarity-Based Synthetic Document Representations for Meta-Feature Generation in Text Classification

We propose new solutions that enhance and extend the already very successful application of meta-features to text classification. Our newly proposed meta-features are capable of: (1) improving the correlation of small pieces of evidence shared by neighbors with labeled categories by means of synthetic document representations and (local and global) hyperplane distances; and (2) estimating the level of error introduced by these newly proposed and the existing meta-features in the literature, specially for hard-to-classify regions of the feature space. Our experiments with large and representative number of datasets show that our new solutions produce the best results in all tested scenarios, achieving gains of up to 12% over the strongest meta-feature proposal of the literature.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval

自引率

0.00%

发文量