分级分类结构中词权对关键词提取的影响

Boonthida Chiraratanasopha, Salin Boonbrahm, T. Theeramunkong
{"title":"分级分类结构中词权对关键词提取的影响","authors":"Boonthida Chiraratanasopha, Salin Boonbrahm, T. Theeramunkong","doi":"10.31577/cai_2021_1_57","DOIUrl":null,"url":null,"abstract":"While there have been several studies related to the effect of term weighting on classification accuracy, relatively few works have been conducted on how term weighting affects the quality of keywords extracted for characterizing a document or a category (i.e., document collection). Moreover, many tasks require more complicated category structure, such as hierarchical and network category structure, rather than a flat category structure. This paper presents a qualitative and quantitative study on how term weighting affects keyword extraction in the hierarchical category structure, in comparison to the flat category structure. A hierarchical structure triggers special characteristic in assigning a set of keywords or tags to represent a document or a document collection, with support of statistics in a hierarchy, including category itself, its parent category, its child categories, and sibling categories. An enhancement of term weighting is proposed particularly in the form of a series of modified TFIDF's, for improving keyword extraction. A text collection of public-hearing opinions is used to evaluate variant TFs and IDFs to identify which types of information in hierarchical category structure are useful. By experiments, we found that the most effective IDF family, namely TF-IDFr, is identity>sibling>child>parent in order. The TF-IDFr outperforms the vanilla version of TFIDF with a centroid-based classifier.","PeriodicalId":345268,"journal":{"name":"Comput. Informatics","volume":"s3-32 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-08-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"Effect of Term Weighting on Keyword Extraction in Hierarchical Category Structure\",\"authors\":\"Boonthida Chiraratanasopha, Salin Boonbrahm, T. Theeramunkong\",\"doi\":\"10.31577/cai_2021_1_57\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"While there have been several studies related to the effect of term weighting on classification accuracy, relatively few works have been conducted on how term weighting affects the quality of keywords extracted for characterizing a document or a category (i.e., document collection). Moreover, many tasks require more complicated category structure, such as hierarchical and network category structure, rather than a flat category structure. This paper presents a qualitative and quantitative study on how term weighting affects keyword extraction in the hierarchical category structure, in comparison to the flat category structure. A hierarchical structure triggers special characteristic in assigning a set of keywords or tags to represent a document or a document collection, with support of statistics in a hierarchy, including category itself, its parent category, its child categories, and sibling categories. An enhancement of term weighting is proposed particularly in the form of a series of modified TFIDF's, for improving keyword extraction. A text collection of public-hearing opinions is used to evaluate variant TFs and IDFs to identify which types of information in hierarchical category structure are useful. By experiments, we found that the most effective IDF family, namely TF-IDFr, is identity>sibling>child>parent in order. The TF-IDFr outperforms the vanilla version of TFIDF with a centroid-based classifier.\",\"PeriodicalId\":345268,\"journal\":{\"name\":\"Comput. Informatics\",\"volume\":\"s3-32 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2021-08-03\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Comput. Informatics\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.31577/cai_2021_1_57\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Comput. Informatics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.31577/cai_2021_1_57","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1

摘要

虽然已经有一些关于术语加权对分类准确性影响的研究,但关于术语加权如何影响为描述文档或类别(即文档集合)而提取的关键词质量的研究相对较少。此外,许多任务需要更复杂的类别结构,如分层和网络类别结构,而不是扁平的类别结构。本文从定性和定量两方面研究了在分层分类结构中,与平面分类结构相比,词权对关键词提取的影响。层次结构在分配一组关键字或标记以表示文档或文档集合时触发了特殊的特性,并支持层次结构中的统计信息,包括类别本身、其父类别、子类别和兄弟类别。提出了一种增强术语权重的方法,特别是以一系列修改的TFIDF的形式,用于改进关键字提取。公开听证意见的文本集合用于评估不同的tf和idf,以确定分层类别结构中哪些类型的信息是有用的。通过实验,我们发现最有效的IDF家族,即TF-IDFr,依次为身份>兄弟姐妹>孩子>父母。使用基于质心的分类器,TF-IDFr优于普通版本的TFIDF。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Effect of Term Weighting on Keyword Extraction in Hierarchical Category Structure
While there have been several studies related to the effect of term weighting on classification accuracy, relatively few works have been conducted on how term weighting affects the quality of keywords extracted for characterizing a document or a category (i.e., document collection). Moreover, many tasks require more complicated category structure, such as hierarchical and network category structure, rather than a flat category structure. This paper presents a qualitative and quantitative study on how term weighting affects keyword extraction in the hierarchical category structure, in comparison to the flat category structure. A hierarchical structure triggers special characteristic in assigning a set of keywords or tags to represent a document or a document collection, with support of statistics in a hierarchy, including category itself, its parent category, its child categories, and sibling categories. An enhancement of term weighting is proposed particularly in the form of a series of modified TFIDF's, for improving keyword extraction. A text collection of public-hearing opinions is used to evaluate variant TFs and IDFs to identify which types of information in hierarchical category structure are useful. By experiments, we found that the most effective IDF family, namely TF-IDFr, is identity>sibling>child>parent in order. The TF-IDFr outperforms the vanilla version of TFIDF with a centroid-based classifier.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信