A Simple Approach to Classify Fictional and Non-Fictional Genres

Mohammed Rameez Qureshi, Sidharth Ranjan, Rajakrishnan Rajkumar, Kushal Shah
{"title":"A Simple Approach to Classify Fictional and Non-Fictional Genres","authors":"Mohammed Rameez Qureshi, Sidharth Ranjan, Rajakrishnan Rajkumar, Kushal Shah","doi":"10.18653/v1/W19-3409","DOIUrl":null,"url":null,"abstract":"In this work, we deploy a logistic regression classifier to ascertain whether a given document belongs to the fiction or non-fiction genre. For genre identification, previous work had proposed three classes of features, viz., low-level (character-level and token counts), high-level (lexical and syntactic information) and derived features (type-token ratio, average word length or average sentence length). Using the Recursive feature elimination with cross-validation (RFECV) algorithm, we perform feature selection experiments on an exhaustive set of nineteen features (belonging to all the classes mentioned above) extracted from Brown corpus text. As a result, two simple features viz., the ratio of the number of adverbs to adjectives and the number of adjectives to pronouns turn out to be the most significant. Subsequently, our classification experiments aimed towards genre identification of documents from the Brown and Baby BNC corpora demonstrate that the performance of a classifier containing just the two aforementioned features is at par with that of a classifier containing the exhaustive feature set.","PeriodicalId":296321,"journal":{"name":"Proceedings of the Second Workshop on Storytelling","volume":"3 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"6","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the Second Workshop on Storytelling","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.18653/v1/W19-3409","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 6

Abstract

In this work, we deploy a logistic regression classifier to ascertain whether a given document belongs to the fiction or non-fiction genre. For genre identification, previous work had proposed three classes of features, viz., low-level (character-level and token counts), high-level (lexical and syntactic information) and derived features (type-token ratio, average word length or average sentence length). Using the Recursive feature elimination with cross-validation (RFECV) algorithm, we perform feature selection experiments on an exhaustive set of nineteen features (belonging to all the classes mentioned above) extracted from Brown corpus text. As a result, two simple features viz., the ratio of the number of adverbs to adjectives and the number of adjectives to pronouns turn out to be the most significant. Subsequently, our classification experiments aimed towards genre identification of documents from the Brown and Baby BNC corpora demonstrate that the performance of a classifier containing just the two aforementioned features is at par with that of a classifier containing the exhaustive feature set.
小说和非小说体裁的简单分类方法
在这项工作中,我们部署了一个逻辑回归分类器来确定给定的文档是属于小说还是非小说类型。对于体裁识别,以往的工作提出了三类特征,即低级特征(字符级和标记计数)、高级特征(词汇和句法信息)和派生特征(类型-标记比、平均单词长度或平均句子长度)。使用递归特征消除交叉验证(RFECV)算法,我们对从布朗语料库文本中提取的19个特征(属于上述所有类别)进行了特征选择实验。因此,两个简单的特征,即副词与形容词的数量比例和形容词与代词的数量比例是最重要的。随后,我们针对Brown和Baby BNC语料库文档类型识别的分类实验表明,仅包含上述两个特征的分类器的性能与包含穷举特征集的分类器的性能相当。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信