Structural Classifiers of Text Types: Towards a Novel Model of Text Representation

LDV Forum Pub Date : 2007-07-01 DOI:10.21248/jlcl.22.2007.95

Alexander Mehler, Peter Geibel, O. Pustylnikov

{"title":"Structural Classifiers of Text Types: Towards a Novel Model of Text Representation","authors":"Alexander Mehler, Peter Geibel, O. Pustylnikov","doi":"10.21248/jlcl.22.2007.95","DOIUrl":null,"url":null,"abstract":"Texts can be distinguished in terms of their content, function, structure or layout (Brinker, 1992; Bateman et al., 2001; Joachims, 2002; Power et al., 2003). These reference points do not open necessarily orthogonal perspectives on text classification. As part of explorative data analysis, text classification aims at automatically dividing sets of textual objects into classes of maximum internal homogeneity and external heterogeneity. This paper deals with classifying texts into text types whose instances serve more or less homogeneous functions. Other than mainstream approaches, which rely on the vector space model (Sebastiani, 2002) or some of its descendants (Baeza-Yates and Ribeiro-Neto, 1999) and, thus, on content-related lexical features, we solely refer to structural dierentiae. That is, we explore patterns of text structure as determinants of class membership. Our starting point are tree-like text representations which induce feature vectors and tree kernels. These kernels are utilized in supervised learning based on cross-validation as a method of model selection (Hastie et al., 2001) by example of a corpus of press communication. For a subset of categories we show that classification can be performed very well by structural dierentia only.","PeriodicalId":346957,"journal":{"name":"LDV Forum","volume":"16 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2007-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"32","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"LDV Forum","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.21248/jlcl.22.2007.95","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 32

Abstract

Texts can be distinguished in terms of their content, function, structure or layout (Brinker, 1992; Bateman et al., 2001; Joachims, 2002; Power et al., 2003). These reference points do not open necessarily orthogonal perspectives on text classification. As part of explorative data analysis, text classification aims at automatically dividing sets of textual objects into classes of maximum internal homogeneity and external heterogeneity. This paper deals with classifying texts into text types whose instances serve more or less homogeneous functions. Other than mainstream approaches, which rely on the vector space model (Sebastiani, 2002) or some of its descendants (Baeza-Yates and Ribeiro-Neto, 1999) and, thus, on content-related lexical features, we solely refer to structural dierentiae. That is, we explore patterns of text structure as determinants of class membership. Our starting point are tree-like text representations which induce feature vectors and tree kernels. These kernels are utilized in supervised learning based on cross-validation as a method of model selection (Hastie et al., 2001) by example of a corpus of press communication. For a subset of categories we show that classification can be performed very well by structural dierentia only.

查看原文本刊更多论文

文本类型的结构分类器:迈向一种新的文本表示模型

文本可以根据其内容、功能、结构或布局来区分(Brinker, 1992;贝特曼等人，2001;约阿希姆,2002;Power et al.， 2003)。这些参考点并不一定打开文本分类的正交视角。作为探索性数据分析的一部分，文本分类旨在将文本对象集自动划分为最大内部同质性和最大外部异质性的类别。本文讨论将文本分类为文本类型，这些文本类型的实例或多或少具有同质功能。主流方法依赖于向量空间模型(Sebastiani, 2002)或它的一些后代(Baeza-Yates和Ribeiro-Neto, 1999)，因此，与内容相关的词汇特征不同，我们只参考结构差异。也就是说，我们探索文本结构模式作为阶级成员的决定因素。我们的出发点是树状的文本表示，它引出特征向量和树核。这些核被用于基于交叉验证的监督学习中，作为模型选择的一种方法(Hastie等人，2001)，以新闻传播语料库为例。对于类别的一个子集，我们表明仅通过结构差异可以很好地执行分类。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

LDV Forum

自引率

0.00%

发文量