Towards automatic Web genre identification: a corpus-based approach in the domain of academia by example of the Academic's Personal Homepage

Georg Rehm
{"title":"Towards automatic Web genre identification: a corpus-based approach in the domain of academia by example of the Academic's Personal Homepage","authors":"Georg Rehm","doi":"10.1109/HICSS.2002.994036","DOIUrl":null,"url":null,"abstract":"We argue for a systematic analysis of one particular, well structured domain -academic Web pages - with regard to a special class of digital genres: Web genres. For this purpose, we have developed a database-driven system that will ultimately consist of more than 3000000 HTML documents, written in German, which are the empirical basis for our research. We introduce the notions of Web genre type which constitutes the basic framework for a certain Web genre, and compulsory and optional Web genre modules. These act as building blocks which go together to make up the structure characterised by the Web genre type and furthermore, operate as modifiers for the default assignment involved. The analysis of a 200 document sample illustrates our notion of Web genre hierarchy, into which Web genre types and modules are embedded. The analysis of four different documents of the Web genre Academic's Personal Homepage, not only illustrates our approach, but also our long-term goal of automatically extracting the contents of Web genre modules in order to build up structured XML documents of groups of unstructured HTML documents.","PeriodicalId":366006,"journal":{"name":"Proceedings of the 35th Annual Hawaii International Conference on System Sciences","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2002-08-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"63","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 35th Annual Hawaii International Conference on System Sciences","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/HICSS.2002.994036","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 63

Abstract

We argue for a systematic analysis of one particular, well structured domain -academic Web pages - with regard to a special class of digital genres: Web genres. For this purpose, we have developed a database-driven system that will ultimately consist of more than 3000000 HTML documents, written in German, which are the empirical basis for our research. We introduce the notions of Web genre type which constitutes the basic framework for a certain Web genre, and compulsory and optional Web genre modules. These act as building blocks which go together to make up the structure characterised by the Web genre type and furthermore, operate as modifiers for the default assignment involved. The analysis of a 200 document sample illustrates our notion of Web genre hierarchy, into which Web genre types and modules are embedded. The analysis of four different documents of the Web genre Academic's Personal Homepage, not only illustrates our approach, but also our long-term goal of automatically extracting the contents of Web genre modules in order to build up structured XML documents of groups of unstructured HTML documents.
走向自动网络体裁识别:以学术个人主页为例,在学术领域中基于语料库的方法
我们主张对一个特定的、结构良好的领域——学术网页——进行系统分析,并将其与一类特殊的数字类型——网络类型联系起来。为此,我们开发了一个数据库驱动的系统,该系统最终将由300多万份用德语编写的HTML文档组成,这是我们研究的经验基础。介绍了构成网络类型基本框架的网络类型的概念,以及网络类型的必修模块和可选模块。它们就像构建块一样,组合在一起构成了Web类型的结构特征,而且,还可以作为所涉及的默认分配的修饰语。对200个文档样本的分析说明了Web类型层次结构的概念,其中嵌入了Web类型类型和模块。通过对Web类型Academic's Personal Homepage的四个不同文档的分析,不仅说明了我们的方法,而且说明了我们的长期目标,即自动提取Web类型模块的内容,以便从一组非结构化HTML文档中构建结构化XML文档。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信