Strigil:半结构化Web文档中数据提取的框架

International Conference on Information Integration and Web-based Applications & Services Pub Date : 2013-12-02 DOI:10.1145/2539150.2539170

J. Stárka, I. Holubová, M. Nečaský

{"title":"Strigil:半结构化Web文档中数据提取的框架","authors":"J. Stárka, I. Holubová, M. Nečaský","doi":"10.1145/2539150.2539170","DOIUrl":null,"url":null,"abstract":"In this paper we introduce Strigil, a framework for automated data extraction. It represents an easily configurable tool that enables one to retrieve a data from textual or weak-structured documents. The paper contains description of the framework architecture and its important components. Additionally, we propose a scraping language inspired by the XSL transformations designed to extract data from different kinds of documents. Although there are many different approaches focused on various aspects of data scraping, they are usually very specialized to a concrete domain or a data source. We compare these solutions and discuss their advantages and disadvantages. Our scraping language is designed to work with an ontology to map scraped data directly to classes and attributes.","PeriodicalId":424918,"journal":{"name":"International Conference on Information Integration and Web-based Applications & Services","volume":"44 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2013-12-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":"{\"title\":\"Strigil: A Framework for Data Extraction in Semi-Structured Web Documents\",\"authors\":\"J. Stárka, I. Holubová, M. Nečaský\",\"doi\":\"10.1145/2539150.2539170\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"In this paper we introduce Strigil, a framework for automated data extraction. It represents an easily configurable tool that enables one to retrieve a data from textual or weak-structured documents. The paper contains description of the framework architecture and its important components. Additionally, we propose a scraping language inspired by the XSL transformations designed to extract data from different kinds of documents. Although there are many different approaches focused on various aspects of data scraping, they are usually very specialized to a concrete domain or a data source. We compare these solutions and discuss their advantages and disadvantages. Our scraping language is designed to work with an ontology to map scraped data directly to classes and attributes.\",\"PeriodicalId\":424918,\"journal\":{\"name\":\"International Conference on Information Integration and Web-based Applications & Services\",\"volume\":\"44 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2013-12-02\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"4\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"International Conference on Information Integration and Web-based Applications & Services\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/2539150.2539170\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Conference on Information Integration and Web-based Applications & Services","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2539150.2539170","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 4

摘要

本文介绍了一个用于自动数据提取的框架Strigil。它代表了一种易于配置的工具，使人们能够从文本或弱结构文档中检索数据。本文对框架体系结构及其重要组成部分进行了描述。此外，我们提出了一种受XSL转换启发的抓取语言，该转换旨在从不同类型的文档中提取数据。尽管有许多不同的方法专注于数据抓取的各个方面，但它们通常非常专门于具体的领域或数据源。我们比较了这些解决方案，并讨论了它们的优缺点。我们的抓取语言被设计成与本体一起工作，将抓取的数据直接映射到类和属性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Strigil: A Framework for Data Extraction in Semi-Structured Web Documents

In this paper we introduce Strigil, a framework for automated data extraction. It represents an easily configurable tool that enables one to retrieve a data from textual or weak-structured documents. The paper contains description of the framework architecture and its important components. Additionally, we propose a scraping language inspired by the XSL transformations designed to extract data from different kinds of documents. Although there are many different approaches focused on various aspects of data scraping, they are usually very specialized to a concrete domain or a data source. We compare these solutions and discuss their advantages and disadvantages. Our scraping language is designed to work with an ontology to map scraped data directly to classes and attributes.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

International Conference on Information Integration and Web-based Applications & Services

自引率

0.00%

发文量