从布局到语义:将Web文档映射到中介XML表示的重排序模型

RIAO Conference Pub Date : 2007-05-30 DOI:10.5555/1931390.1931432

Guillaume Wisniewski, P. Gallinari

{"title":"从布局到语义:将Web文档映射到中介XML表示的重排序模型","authors":"Guillaume Wisniewski, P. Gallinari","doi":"10.5555/1931390.1931432","DOIUrl":null,"url":null,"abstract":"Many documents on the Web are formated in a weakly structured format. Because of their weak semantic and because of the heterogeneity of their formats, the information conveyed by their structure cannot be directly exploited. We consider here the conversion of such documents into a predefined mediated semi-structured format which will be more amenable to automatic processing of the document content. We develop a machine learning approach to this conversion problem where the transformation is learned automatically from a set of document examples manually transformed into the target structure. Our method proceeds in three steps. Given an input document, document elements are first annotated with labels of the target schema. Structured candidate documents are then generated using a generalized probabilistic context-free parsing algorithm. Finally candidates are reranked using a perceptron like ranking algorithm. Experiments performed on two different datasets show that the proposed method performs well in different contexts.","PeriodicalId":120472,"journal":{"name":"RIAO Conference","volume":"94 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2007-05-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":"{\"title\":\"From Layout to Semantic: a Reranking Model for Mapping Web Documents to Mediated XML Representations\",\"authors\":\"Guillaume Wisniewski, P. Gallinari\",\"doi\":\"10.5555/1931390.1931432\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Many documents on the Web are formated in a weakly structured format. Because of their weak semantic and because of the heterogeneity of their formats, the information conveyed by their structure cannot be directly exploited. We consider here the conversion of such documents into a predefined mediated semi-structured format which will be more amenable to automatic processing of the document content. We develop a machine learning approach to this conversion problem where the transformation is learned automatically from a set of document examples manually transformed into the target structure. Our method proceeds in three steps. Given an input document, document elements are first annotated with labels of the target schema. Structured candidate documents are then generated using a generalized probabilistic context-free parsing algorithm. Finally candidates are reranked using a perceptron like ranking algorithm. Experiments performed on two different datasets show that the proposed method performs well in different contexts.\",\"PeriodicalId\":120472,\"journal\":{\"name\":\"RIAO Conference\",\"volume\":\"94 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2007-05-30\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"4\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"RIAO Conference\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.5555/1931390.1931432\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"RIAO Conference","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.5555/1931390.1931432","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 4

摘要

Web上的许多文档都是以弱结构格式格式化的。由于它们的语义较弱和格式的异构性，它们的结构所传递的信息不能被直接利用。我们在这里考虑将这些文档转换为预定义的中介半结构化格式，这种格式将更适合于文档内容的自动处理。我们开发了一种机器学习方法来解决这个转换问题，其中从一组手动转换为目标结构的文档示例中自动学习转换。我们的方法分三步进行。给定一个输入文档，首先用目标模式的标签注释文档元素。然后使用广义概率上下文无关解析算法生成结构化候选文档。最后，使用类似感知器的排序算法对候选对象进行重新排序。在两个不同的数据集上进行的实验表明，该方法在不同的环境下都具有良好的性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

From Layout to Semantic: a Reranking Model for Mapping Web Documents to Mediated XML Representations

Many documents on the Web are formated in a weakly structured format. Because of their weak semantic and because of the heterogeneity of their formats, the information conveyed by their structure cannot be directly exploited. We consider here the conversion of such documents into a predefined mediated semi-structured format which will be more amenable to automatic processing of the document content. We develop a machine learning approach to this conversion problem where the transformation is learned automatically from a set of document examples manually transformed into the target structure. Our method proceeds in three steps. Given an input document, document elements are first annotated with labels of the target schema. Structured candidate documents are then generated using a generalized probabilistic context-free parsing algorithm. Finally candidates are reranked using a perceptron like ranking algorithm. Experiments performed on two different datasets show that the proposed method performs well in different contexts.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

RIAO Conference

自引率

0.00%

发文量