为个性化表示提取web内容

Proceedings of the ACM Symposium on Document Engineering. ACM Symposium on Document Engineering Pub Date : 2014-09-16 DOI:10.1145/2644866.2644871

Rodrigo Chamun, Daniele Pinheiro, Diego Jornada, J. B. Oliveira, I. Manssour

{"title":"为个性化表示提取web内容","authors":"Rodrigo Chamun, Daniele Pinheiro, Diego Jornada, J. B. Oliveira, I. Manssour","doi":"10.1145/2644866.2644871","DOIUrl":null,"url":null,"abstract":"Printing web pages is usually a thankless task as the result is often a document with many badly-used pages and poor layout. Besides the actual content, superfluous web elements like menus and links are often present and in a printed version they are commonly perceived as an annoyance. Therefore, a solution for obtaining cleaner versions for printing is to detect parts of the page that the reader wants to consume, eliminating unnecessary elements and filtering the \"true\" content of the web page. In addition, the same solution may be used online to present cleaner versions of web pages, discarding any elements that the user wishes to avoid.\n In this paper we present a novel approach to implement such filtering. The method is interactive at first: The user samples items that are to be preserved on the page and thereafter everything that is not similar to the samples is removed from the page. This is achieved by comparing the path of all elements on the DOM representation of the page with the path of the elements sampled by the user and preserving only elements that have a path \"similar\" to the sample. The introduction of a similarity measure adds an important degree of adaptability to the needs of different users and applications.\n This approach is quite general and may be applied to any XML tree that has labeled nodes. We use HTML as a case study and present a Google Chrome extension that implements the approach as well as a user study comparing our results with commercial results.","PeriodicalId":91385,"journal":{"name":"Proceedings of the ACM Symposium on Document Engineering. ACM Symposium on Document Engineering","volume":"72 1","pages":"157-164"},"PeriodicalIF":0.0000,"publicationDate":"2014-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":"{\"title\":\"Extracting web content for personalized presentation\",\"authors\":\"Rodrigo Chamun, Daniele Pinheiro, Diego Jornada, J. B. Oliveira, I. Manssour\",\"doi\":\"10.1145/2644866.2644871\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Printing web pages is usually a thankless task as the result is often a document with many badly-used pages and poor layout. Besides the actual content, superfluous web elements like menus and links are often present and in a printed version they are commonly perceived as an annoyance. Therefore, a solution for obtaining cleaner versions for printing is to detect parts of the page that the reader wants to consume, eliminating unnecessary elements and filtering the \\\"true\\\" content of the web page. In addition, the same solution may be used online to present cleaner versions of web pages, discarding any elements that the user wishes to avoid.\\n In this paper we present a novel approach to implement such filtering. The method is interactive at first: The user samples items that are to be preserved on the page and thereafter everything that is not similar to the samples is removed from the page. This is achieved by comparing the path of all elements on the DOM representation of the page with the path of the elements sampled by the user and preserving only elements that have a path \\\"similar\\\" to the sample. The introduction of a similarity measure adds an important degree of adaptability to the needs of different users and applications.\\n This approach is quite general and may be applied to any XML tree that has labeled nodes. We use HTML as a case study and present a Google Chrome extension that implements the approach as well as a user study comparing our results with commercial results.\",\"PeriodicalId\":91385,\"journal\":{\"name\":\"Proceedings of the ACM Symposium on Document Engineering. ACM Symposium on Document Engineering\",\"volume\":\"72 1\",\"pages\":\"157-164\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2014-09-16\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"2\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the ACM Symposium on Document Engineering. ACM Symposium on Document Engineering\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/2644866.2644871\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the ACM Symposium on Document Engineering. ACM Symposium on Document Engineering","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2644866.2644871","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 2

摘要

打印网页通常是一件吃力不讨好的事情，因为打印出来的文件往往有很多没用的页面和糟糕的布局。除了实际的内容外，菜单和链接等多余的网络元素也经常出现，在印刷版本中，它们通常被认为是一种烦恼。因此，获得更清晰的打印版本的解决方案是检测读者想要消费的页面部分，消除不必要的元素并过滤网页的“真实”内容。此外，同样的解决方案也可以用于在线呈现更干净的网页版本，丢弃用户希望避免的任何元素。本文提出了一种实现这种滤波的新方法。该方法最初是交互式的:用户将在页面上保存样例项目，然后从页面中删除与样例不相似的所有内容。这是通过将页面的DOM表示上的所有元素的路径与用户采样的元素的路径进行比较，并只保留路径与样本“相似”的元素来实现的。相似性度量的引入增加了对不同用户和应用程序需求的重要程度的适应性。这种方法非常通用，可以应用于任何带有标记节点的XML树。我们使用HTML作为案例研究，并提供了一个实现该方法的Google Chrome扩展，以及将我们的结果与商业结果进行比较的用户研究。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Extracting web content for personalized presentation

Printing web pages is usually a thankless task as the result is often a document with many badly-used pages and poor layout. Besides the actual content, superfluous web elements like menus and links are often present and in a printed version they are commonly perceived as an annoyance. Therefore, a solution for obtaining cleaner versions for printing is to detect parts of the page that the reader wants to consume, eliminating unnecessary elements and filtering the "true" content of the web page. In addition, the same solution may be used online to present cleaner versions of web pages, discarding any elements that the user wishes to avoid. In this paper we present a novel approach to implement such filtering. The method is interactive at first: The user samples items that are to be preserved on the page and thereafter everything that is not similar to the samples is removed from the page. This is achieved by comparing the path of all elements on the DOM representation of the page with the path of the elements sampled by the user and preserving only elements that have a path "similar" to the sample. The introduction of a similarity measure adds an important degree of adaptability to the needs of different users and applications. This approach is quite general and may be applied to any XML tree that has labeled nodes. We use HTML as a case study and present a Google Chrome extension that implements the approach as well as a user study comparing our results with commercial results.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Proceedings of the ACM Symposium on Document Engineering. ACM Symposium on Document Engineering

自引率

0.00%

发文量