HTML网页内容提取浏览器扩展的开发

2020 International Congress on Human-Computer Interaction, Optimization and Robotic Applications (HORA) Pub Date : 2020-06-01 DOI:10.1109/HORA49412.2020.9152891

Murat Karabulut, İslam Mayda

{"title":"HTML网页内容提取浏览器扩展的开发","authors":"Murat Karabulut, İslam Mayda","doi":"10.1109/HORA49412.2020.9152891","DOIUrl":null,"url":null,"abstract":"As the amount of content on the websites increases, automatic content extraction from Web pages becomes more important. Although many studies have been done in the literature on this subject, a method that fully solves the problem has not been revealed due to the flexible structure of HTML. The performances of the methods that show success at certain rates also decrease over time with the changing and developing Web structure. In this study, a browser extension was developed to automatically download text content on Web pages. This developed extension provides an output with 100% recall rate by cleaning the text content on the Web page from all tags and codes with a parser that utilizes the Document Object Model (DOM) structure. This browser extension that operates independently from the language has been tested on different types of popular Web sites in Turkey and has been shown to work successfully.","PeriodicalId":166917,"journal":{"name":"2020 International Congress on Human-Computer Interaction, Optimization and Robotic Applications (HORA)","volume":"7 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":"{\"title\":\"Development of Browser Extension for HTML Web Page Content Extraction\",\"authors\":\"Murat Karabulut, İslam Mayda\",\"doi\":\"10.1109/HORA49412.2020.9152891\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"As the amount of content on the websites increases, automatic content extraction from Web pages becomes more important. Although many studies have been done in the literature on this subject, a method that fully solves the problem has not been revealed due to the flexible structure of HTML. The performances of the methods that show success at certain rates also decrease over time with the changing and developing Web structure. In this study, a browser extension was developed to automatically download text content on Web pages. This developed extension provides an output with 100% recall rate by cleaning the text content on the Web page from all tags and codes with a parser that utilizes the Document Object Model (DOM) structure. This browser extension that operates independently from the language has been tested on different types of popular Web sites in Turkey and has been shown to work successfully.\",\"PeriodicalId\":166917,\"journal\":{\"name\":\"2020 International Congress on Human-Computer Interaction, Optimization and Robotic Applications (HORA)\",\"volume\":\"7 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2020-06-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"2\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2020 International Congress on Human-Computer Interaction, Optimization and Robotic Applications (HORA)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/HORA49412.2020.9152891\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2020 International Congress on Human-Computer Interaction, Optimization and Robotic Applications (HORA)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/HORA49412.2020.9152891","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 2

摘要

随着网站上内容的增加，从网页中自动提取内容变得更加重要。虽然文献中对这个问题做了很多研究，但是由于HTML结构的灵活性，并没有揭示出一种完全解决这个问题的方法。随着Web结构的变化和发展，在一定程度上显示成功的方法的性能也会随着时间的推移而降低。在本研究中，开发了一个浏览器扩展来自动下载网页上的文本内容。这个开发的扩展通过使用一个利用文档对象模型(Document Object Model, DOM)结构的解析器从所有标记和代码中清除Web页面上的文本内容，从而提供100%召回率的输出。这个独立于语言运行的浏览器扩展已经在土耳其不同类型的流行网站上进行了测试，并已被证明工作成功。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Development of Browser Extension for HTML Web Page Content Extraction

As the amount of content on the websites increases, automatic content extraction from Web pages becomes more important. Although many studies have been done in the literature on this subject, a method that fully solves the problem has not been revealed due to the flexible structure of HTML. The performances of the methods that show success at certain rates also decrease over time with the changing and developing Web structure. In this study, a browser extension was developed to automatically download text content on Web pages. This developed extension provides an output with 100% recall rate by cleaning the text content on the Web page from all tags and codes with a parser that utilizes the Document Object Model (DOM) structure. This browser extension that operates independently from the language has been tested on different types of popular Web sites in Turkey and has been shown to work successfully.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2020 International Congress on Human-Computer Interaction, Optimization and Robotic Applications (HORA)

自引率

0.00%

发文量