{"title":"DOMdiff: Identification and Classification of Inter-DOM Modifications","authors":"Manuel Leithner, D. Simos","doi":"10.1109/WI.2018.00-81","DOIUrl":null,"url":null,"abstract":"Current web crawlers, document databases and change monitoring systems for web sites are commonly limited to static content and analysis of code as retrieved from the server, an approach that is not suitable for modern dynamic web applications. The canonical representation of the contents of a single web page at any given time is an instance of the Document Object Model (DOM), a tree structure that forms the basis for rendering and processing of the page within the browser and is updated when content is modified. This work presents DOMdiff, an algorithm to identify changes between two different DOM instances, as well as a method to classify these changes in terms of a ranking that represents the distance between the two trees. We compare a manually derived classifier with the results of PRank, a ranked version of the Perceptron algorithm, a simple machine learning approach that generates a multiclass classifier based on formulae in a constrained predicate logic, and the established statistical classifier C5.0. Our results indicate that DOMdiff is suitable to large-scale change identification and that entropy-based statistical classifiers are more accurate than our simple predicate-based classifier for the problem at hand, but require a larger decision tree. We additionally identify a shortcoming of PRank when handling features with low information gain/high entropy.","PeriodicalId":405966,"journal":{"name":"2018 IEEE/WIC/ACM International Conference on Web Intelligence (WI)","volume":"46 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"6","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2018 IEEE/WIC/ACM International Conference on Web Intelligence (WI)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/WI.2018.00-81","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 6
Abstract
Current web crawlers, document databases and change monitoring systems for web sites are commonly limited to static content and analysis of code as retrieved from the server, an approach that is not suitable for modern dynamic web applications. The canonical representation of the contents of a single web page at any given time is an instance of the Document Object Model (DOM), a tree structure that forms the basis for rendering and processing of the page within the browser and is updated when content is modified. This work presents DOMdiff, an algorithm to identify changes between two different DOM instances, as well as a method to classify these changes in terms of a ranking that represents the distance between the two trees. We compare a manually derived classifier with the results of PRank, a ranked version of the Perceptron algorithm, a simple machine learning approach that generates a multiclass classifier based on formulae in a constrained predicate logic, and the established statistical classifier C5.0. Our results indicate that DOMdiff is suitable to large-scale change identification and that entropy-based statistical classifiers are more accurate than our simple predicate-based classifier for the problem at hand, but require a larger decision tree. We additionally identify a shortcoming of PRank when handling features with low information gain/high entropy.