{"title":"Web Content Outlier Mining using Machine Learning and Mathematical Approaches","authors":"Thinzar Tun, Khin Mo Mo Tun","doi":"10.1109/AITC.2019.8921085","DOIUrl":null,"url":null,"abstract":"Due to the massive, dynamic and heterogeneous nature of the web, discovering outliers from the web is demanding than from the numeric dataset. On exploring for information in the web, the inappropriate irrelevant and redundant information may be retrieved to the user. So, it is a big challenge to get and access high quality information on the web effectively and efficiently without including irrelevant and redundant information. Mining web content outliers focus on mining inappropriate duplicate and irrelevant web pages from the other web pages under the same categories. Removing outliers from the web improves the accuracy of search results, decreases the complexity of time for indexing and complexity of time and saves the user time and effort. We applied the Latent Dirichlet Allocation method from the machine learning approaches and a mathematical approach named linear correlation method to move web content outliers. This system tends to improve F1-measure, accuracy results and reduce time complexity.","PeriodicalId":388642,"journal":{"name":"2019 International Conference on Advanced Information Technologies (ICAIT)","volume":"59 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 International Conference on Advanced Information Technologies (ICAIT)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/AITC.2019.8921085","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Due to the massive, dynamic and heterogeneous nature of the web, discovering outliers from the web is demanding than from the numeric dataset. On exploring for information in the web, the inappropriate irrelevant and redundant information may be retrieved to the user. So, it is a big challenge to get and access high quality information on the web effectively and efficiently without including irrelevant and redundant information. Mining web content outliers focus on mining inappropriate duplicate and irrelevant web pages from the other web pages under the same categories. Removing outliers from the web improves the accuracy of search results, decreases the complexity of time for indexing and complexity of time and saves the user time and effort. We applied the Latent Dirichlet Allocation method from the machine learning approaches and a mathematical approach named linear correlation method to move web content outliers. This system tends to improve F1-measure, accuracy results and reduce time complexity.