Clustering Alexa Internet Data using Auto Encoder Network and Affinity Propagation

2020 10th International Conference on Computer and Knowledge Engineering (ICCKE) Pub Date : 2020-10-29 DOI:10.1109/ICCKE50421.2020.9303705

Ali Risheh, Ashkan Goharfar, N. T. Javan

{"title":"Clustering Alexa Internet Data using Auto Encoder Network and Affinity Propagation","authors":"Ali Risheh, Ashkan Goharfar, N. T. Javan","doi":"10.1109/ICCKE50421.2020.9303705","DOIUrl":null,"url":null,"abstract":"Non-linear mapping is one of the most popular solutions for complex data structures and distinct patterns to cluster data. Auto encoder Networks (AENs) are widely used in clustering as they improve data representation. In this paper, we collect Alexa.com data by crawling popular websites profiles, where dataset has 84 columns with type number and array of words. Next, an AEN architecture is presented to identify specific websites with exceptional patterns and the encoded data expresses new feature space of our original data. (Our) Encoded data is clustered by Affinity Propagation which is a partitioning algorithm without the need for specifying the number of clusters. There are important results based on 194 clusters and exemplars which are filtered and analyzed. One remarkable fact about results is that the first 11 columns as raw data are not clustered by Affinity Propagation a which re considered all as outlier. The results are summarized by selecting the best option w.r.t. the statistics and charts. Some of the obtained results are useful for website owners and provide some suggestions and solutions for Search Engine Optimization (SEO). Finally, we propose our crawler application which crawls and records data over 4 days. It must be added that the proposed web crawler faces challenges and their solutions can be helpful in most commonly used web crawling algorithms and libraries.","PeriodicalId":402043,"journal":{"name":"2020 10th International Conference on Computer and Knowledge Engineering (ICCKE)","volume":"33 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-10-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2020 10th International Conference on Computer and Knowledge Engineering (ICCKE)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICCKE50421.2020.9303705","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Non-linear mapping is one of the most popular solutions for complex data structures and distinct patterns to cluster data. Auto encoder Networks (AENs) are widely used in clustering as they improve data representation. In this paper, we collect Alexa.com data by crawling popular websites profiles, where dataset has 84 columns with type number and array of words. Next, an AEN architecture is presented to identify specific websites with exceptional patterns and the encoded data expresses new feature space of our original data. (Our) Encoded data is clustered by Affinity Propagation which is a partitioning algorithm without the need for specifying the number of clusters. There are important results based on 194 clusters and exemplars which are filtered and analyzed. One remarkable fact about results is that the first 11 columns as raw data are not clustered by Affinity Propagation a which re considered all as outlier. The results are summarized by selecting the best option w.r.t. the statistics and charts. Some of the obtained results are useful for website owners and provide some suggestions and solutions for Search Engine Optimization (SEO). Finally, we propose our crawler application which crawls and records data over 4 days. It must be added that the proposed web crawler faces challenges and their solutions can be helpful in most commonly used web crawling algorithms and libraries.

查看原文本刊更多论文

使用自动编码器网络和亲和传播聚类Alexa互联网数据

非线性映射是复杂数据结构和不同模式的数据聚类最流行的解决方案之一。自动编码器网络(AENs)由于改进了数据表示，在聚类中得到了广泛的应用。在本文中，我们通过抓取流行网站的配置文件来收集Alexa.com数据，其中数据集有84列，类型为数字和单词数组。其次，提出了一种AEN架构来识别具有特殊模式的特定网站，编码后的数据表达了原始数据的新特征空间。(我们的)编码数据通过亲和性传播聚类，这是一种不需要指定聚类数量的分区算法。对194个聚类和样本进行了过滤和分析，得到了重要的结果。关于结果的一个值得注意的事实是，作为原始数据的前11列没有通过Affinity Propagation聚类(它们被视为离群值)。根据统计数据和图表，选择最佳方案，总结结果。获得的一些结果对网站所有者很有用，并为搜索引擎优化(SEO)提供了一些建议和解决方案。最后，我们提出了我们的爬虫应用程序，它可以在4天内抓取和记录数据。必须补充的是，所提出的网络爬虫面临挑战，他们的解决方案可以在最常用的网络爬虫算法和库中有所帮助。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2020 10th International Conference on Computer and Knowledge Engineering (ICCKE)

自引率

0.00%

发文量