Clustering Alexa Internet Data using Auto Encoder Network and Affinity Propagation

Ali Risheh, Ashkan Goharfar, N. T. Javan
{"title":"Clustering Alexa Internet Data using Auto Encoder Network and Affinity Propagation","authors":"Ali Risheh, Ashkan Goharfar, N. T. Javan","doi":"10.1109/ICCKE50421.2020.9303705","DOIUrl":null,"url":null,"abstract":"Non-linear mapping is one of the most popular solutions for complex data structures and distinct patterns to cluster data. Auto encoder Networks (AENs) are widely used in clustering as they improve data representation. In this paper, we collect Alexa.com data by crawling popular websites profiles, where dataset has 84 columns with type number and array of words. Next, an AEN architecture is presented to identify specific websites with exceptional patterns and the encoded data expresses new feature space of our original data. (Our) Encoded data is clustered by Affinity Propagation which is a partitioning algorithm without the need for specifying the number of clusters. There are important results based on 194 clusters and exemplars which are filtered and analyzed. One remarkable fact about results is that the first 11 columns as raw data are not clustered by Affinity Propagation a which re considered all as outlier. The results are summarized by selecting the best option w.r.t. the statistics and charts. Some of the obtained results are useful for website owners and provide some suggestions and solutions for Search Engine Optimization (SEO). Finally, we propose our crawler application which crawls and records data over 4 days. It must be added that the proposed web crawler faces challenges and their solutions can be helpful in most commonly used web crawling algorithms and libraries.","PeriodicalId":402043,"journal":{"name":"2020 10th International Conference on Computer and Knowledge Engineering (ICCKE)","volume":"33 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-10-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2020 10th International Conference on Computer and Knowledge Engineering (ICCKE)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICCKE50421.2020.9303705","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Non-linear mapping is one of the most popular solutions for complex data structures and distinct patterns to cluster data. Auto encoder Networks (AENs) are widely used in clustering as they improve data representation. In this paper, we collect Alexa.com data by crawling popular websites profiles, where dataset has 84 columns with type number and array of words. Next, an AEN architecture is presented to identify specific websites with exceptional patterns and the encoded data expresses new feature space of our original data. (Our) Encoded data is clustered by Affinity Propagation which is a partitioning algorithm without the need for specifying the number of clusters. There are important results based on 194 clusters and exemplars which are filtered and analyzed. One remarkable fact about results is that the first 11 columns as raw data are not clustered by Affinity Propagation a which re considered all as outlier. The results are summarized by selecting the best option w.r.t. the statistics and charts. Some of the obtained results are useful for website owners and provide some suggestions and solutions for Search Engine Optimization (SEO). Finally, we propose our crawler application which crawls and records data over 4 days. It must be added that the proposed web crawler faces challenges and their solutions can be helpful in most commonly used web crawling algorithms and libraries.
使用自动编码器网络和亲和传播聚类Alexa互联网数据
非线性映射是复杂数据结构和不同模式的数据聚类最流行的解决方案之一。自动编码器网络(AENs)由于改进了数据表示,在聚类中得到了广泛的应用。在本文中,我们通过抓取流行网站的配置文件来收集Alexa.com数据,其中数据集有84列,类型为数字和单词数组。其次,提出了一种AEN架构来识别具有特殊模式的特定网站,编码后的数据表达了原始数据的新特征空间。(我们的)编码数据通过亲和性传播聚类,这是一种不需要指定聚类数量的分区算法。对194个聚类和样本进行了过滤和分析,得到了重要的结果。关于结果的一个值得注意的事实是,作为原始数据的前11列没有通过Affinity Propagation聚类(它们被视为离群值)。根据统计数据和图表,选择最佳方案,总结结果。获得的一些结果对网站所有者很有用,并为搜索引擎优化(SEO)提供了一些建议和解决方案。最后,我们提出了我们的爬虫应用程序,它可以在4天内抓取和记录数据。必须补充的是,所提出的网络爬虫面临挑战,他们的解决方案可以在最常用的网络爬虫算法和库中有所帮助。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信