{"title":"基于应用性能分析的Hadoop技术优化大数据基础设施设计","authors":"S. Shafiyah, Ahmad Ahsan, Rengga Asmara","doi":"10.32520/stmsi.v11i1.1510","DOIUrl":null,"url":null,"abstract":"Big data's infrastructure is a technology that provides the ability to store, process, analyze, and visualize large data. The tools and applications used are one of the challenges when building big data's infrastructure. In the study, we offered a new strategy to optimize big data infrastructure design that was an essential part of big data processing by performing performance analysis applications used at each stage of big data processing. The process started from collecting data sourcing online news using web crawler methods using Scrapyand Apache Nutch. Next, implement Hadoop technologies to facilitate the distribution of big data storage and computing. No-sql databases Mongo DB and HBase made it easier to query data, after which they built search engines using Elasticsearch and Apache Solr. Data saved later in analysis using hive apache, pig, and spark. The data has been analyzed was shown on the website using Zeppelins, Metabolase, Kibana, and Tableau. The test scenario consisted of the number of servers and files used. Testing parameters started from process speed, memory usage, CPU usage, throughput, etc. The performance testing results of each application were compared to and analyzed to see the merits and defaults of the application as a reference to building optimal infrastructure design to meet the needs of the user. This research has SISTEMASI: Jurnal Sistem Informasi ISSN:2302-8149 Volume 11, Nomor 1, Januari 2022: 55-72 e-ISSN:2540-9719 http://sistemasi.ftik.unisi.ac.id 56 produced two big data infrastructure design alternatives. The suggested infrastructure has been implemented on computer nodes in the big data pens for processing big data from online media and proving to be running well.","PeriodicalId":32367,"journal":{"name":"Sistemasi Jurnal Sistem Informasi","volume":"26 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2022-01-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Big Data Infrastructure Design Optimizes Using Hadoop Technologies Based on Application Performance Analysis\",\"authors\":\"S. Shafiyah, Ahmad Ahsan, Rengga Asmara\",\"doi\":\"10.32520/stmsi.v11i1.1510\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Big data's infrastructure is a technology that provides the ability to store, process, analyze, and visualize large data. The tools and applications used are one of the challenges when building big data's infrastructure. In the study, we offered a new strategy to optimize big data infrastructure design that was an essential part of big data processing by performing performance analysis applications used at each stage of big data processing. The process started from collecting data sourcing online news using web crawler methods using Scrapyand Apache Nutch. Next, implement Hadoop technologies to facilitate the distribution of big data storage and computing. No-sql databases Mongo DB and HBase made it easier to query data, after which they built search engines using Elasticsearch and Apache Solr. Data saved later in analysis using hive apache, pig, and spark. The data has been analyzed was shown on the website using Zeppelins, Metabolase, Kibana, and Tableau. The test scenario consisted of the number of servers and files used. Testing parameters started from process speed, memory usage, CPU usage, throughput, etc. The performance testing results of each application were compared to and analyzed to see the merits and defaults of the application as a reference to building optimal infrastructure design to meet the needs of the user. This research has SISTEMASI: Jurnal Sistem Informasi ISSN:2302-8149 Volume 11, Nomor 1, Januari 2022: 55-72 e-ISSN:2540-9719 http://sistemasi.ftik.unisi.ac.id 56 produced two big data infrastructure design alternatives. The suggested infrastructure has been implemented on computer nodes in the big data pens for processing big data from online media and proving to be running well.\",\"PeriodicalId\":32367,\"journal\":{\"name\":\"Sistemasi Jurnal Sistem Informasi\",\"volume\":\"26 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-01-09\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Sistemasi Jurnal Sistem Informasi\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.32520/stmsi.v11i1.1510\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Sistemasi Jurnal Sistem Informasi","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.32520/stmsi.v11i1.1510","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
摘要
大数据基础设施是一种提供存储、处理、分析和可视化大数据能力的技术。所使用的工具和应用程序是构建大数据基础设施时面临的挑战之一。在本研究中,我们通过在大数据处理的每个阶段执行性能分析应用程序,提出了优化大数据基础设施设计的新策略,大数据基础设施设计是大数据处理的重要组成部分。这个过程是从使用Scrapyand Apache Nutch的网络爬虫方法收集数据来源在线新闻开始的。其次,实施Hadoop技术,方便大数据存储和计算的分布。无sql数据库mongodb DB和HBase使查询数据变得更容易,之后他们使用Elasticsearch和Apache Solr构建了搜索引擎。使用hive apache, pig和spark进行分析时保存的数据。使用齐柏林、Metabolase、Kibana和Tableau对数据进行了分析,并在网站上显示。测试场景包括所使用的服务器和文件的数量。测试参数从进程速度、内存使用、CPU使用、吞吐量等开始。对每个应用程序的性能测试结果进行比较和分析,以了解应用程序的优点和缺点,作为构建最优基础设施设计以满足用户需求的参考。本研究已SISTEMASI: journal system Informasi ISSN:2302-8149 vol . 11, Nomor 1, janari 2022: 55-72 e-ISSN:2540-9719 http://sistemasi.ftik.unisi.ac.id 56产生了两种大数据基础设施设计方案。建议的基础设施已经在大数据笔的计算机节点上实施,用于处理来自网络媒体的大数据,并证明运行良好。
Big Data Infrastructure Design Optimizes Using Hadoop Technologies Based on Application Performance Analysis
Big data's infrastructure is a technology that provides the ability to store, process, analyze, and visualize large data. The tools and applications used are one of the challenges when building big data's infrastructure. In the study, we offered a new strategy to optimize big data infrastructure design that was an essential part of big data processing by performing performance analysis applications used at each stage of big data processing. The process started from collecting data sourcing online news using web crawler methods using Scrapyand Apache Nutch. Next, implement Hadoop technologies to facilitate the distribution of big data storage and computing. No-sql databases Mongo DB and HBase made it easier to query data, after which they built search engines using Elasticsearch and Apache Solr. Data saved later in analysis using hive apache, pig, and spark. The data has been analyzed was shown on the website using Zeppelins, Metabolase, Kibana, and Tableau. The test scenario consisted of the number of servers and files used. Testing parameters started from process speed, memory usage, CPU usage, throughput, etc. The performance testing results of each application were compared to and analyzed to see the merits and defaults of the application as a reference to building optimal infrastructure design to meet the needs of the user. This research has SISTEMASI: Jurnal Sistem Informasi ISSN:2302-8149 Volume 11, Nomor 1, Januari 2022: 55-72 e-ISSN:2540-9719 http://sistemasi.ftik.unisi.ac.id 56 produced two big data infrastructure design alternatives. The suggested infrastructure has been implemented on computer nodes in the big data pens for processing big data from online media and proving to be running well.