Denish K Kalariya, Shubham Vyas, Dev Savasni, Samir Patel
{"title":"Big data analysis on yelp user-generated reviews","authors":"Denish K Kalariya, Shubham Vyas, Dev Savasni, Samir Patel","doi":"10.1109/ICONAT53423.2022.9726108","DOIUrl":null,"url":null,"abstract":"The goal of this project is to demostrate the use of PySpark and Spark SQL to query and analyze the Yelp Open Dataset. Specifically, the aim is to analyze the Yelp Reviews dataset, which consists of 8.6 million user-generated reviews of businesses on Yelp. we also perform JOIN operations with the Yelp Business and Yelp User datasets to describe relations between review ratings and characteristics of the business, such as geographic location. To perform some of these queries, we demonstrate the use of user-defined functions (UDFs) in Spark SQL queries. Lastly, we briefly examine how partitioning of the underlying data abstraction changes computational speed.","PeriodicalId":377501,"journal":{"name":"2022 International Conference for Advancement in Technology (ICONAT)","volume":"56 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-01-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 International Conference for Advancement in Technology (ICONAT)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICONAT53423.2022.9726108","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
The goal of this project is to demostrate the use of PySpark and Spark SQL to query and analyze the Yelp Open Dataset. Specifically, the aim is to analyze the Yelp Reviews dataset, which consists of 8.6 million user-generated reviews of businesses on Yelp. we also perform JOIN operations with the Yelp Business and Yelp User datasets to describe relations between review ratings and characteristics of the business, such as geographic location. To perform some of these queries, we demonstrate the use of user-defined functions (UDFs) in Spark SQL queries. Lastly, we briefly examine how partitioning of the underlying data abstraction changes computational speed.