医疗保健中的大数据分析:Apache Hadoop, Apache spark和Apache Flink

Frontiers in Health Informatics Pub Date : 2019-07-27 DOI:10.30699/FHI.V8I1.180

Elham Nazari, M. Shahriari, H. Tabesh

{"title":"医疗保健中的大数据分析:Apache Hadoop, Apache spark和Apache Flink","authors":"Elham Nazari, M. Shahriari, H. Tabesh","doi":"10.30699/FHI.V8I1.180","DOIUrl":null,"url":null,"abstract":"Introduction: Health care data is increasing. The correct analysis of such data will improve the quality of care and reduce costs. This kind of data has certain features such as high volume, variety, high-speed production, etc. It makes it impossible to analyze with ordinary hardware and software platforms. Choosing the right platform for managing this kind of data is very important. The purpose of this study is to introduce and compare the most popular and most widely used platform for processing big data, Apache Hadoop MapReduce, and the two Apache Spark and Apache Flink platforms, which have recently been featured with great prominence.Material and Methods: This study is a survey whose content is based on the subject matter search of the Proquest, PubMed, Google Scholar, Science Direct, Scopus, IranMedex, Irandoc, Magiran, ParsMedline and Scientific Information Database (SID) databases, as well as Web reviews, specialized books with related keywords and standard. Finally, 80 articles related to the subject of the study were reviewed.Results: The findings showed that each of the studied platforms has features, such as data processing, support for different languages, processing speed, computational model, memory management, optimization, delay, error tolerance, scalability, performance, compatibility, Security and so on. Overall, the findings showed that the Apache Hadoop environment has simplicity, error detection, and scalability management based on clusters, but because its processing is based on batch processing, it works for slow complex analyzes and does not support flow processing, Apache Spark is also distributed as a computational platform that can process a big data set in memory with a very fast response time, the Apache Flink allows users to store data in memory and load them multiple times and provide a complex Fault Tolerance mechanism Continuously retrieves data flow status.Conclusion: The application of big data analysis and processing platforms varies according to the needs. In other words, it can be said that each technology is complementary, each of which is applicable in a particular field and cannot be separated from one another and depending on the purpose and the expected expectation, and the platform must be selected for analysis or whether custom tools are designed on these platforms.","PeriodicalId":154611,"journal":{"name":"Frontiers in Health Informatics","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-07-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"25","resultStr":"{\"title\":\"BigData Analysis in Healthcare: Apache Hadoop , Apache spark and Apache Flink\",\"authors\":\"Elham Nazari, M. Shahriari, H. Tabesh\",\"doi\":\"10.30699/FHI.V8I1.180\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Introduction: Health care data is increasing. The correct analysis of such data will improve the quality of care and reduce costs. This kind of data has certain features such as high volume, variety, high-speed production, etc. It makes it impossible to analyze with ordinary hardware and software platforms. Choosing the right platform for managing this kind of data is very important. The purpose of this study is to introduce and compare the most popular and most widely used platform for processing big data, Apache Hadoop MapReduce, and the two Apache Spark and Apache Flink platforms, which have recently been featured with great prominence.Material and Methods: This study is a survey whose content is based on the subject matter search of the Proquest, PubMed, Google Scholar, Science Direct, Scopus, IranMedex, Irandoc, Magiran, ParsMedline and Scientific Information Database (SID) databases, as well as Web reviews, specialized books with related keywords and standard. Finally, 80 articles related to the subject of the study were reviewed.Results: The findings showed that each of the studied platforms has features, such as data processing, support for different languages, processing speed, computational model, memory management, optimization, delay, error tolerance, scalability, performance, compatibility, Security and so on. Overall, the findings showed that the Apache Hadoop environment has simplicity, error detection, and scalability management based on clusters, but because its processing is based on batch processing, it works for slow complex analyzes and does not support flow processing, Apache Spark is also distributed as a computational platform that can process a big data set in memory with a very fast response time, the Apache Flink allows users to store data in memory and load them multiple times and provide a complex Fault Tolerance mechanism Continuously retrieves data flow status.Conclusion: The application of big data analysis and processing platforms varies according to the needs. In other words, it can be said that each technology is complementary, each of which is applicable in a particular field and cannot be separated from one another and depending on the purpose and the expected expectation, and the platform must be selected for analysis or whether custom tools are designed on these platforms.\",\"PeriodicalId\":154611,\"journal\":{\"name\":\"Frontiers in Health Informatics\",\"volume\":\"1 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2019-07-27\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"25\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Frontiers in Health Informatics\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.30699/FHI.V8I1.180\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Frontiers in Health Informatics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.30699/FHI.V8I1.180","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 25

摘要

导读:医疗保健数据正在增加。对这些数据的正确分析将提高护理质量并降低成本。这类数据具有量大、品种多、生产速度快等特点。这使得普通的硬件和软件平台无法进行分析。选择合适的平台来管理这类数据非常重要。本研究的目的是介绍和比较最流行和最广泛使用的处理大数据的平台，Apache Hadoop MapReduce，以及最近非常突出的两个Apache Spark和Apache Flink平台。材料和方法:本研究是一项调查，其内容基于Proquest、PubMed、谷歌Scholar、Science Direct、Scopus、IranMedex、Irandoc、Magiran、ParsMedline和Scientific Information Database (SID)数据库的主题搜索，以及Web评论、相关关键词和标准的专业书籍。最后，回顾了与本研究主题相关的80篇文章。结果:研究结果表明，所研究的每个平台都具有数据处理、支持不同语言、处理速度、计算模型、内存管理、优化、延迟、容错、可扩展性、性能、兼容性、安全性等特点。总的来说，研究结果表明，Apache Hadoop环境具有简单性，错误检测和基于集群的可扩展性管理，但由于其处理基于批处理，它适用于缓慢的复杂分析，不支持流处理，Apache Spark也是分布式计算平台，可以以非常快的响应时间处理内存中的大数据集。Apache Flink允许用户将数据存储在内存中并多次加载，并提供复杂的容错机制，持续检索数据流状态。结论:大数据分析处理平台的应用因需求而异。换句话说，可以说每一种技术都是互补的，每一种技术都适用于特定的领域，不能相互分离，取决于目的和预期的期望，必须选择平台进行分析，或者是否在这些平台上设计定制工具。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

BigData Analysis in Healthcare: Apache Hadoop , Apache spark and Apache Flink

Introduction: Health care data is increasing. The correct analysis of such data will improve the quality of care and reduce costs. This kind of data has certain features such as high volume, variety, high-speed production, etc. It makes it impossible to analyze with ordinary hardware and software platforms. Choosing the right platform for managing this kind of data is very important. The purpose of this study is to introduce and compare the most popular and most widely used platform for processing big data, Apache Hadoop MapReduce, and the two Apache Spark and Apache Flink platforms, which have recently been featured with great prominence.Material and Methods: This study is a survey whose content is based on the subject matter search of the Proquest, PubMed, Google Scholar, Science Direct, Scopus, IranMedex, Irandoc, Magiran, ParsMedline and Scientific Information Database (SID) databases, as well as Web reviews, specialized books with related keywords and standard. Finally, 80 articles related to the subject of the study were reviewed.Results: The findings showed that each of the studied platforms has features, such as data processing, support for different languages, processing speed, computational model, memory management, optimization, delay, error tolerance, scalability, performance, compatibility, Security and so on. Overall, the findings showed that the Apache Hadoop environment has simplicity, error detection, and scalability management based on clusters, but because its processing is based on batch processing, it works for slow complex analyzes and does not support flow processing, Apache Spark is also distributed as a computational platform that can process a big data set in memory with a very fast response time, the Apache Flink allows users to store data in memory and load them multiple times and provide a complex Fault Tolerance mechanism Continuously retrieves data flow status.Conclusion: The application of big data analysis and processing platforms varies according to the needs. In other words, it can be said that each technology is complementary, each of which is applicable in a particular field and cannot be separated from one another and depending on the purpose and the expected expectation, and the platform must be selected for analysis or whether custom tools are designed on these platforms.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Frontiers in Health Informatics

CiteScore

1.20

自引率

0.00%

发文量