Frontiers in Big Data最新文献

筛选
英文 中文
Ontology extension by online clustering with large language model agents. 利用大型语言模型代理进行在线聚类的本体扩展。
IF 2.4
Frontiers in Big Data Pub Date : 2024-10-07 eCollection Date: 2024-01-01 DOI: 10.3389/fdata.2024.1463543
Guanchen Wu, Chen Ling, Ilana Graetz, Liang Zhao
{"title":"Ontology extension by online clustering with large language model agents.","authors":"Guanchen Wu, Chen Ling, Ilana Graetz, Liang Zhao","doi":"10.3389/fdata.2024.1463543","DOIUrl":"10.3389/fdata.2024.1463543","url":null,"abstract":"<p><p>An ontology is a structured framework that categorizes entities, concepts, and relationships within a domain to facilitate shared understanding, and it is important in computational linguistics and knowledge representation. In this paper, we propose a novel framework to automatically extend an existing ontology from streaming data in a zero-shot manner. Specifically, the zero-shot ontology extension framework uses online and hierarchical clustering to integrate new knowledge into existing ontologies without substantial annotated data or domain-specific expertise. Focusing on the medical field, this approach leverages Large Language Models (LLMs) for two key tasks: Symptom Typing and Symptom Taxonomy among breast and bladder cancer survivors. Symptom Typing involves identifying and classifying medical symptoms from unstructured online patient forum data, while Symptom Taxonomy organizes and integrates these symptoms into an existing ontology. The combined use of online and hierarchical clustering enables real-time and structured categorization and integration of symptoms. The dual-phase model employs multiple LLMs to ensure accurate classification and seamless integration of new symptoms with minimal human oversight. The paper details the framework's development, experiments, quantitative analyses, and data visualizations, demonstrating its effectiveness in enhancing medical ontologies and advancing knowledge-based systems in healthcare.</p>","PeriodicalId":52859,"journal":{"name":"Frontiers in Big Data","volume":"7 ","pages":"1463543"},"PeriodicalIF":2.4,"publicationDate":"2024-10-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11491333/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142480536","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Machine learning-based remission prediction in rheumatoid arthritis patients treated with biologic disease-modifying anti-rheumatic drugs: findings from the Kuwait rheumatic disease registry. 基于机器学习的类风湿关节炎患者缓解预测:科威特风湿病登记处的研究结果。
IF 2.4
Frontiers in Big Data Pub Date : 2024-10-03 eCollection Date: 2024-01-01 DOI: 10.3389/fdata.2024.1406365
Ahmad R Alsaber, Adeeba Al-Herz, Balqees Alawadhi, Iyad Abu Doush, Parul Setiya, Ahmad T Al-Sultan, Khulood Saleh, Adel Al-Awadhi, Eman Hasan, Waleed Al-Kandari, Khalid Mokaddem, Aqeel A Ghanem, Yousef Attia, Mohammed Hussain, Naser AlHadhood, Yaser Ali, Hoda Tarakmeh, Ghaydaa Aldabie, Amjad AlKadi, Hebah Alhajeri
{"title":"Machine learning-based remission prediction in rheumatoid arthritis patients treated with biologic disease-modifying anti-rheumatic drugs: findings from the Kuwait rheumatic disease registry.","authors":"Ahmad R Alsaber, Adeeba Al-Herz, Balqees Alawadhi, Iyad Abu Doush, Parul Setiya, Ahmad T Al-Sultan, Khulood Saleh, Adel Al-Awadhi, Eman Hasan, Waleed Al-Kandari, Khalid Mokaddem, Aqeel A Ghanem, Yousef Attia, Mohammed Hussain, Naser AlHadhood, Yaser Ali, Hoda Tarakmeh, Ghaydaa Aldabie, Amjad AlKadi, Hebah Alhajeri","doi":"10.3389/fdata.2024.1406365","DOIUrl":"https://doi.org/10.3389/fdata.2024.1406365","url":null,"abstract":"<p><strong>Background: </strong>Rheumatoid arthritis (RA) is a common condition treated with biological disease-modifying anti-rheumatic medicines (bDMARDs). However, many patients exhibit resistance, necessitating the use of machine learning models to predict remissions in patients treated with bDMARDs, thereby reducing healthcare costs and minimizing negative effects.</p><p><strong>Objective: </strong>The study aims to develop machine learning models using data from the Kuwait Registry for Rheumatic Diseases (KRRD) to identify clinical characteristics predictive of remission in RA patients treated with biologics.</p><p><strong>Methods: </strong>The study collected follow-up data from 1,968 patients treated with bDMARDs from four public hospitals in Kuwait from 2013 to 2022. Machine learning techniques like lasso, ridge, support vector machine, random forest, XGBoost, and Shapley additive explanation were used to predict remission at a 1-year follow-up.</p><p><strong>Results: </strong>The study used the Shapley plot in explainable Artificial Intelligence (XAI) to analyze the effects of predictors on remission prognosis across different types of bDMARDs. Top clinical features were identified for patients treated with bDMARDs, each associated with specific mean SHAP values. The findings highlight the importance of clinical assessments and specific treatments in shaping treatment outcomes.</p><p><strong>Conclusion: </strong>The proposed machine learning model system effectively identifies clinical features predicting remission in bDMARDs, potentially improving treatment efficacy in rheumatoid arthritis patients.</p>","PeriodicalId":52859,"journal":{"name":"Frontiers in Big Data","volume":"7 ","pages":"1406365"},"PeriodicalIF":2.4,"publicationDate":"2024-10-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11484091/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142480535","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Unsupervised machine learning model for detecting anomalous volumetric modulated arc therapy plans for lung cancer patients. 用于检测肺癌患者异常容积调制弧治疗计划的无监督机器学习模型。
IF 2.4
Frontiers in Big Data Pub Date : 2024-10-03 eCollection Date: 2024-01-01 DOI: 10.3389/fdata.2024.1462745
Peng Huang, Jiawen Shang, Yuhan Fan, Zhihui Hu, Jianrong Dai, Zhiqiang Liu, Hui Yan
{"title":"Unsupervised machine learning model for detecting anomalous volumetric modulated arc therapy plans for lung cancer patients.","authors":"Peng Huang, Jiawen Shang, Yuhan Fan, Zhihui Hu, Jianrong Dai, Zhiqiang Liu, Hui Yan","doi":"10.3389/fdata.2024.1462745","DOIUrl":"https://doi.org/10.3389/fdata.2024.1462745","url":null,"abstract":"<p><strong>Purpose: </strong>Volumetric modulated arc therapy (VMAT) is a new treatment modality in modern radiotherapy. To ensure the quality of the radiotherapy plan, a physics plan review is routinely conducted by senior clinicians; however, this process is less efficient and less accurate. In this study, a multi-task AutoEncoder (AE) is proposed to automate anomaly detection of VMAT plans for lung cancer patients.</p><p><strong>Methods: </strong>The feature maps are first extracted from a VMAT plan. Then, a multi-task AE is trained based on the input of a feature map, and its output is the two targets (beam aperture and prescribed dose). Based on the distribution of reconstruction errors on the training set, a detection threshold value is obtained. For a testing sample, its reconstruction error is calculated using the AE model and compared with the threshold value to determine its classes (anomaly or regular). The proposed multi-task AE model is compared to the other existing AE models, including Vanilla AE, Contractive AE, and Variational AE. The area under the receiver operating characteristic curve (AUC) and the other statistics are used to evaluate the performance of these models.</p><p><strong>Results: </strong>Among the four tested AE models, the proposed multi-task AE model achieves the highest values in AUC (0.964), accuracy (0.821), precision (0.471), and <i>F</i>1 score (0.632), and the lowest value in FPR (0.206).</p><p><strong>Conclusion: </strong>The proposed multi-task AE model using two-dimensional (2D) feature maps can effectively detect anomalies in radiotherapy plans for lung cancer patients. Compared to the other existing AE models, the multi-task AE is more accurate and efficient. The proposed model provides a feasible way to carry out automated anomaly detection of VMAT plans in radiotherapy.</p>","PeriodicalId":52859,"journal":{"name":"Frontiers in Big Data","volume":"7 ","pages":"1462745"},"PeriodicalIF":2.4,"publicationDate":"2024-10-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11484413/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142480538","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Prediction and classification of obesity risk based on a hybrid metaheuristic machine learning approach. 基于混合元启发式机器学习方法的肥胖风险预测与分类。
IF 2.4
Frontiers in Big Data Pub Date : 2024-09-30 eCollection Date: 2024-01-01 DOI: 10.3389/fdata.2024.1469981
Zarindokht Helforoush, Hossein Sayyad
{"title":"Prediction and classification of obesity risk based on a hybrid metaheuristic machine learning approach.","authors":"Zarindokht Helforoush, Hossein Sayyad","doi":"10.3389/fdata.2024.1469981","DOIUrl":"https://doi.org/10.3389/fdata.2024.1469981","url":null,"abstract":"<p><strong>Introduction: </strong>As the global prevalence of obesity continues to rise, it has become a major public health concern requiring more accurate prediction methods. Traditional regression models often fail to capture the complex interactions between genetic, environmental, and behavioral factors contributing to obesity.</p><p><strong>Methods: </strong>This study explores the potential of machine-learning techniques to improve obesity risk prediction. Various supervised learning algorithms, including the novel ANN-PSO hybrid model, were applied following comprehensive data preprocessing and evaluation.</p><p><strong>Results: </strong>The proposed ANN-PSO model achieved a remarkable accuracy rate of 92%, outperforming traditional regression methods. SHAP was employed to analyze feature importance, offering deeper insights into the influence of various factors on obesity risk.</p><p><strong>Discussion: </strong>The findings highlight the transformative role of advanced machine-learning models in public health research, offering a pathway for personalized healthcare interventions. By providing detailed obesity risk profiles, these models enable healthcare providers to tailor prevention and treatment strategies to individual needs. The results underscore the need to integrate innovative machine-learning approaches into global public health efforts to combat the growing obesity epidemic.</p>","PeriodicalId":52859,"journal":{"name":"Frontiers in Big Data","volume":"7 ","pages":"1469981"},"PeriodicalIF":2.4,"publicationDate":"2024-09-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11471553/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142480537","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Making the most of big qualitative datasets: a living systematic review of analysis methods. 充分利用大型定性数据集:对分析方法的系统回顾。
IF 2.4
Frontiers in Big Data Pub Date : 2024-09-25 eCollection Date: 2024-01-01 DOI: 10.3389/fdata.2024.1455399
Abinaya Chandrasekar, Sigrún Eyrúnardóttir Clark, Sam Martin, Samantha Vanderslott, Elaine C Flores, David Aceituno, Phoebe Barnett, Cecilia Vindrola-Padros, Norha Vera San Juan
{"title":"Making the most of big qualitative datasets: a living systematic review of analysis methods.","authors":"Abinaya Chandrasekar, Sigrún Eyrúnardóttir Clark, Sam Martin, Samantha Vanderslott, Elaine C Flores, David Aceituno, Phoebe Barnett, Cecilia Vindrola-Padros, Norha Vera San Juan","doi":"10.3389/fdata.2024.1455399","DOIUrl":"10.3389/fdata.2024.1455399","url":null,"abstract":"<p><strong>Introduction: </strong>Qualitative data provides deep insights into an individual's behaviors and beliefs, and the contextual factors that may shape these. Big qualitative data analysis is an emerging field that aims to identify trends and patterns in large qualitative datasets. The purpose of this review was to identify the methods used to analyse large bodies of qualitative data, their cited strengths and limitations and comparisons between manual and digital analysis approaches.</p><p><strong>Methods: </strong>A multifaceted approach has been taken to develop the review relying on academic, gray and media-based literature, using approaches such as iterative analysis, frequency analysis, text network analysis and team discussion.</p><p><strong>Results: </strong>The review identified 520 articles that detailed analysis approaches of big qualitative data. From these publications a diverse range of methods and software used for analysis were identified, with thematic analysis and basic software being most common. Studies were most commonly conducted in high-income countries, and the most common data sources were open-ended survey responses, interview transcripts, and first-person narratives.</p><p><strong>Discussion: </strong>We identified an emerging trend to expand the sources of qualitative data (e.g., using social media data, images, or videos), and develop new methods and software for analysis. As the qualitative analysis field may continue to change, it will be necessary to conduct further research to compare the utility of different big qualitative analysis methods and to develop standardized guidelines to raise awareness and support researchers in the use of more novel approaches for big qualitative analysis.</p><p><strong>Systematic review registration: </strong>https://osf.io/hbvsy/?view_only=.</p>","PeriodicalId":52859,"journal":{"name":"Frontiers in Big Data","volume":"7 ","pages":"1455399"},"PeriodicalIF":2.4,"publicationDate":"2024-09-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11461344/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142395131","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Data-driven classification and explainable-AI in the field of lung imaging. 肺部成像领域的数据驱动分类和可解释人工智能。
IF 2.4
Frontiers in Big Data Pub Date : 2024-09-19 eCollection Date: 2024-01-01 DOI: 10.3389/fdata.2024.1393758
Syed Taimoor Hussain Shah, Syed Adil Hussain Shah, Iqra Iqbal Khan, Atif Imran, Syed Baqir Hussain Shah, Atif Mehmood, Shahzad Ahmad Qureshi, Mudassar Raza, Angelo Di Terlizzi, Marco Cavaglià, Marco Agostino Deriu
{"title":"Data-driven classification and explainable-AI in the field of lung imaging.","authors":"Syed Taimoor Hussain Shah, Syed Adil Hussain Shah, Iqra Iqbal Khan, Atif Imran, Syed Baqir Hussain Shah, Atif Mehmood, Shahzad Ahmad Qureshi, Mudassar Raza, Angelo Di Terlizzi, Marco Cavaglià, Marco Agostino Deriu","doi":"10.3389/fdata.2024.1393758","DOIUrl":"10.3389/fdata.2024.1393758","url":null,"abstract":"<p><p>Detecting lung diseases in medical images can be quite challenging for radiologists. In some cases, even experienced experts may struggle with accurately diagnosing chest diseases, leading to potential inaccuracies due to complex or unseen biomarkers. This review paper delves into various datasets and machine learning techniques employed in recent research for lung disease classification, focusing on pneumonia analysis using chest X-ray images. We explore conventional machine learning methods, pretrained deep learning models, customized convolutional neural networks (CNNs), and ensemble methods. A comprehensive comparison of different classification approaches is presented, encompassing data acquisition, preprocessing, feature extraction, and classification using machine vision, machine and deep learning, and explainable-AI (XAI). Our analysis highlights the superior performance of transfer learning-based methods using CNNs and ensemble models/features for lung disease classification. In addition, our comprehensive review offers insights for researchers in other medical domains too who utilize radiological images. By providing a thorough overview of various techniques, our work enables the establishment of effective strategies and identification of suitable methods for a wide range of challenges. Currently, beyond traditional evaluation metrics, researchers emphasize the importance of XAI techniques in machine and deep learning models and their applications in classification tasks. This incorporation helps in gaining a deeper understanding of their decision-making processes, leading to improved trust, transparency, and overall clinical decision-making. Our comprehensive review serves as a valuable resource for researchers and practitioners seeking not only to advance the field of lung disease detection using machine learning and XAI but also from other diverse domains.</p>","PeriodicalId":52859,"journal":{"name":"Frontiers in Big Data","volume":"7 ","pages":"1393758"},"PeriodicalIF":2.4,"publicationDate":"2024-09-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11446784/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142373559","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Current state of data stewardship tools in life science. 生命科学数据管理工具的现状。
IF 2.4
Frontiers in Big Data Pub Date : 2024-09-16 eCollection Date: 2024-01-01 DOI: 10.3389/fdata.2024.1428568
Anna Aksenova, Anoop Johny, Tim Adams, Phil Gribbon, Marc Jacobs, Martin Hofmann-Apitius
{"title":"Current state of data stewardship tools in life science.","authors":"Anna Aksenova, Anoop Johny, Tim Adams, Phil Gribbon, Marc Jacobs, Martin Hofmann-Apitius","doi":"10.3389/fdata.2024.1428568","DOIUrl":"10.3389/fdata.2024.1428568","url":null,"abstract":"<p><p>In today's data-centric landscape, effective data stewardship is critical for facilitating scientific research and innovation. This article provides an overview of essential tools and frameworks for modern data stewardship practices. Over 300 tools were analyzed in this study, assessing their utility, relevance to data stewardship, and applicability within the life sciences domain.</p>","PeriodicalId":52859,"journal":{"name":"Frontiers in Big Data","volume":"7 ","pages":"1428568"},"PeriodicalIF":2.4,"publicationDate":"2024-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11439729/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142332186","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Navigating pathways to automated personality prediction: a comparative study of small and medium language models. 通往自动人格预测之路:中小型语言模型的比较研究。
IF 2.4
Frontiers in Big Data Pub Date : 2024-09-13 eCollection Date: 2024-01-01 DOI: 10.3389/fdata.2024.1387325
Fatima Habib, Zeeshan Ali, Akbar Azam, Komal Kamran, Fahad Mansoor Pasha
{"title":"Navigating pathways to automated personality prediction: a comparative study of small and medium language models.","authors":"Fatima Habib, Zeeshan Ali, Akbar Azam, Komal Kamran, Fahad Mansoor Pasha","doi":"10.3389/fdata.2024.1387325","DOIUrl":"https://doi.org/10.3389/fdata.2024.1387325","url":null,"abstract":"<p><strong>Introduction: </strong>Recent advancements in Natural Language Processing (NLP) and widely available social media data have made it possible to predict human personalities in various computational applications. In this context, pre-trained Large Language Models (LLMs) have gained recognition for their exceptional performance in NLP benchmarks. However, these models require substantial computational resources, escalating their carbon and water footprint. Consequently, a shift toward more computationally efficient smaller models is observed.</p><p><strong>Methods: </strong>This study compares a small model ALBERT (11.8M parameters) with a larger model, RoBERTa (125M parameters) in predicting big five personality traits. It utilizes the PANDORA dataset comprising Reddit comments, processing them on a Tesla P100-PCIE-16GB GPU. The study customized both models to support multi-output regression and added two linear layers for fine-grained regression analysis.</p><p><strong>Results: </strong>Results are evaluated on Mean Squared Error (MSE) and Root Mean Squared Error (RMSE), considering the computational resources consumed during training. While ALBERT consumed lower levels of system memory with lower heat emission, it took higher computation time compared to RoBERTa. The study produced comparable levels of MSE, RMSE, and training loss reduction.</p><p><strong>Discussion: </strong>This highlights the influence of training data quality on the model's performance, outweighing the significance of model size. Theoretical and practical implications are also discussed.</p>","PeriodicalId":52859,"journal":{"name":"Frontiers in Big Data","volume":"7 ","pages":"1387325"},"PeriodicalIF":2.4,"publicationDate":"2024-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11427259/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142332187","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
When we talk about Big Data, What do we really mean? Toward a more precise definition of Big Data. 当我们谈论大数据时,我们真正指的是什么?更准确地定义大数据。
IF 2.4
Frontiers in Big Data Pub Date : 2024-09-10 eCollection Date: 2024-01-01 DOI: 10.3389/fdata.2024.1441869
Xiaoyao Han, Oskar Josef Gstrein, Vasilios Andrikopoulos
{"title":"When we talk about Big Data, What do we really mean? Toward a more precise definition of Big Data.","authors":"Xiaoyao Han, Oskar Josef Gstrein, Vasilios Andrikopoulos","doi":"10.3389/fdata.2024.1441869","DOIUrl":"https://doi.org/10.3389/fdata.2024.1441869","url":null,"abstract":"<p><p>Despite the lack of consensus on an official definition of Big Data, research and studies have continued to progress based on this \"no consensus\" stance over the years. However, the lack of a clear definition and scope for Big Data results in scientific research and communication lacking a common ground. Even with the popular \"V\" characteristics, Big Data remains elusive. The term is broad and is used differently in research, often referring to entirely different concepts, which is rarely stated explicitly in papers. While many studies and reviews attempt to draw a comprehensive understanding of Big Data, there has been little systematic research on the position and practical implications of the term Big Data in research environments. To address this gap, this paper presents a Systematic Literature Review (SLR) on secondary studies to provide a comprehensive overview of how Big Data is used and understood across different scientific domains. Our objective was to monitor the application of the Big Data concept in science, identify which technologies are prevalent in which fields, and investigate the discrepancies between the theoretical understanding and practical usage of the term. Our study found that various Big Data technologies are being used in different scientific fields, including machine learning algorithms, distributed computing frameworks, and other tools. These manifestations of Big Data can be classified into four major categories: abstract concepts, large datasets, machine learning techniques, and the Big Data ecosystem. This study revealed that despite the general agreement on the \"V\" characteristics, researchers in different scientific fields have varied implicit understandings of Big Data. These implicit understandings significantly influence the content and discussions of studies involving Big Data, although they are often not explicitly stated. We call for a clearer articulation of the meaning of Big Data in research to facilitate smoother scientific communication.</p>","PeriodicalId":52859,"journal":{"name":"Frontiers in Big Data","volume":"7 ","pages":"1441869"},"PeriodicalIF":2.4,"publicationDate":"2024-09-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11420115/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142332189","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
SparkDWM: a scalable design of a Data Washing Machine using Apache Spark. SparkDWM:使用 Apache Spark 的数据清洗机的可扩展设计。
IF 2.4
Frontiers in Big Data Pub Date : 2024-09-09 eCollection Date: 2024-01-01 DOI: 10.3389/fdata.2024.1446071
Nicholas Kofi Akortia Hagan, John R Talburt
{"title":"SparkDWM: a scalable design of a Data Washing Machine using Apache Spark.","authors":"Nicholas Kofi Akortia Hagan, John R Talburt","doi":"10.3389/fdata.2024.1446071","DOIUrl":"10.3389/fdata.2024.1446071","url":null,"abstract":"<p><p>Data volume has been one of the fast-growing assets of most real-world applications. This increases the rate of human errors such as duplication of records, misspellings, and erroneous transpositions, among other data quality issues. Entity Resolution is an ETL process that aims to resolve data inconsistencies by ensuring entities are referring to the same real-world objects. One of the main challenges of most traditional Entity Resolution systems is ensuring their scalability to meet the rising data needs. This research aims to refactor a working proof-of-concept entity resolution system called the Data Washing Machine to be highly scalable using Apache Spark distributed data processing framework. We solve the single-threaded design problem of the legacy Data Washing Machine by using PySpark's Resilient Distributed Dataset and improve the Data Washing Machine design to use intrinsic metadata information from references. We prove that our systems achieve the same results as the legacy Data Washing Machine using 18 synthetically generated datasets. We also test the scalability of our system using a variety of real-world benchmark ER datasets from a few thousand to millions. Our experimental results show that our proposed system performs better than a MapReduce-based Data Washing Machine. We also compared our system with Famer and concluded that our system can find more clusters when given optimal starting parameters for clustering.</p>","PeriodicalId":52859,"journal":{"name":"Frontiers in Big Data","volume":"7 ","pages":"1446071"},"PeriodicalIF":2.4,"publicationDate":"2024-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11416992/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142309124","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信