Frontiers in Big DataPub Date : 2024-09-10eCollection Date: 2024-01-01DOI: 10.3389/fdata.2024.1441869
Xiaoyao Han, Oskar Josef Gstrein, Vasilios Andrikopoulos
{"title":"When we talk about Big Data, What do we really mean? Toward a more precise definition of Big Data.","authors":"Xiaoyao Han, Oskar Josef Gstrein, Vasilios Andrikopoulos","doi":"10.3389/fdata.2024.1441869","DOIUrl":"https://doi.org/10.3389/fdata.2024.1441869","url":null,"abstract":"<p><p>Despite the lack of consensus on an official definition of Big Data, research and studies have continued to progress based on this \"no consensus\" stance over the years. However, the lack of a clear definition and scope for Big Data results in scientific research and communication lacking a common ground. Even with the popular \"V\" characteristics, Big Data remains elusive. The term is broad and is used differently in research, often referring to entirely different concepts, which is rarely stated explicitly in papers. While many studies and reviews attempt to draw a comprehensive understanding of Big Data, there has been little systematic research on the position and practical implications of the term Big Data in research environments. To address this gap, this paper presents a Systematic Literature Review (SLR) on secondary studies to provide a comprehensive overview of how Big Data is used and understood across different scientific domains. Our objective was to monitor the application of the Big Data concept in science, identify which technologies are prevalent in which fields, and investigate the discrepancies between the theoretical understanding and practical usage of the term. Our study found that various Big Data technologies are being used in different scientific fields, including machine learning algorithms, distributed computing frameworks, and other tools. These manifestations of Big Data can be classified into four major categories: abstract concepts, large datasets, machine learning techniques, and the Big Data ecosystem. This study revealed that despite the general agreement on the \"V\" characteristics, researchers in different scientific fields have varied implicit understandings of Big Data. These implicit understandings significantly influence the content and discussions of studies involving Big Data, although they are often not explicitly stated. We call for a clearer articulation of the meaning of Big Data in research to facilitate smoother scientific communication.</p>","PeriodicalId":52859,"journal":{"name":"Frontiers in Big Data","volume":"7 ","pages":"1441869"},"PeriodicalIF":2.4,"publicationDate":"2024-09-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11420115/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142332189","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Frontiers in Big DataPub Date : 2024-09-09eCollection Date: 2024-01-01DOI: 10.3389/fdata.2024.1446071
Nicholas Kofi Akortia Hagan, John R Talburt
{"title":"SparkDWM: a scalable design of a Data Washing Machine using Apache Spark.","authors":"Nicholas Kofi Akortia Hagan, John R Talburt","doi":"10.3389/fdata.2024.1446071","DOIUrl":"10.3389/fdata.2024.1446071","url":null,"abstract":"<p><p>Data volume has been one of the fast-growing assets of most real-world applications. This increases the rate of human errors such as duplication of records, misspellings, and erroneous transpositions, among other data quality issues. Entity Resolution is an ETL process that aims to resolve data inconsistencies by ensuring entities are referring to the same real-world objects. One of the main challenges of most traditional Entity Resolution systems is ensuring their scalability to meet the rising data needs. This research aims to refactor a working proof-of-concept entity resolution system called the Data Washing Machine to be highly scalable using Apache Spark distributed data processing framework. We solve the single-threaded design problem of the legacy Data Washing Machine by using PySpark's Resilient Distributed Dataset and improve the Data Washing Machine design to use intrinsic metadata information from references. We prove that our systems achieve the same results as the legacy Data Washing Machine using 18 synthetically generated datasets. We also test the scalability of our system using a variety of real-world benchmark ER datasets from a few thousand to millions. Our experimental results show that our proposed system performs better than a MapReduce-based Data Washing Machine. We also compared our system with Famer and concluded that our system can find more clusters when given optimal starting parameters for clustering.</p>","PeriodicalId":52859,"journal":{"name":"Frontiers in Big Data","volume":"7 ","pages":"1446071"},"PeriodicalIF":2.4,"publicationDate":"2024-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11416992/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142309124","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Frontiers in Big DataPub Date : 2024-09-04eCollection Date: 2024-01-01DOI: 10.3389/fdata.2024.1400024
Enes Altuncu, Virginia N L Franqueira, Shujun Li
{"title":"Deepfake: definitions, performance metrics and standards, datasets, and a meta-review.","authors":"Enes Altuncu, Virginia N L Franqueira, Shujun Li","doi":"10.3389/fdata.2024.1400024","DOIUrl":"https://doi.org/10.3389/fdata.2024.1400024","url":null,"abstract":"<p><p>Recent advancements in AI, especially deep learning, have contributed to a significant increase in the creation of new realistic-looking synthetic media (video, image, and audio) and manipulation of existing media, which has led to the creation of the new term \"deepfake.\" Based on both the research literature and resources in English, this paper gives a comprehensive overview of deepfake, covering multiple important aspects of this emerging concept, including (1) different definitions, (2) commonly used performance metrics and standards, and (3) deepfake-related datasets. In addition, the paper also reports a meta-review of 15 selected deepfake-related survey papers published since 2020, focusing not only on the mentioned aspects but also on the analysis of key challenges and recommendations. We believe that this paper is the most comprehensive review of deepfake in terms of the aspects covered.</p>","PeriodicalId":52859,"journal":{"name":"Frontiers in Big Data","volume":"7 ","pages":"1400024"},"PeriodicalIF":2.4,"publicationDate":"2024-09-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11408348/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142300698","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Frontiers in Big DataPub Date : 2024-08-29eCollection Date: 2024-01-01DOI: 10.3389/fdata.2024.1348030
Charles X Ling, Ganyu Wang, Boyu Wang
{"title":"Sparse and Expandable Network for Google's Pathways.","authors":"Charles X Ling, Ganyu Wang, Boyu Wang","doi":"10.3389/fdata.2024.1348030","DOIUrl":"https://doi.org/10.3389/fdata.2024.1348030","url":null,"abstract":"<p><strong>Introduction: </strong>Recently, Google introduced Pathways as its next-generation AI architecture. Pathways must address three critical challenges: learning one general model for several continuous tasks, ensuring tasks can leverage each other without forgetting old tasks, and learning from multi-modal data such as images and audio. Additionally, Pathways must maintain sparsity in both learning and deployment. Current lifelong multi-task learning approaches are inadequate in addressing these challenges.</p><p><strong>Methods: </strong>To address these challenges, we propose SEN, a Sparse and Expandable Network. SEN is designed to handle multiple tasks concurrently by maintaining sparsity and enabling expansion when new tasks are introduced. The network leverages multi-modal data, integrating information from different sources while preventing interference between tasks.</p><p><strong>Results: </strong>The proposed SEN model demonstrates significant improvements in multi-task learning, successfully managing task interference and forgetting. It effectively integrates data from various modalities and maintains efficiency through sparsity during both the learning and deployment phases.</p><p><strong>Discussion: </strong>SEN offers a straightforward yet effective solution to the limitations of current lifelong multi-task learning methods. By addressing the challenges identified in the Pathways architecture, SEN provides a promising approach for developing AI systems capable of learning and adapting over time without sacrificing performance or efficiency.</p>","PeriodicalId":52859,"journal":{"name":"Frontiers in Big Data","volume":"7 ","pages":"1348030"},"PeriodicalIF":2.4,"publicationDate":"2024-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11390433/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142300699","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Efficient use of binned data for imputing univariate time series data.","authors":"Jay Darji, Nupur Biswas, Vijay Padul, Jaya Gill, Santosh Kesari, Shashaanka Ashili","doi":"10.3389/fdata.2024.1422650","DOIUrl":"10.3389/fdata.2024.1422650","url":null,"abstract":"<p><p>Time series data are recorded in various sectors, resulting in a large amount of data. However, the continuity of these data is often interrupted, resulting in periods of missing data. Several algorithms are used to impute the missing data, and the performance of these methods is widely varied. Apart from the choice of algorithm, the effective imputation depends on the nature of missing and available data. We conducted extensive studies using different types of time series data, specifically heart rate data and power consumption data. We generated the missing data for different time spans and imputed using different algorithms with binned data of different sizes. The performance was evaluated using the root mean square error (RMSE) metric. We observed a reduction in RMSE when using binned data compared to the entire dataset, particularly in the case of the expectation-maximization (EM) algorithm. We found that RMSE was reduced when using binned data for 1-, 5-, and 15-min missing data, with greater reduction observed for 15-min missing data. We also observed the effect of data fluctuation. We conclude that the usefulness of binned data depends precisely on the span of missing data, sampling frequency of the data, and fluctuation within data. Depending on the inherent characteristics, quality, and quantity of the missing and available data, binned data can impute a wide variety of data, including biological heart rate data derived from the Internet of Things (IoT) device smartwatch and non-biological data such as household power consumption data.</p>","PeriodicalId":52859,"journal":{"name":"Frontiers in Big Data","volume":"7 ","pages":"1422650"},"PeriodicalIF":2.4,"publicationDate":"2024-08-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11371617/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142134416","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Frontiers in Big DataPub Date : 2024-08-16eCollection Date: 2024-01-01DOI: 10.3389/fdata.2024.1420344
Vasundhara Kaul, Tamalika Mukherjee
{"title":"Equitable differential privacy.","authors":"Vasundhara Kaul, Tamalika Mukherjee","doi":"10.3389/fdata.2024.1420344","DOIUrl":"10.3389/fdata.2024.1420344","url":null,"abstract":"<p><p>Differential privacy (DP) has been in the public spotlight since the announcement of its use in the 2020 U.S. Census. While DP algorithms have substantially improved the confidentiality protections provided to Census respondents, concerns have been raised about the accuracy of the DP-protected Census data. The extent to which the use of DP distorts the ability to draw inferences that drive policy about small-populations, especially marginalized communities, has been of particular concern to researchers and policy makers. After all, inaccurate information about marginalized populations can often engender policies that exacerbate rather than ameliorate social inequities. Consequently, computer science experts have focused on developing mechanisms that help achieve equitable privacy, i.e., mechanisms that mitigate the data distortions introduced by privacy protections to ensure equitable outcomes and benefits for all groups, particularly marginalized groups. Our paper extends the conversation on equitable privacy by highlighting the importance of inclusive communication in ensuring equitable outcomes for all social groups through all the stages of deploying a differentially private system. We conceptualize Equitable DP as the design, communication, and implementation of DP algorithms that ensure equitable outcomes. Thus, in addition to adopting computer scientists' recommendations of incorporating equity parameters within DP algorithms, we suggest that it is critical for an organization to also facilitate inclusive communication throughout the design, development, and implementation stages of a DP algorithm to ensure it has an equitable impact on social groups and does not hinder the redressal of social inequities. To demonstrate the importance of communication for Equitable DP, we undertake a case study of the process through which DP was adopted as the newest disclosure avoidance system for the 2020 U.S. Census. Drawing on the Inclusive Science Communication (ISC) framework, we examine the extent to which the Census Bureau's communication strategies encouraged engagement across the diverse groups of users that employ the decennial Census data for research and policy making. Our analysis provides lessons that can be used by other government organizations interested in incorporating the Equitable DP approach in their data collection practices.</p>","PeriodicalId":52859,"journal":{"name":"Frontiers in Big Data","volume":"7 ","pages":"1420344"},"PeriodicalIF":2.4,"publicationDate":"2024-08-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11363707/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142114688","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Frontiers in Big DataPub Date : 2024-08-14eCollection Date: 2024-01-01DOI: 10.3389/fdata.2024.1287442
Philipp Brandt
{"title":"Data science's cultural construction: qualitative ideas for quantitative work.","authors":"Philipp Brandt","doi":"10.3389/fdata.2024.1287442","DOIUrl":"https://doi.org/10.3389/fdata.2024.1287442","url":null,"abstract":"<p><strong>Introduction: </strong>\"Data scientists\" quickly became ubiquitous, often infamously so, but they have struggled with the ambiguity of their novel role. This article studies data science's collective definition on Twitter.</p><p><strong>Methods: </strong>The analysis responds to the challenges of studying an emergent case with unclear boundaries and substance through a cultural perspective and complementary datasets ranging from 1,025 to 752,815 tweets. It brings together relations between accounts that tweeted about data science, the hashtags they used, indicating purposes, and the topics they discussed.</p><p><strong>Results: </strong>The first results reproduce familiar commercial and technical motives. Additional results reveal concerns with new practical and ethical standards as a distinctive motive for constructing data science.</p><p><strong>Discussion: </strong>The article provides a sensibility for local meaning in usually abstract datasets and a heuristic for navigating increasingly abundant datasets toward surprising insights. For data scientists, it offers a guide for positioning themselves vis-à-vis others to navigate their professional future.</p>","PeriodicalId":52859,"journal":{"name":"Frontiers in Big Data","volume":"7 ","pages":"1287442"},"PeriodicalIF":2.4,"publicationDate":"2024-08-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11349665/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142114687","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Frontiers in Big DataPub Date : 2024-07-31eCollection Date: 2024-01-01DOI: 10.3389/fdata.2024.1374980
Wenjun Meng, Lili Chen, Zhaomin Dong
{"title":"The development and application of a novel E-commerce recommendation system used in electric power B2B sector.","authors":"Wenjun Meng, Lili Chen, Zhaomin Dong","doi":"10.3389/fdata.2024.1374980","DOIUrl":"https://doi.org/10.3389/fdata.2024.1374980","url":null,"abstract":"<p><p>The advent of the digital era has transformed E-commerce platforms into critical tools for industry, yet traditional recommendation systems often fall short in the specialized context of the electric power industry. These systems typically struggle with the industry's unique challenges, such as infrequent and high-stakes transactions, prolonged decision-making processes, and sparse data. This research has developed a novel recommendation engine tailored to these specific conditions, such as to handle the low frequency and long cycle nature of Business-to-Business (B2B) transactions. This approach includes algorithmic enhancements to better process and interpret the limited data available, and data pre-processing techniques designed to enrich the sparse datasets characteristic of this industry. This research also introduces a methodological innovation that integrates multi-dimensional data, combining user E-commerce activities, product specifics, and essential non-tendering information. The proposed engine employs advanced machine learning techniques to provide more accurate and relevant recommendations. The results demonstrate a marked improvement over traditional models, offering a more robust and effective tool for facilitating B2B transactions in the electric power industry. This research not only addresses the sector's unique challenges but also provides a blueprint for adapting recommendation systems to other industries with similar B2B characteristics.</p>","PeriodicalId":52859,"journal":{"name":"Frontiers in Big Data","volume":"7 ","pages":"1374980"},"PeriodicalIF":2.4,"publicationDate":"2024-07-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11322496/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141983886","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Frontiers in Big DataPub Date : 2024-07-02eCollection Date: 2024-01-01DOI: 10.3389/fdata.2024.1382144
Yan Wu, Yunzhi Jin
{"title":"Efficient enhancement of low-rank tensor completion via thin QR decomposition.","authors":"Yan Wu, Yunzhi Jin","doi":"10.3389/fdata.2024.1382144","DOIUrl":"10.3389/fdata.2024.1382144","url":null,"abstract":"<p><p>Low-rank tensor completion (LRTC), which aims to complete missing entries from tensors with partially observed terms by utilizing the low-rank structure of tensors, has been widely used in various real-world issues. The core tensor nuclear norm minimization (CTNM) method based on Tucker decomposition is one of common LRTC methods. However, the CTNM methods based on Tucker decomposition often have a large computing cost due to the fact that the general factor matrix solving technique involves multiple singular value decompositions (SVDs) in each loop. To address this problem, this article enhances the method and proposes an effective CTNM method based on thin QR decomposition (CTNM-QR) with lower computing complexity. The proposed method extends the CTNM by introducing tensor versions of the auxiliary variables instead of matrices, while using the thin QR decomposition to solve the factor matrix rather than the SVD, which can save the computational complexity and improve the tensor completion accuracy. In addition, the CTNM-QR method's convergence and complexity are analyzed further. Numerous experiments in synthetic data, real color images, and brain MRI data at different missing rates demonstrate that the proposed method not only outperforms in terms of completion accuracy and visualization, but also conducts more efficiently than most state-of-the-art LRTC methods.</p>","PeriodicalId":52859,"journal":{"name":"Frontiers in Big Data","volume":"7 ","pages":"1382144"},"PeriodicalIF":2.4,"publicationDate":"2024-07-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11250652/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141629268","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Frontiers in Big DataPub Date : 2024-07-01eCollection Date: 2024-01-01DOI: 10.3389/fdata.2024.1402384
Patchanok Srisuradetchai, Korn Suksrikran
{"title":"Random kernel k-nearest neighbors regression.","authors":"Patchanok Srisuradetchai, Korn Suksrikran","doi":"10.3389/fdata.2024.1402384","DOIUrl":"10.3389/fdata.2024.1402384","url":null,"abstract":"<p><p>The k-nearest neighbors (KNN) regression method, known for its nonparametric nature, is highly valued for its simplicity and its effectiveness in handling complex structured data, particularly in big data contexts. However, this method is susceptible to overfitting and fit discontinuity, which present significant challenges. This paper introduces the random kernel k-nearest neighbors (RK-KNN) regression as a novel approach that is well-suited for big data applications. It integrates kernel smoothing with bootstrap sampling to enhance prediction accuracy and the robustness of the model. This method aggregates multiple predictions using random sampling from the training dataset and selects subsets of input variables for kernel KNN (K-KNN). A comprehensive evaluation of RK-KNN on 15 diverse datasets, employing various kernel functions including Gaussian and Epanechnikov, demonstrates its superior performance. When compared to standard KNN and the random KNN (R-KNN) models, it significantly reduces the root mean square error (RMSE) and mean absolute error, as well as improving R-squared values. The RK-KNN variant that employs a specific kernel function yielding the lowest RMSE will be benchmarked against state-of-the-art methods, including support vector regression, artificial neural networks, and random forests.</p>","PeriodicalId":52859,"journal":{"name":"Frontiers in Big Data","volume":"7 ","pages":"1402384"},"PeriodicalIF":2.4,"publicationDate":"2024-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11246867/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141622134","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}