{"title":"Computer aided technology based on graph sample and aggregate attention network optimized for soccer teaching and training","authors":"Guanghui Yang, Xinyuan Feng","doi":"10.1186/s40537-024-00893-x","DOIUrl":"https://doi.org/10.1186/s40537-024-00893-x","url":null,"abstract":"<p>Football is the most popular game in the world and has significant influence on various aspects including politics, economy and culture. The experience of the football developed nation has shown that the steady growth of youth football is crucial for elevating a nation's overall football proficiency. It is essential to develop techniques and create strategies that adapt to their individual physical features to resolve the football players’ problem of lacking exercise in various topics. In this manuscript, Computer aided technology depending on the Graph Sample and Aggregate Attention Network Optimized for Soccer Teaching and Training (CAT-GSAAN-STT) is proposed to improve the efficiency of Soccer teaching and training effectively. The proposed method contains four stages, like data collection, data preprocessing, prediction and optimization. Initially the input data are collected by Microsoft Kinect V2 smart camera. Then the collected data are preprocessed by using Improving graph collaborative filtering. After preprocessing the data is given for motion recognition layer here prediction is done using Graph Sample and Aggregate Attention Network (GSAAN) for improving the effectiveness of Soccer Teaching and Training. To enhance the accuracy of the system, the GSAAN are optimized by using Artificial Rabbits Optimization. The proposed CAT-GSAAN-STT method is executed in Python and the efficiency of the proposed technique is examined with different metrics, like accuracy, computation time, learning activity analysis, student performance ratio and teaching evaluation analysis. The simulation outcomes proves that the proposed technique attains provides28.33%, 31.60%, 25.63% higherRecognition accuracy and33.67%, 38.12% and 27.34%lesser evaluation time while compared with existing techniques like computer aided teaching system based upon artificial intelligence in football teaching with training (STT-IOT-CATS), Computer Aided Teaching System for Football Teaching and Training Based on Video Image (CAT-STT-VI) and method for enhancing the football coaching quality using artificial intelligence and meta verse-empowered in mobile internet environment (SI-STQ-AI-MIE) respectively.</p>","PeriodicalId":15158,"journal":{"name":"Journal of Big Data","volume":"38 1","pages":""},"PeriodicalIF":8.1,"publicationDate":"2024-04-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140573471","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Essam H. Houssein, Rehab E. Mohamed, Gang Hu, Abdelmgeid A. Ali
{"title":"Adapting transformer-based language models for heart disease detection and risk factors extraction","authors":"Essam H. Houssein, Rehab E. Mohamed, Gang Hu, Abdelmgeid A. Ali","doi":"10.1186/s40537-024-00903-y","DOIUrl":"https://doi.org/10.1186/s40537-024-00903-y","url":null,"abstract":"<p>Efficiently treating cardiac patients before the onset of a heart attack relies on the precise prediction of heart disease. Identifying and detecting the risk factors for heart disease such as diabetes mellitus, Coronary Artery Disease (CAD), hyperlipidemia, hypertension, smoking, familial CAD history, obesity, and medications is critical for developing effective preventative and management measures. Although Electronic Health Records (EHRs) have emerged as valuable resources for identifying these risk factors, their unstructured format poses challenges for cardiologists in retrieving relevant information. This research proposed employing transfer learning techniques to automatically extract heart disease risk factors from EHRs. Leveraging transfer learning, a deep learning technique has demonstrated a significant performance in various clinical natural language processing (NLP) applications, particularly in heart disease risk prediction. This study explored the application of transformer-based language models, specifically utilizing pre-trained architectures like BERT (Bidirectional Encoder Representations from Transformers), RoBERTa, BioClinicalBERT, XLNet, and BioBERT for heart disease detection and extraction of related risk factors from clinical notes, using the i2b2 dataset. These transformer models are pre-trained on an extensive corpus of medical literature and clinical records to gain a deep understanding of contextualized language representations. Adapted models are then fine-tuned using annotated datasets specific to heart disease, such as the i2b2 dataset, enabling them to learn patterns and relationships within the domain. These models have demonstrated superior performance in extracting semantic information from EHRs, automating high-performance heart disease risk factor identification, and performing downstream NLP tasks within the clinical domain. This study proposed fine-tuned five widely used transformer-based models, namely BERT, RoBERTa, BioClinicalBERT, XLNet, and BioBERT, using the 2014 i2b2 clinical NLP challenge dataset. The fine-tuned models surpass conventional approaches in predicting the presence of heart disease risk factors with impressive accuracy. The RoBERTa model has achieved the highest performance, with micro F1-scores of 94.27%, while the BERT, BioClinicalBERT, XLNet, and BioBERT models have provided competitive performances with micro F1-scores of 93.73%, 94.03%, 93.97%, and 93.99%, respectively. Finally, a simple ensemble of the five transformer-based models has been proposed, which outperformed the most existing methods in heart disease risk fan, achieving a micro F1-Score of 94.26%. This study demonstrated the efficacy of transfer learning using transformer-based models in enhancing risk prediction and facilitating early intervention for heart disease prevention.</p>","PeriodicalId":15158,"journal":{"name":"Journal of Big Data","volume":"24 1","pages":""},"PeriodicalIF":8.1,"publicationDate":"2024-04-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140573484","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Gene selection via improved nuclear reaction optimization algorithm for cancer classification in high-dimensional data","authors":"","doi":"10.1186/s40537-024-00902-z","DOIUrl":"https://doi.org/10.1186/s40537-024-00902-z","url":null,"abstract":"<h3>Abstract</h3> <p>RNA Sequencing (RNA-Seq) has been considered a revolutionary technique in gene profiling and quantification. It offers a comprehensive view of the transcriptome, making it a more expansive technique in comparison with micro-array. Genes that discriminate malignancy and normal can be deduced using quantitative gene expression. However, this data is a high-dimensional dense matrix; each sample has a dimension of more than 20,000 genes. Dealing with this data poses challenges. This paper proposes RBNRO-DE (Relief Binary NRO based on Differential Evolution) for handling the gene selection strategy on (rnaseqv2 illuminahiseq rnaseqv2 un edu Level 3 RSEM genes normalized) with more than 20,000 genes to pick the best informative genes and assess them through 22 cancer datasets. The <em>k</em>-nearest Neighbor (<em>k</em>-NN) and Support Vector Machine (SVM) are applied to assess the quality of the selected genes. Binary versions of the most common meta-heuristic algorithms have been compared with the proposed RBNRO-DE algorithm. In most of the 22 cancer datasets, the RBNRO-DE algorithm based on <em>k</em>-NN and SVM classifiers achieved optimal convergence and classification accuracy up to 100% integrated with a feature reduction size down to 98%, which is very evident when compared to its counterparts, according to Wilcoxon’s rank-sum test (5% significance level).</p>","PeriodicalId":15158,"journal":{"name":"Journal of Big Data","volume":"82 1","pages":""},"PeriodicalIF":8.1,"publicationDate":"2024-04-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140573483","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Dario Malchiodi, Davide Raimondi, Giacomo Fumagalli, Raffaele Giancarlo, Marco Frasca
{"title":"The role of classifiers and data complexity in learned Bloom filters: insights and recommendations","authors":"Dario Malchiodi, Davide Raimondi, Giacomo Fumagalli, Raffaele Giancarlo, Marco Frasca","doi":"10.1186/s40537-024-00906-9","DOIUrl":"https://doi.org/10.1186/s40537-024-00906-9","url":null,"abstract":"<p>Bloom filters, since their introduction over 50 years ago, have become a pillar to handle membership queries in small space, with relevant application in Big Data Mining and Stream Processing. Further improvements have been recently proposed with the use of Machine Learning techniques: learned Bloom filters. Those latter make considerably more complicated the proper parameter setting of this multi-criteria data structure, in particular in regard to the choice of one of its key components (the classifier) and accounting for the classification complexity of the input dataset. Given this State of the Art, our contributions are as follows. (1) A novel methodology, supported by software, for designing, analyzing and implementing learned Bloom filters that account for their own multi-criteria nature, in particular concerning classifier type choice and data classification complexity. Extensive experiments show the validity of the proposed methodology and, being our software public, we offer a valid tool to the practitioners interested in using learned Bloom filters. (2) Further contributions to the advancement of the State of the Art that are of great practical relevance are the following: (a) the classifier inference time should not be taken as a proxy for the filter reject time; (b) of the many classifiers we have considered, only two offer good performance; this result is in agreement with and further strengthens early findings in the literature; (c) Sandwiched Bloom filter, which is already known as being one of the references of this area, is further shown here to have the remarkable property of robustness to data complexity and classifier performance variability.</p>","PeriodicalId":15158,"journal":{"name":"Journal of Big Data","volume":"5 1","pages":""},"PeriodicalIF":8.1,"publicationDate":"2024-03-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140313130","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Huanjing Wang, Qianxin Liang, John T. Hancock, Taghi M. Khoshgoftaar
{"title":"Feature selection strategies: a comparative analysis of SHAP-value and importance-based methods","authors":"Huanjing Wang, Qianxin Liang, John T. Hancock, Taghi M. Khoshgoftaar","doi":"10.1186/s40537-024-00905-w","DOIUrl":"https://doi.org/10.1186/s40537-024-00905-w","url":null,"abstract":"<p>In the context of high-dimensional credit card fraud data, researchers and practitioners commonly utilize feature selection techniques to enhance the performance of fraud detection models. This study presents a comparison in model performance using the most important features selected by SHAP (SHapley Additive exPlanations) values and the model’s built-in feature importance list. Both methods rank features and choose the most significant ones for model assessment. To evaluate the effectiveness of these feature selection techniques, classification models are built using five classifiers: XGBoost, Decision Tree, CatBoost, Extremely Randomized Trees, and Random Forest. The Area under the Precision-Recall Curve (AUPRC) serves as the evaluation metric. All experiments are executed on the Kaggle Credit Card Fraud Detection Dataset. The experimental outcomes and statistical tests indicate that feature selection methods based on importance values outperform those based on SHAP values across classifiers and various feature subset sizes. For models trained on larger datasets, it is recommended to use the model’s built-in feature importance list as the primary feature selection method over SHAP. This suggestion is based on the rationale that computing SHAP feature importance is a distinct activity, while models naturally provide built-in feature importance as part of the training process, requiring no additional effort. Consequently, opting for the model’s built-in feature importance list can offer a more efficient and practical approach for larger datasets and more intricate models.</p>","PeriodicalId":15158,"journal":{"name":"Journal of Big Data","volume":"6 1","pages":""},"PeriodicalIF":8.1,"publicationDate":"2024-03-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140298523","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Multi-sample $$zeta $$ -mixup: richer, more realistic synthetic samples from a p-series interpolant","authors":"","doi":"10.1186/s40537-024-00898-6","DOIUrl":"https://doi.org/10.1186/s40537-024-00898-6","url":null,"abstract":"<h3>Abstract</h3> <p>Modern deep learning training procedures rely on model regularization techniques such as data augmentation methods, which generate training samples that increase the diversity of data and richness of label information. A popular recent method, <em>mixup</em>, uses convex combinations of pairs of original samples to generate new samples. However, as we show in our experiments, <em>mixup</em> can produce undesirable synthetic samples, where the data is sampled off the manifold and can contain incorrect labels. We propose <span> <span>(zeta )</span> </span>-<em>mixup</em>, a generalization of <em>mixup</em> with provably and demonstrably desirable properties that allows convex combinations of <span> <span>({T} ge 2)</span> </span> samples, leading to more realistic and diverse outputs that incorporate information from <span> <span>({T})</span> </span> original samples by using a <em>p</em>-series interpolant. We show that, compared to <em>mixup</em>, <span> <span>(zeta )</span> </span>-<em>mixup</em> better preserves the intrinsic dimensionality of the original datasets, which is a desirable property for training generalizable models. Furthermore, we show that our implementation of <span> <span>(zeta )</span> </span>-<em>mixup</em> is faster than <em>mixup</em>, and extensive evaluation on controlled synthetic and 26 diverse real-world natural and medical image classification datasets shows that <span> <span>(zeta )</span> </span>-<em>mixup</em> outperforms <em>mixup</em>, CutMix, and traditional data augmentation techniques. The code will be released at https://github.com/kakumarabhishek/zeta-mixup.</p>","PeriodicalId":15158,"journal":{"name":"Journal of Big Data","volume":"9 1","pages":""},"PeriodicalIF":8.1,"publicationDate":"2024-03-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140197173","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Learning manifolds from non-stationary streams","authors":"Suchismit Mahapatra, Varun Chandola","doi":"10.1186/s40537-023-00872-8","DOIUrl":"https://doi.org/10.1186/s40537-023-00872-8","url":null,"abstract":"<p>Streaming adaptations of manifold learning based dimensionality reduction methods, such as <i>Isomap</i>, are based on the assumption that a small initial batch of observations is enough for exact learning of the manifold, while remaining streaming data instances can be cheaply mapped to this manifold. However, there are no theoretical results to show that this core assumption is valid. Moreover, such methods typically assume that the underlying data distribution is stationary and are not equipped to detect, or handle, sudden changes or gradual drifts in the distribution that may occur when the data is streaming. We present theoretical results to show that the quality of a manifold asymptotically converges as the size of data increases. We then show that a Gaussian Process Regression (GPR) model, that uses a manifold-specific kernel function and is trained on an initial batch of sufficient size, can closely approximate the state-of-art streaming Isomap algorithms, and the predictive variance obtained from the GPR prediction can be employed as an effective detector of changes in the underlying data distribution. Results on several synthetic and real data sets show that the resulting algorithm can effectively learn lower dimensional representation of high dimensional data in a streaming setting, while identifying shifts in the generative distribution. For instance, key findings on a Gas sensor array data set show that our method can detect changes in the underlying data stream, triggered due to real-world factors, such as introduction of a new gas in the system, while efficiently mapping data on a low-dimensional manifold.</p>","PeriodicalId":15158,"journal":{"name":"Journal of Big Data","volume":"26 1","pages":""},"PeriodicalIF":8.1,"publicationDate":"2024-03-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140197220","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Amr A. Abd El-Mageed, Amr A. Abohany, Asmaa H. Ali, Khalid M. Hosny
{"title":"An adaptive hybrid african vultures-aquila optimizer with Xgb-Tree algorithm for fake news detection","authors":"Amr A. Abd El-Mageed, Amr A. Abohany, Asmaa H. Ali, Khalid M. Hosny","doi":"10.1186/s40537-024-00895-9","DOIUrl":"https://doi.org/10.1186/s40537-024-00895-9","url":null,"abstract":"<p>Online platforms and social networking have increased in the contemporary years. They are now a major news source worldwide, leading to the online proliferation of Fake News (FNs). These FNs are alarming because they fundamentally reshape public opinion, which may cause customers to leave these online platforms, threatening the reputations of several organizations and industries. This rapid dissemination of FNs makes it imperative for automated systems to detect them, encouraging many researchers to propose various systems to classify news articles and detect FNs automatically. In this paper, a Fake News Detection (FND) methodology is presented based on an effective IBAVO-AO algorithm, which stands for hybridization of African Vultures Optimization (AVO) and Aquila Optimization (AO) algorithms, with an extreme gradient boosting Tree (Xgb-Tree) classifier. The suggested methodology involves three main phases: Initially, the unstructured FNs dataset is analyzed, and the essential features are extracted by tokenizing, encoding, and padding the input news words into a sequence of integers utilizing the GLOVE approach. Then, the extracted features are filtered using the effective Relief algorithm to select only the appropriate ones. Finally, the recovered features are used to classify the news items using the suggested IBAVO-AO algorithm based on the Xgb-Tree classifier. Hence, the suggested methodology is distinguished from prior models in that it performs automatic data pre-processing, optimization, and classification tasks. The proposed methodology is carried out on the ISOT-FNs dataset, containing more than 44 thousand multiple news articles divided into truthful and fake. We validated the proposed methodology’s reliability by examining numerous evaluation metrics involving accuracy, fitness values, the number of selected features, Kappa, Precision, Recall, F1-score, Specificity, Sensitivity, ROC_AUC, and MCC. Then, the proposed methodology is compared against the most common meta-heuristic optimization algorithms utilizing the ISOT-FNs. The experimental results reveal that the suggested methodology achieved optimal classification accuracy and F1-score and successfully categorized more than 92.5% of news articles compared to its peers. This study will assist researchers in expanding their understanding of meta-heuristic optimization algorithms applications for FND.</p><h3 data-test=\"abstract-sub-heading\">Graphical Abstract</h3>","PeriodicalId":15158,"journal":{"name":"Journal of Big Data","volume":"38 1","pages":""},"PeriodicalIF":8.1,"publicationDate":"2024-03-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140168840","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jing Sun, Run Shi, Yang Wu, Yan Lou, Lijuan Nie, Chun Zhang, Yutian Cao, Qianhua Yan, Lifang Ye, Shu Zhang, Xuanbin Wang, Qibiao Wu, Xuehua Jiao, Jiangyi Yu, Zhuyuan Fang, Xiqiao Zhou
{"title":"Integration of transcriptomic analysis and multiple machine learning approaches identifies NAFLD progression-specific hub genes to reveal distinct genomic patterns and actionable targets","authors":"Jing Sun, Run Shi, Yang Wu, Yan Lou, Lijuan Nie, Chun Zhang, Yutian Cao, Qianhua Yan, Lifang Ye, Shu Zhang, Xuanbin Wang, Qibiao Wu, Xuehua Jiao, Jiangyi Yu, Zhuyuan Fang, Xiqiao Zhou","doi":"10.1186/s40537-024-00899-5","DOIUrl":"https://doi.org/10.1186/s40537-024-00899-5","url":null,"abstract":"<h3 data-test=\"abstract-sub-heading\">Background</h3><p>Nonalcoholic fatty liver disease (NAFLD) is a leading public health problem worldwide. Approximately one fourth of patients with nonalcoholic fatty liver (NAFL) progress to nonalcoholic steatohepatitis (NASH), an advanced stage of NAFLD. Hence, there is an urgent need to make a better understanding of NAFLD heterogeneity and facilitate personalized management of high-risk NAFLD patients who may benefit from more intensive surveillance and preventive intervene.</p><h3 data-test=\"abstract-sub-heading\">Methods</h3><p>In this study, a series of bioinformatic methods were performed to identify NAFLD progression-specific pathways and genes, and three machine learning approaches were combined to construct a risk-stratification gene signature to quantify risk assessment. In addition, bulk RNA-seq, single-cell RNA-seq (scRNA-seq) transcriptome profiling data and whole-exome sequencing (WES) data were comprehensively analyzed to reveal the genomic alterations and altered pathways between distinct molecular subtypes.</p><h3 data-test=\"abstract-sub-heading\">Results</h3><p>Two distinct subtypes of NAFL were identified with the NAFLD progression-specific genes, and one subtype has a high similarity of the inflammatory pattern and fibrotic potential with NASH. The established risk-stratification gene signature could discriminate advanced samples from overall NAFLD. COL1A2, one key gene closely related to NAFLD progression, is specifically expressed in fibroblasts involved in hepatocellular carcinoma (HCC), and significantly correlated with EMT and angiogenesis in pan-cancer. Moreover, the β-catenin/COL1A2 axis might play a critical role in fibrosis severity and inflammatory response during NAFLD-HCC progression.</p><h3 data-test=\"abstract-sub-heading\">Conclusion</h3><p>In summary, our study provided evidence for the necessity of molecular classification and established a risk-stratification gene signature to quantify risk assessment of NAFLD, aiming to identify different risk subsets and to guide personalized treatment.</p>","PeriodicalId":15158,"journal":{"name":"Journal of Big Data","volume":"2 1","pages":""},"PeriodicalIF":8.1,"publicationDate":"2024-03-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140154958","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Nina Wiedemann, Krzysztof Janowicz, Martin Raubal, Ourania Kounadi
{"title":"Where you go is who you are: a study on machine learning based semantic privacy attacks","authors":"Nina Wiedemann, Krzysztof Janowicz, Martin Raubal, Ourania Kounadi","doi":"10.1186/s40537-024-00888-8","DOIUrl":"https://doi.org/10.1186/s40537-024-00888-8","url":null,"abstract":"<p>Concerns about data privacy are omnipresent, given the increasing usage of digital applications and their underlying business model that includes selling user data. Location data is particularly sensitive since they allow us to infer activity patterns and interests of users, e.g., by categorizing visited locations based on nearby points of interest (POI). On top of that, machine learning methods provide new powerful tools to interpret big data. In light of these considerations, we raise the following question: What is the actual risk that realistic, machine learning based privacy attacks can obtain meaningful semantic information from raw location data, subject to inaccuracies in the data? In response, we present a systematic analysis of two attack scenarios, namely location categorization and user profiling. Experiments on the Foursquare dataset and tracking data demonstrate the potential for abuse of high-quality spatial information, leading to a significant privacy loss even with location inaccuracy of up to 200 m. With location obfuscation of more than 1 km, spatial information hardly adds any value, but a high privacy risk solely from temporal information remains. The availability of public context data such as POIs plays a key role in inference based on spatial information. Our findings point out the risks of ever-growing databases of tracking data and spatial context data, which policymakers should consider for privacy regulations, and which could guide individuals in their personal location protection measures.</p>","PeriodicalId":15158,"journal":{"name":"Journal of Big Data","volume":"19 1","pages":""},"PeriodicalIF":8.1,"publicationDate":"2024-03-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140125539","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}