{"title":"Recent Developments in Privacy-Preserving Mining of Clinical Data.","authors":"Chance Desmet, Diane J Cook","doi":"10.1145/3447774","DOIUrl":"10.1145/3447774","url":null,"abstract":"<p><p>With the dramatic increases in both the capability to collect personal data and the capability to analyze large amounts of data, increasingly sophisticated and personal insights are being drawn. These insights are valuable for clinical applications but also open up possibilities for identification and abuse of personal information. In this paper, we survey recent research on classical methods of privacy-preserving data mining. Looking at dominant techniques and recent innovations to them, we examine the applicability of these methods to the privacy-preserving analysis of clinical data. We also discuss promising directions for future research in this area.</p>","PeriodicalId":93404,"journal":{"name":"ACM/IMS transactions on data science","volume":"2 4","pages":""},"PeriodicalIF":0.0,"publicationDate":"2021-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8746818/pdf/nihms-1678257.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"39814074","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A Survey on the Role of Centrality as Seed Nodes for Information Propagation in Large Scale Network","authors":"Paramita Dey, Subhayan Bhattacharya, Sarbani Roy","doi":"10.1145/3465374","DOIUrl":"https://doi.org/10.1145/3465374","url":null,"abstract":"From the popular concept of six-degree separation, social networks are generally analyzed in the perspective of small world networks where centrality of nodes play a pivotal role in information propagation. However, working with a large dataset of a scale-free network (which follows power law) may be different due to the nature of the social graph. Moreover, the derivation of centrality may be difficult due to the computational complexity of identifying centrality measures. This study provides a comprehensive and extensive review and comparison of seven centrality measures (clustering coefficients, Node degree, K-core, Betweenness, Closeness, Eigenvector, PageRank) using four information propagation methods (Breadth First Search, Random Walk, Susceptible-Infected-Removed, Forest Fire). Five benchmark similarity measures (Tanimoto, Hamming, Dice, Sorensen, Jaccard) have been used to measure the similarity between the seed nodes identified using the centrality measures with actual source seeds derived through Google's LargeStar-SmallStar algorithm on Twitter Stream Data. MapReduce has been utilized for identifying the seed nodes based on centrality measures and for information propagation simulation. It is observed that most of the centrality measures perform well compared to the actual source in the initial stage but are saturated after a certain level of influence maximization in terms of both affected nodes and propagation level.","PeriodicalId":93404,"journal":{"name":"ACM/IMS transactions on data science","volume":"2 1","pages":"1 - 25"},"PeriodicalIF":0.0,"publicationDate":"2021-08-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"46862515","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"PoBery: Possibly-complete Big Data Queries with Probabilistic Data Placement and Scanning","authors":"Jie Song, Qiang He, Feifei Chen, Ye Yuan, Ge Yu","doi":"10.1145/3465375","DOIUrl":"https://doi.org/10.1145/3465375","url":null,"abstract":"In big data query processing, there is a trade-off between query accuracy and query efficiency, for example, sampling query approaches trade-off query completeness for efficiency. In this article, we argue that query performance can be significantly improved by slightly losing the possibility of query completeness, that is, the chance that a query is complete. To quantify the possibility, we define a new concept, Probability of query Completeness (hereinafter referred to as PC). For example, If a query is executed 100 times, PC = 0.95 guarantees that there are no more than 5 incomplete results among 100 results. Leveraging the probabilistic data placement and scanning, we trade off PC for query performance. In the article, we propose PoBery (POssibly-complete Big data quERY), a method that supports neither complete queries nor incomplete queries, but possibly-complete queries. The experimental results conducted on HiBench prove that PoBery can significantly accelerate queries while ensuring the PC. Specifically, it is guaranteed that the percentage of complete queries is larger than the given PC confidence. Through comparison with state-of-the-art key-value stores, we show that while Drill-based PoBery performs as fast as Drill on complete queries, it is 1.7 ×, 1.1 ×, and 1.5 × faster on average than Drill, Impala, and Hive, respectively, on possibly-complete queries.","PeriodicalId":93404,"journal":{"name":"ACM/IMS transactions on data science","volume":"2 1","pages":"1 - 28"},"PeriodicalIF":0.0,"publicationDate":"2021-08-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"43046454","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
H. Behrens, K. Candan, Xilun Chen, Yash Garg, Mao-Lin Li, Xinsheng Li, Sicong Liu, M. Sapino, M. Shadab, Dalton Turner, Magesh Vijayakumaren
{"title":"DataStorm: Coupled, Continuous Simulations for Complex Urban Environments","authors":"H. Behrens, K. Candan, Xilun Chen, Yash Garg, Mao-Lin Li, Xinsheng Li, Sicong Liu, M. Sapino, M. Shadab, Dalton Turner, Magesh Vijayakumaren","doi":"10.1145/3447572","DOIUrl":"https://doi.org/10.1145/3447572","url":null,"abstract":"Urban systems are characterized by complexity and dynamicity. Data-driven simulations represent a promising approach in understanding and predicting complex dynamic processes in the presence of shifting demands of urban systems. Yet, today’s silo-based, de-coupled simulation engines fail to provide an end-to-end view of the complex urban system, preventing informed decision-making. In this article, we present DataStorm to support integration of existing simulation, analysis and visualization components into integrated workflows. DataStorm provides a flow engine, DataStorm-FE, for coordinating data and decision flows among multiple actors (each representing a model, analytic operation, or a decision criterion) and enables ensemble planning and optimization across cloud resources. DataStorm provides native support for simulation ensemble creation through parameter space sampling to decide which simulations to run, as well as distributed instantiation and parallel execution of simulation instances on cluster resources. Recognizing that simulation ensembles are inherently sparse relative to the potential parameter space, we also present a density-boosting partition-stitch sampling scheme to increase the effective density of the simulation ensemble through a sub-space partitioning scheme, complemented with an efficient stitching mechanism that leverages partial and imperfect knowledge from partial dynamical systems to effectively obtain a global view of the complex urban process being simulated.","PeriodicalId":93404,"journal":{"name":"ACM/IMS transactions on data science","volume":"2 1","pages":"1 - 37"},"PeriodicalIF":0.0,"publicationDate":"2021-07-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1145/3447572","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"46756618","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Mona Nashaat, Aindrila Ghosh, James Miller, Shaikh Quader
{"title":"TabReformer: Unsupervised Representation Learning for Erroneous Data Detection","authors":"Mona Nashaat, Aindrila Ghosh, James Miller, Shaikh Quader","doi":"10.1145/3447541","DOIUrl":"https://doi.org/10.1145/3447541","url":null,"abstract":"Error detection is a crucial preliminary phase in any data analytics pipeline. Existing error detection techniques typically target specific types of errors. Moreover, most of these detection models either require user-defined rules or ample hand-labeled training examples. Therefore, in this article, we present TabReformer, a model that learns bidirectional encoder representations for tabular data. The proposed model consists of two main phases. In the first phase, TabReformer follows encoder architecture with multiple self-attention layers to model the dependencies between cells and capture tuple-level representations. Also, the model utilizes a Gaussian Error Linear Unit activation function with the Masked Data Model objective to achieve deeper probabilistic understanding. In the second phase, the model parameters are fine-tuned for the task of erroneous data detection. The model applies a data augmentation module to generate more erroneous examples to represent the minority class. The experimental evaluation considers a wide range of databases with different types of errors and distributions. The empirical results show that our solution can enhance the recall values by 32.95% on average compared with state-of-the-art techniques while reducing the manual effort by up to 48.86%.","PeriodicalId":93404,"journal":{"name":"ACM/IMS transactions on data science","volume":"2 1","pages":"1 - 29"},"PeriodicalIF":0.0,"publicationDate":"2021-05-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1145/3447541","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"48123450","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Boosting the Restoring Performance of Deduplication Data by Classifying Backup Metadata","authors":"Ru Yang, Yuhui Deng, Yi Zhou, Ping Huang","doi":"10.1145/3437261","DOIUrl":"https://doi.org/10.1145/3437261","url":null,"abstract":"Restoring data is the main purpose of data backup in storage systems. The fragmentation issue, caused by physically scattering logically continuous data across a variety of disk locations, poses a negative impact on the restoring performance of a deduplication system. Rewriting algorithms are used to alleviate the fragmentation problem by improving the restoring speed of a deduplication system. However, rewriting methods give birth to a big sacrifice in terms of deduplication ratio, leading to a huge storage space waste. Furthermore, traditional backup approaches treat file metadata and chunk metadata as the same, which causes frequent on-disk metadata accesses. In this article, we start by analyzing storage characteristics of backup metadata. An intriguing finding shows that with 10 million files, the file metadata merely takes up approximately 340 MB. Motivated by this finding, we propose a Classified-Metadata based Restoring method (CMR) that classifies backup metadata into file metadata and chunk metadata. Because the file metadata merely takes up a meager amount of space, CMR maintains all file metadata in memory, whereas chunk metadata are aggressively prefetched to memory in a greedy manner. A deduplication system with CMR in place exhibits three salient features: (i) It avoids rewriting algorithms’ additional overhead by reducing the number of disk reads in a restoring process, (ii) it increases the restoring throughput without sacrificing the deduplication ratio, and (iii) it thoroughly leverages the hardware resources to boost the restoring performance. To quantitatively evaluate the performance of CMR, we compare our CMR against two state-of-the-art approaches, namely, a history-aware rewriting method (HAR) and a context-based rewriting scheme (CAP). The experimental results show that compared to HAR and CAP, CMR reduces the restoring time by 27.2% and 29.3%, respectively. Moreover, the deduplication ratio is improved by 1.91% and 4.36%, respectively.","PeriodicalId":93404,"journal":{"name":"ACM/IMS transactions on data science","volume":"2 1","pages":"1 - 16"},"PeriodicalIF":0.0,"publicationDate":"2021-04-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1145/3437261","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"41530445","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Introduction to the Special Issue on Learning-based Support for Data Science Applications","authors":"Ke Zhou, Jingkuan Song","doi":"10.1145/3450751","DOIUrl":"https://doi.org/10.1145/3450751","url":null,"abstract":"This issue of ACM/IMS Transactions on Data Science (TDS) contains a collection of 6 articles from 25 submissions to the TDS journal. TDS is a Gold Open Access journal that publishes articles on cross-disciplinary innovative research ideas, algorithms, systems, theory, and applications for data science. Articles that address challenges at every stage, from acquisition on, through data cleaning, transformation, representation, integration, indexing, modeling, analysis, visualization, and interpretation, while retaining privacy, fairness, provenance, transparency, and provision of social benefit, within the context of big data, fall within the scope of the journal. The six accepted articles for this issue are representative in their respective fields, including communication, image processing, natural language processing, dark data, text recognition, and data deduplication. According to the traits of their own fields, these articles apply appropriate machine learning technologies and have achieved convincing results. We believe that these achievements can provide machine learning–based solutions for applications in real life and inspire more practical problems to be solved via machine learning techniques. In “Deep Hash–based Relevance-aware Data Quality Assessment for Image Dark Data,” authors propose a deep hash– based framework called DHR-DQA to lighten and assess image dark data. This framework combines deep learning, hashing technique, graph technique, and CV technique to explore a very advanced application, which is very enlightening. In “Boosting the Restoring Performance of Deduplication Data by Classifying Backup Metadata,” authors utilize machine learning techniques to complete a classic task in the storage field, and the results show the universality of machine learning techniques. From these articles, we observe that machine learning can solve almost all the application problems under rational condition. We hope readers enjoy this special issue and that these articles can enlighten their work.","PeriodicalId":93404,"journal":{"name":"ACM/IMS transactions on data science","volume":"2 1","pages":"1 - 1"},"PeriodicalIF":0.0,"publicationDate":"2021-04-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1145/3450751","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"41765295","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Lu Cheng, Ruocheng Guo, Yasin N. Silva, Deborah L. Hall, Huan Liu
{"title":"Modeling Temporal Patterns of Cyberbullying Detection with Hierarchical Attention Networks","authors":"Lu Cheng, Ruocheng Guo, Yasin N. Silva, Deborah L. Hall, Huan Liu","doi":"10.1145/3441141","DOIUrl":"https://doi.org/10.1145/3441141","url":null,"abstract":"Cyberbullying is rapidly becoming one of the most serious online risks for adolescents. This has motivated work on machine learning methods to automate the process of cyberbullying detection, which have so far mostly viewed cyberbullying as one-off incidents that occur at a single point in time. Comparatively less is known about how cyberbullying behavior occurs and evolves over time. This oversight highlights a crucial open challenge for cyberbullying-related research, given that cyberbullying is typically defined as intentional acts of aggression via electronic communication that occur repeatedly and persistently. In this article, we center our discussion on the challenge of modeling temporal patterns of cyberbullying behavior. Specifically, we investigate how temporal information within a social media session, which has an inherently hierarchical structure (e.g., words form a comment and comments form a session), can be leveraged to facilitate cyberbullying detection. Recent findings from interdisciplinary research suggest that the temporal characteristics of bullying sessions differ from those of non-bullying sessions and that the temporal information from users’ comments can improve cyberbullying detection. The proposed framework consists of three distinctive features: (1) a hierarchical structure that reflects how a social media session is formed in a bottom-up manner; (2) attention mechanisms applied at the word- and comment-level to differentiate the contributions of words and comments to the representation of a social media session; and (3) the incorporation of temporal features in modeling cyberbullying behavior at the comment-level. Quantitative and qualitative evaluations are conducted on a real-world dataset collected from Instagram, the social networking site with the highest percentage of users reporting cyberbullying experiences. Results from empirical evaluations show the significance of the proposed methods, which are tailored to capture temporal patterns of cyberbullying detection.","PeriodicalId":93404,"journal":{"name":"ACM/IMS transactions on data science","volume":"2 1","pages":"1 - 23"},"PeriodicalIF":0.0,"publicationDate":"2021-04-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1145/3441141","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"46255488","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Deep Hash-based Relevance-aware Data Quality Assessment for Image Dark Data","authors":"Yu Liu, Yangtao Wang, Lianli Gao, Chan Guo, Yanzhao Xie, Zhili Xiao","doi":"10.1145/3420038","DOIUrl":"https://doi.org/10.1145/3420038","url":null,"abstract":"Data mining can hardly solve but always faces a problem that there is little meaningful information within the dataset serving a given requirement. Faced with multiple unknown datasets, to allocate data mining resources to acquire more desired data, it is necessary to establish a data quality assessment framework based on the relevance between the dataset and requirements. This framework can help the user to judge the potential benefits in advance, so as to optimize the resource allocation to those candidates. However, the unstructured data (e.g., image data) often presents dark data states, which makes it tricky for the user to understand the relevance based on content of the dataset in real time. Even if all data have label descriptions, how to measure the relevance between data efficiently under semantic propagation remains an urgent problem. Based on this, we propose a Deep Hash-based Relevance-aware Data Quality Assessment framework, which contains off-line learning and relevance mining parts as well as an on-line assessing part. In the off-line part, we first design a Graph Convolution Network (GCN)-AutoEncoder hash (GAH) algorithm to recognize the data (i.e., lighten the dark data), then construct a graph with restricted Hamming distance, and finally design a Cluster PageRank (CPR) algorithm to calculate the importance score for each node (image) so as to obtain the relevance representation based on semantic propagation. In the on-line part, we first retrieve the importance score by hash codes and then quickly get the assessment conclusion in the importance list. On the one hand, the introduction of GCN and co-occurrence probability in the GAH promotes the perception ability for dark data. On the other hand, the design of CPR utilizes hash collision to reduce the scale of graph and iteration matrix, which greatly decreases the consumption of space and computing resources. We conduct extensive experiments on both single-label and multi-label datasets to assess the relevance between data and requirements as well as test the resources allocation. Experimental results show our framework can gain the most desired data with the same mining resources. Besides, the test results on Tencent1M dataset demonstrate the framework can complete the assessment with a stability for given different requirements.","PeriodicalId":93404,"journal":{"name":"ACM/IMS transactions on data science","volume":"2 1","pages":"1 - 26"},"PeriodicalIF":0.0,"publicationDate":"2021-04-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1145/3420038","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"46805752","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Simultaneous Image Reconstruction and Feature Learning with 3D-CNNs for Image Set–Based Classification","authors":"Xinyu Zhang, Xiaocui Li, X.-Y. Jing, Li Cheng","doi":"10.1145/3420037","DOIUrl":"https://doi.org/10.1145/3420037","url":null,"abstract":"Image set–based classification has attracted substantial research interest because of its broad applications. Recently, lots of methods based on feature learning or dictionary learning have been developed to solve this problem, and some of them have made gratifying achievements. However, most of them transform the image set into a 2D matrix or use 2D convolutional neural networks (CNNs) for feature learning, so the spatial and temporal information is missing. At the same time, these methods extract features from original images in which there may exist huge intra-class diversity. To explore a possible solution to these issues, we propose a simultaneous image reconstruction with deep learning and feature learning with 3D-CNNs (SIRFL) for image set classification. The proposed SIRFL approach consists of a deep image reconstruction network and a 3D-CNN-based feature learning network. The deep image reconstruction network is used to reduce the diversity of images from the same set, and the feature learning network can effectively retain spatial and temporal information by using 3D-CNNs. Extensive experimental results on five widely used datasets show that our SIRFL approach is a strong competitor for the state-of-the-art image set classification methods.","PeriodicalId":93404,"journal":{"name":"ACM/IMS transactions on data science","volume":"2 1","pages":"1 - 13"},"PeriodicalIF":0.0,"publicationDate":"2021-04-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1145/3420037","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"42723418","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}