D. Nguyen, Jiawen Kong, Hao Wang, S. Menzel, B. Sendhoff, Anna V. Kononova, Thomas Bäck
{"title":"Improved Automated CASH Optimization with Tree Parzen Estimators for Class Imbalance Problems","authors":"D. Nguyen, Jiawen Kong, Hao Wang, S. Menzel, B. Sendhoff, Anna V. Kononova, Thomas Bäck","doi":"10.1109/DSAA53316.2021.9564147","DOIUrl":"https://doi.org/10.1109/DSAA53316.2021.9564147","url":null,"abstract":"The imbalanced classification problem is very relevant in both academic and industrial applications. The task of finding the best machine learning model to use for a specific imbalanced dataset is complicated due to a large number of existing algorithms, each with its own hyperparameters. The Combined Algorithm Selection and Hyperparameter optimization (CASH) has been introduced to tackle both aspects at the same time. However, CASH has not been studied in detail in the class imbalance domain, where the best combination of resampling technique and classification algorithm is searched for, together with their optimized hyperparameters. Thus, we target the CASH problem for imbalanced classification. We experiment with a search space of 5 classification algorithms, 21 resampling approaches and 64 relevant hyperparameters in total. Moreover, we investigate performance of 2 well-known optimization approaches: Random search and Tree Parzen Estimators approach which is a kind of Bayesian optimization. For comparison, we also perform grid search on all combinations of resampling techniques and classification algorithms with their default hyperparameters. Our experimental results show that a Bayesian optimization approach outperforms the other approaches for CASH in this application domain.","PeriodicalId":129612,"journal":{"name":"2021 IEEE 8th International Conference on Data Science and Advanced Analytics (DSAA)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128516136","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Romain Mathonat, Diana Nurbakova, Jean-François Boulicaut, Mehdi Kaytoue, bon
{"title":"Anytime Subgroup Discovery in High Dimensional Numerical Data","authors":"Romain Mathonat, Diana Nurbakova, Jean-François Boulicaut, Mehdi Kaytoue, bon","doi":"10.1109/DSAA53316.2021.9564223","DOIUrl":"https://doi.org/10.1109/DSAA53316.2021.9564223","url":null,"abstract":"Subgroup discovery (SD) enables one to elicit patterns that strongly discriminate a class label. When it comes to numerical data, most of the existing SD approaches perform data discretizations and thus suffer from information loss. A few algorithms avoid such a loss by considering the search space of every interval pattern built on the dataset numerical values and provide an “anytime” property: at any moment, they are able to provide a result that improves over time. Given a sufficient time/memory budget, they may eventually complete an exhaustive search. However, such approaches are often intractable when dealing with high-dimensional numerical data, for instance, when extracting features from real-life multivariate time series. To overcome such limitations, we propose MonteCloPi, an approach based on a bottom-up exploration of numerical patterns with a Monte Carlo Tree Search. It enables to have a better exploration-exploitation trade-off between exploration and exploitation when sampling huge search spaces. Our extensive set of experiments proves the efficiency of MonteCloPi on high-dimensional data with hundreds of attributes. We finally discuss the actionability of discovered subgroups when looking for skill analysis from Rocket League action logs.","PeriodicalId":129612,"journal":{"name":"2021 IEEE 8th International Conference on Data Science and Advanced Analytics (DSAA)","volume":"68 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121553966","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Learning User Preferences Without Feedbacks","authors":"Wei Zhang, Chris Challis","doi":"10.1109/DSAA53316.2021.9564131","DOIUrl":"https://doi.org/10.1109/DSAA53316.2021.9564131","url":null,"abstract":"Recommending relevant data is vital for helping users to navigate through the ocean of data. We developed a service that learns user preferences through natural user interactions, without asking for user feedbacks, so users are not distracted from their regular workflow. Our approach has few parameters and very low time and space complexities, making it suitable for large scale applications. We demonstrate through experiments how it converges to user preferences and adapts to user behavior changes.","PeriodicalId":129612,"journal":{"name":"2021 IEEE 8th International Conference on Data Science and Advanced Analytics (DSAA)","volume":"29 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126970295","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Max Lübbering, M. Gebauer, Rajkumar Ramamurthy, C. Bauckhage, R. Sifa
{"title":"Decoupling Autoencoders for Robust One-vs-Rest Classification","authors":"Max Lübbering, M. Gebauer, Rajkumar Ramamurthy, C. Bauckhage, R. Sifa","doi":"10.1109/DSAA53316.2021.9564136","DOIUrl":"https://doi.org/10.1109/DSAA53316.2021.9564136","url":null,"abstract":"One-vs-Rest (OVR) classification aims to distinguish a single class of interest from other classes. The concept of novelty detection and robustness to dataset shift becomes crucial in OVR when the scope of the rest class extends from the classes observed during training to unseen and possibly unrelated classes. In this work, we propose a novel architecture, namely Decoupling Autoencoder (DAE) to tackle the common issue of robustness w.r.t. out-of-distribution samples which is prevalent in classifiers such as multi-layer perceptrons (MLP) and ensemble architectures. Experiments on plain classification, outlier detection, and dataset shift tasks show DAE to achieve robust performance across these tasks compared to the baselines, which tend to fail completely, when exposed to dataset shift. While DAE and the baselines yield rather uncalibrated predictions on the outlier detection and dataset shift task, we found that DAE calibration is more stable across all tasks. Therefore, calibration measures applied to the classification task could also improve the calibration of the outlier detection and dataset shift scenarios for DAE.","PeriodicalId":129612,"journal":{"name":"2021 IEEE 8th International Conference on Data Science and Advanced Analytics (DSAA)","volume":"66 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128919940","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"DaskDB: Scalable Data Science with Unified Data Analytics and In Situ Query Processing","authors":"A. Watson, Suvam Kumar Das, S. Ray","doi":"10.1109/DSAA53316.2021.9564218","DOIUrl":"https://doi.org/10.1109/DSAA53316.2021.9564218","url":null,"abstract":"Due to the rapidly rising data volume, there is a need to analyze this data efficiently and produce results quickly. However, data scientists today need to use different systems, since presently relational databases are primarily used for SQL querying and data science frameworks for complex data analysis. This may incur significant movement of data across multiple systems, which is expensive. Furthermore, with relational databases, the data must be completely loaded into the database before performing any analysis. We believe that data scientists would prefer to use a single system to perform both data analysis tasks and SQL querying, without requiring data movement between different systems. Ideally, this system would offer adequate performance, scalability, built-in data analysis functionalities, and usability. We present DaskDB, a scalable data science system with support for unified data analytics and in situ SQL query processing on heterogeneous data sources. DaskDB supports invoking Python APIs as User-Defined Functions (UDF). So, it can be easily integrated with most existing Python data science applications. Moreover, we introduce a distributed index join algorithm and a novel distributed learned index to improve join performance. Our experimental evaluation involve the TPC-H benchmark and a custom UDF benchmark, which we developed, for data analytics. And, we demonstrate that DaskDB significantly outperforms PySpark and Hive/Hivemall.","PeriodicalId":129612,"journal":{"name":"2021 IEEE 8th International Conference on Data Science and Advanced Analytics (DSAA)","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131202666","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Cairong Yan, Anan Ding, Yanting Zhang, Zijian Wang
{"title":"Learning Fashion Similarity Based on Hierarchical Attribute Embedding","authors":"Cairong Yan, Anan Ding, Yanting Zhang, Zijian Wang","doi":"10.1109/DSAA53316.2021.9564236","DOIUrl":"https://doi.org/10.1109/DSAA53316.2021.9564236","url":null,"abstract":"Embedding items directly into a common feature space, and then measuring the similarity by calculating the feature distance in this space, has become the main method for similarity learning in current fashion retrieval tasks. The method is simple and efficient, but it ignores the correlation among fashion attributes and the impact of these correlations on the feature space, thereby reducing the accuracy of retrieval. Since the number of fashion attributes is large and the semantic granularity is also different, how to capture the relationship between fashion attributes and perform refined embedding to accurately represent fashion items is a challenge. In this paper, by constructing an attribute tree, we propose a hierarchical attribute embedding method for representing fashion items to enhance the relationship between attributes and use masking technology to disentangle different attributes. Based on these modules, we propose a hierarchical attribute-aware embedding network (HAEN) which takes images and attributes as input, learns multiple attribute-specific embedding spaces, and measures fine-grained similarity in the corresponding spaces. The extensive experimental result on two fashion-related public datasets FashionAI and DARN shows the superiority (+5.11% and +3.09% in MAP, respectively) of our proposed HAEN compared with state-of-the-art methods.","PeriodicalId":129612,"journal":{"name":"2021 IEEE 8th International Conference on Data Science and Advanced Analytics (DSAA)","volume":"58 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114763695","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Unsupervised Tensor based Feature Extraction and Outlier Detection for Multivariate Time Series","authors":"Kiyotaka Matsue, M. Sugiyama","doi":"10.1109/DSAA53316.2021.9564117","DOIUrl":"https://doi.org/10.1109/DSAA53316.2021.9564117","url":null,"abstract":"Although finding useful feature vector representation is one of crucial tasks as data analysis for multivariate time series, finding useful features is still challenging because both time-wise and variable-wise associations should be taken into account. To overcome this issue, we present an unsupervised feature extraction algorithm for multivariate time series, called UFEKT (Unsupervised Feature Extraction using Kernel Method and Tucker Decomposition). Our algorithm (1) constructs a kernel matrix from subsequences of each time series to account for time-wise association and (2) constructs a single tensor from the kernel matrices and performs Tucker decomposition to account for variable-wise association. Feature representation is obtained as rows of the factor matrix of the decomposed tensor in a fully unsupervised manner, which can be used to subsequent machine learning problems. Our experimental results using synthetic and real-world multivariate time series datasets in the unsupervised outlier detection scenario show that our algorithm improves detection accuracy when it is used as pre-processing for outlier detection algorithms.","PeriodicalId":129612,"journal":{"name":"2021 IEEE 8th International Conference on Data Science and Advanced Analytics (DSAA)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130036213","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Luis E. Colchado, Edwin Villanueva, José Eduardo Ochoa Luna
{"title":"A Neural Network Architecture with an Attention-based Layer for Spatial Prediction of Fine Particulate Matter","authors":"Luis E. Colchado, Edwin Villanueva, José Eduardo Ochoa Luna","doi":"10.1109/DSAA53316.2021.9564200","DOIUrl":"https://doi.org/10.1109/DSAA53316.2021.9564200","url":null,"abstract":"Several epidemiological studies indicate that fine particulate matter $PM_{2.5}$ affect human health, provoking cardiovascular and respiratory diseases, among other. It is therefore important to assess the spatial distribution of this pollutant. Air quality monitoring (AQM) networks are used to this end. However, they are usually spatially sparse due to their high costs, leaving large areas without monitoring. Numerical models have traditionally been proposed to infer the spatial distribution of air pollutants by simulating the diffusion and reaction process of air pollutants. However, such models usually need highly precise emission data and high-end computing hardware. In this paper, we propose a novel neural network architecture for $PM_{2.5}$ spatial estimation. This model uses a recently proposed attention layer to build an structured graph of the AQM stations (nodes) and to weight the k nearest neighbors for certain nodes based on attention kernels. The learned attention layer can generate a transformed feature representation for a testing node, which is further processed by a fully connected neural network (FCNN) to infer the pollutant concentration. Results on data from Sao Paulo AQM network showed that our approach has better predictive performance than classical methods like Inverse Distance Weighting (IDW), Ordinary Kriging (OK), and FCNN without attention layer, according to different performance metrics. Additionally, the normalized attention weights computed by our model showed that in some cases, the attention given to the nearest nodes is independent of their spatial distances. This shows that the model is more flexible, since it can learn to interpolate $PM_{2.5}$ concentration levels based on the available data of the AQM network and some context information. As for this information we supply to the model different variables like vegetation index (NDVI), surface elevation data, Nighttime Lights (NTL) information and meteorological information.","PeriodicalId":129612,"journal":{"name":"2021 IEEE 8th International Conference on Data Science and Advanced Analytics (DSAA)","volume":"55 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133473369","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A generic framework for forecasting short-term traffic conditions on urban highways","authors":"Seif-Eddine Attoui, Maroua Meddeb","doi":"10.1109/DSAA53316.2021.9564192","DOIUrl":"https://doi.org/10.1109/DSAA53316.2021.9564192","url":null,"abstract":"With the emergence of Connected and Smart Cities, the need to predict traffic conditions has led to the development of a large variety of forecasting algorithms. In spite of various research efforts, the choice of models and techniques strongly depends on the use case, the highway infrastructure as well as the provided dataset. This study is launched as part of a project which aims to design an Intelligent Transport System (ITS) dedicated to highway supervisors to regulate traffic. This system needs to be supplied by continuous, real-time forecasting of short-term traffic congestions in order to make decisions accordingly. In this paper, we propose a general framework that, first, performs different data preprocessing techniques to improve data quality, and second, provides real-time multiple horizons predictions. Our framework uses different models combining Machine learning and Deep learning algorithms. Experiments results confirmed the necessity of the data preprocessing step, especially with highly dynamic data and heterogeneous mobility contexts. In addition, our methodology is tested in a real case study and shows very encouraging results.","PeriodicalId":129612,"journal":{"name":"2021 IEEE 8th International Conference on Data Science and Advanced Analytics (DSAA)","volume":"31 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125008338","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Centrality-based Interpretability Measures for Graph Embeddings","authors":"Shima Khoshraftar, Sedigheh Mahdavi, Aijun An","doi":"10.1109/DSAA53316.2021.9564221","DOIUrl":"https://doi.org/10.1109/DSAA53316.2021.9564221","url":null,"abstract":"Many real-world data are considered as graphs, such as computer networks, social networks and protein-protein interaction networks. Graph embedding methods are powerful tools for representing large graphs in various domains. A graph embedding method projects the components of a graph, such as its nodes or edges, into a vector space with a lower dimensionality than the adjacency matrix of the graph, and aims to preserve the characteristics of the graph. The generated embedding vectors have been utilized in various graph mining applications such as node classification, link prediction and anomaly detection. Despite the wide success of the graph embedding methods, little study has been done to facilitate a better understanding of the graph embeddings. In this paper, inspired by advancements in interpreting word embeddings, we propose two interpretability measures to quantify the interpretability of graph embeddings by leveraging useful network centrality properties and perform comparisons of different graph embedding methods. Using these scores, we can provide insights into the representational power of graph embedding methods.","PeriodicalId":129612,"journal":{"name":"2021 IEEE 8th International Conference on Data Science and Advanced Analytics (DSAA)","volume":"42 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123834405","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}