Nuno Moniz, Rita P. Ribeiro, Vítor Cerqueira, N. Chawla
{"title":"SMOTEBoost for Regression: Improving the Prediction of Extreme Values","authors":"Nuno Moniz, Rita P. Ribeiro, Vítor Cerqueira, N. Chawla","doi":"10.1109/DSAA.2018.00025","DOIUrl":"https://doi.org/10.1109/DSAA.2018.00025","url":null,"abstract":"Supervised learning with imbalanced domains is one of the biggest challenges in machine learning. Such tasks differ from standard learning tasks by assuming a skewed distribution of target variables, and user domain preference towards under-represented cases. Most research has focused on imbalanced classification tasks, where a wide range of solutions has been tested. Still, little work has been done concerning imbalanced regression tasks. In this paper, we propose an adaptation of the SMOTEBoost approach for the problem of imbalanced regression. Originally designed for classification tasks, it combines boosting methods and the SMOTE resampling strategy. We present four variants of SMOTEBoost and provide an experimental evaluation using 30 datasets with an extensive analysis of results in order to assess the ability of SMOTEBoost methods in predicting extreme target values, and their predictive trade-off concerning baseline boosting methods. SMOTEBoost is publicly available in a software package.","PeriodicalId":208455,"journal":{"name":"2018 IEEE 5th International Conference on Data Science and Advanced Analytics (DSAA)","volume":"128 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122548464","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Multivariate Time Series Early Classification Using Multi-Domain Deep Neural Network","authors":"Huai-Shuo Huang, Chien-Liang Liu, V. Tseng","doi":"10.1109/DSAA.2018.00019","DOIUrl":"https://doi.org/10.1109/DSAA.2018.00019","url":null,"abstract":"Early classification on multivariate time series is an important research topic in data mining with wide applications to various domains like medical diagnosis, motion detection and financial prediction, etc. Shapelet is probably one of the most commonly used approaches to tackle early classification problem, but one drawback of shaplet is its inefficiency. More importantly, the extracted shapelets may not be applicable to every test case at any time point. This work focuses on early classification of multivariate time series and proposes a novel framework named Multi-Domain Deep Neural Network (MDDNN), in which convolutional neural network (CNN) and long-short term memory (LSTM) are incorporated to learn feature representation and relationship embedding in the long sequences with long time lags. The proposed model can make predictions at any time point of a multivariate time series with the help of a truncation process. We conducted experiments on four real datasets and compared with state-of-the-art algorithms. The experimental results indicate that the proposed method outperforms the alternatives significantly on both of earliness and accuracy. Detailed analysis about the proposed model is also provided in this work. To the best of our knowledge, this is the first work that incorporates deep neural network methods (CNN and LSTM) and multi-domain approach to boost the problem of early classification on multivariate time series.","PeriodicalId":208455,"journal":{"name":"2018 IEEE 5th International Conference on Data Science and Advanced Analytics (DSAA)","volume":"35 3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115984439","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"DSAA 2018 Special Session: Data Science for Social Good","authors":"D. Paolotti, M. Tizzoni","doi":"10.1109/DSAA.2018.00060","DOIUrl":"https://doi.org/10.1109/DSAA.2018.00060","url":null,"abstract":"We provide an overview of the DSAA 2018 Data Science for Social Good special session, its aims and contributions.","PeriodicalId":208455,"journal":{"name":"2018 IEEE 5th International Conference on Data Science and Advanced Analytics (DSAA)","volume":"40 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127406482","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
A. Azzini, S. Marrara, Amir Topalovic, M. P. Bach, Matthew J. Rattigan
{"title":"Opportunities and Risks for Data Science in Organizations: Banking, Finance, and Policy - Special Session Overview","authors":"A. Azzini, S. Marrara, Amir Topalovic, M. P. Bach, Matthew J. Rattigan","doi":"10.1109/DSAA.2018.00078","DOIUrl":"https://doi.org/10.1109/DSAA.2018.00078","url":null,"abstract":"In this paper, the DSAA 2018 special session \"Opportunities and Risks for Data Science in Organizations: Banking, Finance, and Policy\" is presented. This session is focused on discussing how banking and finance organizations can benefit from the huge amount of data they own and continue to gather. Moreover, the session aims at identifying and exploring the challenges of applying data science to financial policy questions. It is also planned to promote a special issue of the ACM Journal of Data and Information Quality on the workshop topics.","PeriodicalId":208455,"journal":{"name":"2018 IEEE 5th International Conference on Data Science and Advanced Analytics (DSAA)","volume":"87 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125660109","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Data Fusion to Describe and Quantify Search and Rescue Operations in the Mediterranean Sea","authors":"K. H. Pham, Jeremy Boy, M. Luengo-Oroz","doi":"10.1109/DSAA.2018.00066","DOIUrl":"https://doi.org/10.1109/DSAA.2018.00066","url":null,"abstract":"The Mediterranean Sea is the stage of one of the biggest humanitarian crises to affect Europe. Since 2014, thousands of migrants and refugees have died or gone missing in dangerous attempts to cross into the continent. However, there is relatively little structured information available on how they attempt the crossing. Such information could be used to better target maritime rescue efforts or to anticipate smuggling patterns, which could potentially save lives. In this article, we provide an overview of data sources available for the study of migration in the Central Mediterranean. We describe how these data can be structured, combined, and analyzed to provide quantitative insights on the situation in the region. We define a quantified rescue framework for fusing different data sources around individual rescue operations, and we explore the potential of machine learning to perform automated rescue detection based on vessel trajectory information. We conclude with technical research questions, and potential policy and operational implications related to the use of these data sources.","PeriodicalId":208455,"journal":{"name":"2018 IEEE 5th International Conference on Data Science and Advanced Analytics (DSAA)","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130481836","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"DSAA 2018 Keynotes","authors":"","doi":"10.1109/dsaa.2018.00009","DOIUrl":"https://doi.org/10.1109/dsaa.2018.00009","url":null,"abstract":"","PeriodicalId":208455,"journal":{"name":"2018 IEEE 5th International Conference on Data Science and Advanced Analytics (DSAA)","volume":"31 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130322228","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"DeepClean: Data Cleaning via Question Asking","authors":"Xinyang Zhang, Yujie Ji, Chanh Nguyen, Ting Wang","doi":"10.1109/DSAA.2018.00039","DOIUrl":"https://doi.org/10.1109/DSAA.2018.00039","url":null,"abstract":"As one critical task in the data analysis pipeline, data cleaning is notoriously human labor-intensive and error-prone. Knowledge base-assisted data cleaning has proved a powerful tool for finding and fixing data defects; however, its applicability is inevitably bounded by the natural limitations of knowledge bases. Meanwhile, although a vast number of knowledge sources exist in the form of free-text corpora (e.g., Wikipedia), transforming them into formats usable by existing data cleaning tools can be prohibitively costly and error-prone, if not at all impossible. Here, we present DeepClean, the first end-to-end data cleaning framework powered by free-text knowledge sources. At a high level, DeepClean leverages a knowledge source through its question-answering (QA) interface and achieves high-quality cleaning via iterative question asking. Specifically, DeepClean detects and repairs data defects in three stages: (i) Pattern extraction - it automatically discovers the semantic types of the data attributes as well as their correlations; (ii) Question generation - it translates each data tuple into a minimal set of validation questions; (iii) Completion and repair - by checking the answers returned by the knowledge source against the data values, it identifies erroneous cases and suggests possible fixes. Through extensive empirical studies, we demonstrate that DeepClean is applicable to a range of domains, and can effectively repair a variety of data defects, highlighting data cleaning powered by free-text knowledge sources as a promising direction for future research.","PeriodicalId":208455,"journal":{"name":"2018 IEEE 5th International Conference on Data Science and Advanced Analytics (DSAA)","volume":"32 9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134484105","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Pattern-Based Automatic Parallelization of Representative-Based Clustering Algorithms","authors":"Saiyedul Islam, S. Balasubramaniam, Shruti Gupta, Shikhar Brajesh, Rohan Badlani, Nitin Labhishetty, Abhinav Baid, Poonam Goyal, Navneet Goyal","doi":"10.1109/DSAA.2018.00020","DOIUrl":"https://doi.org/10.1109/DSAA.2018.00020","url":null,"abstract":"Ease of programming and optimal parallel performance have historically been on the opposite side of a tradeoff, forcing the user to choose. With the advent of the Big Data era and rapid evolution of sequential algorithms, the data analytics community can no longer afford the tradeoff. We observed that several clustering algorithms often share common traits - particularly, algorithms belonging to same class of clustering exhibit significant overlap in processing steps. Here, we present our observation on domain patterns in Representative-based clustering algorithms and how they manifest as clearly identifiable programming patterns when mapped to a Domain Specific Language (DSL). We have integrated the signatures of these patterns in the DSL compiler for parallelism identification and automatic parallel code generation. Our experiments on different state-of-the-art parallelization frameworks shows that our system is able to achieve near-optimal speedup while requiring a fraction of the programming effort, making it an ideal choice for the data analytics community.","PeriodicalId":208455,"journal":{"name":"2018 IEEE 5th International Conference on Data Science and Advanced Analytics (DSAA)","volume":"102 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133566676","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Holger Trittenbach, M. Gauch, Klemens Böhm, K. Schulz
{"title":"Towards Simulation-Data Science – A Case Study on Material Failures","authors":"Holger Trittenbach, M. Gauch, Klemens Böhm, K. Schulz","doi":"10.1109/DSAA.2018.00058","DOIUrl":"https://doi.org/10.1109/DSAA.2018.00058","url":null,"abstract":"Simulations let scientists study properties of complex systems. At first sight, data mining is a good choice when evaluating large numbers of simulations. But it is currently unclear whether there are general principles that might guide the deployment of respective methods to simulation data. In other words, is it worthwhile to target at simulation-data science as a distinct subdiscipline of data science? To identify a respective research agenda and to structure the research questions, we conduct a case study from the domain of materials science. One insight that simulation data may be different from other data regarding its structure and quality, which entails focal points different from the ones of conventional data-analysis projects. It also turns out that interpretability and usability are important notions in our context as well. More attention is needed to gather the various meanings of these terms to align them with the needs and priorities of domain scientists. Finally, we propose extensions to our case study which we deem necessary to generalize our insights towards the guidelines envisioned for simulation-data science.","PeriodicalId":208455,"journal":{"name":"2018 IEEE 5th International Conference on Data Science and Advanced Analytics (DSAA)","volume":"48 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121849579","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Rocio Nahime Torres, Darian Frajberg, P. Fraternali, Sergio Luis Herrera Gonzales
{"title":"Crowdsourcing Landforms for Open GIS Enrichment","authors":"Rocio Nahime Torres, Darian Frajberg, P. Fraternali, Sergio Luis Herrera Gonzales","doi":"10.1109/DSAA.2018.00077","DOIUrl":"https://doi.org/10.1109/DSAA.2018.00077","url":null,"abstract":"Open Source Geographical Information Systems, such as OpenStreetMap (OSM), offer a valuable alternative to proprietary solutions for the development of voluntary environment monitoring systems. However, the quantity and quality of information stored in such systems must be carefully evaluated and the contributions of volunteers must be boosted by means of effective engagement methods. This paper reports the results of the assessment of the quality and quantity of OpenStreetMap mountain information: different types of information and world regions have different gaps and improvement requirements. To address this issue, we propose a hybrid approach, in which an open Digital Elevation Model data set is processed with a heuristic algorithm to find candidate mountain information and uncertainty in the automatically extracted candidates is reduced by means of voluntary expert crowdsourcing. The improvement of landform information (not only about mountains, but also about orography and hydrography in general) can support the development of environment monitoring applications.","PeriodicalId":208455,"journal":{"name":"2018 IEEE 5th International Conference on Data Science and Advanced Analytics (DSAA)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129788710","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}