{"title":"Robust Binary Classification via ℓ0-SVM","authors":"Jianxiong Tang, N. Zhang, Qia Li","doi":"10.1109/ICDMW.2018.00180","DOIUrl":"https://doi.org/10.1109/ICDMW.2018.00180","url":null,"abstract":"Binary classification is one of the fundamental problems in data mining and the support vector machine (SVM) has been successfully used in binary classification problems. In many applications, the data sets are often polluted by label noise especially in the case that human experts are involved. The performance of the classical SVM may be not enough satisfactory for label noisy data sets. From the view of maximum likelihood estimation, the 0-1 loss function is an appropriate loss function for label noisy data. However, the existence of minimizers of the corresponding optimization problem is not guaranteed. In this paper, we bring the idea of 0-1 loss as well as the hinge loss and propose the ℓ0-norm hinge loss. The function value of the ℓ0-norm hinge loss is 1 as the product of the label and the projection of a sample is less than some small positive number. Otherwise, the function value is 0. Based on the l0-norm hinge loss, we first propose the linear ℓ0-SVM and then design the nonlinear ℓ0-SVM by introducing the kernel function. Compared with the classical SVM, the piecewise constant property of the ℓ0-norm hinge loss makes it robust for label noise. The optimization problems in both linear and nonlinear ℓ0-SVMs are ensured to have minimizers. To solve the corresponding optimization problem, we first utilize the penalty method to decompose the ℓ0-norm and the corresponding linear mapping. Then, the block coordinate decent method with convergence in the sense that the objective function value decreases and converges can be adjusted to solve the penalty problem. Experiments show that the proposed ℓ0-SVM performs well in applications.","PeriodicalId":259600,"journal":{"name":"2018 IEEE International Conference on Data Mining Workshops (ICDMW)","volume":"20 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126305909","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Online and Semi-Online Vector Scheduling on A Single Machine with Rejection","authors":"Qianna Cui, Haiwei Pan","doi":"10.1109/ICDMW.2018.00141","DOIUrl":"https://doi.org/10.1109/ICDMW.2018.00141","url":null,"abstract":"In this paper, we design an online algorithm for vector scheduling on a single machine with rejection and its competitive ratio is d, where d is the dimensions of vector. In addition, we consider two versions of semi-online vector scheduling on a single machine with rejection. In the first version, semi-online with rearrangement allows at most one job to be reassigned after scheduling all jobs, then we show a semi-online algorithm with competitive ratio 1/2 d+2 for d > 3. The second version is semi-online with rejection buffer whose length= 1, which can hold one job. When d > 3, we also give an algorithm with competitive ratio 1/2 d + 2.","PeriodicalId":259600,"journal":{"name":"2018 IEEE International Conference on Data Mining Workshops (ICDMW)","volume":"115 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129871153","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Burak Suyunu, Gonul Ayci, Mine Ögretir, A. Cemgil, S. Uskudarli, Hamza Zeytinoglu, Bülent Özel, Arman Boyaci
{"title":"Semi-Supervised Psychometric Scoring of Document Collections","authors":"Burak Suyunu, Gonul Ayci, Mine Ögretir, A. Cemgil, S. Uskudarli, Hamza Zeytinoglu, Bülent Özel, Arman Boyaci","doi":"10.1109/ICDMW.2018.00194","DOIUrl":"https://doi.org/10.1109/ICDMW.2018.00194","url":null,"abstract":"We describe a generic computational approach that can be used in developing methods for psychometric profiling. Our approach is based on semi-supervised analysis of document collections using topic modeling. The method depends on a supervisor providing a set of seed documents, grouped by abstract themes, such as Schwartz values or personality traits; and possibly a separate background document corpus. Instead of casting the problem into a standard classification framework, we interpret the group labels as a guide for finding distinguishing features. During training, we train each group of documents associated with a theme separately by using nonnegative matrix factorization to obtain theme specific topic distributions. In the analysis, we decompose a new document using the model learned during training to arrive at the theme scores. We demonstrate our approach on two psychometric profiling theories (Schwartz and Big Five) and evaluate our Schwartz scores with leave-one-out cross-validation method and compare Big Five scores to independent surveys, which are much more costly to carry out.","PeriodicalId":259600,"journal":{"name":"2018 IEEE International Conference on Data Mining Workshops (ICDMW)","volume":"110 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124581832","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A Network-Based Approach to Enhance Electricity Load Forecasting","authors":"Etienne Gael Tajeuna, M. Bouguessa, Shengrui Wang","doi":"10.1109/ICDMW.2018.00046","DOIUrl":"https://doi.org/10.1109/ICDMW.2018.00046","url":null,"abstract":"In the field of energy analysis, time series forecasting techniques are widely used to predict customer electricity consumptions. To enhance the electricity forecasting accuracy, in current approaches, clustering techniques are first applied to identify groups of customers exhibiting the same electricity load profile, from which a representative consumption pattern can be extracted. This pattern is later used to predict customers' subsequent electricity consumption. In the vast majority of clustering approaches, authors use the entire data set as input to identify customer consumption groups. However, electricity load data vary extremely rapidly and can thus be dominated by outdated historical information which may influence the effective cluster status at a given time-stamp. To overcome this constraint, instead of using the entire data set, we propose an adaptive process which involves tracking the evolution of identified customer consumption groups at different time-stamps. A network structure is used to model the interrelation between customer electricity load profiles. The network is then split into subnetworks that are treated as customer electricity consumption clusters. Representative subseries, called master subseries, are extracted to track the evolution of clusters over time. Finally, the master subseries are used as a knowledge base for forecasting customers' electricity consumption at later time-stamps and automatically predicting future cluster status. The load forecasting is done using a seasonal autoregressive integrated moving average model, which is compared to a multi-layer perceptron, support vector regression, lasso regression, bayesian ridge regression and K-nearest neighbor regression models.","PeriodicalId":259600,"journal":{"name":"2018 IEEE International Conference on Data Mining Workshops (ICDMW)","volume":"37 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126957976","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Munshi Yusuf Alam, Shahrukh Imam, H. Anurag, Sujoy Saha, S. Nandi, M. Saha
{"title":"LiSense: Monitoring City Street Lighting During Night using Smartphone Sensors","authors":"Munshi Yusuf Alam, Shahrukh Imam, H. Anurag, Sujoy Saha, S. Nandi, M. Saha","doi":"10.1109/ICDMW.2018.00092","DOIUrl":"https://doi.org/10.1109/ICDMW.2018.00092","url":null,"abstract":"Adequate illumination of city streets during night hours is essential to ensure road safety. However, even for developed cities, monitoring streetlights still remain a tedious task that relies on manual inspection reports. Existing systems mostly rely on vehicle-mounted camera or sensors fitted at every light post that is not cost-effective and scalable. In contrary, in this paper, we develop a novel cost-effective system LiSense to monitor illumination levels of street lights and detect as well as localize malfunctioning light posts. The system utilizes ambient light and GPS sensors and uses crowdsourcing. Sensor trails collected by our App from 2-wheeler covering 160 km suburban city road detects all malfunctioning street lights more than 96% in accuracy with a mean localization error of 6 meters. To the best of our knowledge, this is the first of its kind approach to monitoring street light condition which is cost-effective, scalable and suitable for developing regions.","PeriodicalId":259600,"journal":{"name":"2018 IEEE International Conference on Data Mining Workshops (ICDMW)","volume":"30 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126008574","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
R. Mills, Vamsi Sripathi, J. Kumar, S. Sreepathi, F. Hoffman, W. Hargrove
{"title":"Parallel k-Means Clustering of Geospatial Data Sets Using Manycore CPU Architectures","authors":"R. Mills, Vamsi Sripathi, J. Kumar, S. Sreepathi, F. Hoffman, W. Hargrove","doi":"10.1109/ICDMW.2018.00118","DOIUrl":"https://doi.org/10.1109/ICDMW.2018.00118","url":null,"abstract":"The increasing availability of high-resolution geospatiotemporal data sets from sources such as observatory networks, remote sensing platforms, and computational Earth system models has opened new possibilities for knowledge discovery and mining of weather, climate, ecological, and other geoscientific data sets fused from disparate sources. Many of the standard tools used on individual workstations are impractical for the analysis and synthesis of data sets of this size; however, new algorithmic approaches that can effectively utilize the complex memory hierarchies and the extremely high levels of parallelism available in state-of-the-art high-performance computing platforms can enable such analysis. Here, we describe pKluster, an open-source tool we have developed for accelerated k-means clustering of geospatial and geospatiotemporal data, and discuss algorithmic modifications and code optimizations we have made to enable it to effectively use parallel machines based on novel CPU architectures—such as the Intel Knights Landing Xeon Phi and Skylake Xeon processors—with many cores and hardware threads, and employing significant single instruction, multiple data (SIMD) parallelism. We outline some applications of the code in ecology and climate science contexts and present a detailed discussion of the performance of the code for one such application, LiDAR-derived vertical vegetation structure classification.","PeriodicalId":259600,"journal":{"name":"2018 IEEE International Conference on Data Mining Workshops (ICDMW)","volume":"35 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124099587","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Hardening Encrypted Patient Names Against Cryptographic Attacks Using Cellular Automata","authors":"R. Schnell, C. Borgs","doi":"10.1109/ICDMW.2018.00082","DOIUrl":"https://doi.org/10.1109/ICDMW.2018.00082","url":null,"abstract":"Linking information across different databases enables new research in the medical sciences. Recent EU privacy regulations recommend encrypting personal identifiers used for linking. In this contribution, a new method for hardening such a privacy-preserving record linkage technique (PPRL) against attacks is presented. The new hardening method prevents re-identifications and cryptographic attacks while still delivering acceptable linkage quality. Using real-world mortality data, we compare clear-text and several current PPRL methods with our newly proposed method. While all PPRL methods will have to balance security and quality, the use of a cellular automata transformation to protect against attacks will decrease the linkage quality only slightly, while preventing all currently known methods of decrypting Bloom filter-based private linkage keys.","PeriodicalId":259600,"journal":{"name":"2018 IEEE International Conference on Data Mining Workshops (ICDMW)","volume":"20 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127456555","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"High Dimensional Clustering: A Strongly Connected Component Clustering Solution (SCCC)","authors":"Mihir Shekhar, Lini T. Thomas, K. Karlapalem","doi":"10.1109/ICDMW.2018.00159","DOIUrl":"https://doi.org/10.1109/ICDMW.2018.00159","url":null,"abstract":"High dimensional data is often challenging to cluster due to the curse of dimensionality leading to challenges in identifying clusters. The key challenge in high dimensional clustering is to develop a solution that identifies clusters which are as complete as they can be, while not merging well-separated clusters. We propose core points which represent local compact regions. The strongly connected component from the k-nearest neighbor graph of core points provides for a group of points that are strongly mutually connected. These mutually connected regions represent the core structure of the clusters. Our empirical analysis and experimental results present the rationale behind our solution and validate the goodness of the clusters against the state of the art high dimensional clustering algorithms. The novelty of our solution is to use the concept of reverse nearest neighbors to generate natural clusters in high dimensions.","PeriodicalId":259600,"journal":{"name":"2018 IEEE International Conference on Data Mining Workshops (ICDMW)","volume":"49 3","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133685495","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Preliminary Case Study on Data Utilization and Collaboration on the Web","authors":"Daiji Iwasa","doi":"10.1109/ICDMW.2018.00033","DOIUrl":"https://doi.org/10.1109/ICDMW.2018.00033","url":null,"abstract":"Recently, data holders collect various kinds of data owing to the improvement of Internet of Things (IoT) technologies. On the other hand, data analytic can observe even large data owing to the spread of analysis methods/tools and strong computing power. The critical point for accelerating data utilization is the communication of data stakeholders. Data analysts should consider the purpose for which data holders collect data. However, the communication among stakeholders is hard in case some of them are not familiar with data. Innovators Marketplace on Data Jackets (IMDJ) [5] is a workshop method to tackle this problem. In IMDJ workshop, participants state their requirements and create a scenario for solving these requirements based on Data Jackets. Data Jacket (DJ) [4] is a framework to describe structured information about data in natural language, which enable for those who are not familiar with data to discuss based on data. In this paper, we introduce a platform called Web-IMDJ for conducting IMDJ workshop on the web. Web-IMDJ not only reduces the burden of workshops but enables to participate in workshop remotely. By conducting workshop on Web-IMDJ as case study, we found that the number of ideas is as many as previous IMDJ and the capacity of participants is superior in Web-IMDJ.","PeriodicalId":259600,"journal":{"name":"2018 IEEE International Conference on Data Mining Workshops (ICDMW)","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130743447","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Mining Connections Between Domains through Latent Space Mapping","authors":"Yingjing Lu","doi":"10.1109/ICDMW.2018.00157","DOIUrl":"https://doi.org/10.1109/ICDMW.2018.00157","url":null,"abstract":"Exploring ways to connect data is crucial to building knowledge graphs to associate data from different domains together. Humans, for example, can learn to associate flour with bread because bread is made of flour so that they can recall information of flour given a piece of bread even though bread and flour have few common features. In data mining, this ability can be translated to the way to connect images, texts, audios from different classes or domains together. Most works so far assume shared feature representations between domains we want to connect together. Another limitation yet to be improved is that for each defined mapping scheme, we often have to train a new model end-to-end among all sample data, which is often expensive. In this work, we present a model that aims to simultaneously address the two limitations. We use unconditionally trained Variational Autoencoders(VAEs) to project high dimensional data into the latent space and present a novel generative model that transfer latent representation of data from one domain to another by any custom schema. The model makes no assumption on any shared representation among different domains. The VAEs that encodes entire datasets, being the largest training overhead in this model, can be reused to support any new mapping schema without any retraining.","PeriodicalId":259600,"journal":{"name":"2018 IEEE International Conference on Data Mining Workshops (ICDMW)","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117236938","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}