{"title":"SCUT: Multi-class imbalanced data classification using SMOTE and cluster-based undersampling","authors":"Astha Agrawal, H. Viktor, E. Paquet","doi":"10.5220/0005595502260234","DOIUrl":"https://doi.org/10.5220/0005595502260234","url":null,"abstract":"Class imbalance is a crucial problem in machine learning and occurs in many domains. Specifically, the two-class problem has received interest from researchers in recent years, leading to solutions for oil spill detection, tumour discovery and fraudulent credit card detection, amongst others. However, handling class imbalance in datasets that contains multiple classes, with varying degree of imbalance, has received limited attention. In such a multi-class imbalanced dataset, the classification model tends to favour the majority classes and incorrectly classify instances from the minority classes as belonging to the majority classes, leading to poor predictive accuracies. Further, there is a need to handle both the imbalances between classes as well as address the selection of examples within a class (i.e. the so-called within class imbalance). In this paper, we propose the SCUT hybrid sampling method, which is used to balance the number of training examples in such a multi-class setting. Our SCUT approach oversamples minority class examples through the generation of synthetic examples and employs cluster analysis in order to undersample majority classes. In addition, it handles both within-class and between-class imbalance. Our experimental results against a number of multi-class problems show that, when the SCUT method is used for pre-processing the data before classification, we obtain highly accurate models that compare favourably to the state-of-the-art.","PeriodicalId":102743,"journal":{"name":"2015 7th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management (IC3K)","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-11-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128493972","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
D. Mikhaylov, A. Zuykov, S. Kharkov, S. V. Ponomarev, S. Dvoryankin, A. Tolstaya
{"title":"Customer tracking systems based on identifiers of mobile phones","authors":"D. Mikhaylov, A. Zuykov, S. Kharkov, S. V. Ponomarev, S. Dvoryankin, A. Tolstaya","doi":"10.5220/0005628004510455","DOIUrl":"https://doi.org/10.5220/0005628004510455","url":null,"abstract":"Gathering statistics about visitors finds more and more applications in various fields of business and commerce. This paper describes the system of impersonal counting of unique visitors by their mobile identifiers. Counting is carried out using a non-functioning communication cell (the system does not provide communication services to users of mobile networks). The system masks itself as the base station of a mobile operator. Mobile devices automatically connect to the system even in case of a strong signal from the towers of mobile operators. Once connected, the user identification data is received. The proposed solution allows to compare data about the number of visits to a particular site in various periods of time and to identify the re-occurrence of the visitors. The system is inexpensive and shows 99% accuracy in the identification of users (compared to the real data about the visitors).","PeriodicalId":102743,"journal":{"name":"2015 7th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management (IC3K)","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-11-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132872865","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Multiple behavioral models: A Divide and Conquer strategy to fraud detection in financial data streams","authors":"Roberto Saia, Ludovico Boratto, S. Carta","doi":"10.5220/0005637104960503","DOIUrl":"https://doi.org/10.5220/0005637104960503","url":null,"abstract":"The exponential and rapid growth of the E-commerce based both on the new opportunities offered by the Internet, and on the spread of the use of debit or credit cards in the online purchases, has strongly increased the number of frauds, causing large economic losses to the involved businesses. The design of effective strategies able to face this problem is however particularly challenging, due to several factors, such as the heterogeneity and the non-stationary distribution of the data stream, as well as the presence of an imbalanced class distribution. To complicate the problem, there is the scarcity of public datasets for confidentiality issues, which does not allow researchers to verify the new strategies in many data contexts. Differently from the canonical state-of-the-art strategies, instead of defining a unique model based on the past transactions of the users, we follow a Divide and Conquer strategy, by defining multiple models (user behavioral patterns), which we exploit to evaluate a new transaction, in order to detect potential attempts of fraud. We can act on some parameters of this process, in order to adapt the models sensitivity to the operating environment. Considering that our models do not need to be trained with both the past legitimate and fraudulent transactions of a user, since they use only the legitimate ones, we can operate in a proactive manner, by detecting fraudulent transactions that have never occurred in the past. Such a way to proceed also overcomes the data imbalance problem that afflicts the machine learning approaches. The evaluation of the proposed approach is performed by comparing it with one of the most performant approaches at the state of the art as Random Forests, using a real-world credit card dataset.","PeriodicalId":102743,"journal":{"name":"2015 7th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management (IC3K)","volume":"182 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-11-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123007370","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
L. Deri, M. Martinelli, Daniele Sartiano, Loredana Sideri
{"title":"Large scale web-content classification","authors":"L. Deri, M. Martinelli, Daniele Sartiano, Loredana Sideri","doi":"10.5220/0005635605450554","DOIUrl":"https://doi.org/10.5220/0005635605450554","url":null,"abstract":"Web classification is used in many security devices for preventing users to access selected web sites that are not allowed by the current security policy, as well for improving web search and for implementing contextual advertising. There are many commercial web classification services available on the market and a few publicly available web directory services. Unfortunately they mostly focus on English-speaking web sites, making them unsuitable for other languages in terms of classification reliability and coverage. This paper covers the design and implementation of a web-based classification tool for TLDs (Top Level Domain). Each domain is classified by analysing the main domain web site, and classifying it in categories according to its content. The tool has been successfully validated by classifying all the registered it. Internet domains, whose results are presented in this paper.","PeriodicalId":102743,"journal":{"name":"2015 7th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management (IC3K)","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-11-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123274327","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"KISS MIR: Keep it semantic and social music information retrieval","authors":"Amna Dridi, Mouna Kacimi","doi":"10.5220/0005616704330439","DOIUrl":"https://doi.org/10.5220/0005616704330439","url":null,"abstract":"While content-based approaches for music information retrieval (MIR) have been heavily investigated, user-centric approaches are still in their early stage. Existing user-centric approaches use either music-context or user-context to personalize the search. However, none of them give the possibility to the user to choose the suitable context for his needs. In this paper we propose KISS MIR, a versatile approach for music information retrieval. It consists in combining both music-context and user-context to rank search results. The core contribution of this work is the investigation of different types of contexts derived from social networks. We distinguish semantic and social information and use them to build semantic and social profiles for music and users. The different contexts and profiles can be combined and personalized by the user. We have assessed the quality of our model using a real dataset from Last.fm. The results show that the use of user-context to rank search results is two times better than the use of music-context. More importantly, the combination of semantic and social information is crucial for satisfying user needs.","PeriodicalId":102743,"journal":{"name":"2015 7th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management (IC3K)","volume":"48 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-11-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115806277","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Temporal-based feature selection and transfer learning for text categorization","authors":"Fumiyo Fukumoto, Yoshimi Suzuki","doi":"10.5220/0005593100170026","DOIUrl":"https://doi.org/10.5220/0005593100170026","url":null,"abstract":"This paper addresses text categorization problem that training data may derive from a different time period from the test data. We present a method for text categorization that minimizes the impact of temporal effects. Like much previous work on text categorization, we used feature selection. We selected two types of informative terms according to corpus statistics. One is temporal independent terms that are salient across full temporal range of training documents. Another is temporal dependent terms which are important for a specific time period. For the training documents represented by independent/dependent terms, we applied boosting based transfer learning to learn accurate model for timeline adaptation. The results using Japanese data showed that the method was comparable to the current state-of-the-art biased-SVM method, as the macro-averaged F-score obtained by our method was 0.688 and that of biased-SVM was 0.671. Moreover, we found that the method is effective, especially when the creation time period of the test data differs greatly from that of the training data.","PeriodicalId":102743,"journal":{"name":"2015 7th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management (IC3K)","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-11-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114838815","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Towards reusability of computational experiments: Capturing and sharing Research Objects from knowledge discovery processes","authors":"A. Lefebvre, M. Spruit, Wienand A. Omta","doi":"10.5220/0005631604560462","DOIUrl":"https://doi.org/10.5220/0005631604560462","url":null,"abstract":"Calls for more reproducible research by sharing code and data are released in a large number of fields from biomedical science to signal processing. At the same time, the urge to solve data analysis bottlenecks in the biomedical field generates the need for more interactive data analytics solutions. These interactive solutions are oriented towards wet lab users whereas bioinformaticians favor custom analysis tools. In this position paper we elaborate on why Reproducible Research, by presenting code and data sharing as a gold standard for reproducibility misses important challenges in data analytics. We suggest new ways to design interactive tools embedding constraints of reusability with data exploration. Finally, we seek to integrate our solution with Research Objects as they are expected to bring promising advances in reusability and partial reproducibility of computational work.","PeriodicalId":102743,"journal":{"name":"2015 7th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management (IC3K)","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-11-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124000131","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Hybrid sentiment analyser for Arabic tweets using R","authors":"S. Alhumoud, Tarfa Albuhairi, Wejdan Alohaideb","doi":"10.5220/0005616204170424","DOIUrl":"https://doi.org/10.5220/0005616204170424","url":null,"abstract":"Harvesting meaning out of massively increasing data could be of great value for organizations. Twitter is one of the biggest public and freely available data sources. This paper presents a Hybrid learning implementation to sentiment analysis combining lexicon and supervised approaches. Analysing Arabic, Saudi dialect Twitter tweets to extract sentiments toward a specific topic. This was done using a dataset consisting of 3000 tweets collected in three domains. The obtained results confirm the superiority of the hybrid learning approach over the supervised and unsupervised approaches.","PeriodicalId":102743,"journal":{"name":"2015 7th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management (IC3K)","volume":"32 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-11-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121030611","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Víctor Martínez, Fernando Berzal Galiano, J. Cubero
{"title":"The NOESIS open source framework for network data mining","authors":"Víctor Martínez, Fernando Berzal Galiano, J. Cubero","doi":"10.5220/0005610103160321","DOIUrl":"https://doi.org/10.5220/0005610103160321","url":null,"abstract":"NOESIS is a software framework for the development of data mining techniques for networked data. As an open source project, released under a BSD license, NOESIS intends to provide the necessary infrastructure for solving complex network data mining problems. Currently, it includes a large collection of popular network-related data mining techniques, including the analysis of network structural properties, community detection algorithms, link scoring and prediction methods, and network visualization techniques. The design of NOESIS tries to facilitate the development of parallel algorithms using solid object-oriented design principles and structured parallel programming. NOESIS can be used as a stand-alone application, as many other network analysis packages, and can be included, as a lightweight library, in domain-specific data mining applications and systems.","PeriodicalId":102743,"journal":{"name":"2015 7th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management (IC3K)","volume":"42 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-11-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126331496","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A continuum among logarithmic, linear, and exponential functions, and its potential to improve generalization in neural networks","authors":"Luke B. Godfrey, Michael S. Gashler","doi":"10.5220/0005635804810486","DOIUrl":"https://doi.org/10.5220/0005635804810486","url":null,"abstract":"We present the soft exponential activation function for artificial neural networks that continuously interpolates between logarithmic, linear, and exponential functions. This activation function is simple, differentiable, and parameterized so that it can be trained as the rest of the network is trained. We hypothesize that soft exponential has the potential to improve neural network learning, as it can exactly calculate many natural operations that typical neural networks can only approximate, including addition, multiplication, inner product, distance, and sinusoids.","PeriodicalId":102743,"journal":{"name":"2015 7th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management (IC3K)","volume":"44 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-11-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121751977","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}