{"title":"BioHCDP: A Hybrid Constituency-Dependency Parser for Biological NLP information extraction","authors":"K. Taha, M. Alzaabi","doi":"10.1109/CIDM.2014.7008151","DOIUrl":"https://doi.org/10.1109/CIDM.2014.7008151","url":null,"abstract":"One of the key goals of biological Natural Language Processing (NLP) is the automatic information extraction from biomedical publications. Most current constituency and dependency parsers overlook the semantic relationships between the constituents comprising a sentence and may not be well suited for capturing complex long-distance dependencies. We propose in this paper a hybrid constituency-dependency parser for biological NLP information extraction called BioHCDP. BioHCDP aims at enhancing the state of the art of biological text mining by applying novel linguistic computational techniques that overcome the limitations of current constituency and dependency parsers outlined above, as follows: (1) it determines the semantic relationship between each pair of constituents in a sentence using novel semantic rules, and (2) it applies semantic relationship extraction models that represent the relationships of different patterns of usage in different contexts. BioHCDP can be used to extract various classes of data from biological texts, including protein function assignments, genetic networks, and protein-protein interactions. We compared BioHCDP experimentally with three systems. Results showed marked improvement.","PeriodicalId":117542,"journal":{"name":"2014 IEEE Symposium on Computational Intelligence and Data Mining (CIDM)","volume":"294 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132703739","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Interpolation and extrapolation: Comparison of definitions and survey of algorithms for convex and concave hulls","authors":"Tobias Ebert, Julian Belz, O. Nelles","doi":"10.1109/CIDM.2014.7008683","DOIUrl":"https://doi.org/10.1109/CIDM.2014.7008683","url":null,"abstract":"Any data based method is vulnerable to the problem of extrapolation, nonetheless there exists no unified theory on handling it. The main topic of this publication is to point out the differences in definitions of extrapolation and related methods. There are many different interpretations of extrapolation and a multitude of methods and algorithms, which address the problem of extrapolation detection in different fields of study. We examine popular definitions of extrapolation, compare them to each other and list related literature and methods. It becomes apparent, that the opinions what extrapolation is and how to handle it, differ greatly from each other. We categorize existing literature and give guidelines to choose an appropriate definition of extrapolation for a present problem. We also present hull algorithms, from classic approaches to recent advances. The presented guidelines and categorized literature enables the reader to categorize a present problem, inspect relevant literature and apply suitable methods and algorithms to solve a problem, which is affected by extrapolation.","PeriodicalId":117542,"journal":{"name":"2014 IEEE Symposium on Computational Intelligence and Data Mining (CIDM)","volume":"24 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114257720","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"High-SNR model order selection using exponentially embedded family and its applications to curve fitting and clustering","authors":"Quan Ding, S. Kay, Xiaorong Zhang","doi":"10.1109/CIDM.2014.7008708","DOIUrl":"https://doi.org/10.1109/CIDM.2014.7008708","url":null,"abstract":"The exponentially embedded family (EEF) of probability density functions was originally proposed in [1] for model order selection. The performance of the original EEF deteriorates somewhat when nuisance parameters are present, especially in the case of high signal-to-noise ratio (SNR). Therefore, we propose a new EEF for model order selection in the case of high SNR. It is shown that without nuisance parameters, the new EEF is the same as the original EEF. However, with nuisance parameters, the new EEF takes a different form. The new EEF is applied to problems of polynomial curve fitting and clustering. Simulation results show that, with nuisance parameters, the new EEF outperforms the original EEF and Bayesian information criterion (BIC) at high SNR.","PeriodicalId":117542,"journal":{"name":"2014 IEEE Symposium on Computational Intelligence and Data Mining (CIDM)","volume":"68 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115456066","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pekka Siirtola, Riitta Pyky, Riikka Ahola, Heli Koskimäki, T. Jämsä, R. Korpelainen, J. Röning
{"title":"Detecting and profiling sedentary young men using machine learning algorithms","authors":"Pekka Siirtola, Riitta Pyky, Riikka Ahola, Heli Koskimäki, T. Jämsä, R. Korpelainen, J. Röning","doi":"10.1109/CIDM.2014.7008681","DOIUrl":"https://doi.org/10.1109/CIDM.2014.7008681","url":null,"abstract":"Many governments and institutions have guidelines for health-enhancing physical activity. Additionally, according to recent studies, the amount of time spent on sitting is a highly important determinant of health and wellbeing. In fact, sedentary lifestyle can lead to many diseases and, what is more, it is even found to be associated with increased mortality. In this study, a data set consisting of self-reported questionnaire, medical diagnoses and fitness tests was studied to detect sedentary young men from a large population and to create a profile of a sedentary person. The data set was collected from 595 young men and contained altogether 678 features. Most of these are answers to multi-choice close-ended questions. More precisely, features were mostly integers with a scale from 1 to 5 or from 1 to 2, and therefore, there was only a little variability in the values of features. In order to detect and profile a sedentary young man, machine learning algorithms were applied to the data set. The performance of five algorithms is compared (quadratic discriminant analysis (QDA), linear discriminant analysis (LDA), C4.5, random forests, and nearest neighbours (kNN)) to find the most accurate algorithm. The results of this study show that when the aim is to detect a sedentary person based on medical records and fitness tests, LDA performs better than the other algorithms, but still the accuracy is not high. In the second part of the study the differences between highly sedentary and non-sedentary young men are searched, recognition can be obtained with high accuracy with each algorithm.","PeriodicalId":117542,"journal":{"name":"2014 IEEE Symposium on Computational Intelligence and Data Mining (CIDM)","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125165295","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Weighted feature-based classification of time series data","authors":"Penugonda Ravikumar, V. Devi","doi":"10.1109/CIDM.2014.7008671","DOIUrl":"https://doi.org/10.1109/CIDM.2014.7008671","url":null,"abstract":"Classification is one of the most popular techniques in the data mining area. In supervised learning, a new pattern is assigned a class label based on a training set whose class labels are already known. This paper proposes a novel classification algorithm for time series data. In our algorithm, we use four parameters and based on their significance on different benchmark datasets, we have assigned the weights using simulated annealing process. We have taken the combination of these parameters as a performance metric to find the accuracy and time complexity. We have experimented with 6 benchmark datasets and results shows that our novel algorithm is computationally fast and accurate in several cases when compared with 1NN classifier.","PeriodicalId":117542,"journal":{"name":"2014 IEEE Symposium on Computational Intelligence and Data Mining (CIDM)","volume":"101 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125399127","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Learning energy consumption profiles from data","authors":"J. Andreoli","doi":"10.1109/CIDM.2014.7008704","DOIUrl":"https://doi.org/10.1109/CIDM.2014.7008704","url":null,"abstract":"A first step in the optimisation of the power consumption of a device infrastructure is to detect the power consumption signature of the involved devices. In this paper, we are especially interested in devices which spend most of their time waiting for a job to execute, as is often the case of shared devices in a networked infrastructure, like multi-function printing devices in an office or transaction processing terminals in a public service. We formulate the problem as an instance of power disaggregation in non intrusive load monitoring (NILM), with strong prior assumptions on the sources but with specific constraints: in particular, the aggregation is occlusive rather than additive.We use a specific variant of Hidden Semi Markov Models (HSMM) to build a generative model of the data, and adapt the Expectation-Maximisation (EM) algorithm to that model, in order to learn, from daily operation data, the physical characteristics of the device, separated from those linked to the job load or the device configurations. Finally, we show some experimental results on a multifunction printing device.","PeriodicalId":117542,"journal":{"name":"2014 IEEE Symposium on Computational Intelligence and Data Mining (CIDM)","volume":"123 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130379107","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Kuang-Pen Chou, M. Prasad, Yang-Yin Lin, Sudhanshu Joshi, Chin-Teng Lin, J. Chang
{"title":"Takagi-Sugeno-Kang type collaborative fuzzy rule based system","authors":"Kuang-Pen Chou, M. Prasad, Yang-Yin Lin, Sudhanshu Joshi, Chin-Teng Lin, J. Chang","doi":"10.1109/CIDM.2014.7008684","DOIUrl":"https://doi.org/10.1109/CIDM.2014.7008684","url":null,"abstract":"In this paper, a Takagi-Sugeno-Kang (TSK) type collaborative fuzzy rule based system is proposed with the help of knowledge learning ability of collaborative fuzzy clustering (CFC). The proposed method split a huge dataset into several small datasets and applying collaborative mechanism to interact each other and this process could be helpful to solve the big data issue. The proposed method applies the collective knowledge of CFC as input variables and the consequent part is a linear combination of the input variables. Through the intensive experimental tests on prediction problem, the performance of the proposed method is as higher as other methods. The proposed method only uses one half information of given dataset for training process and provide an accurate modeling platform while other methods use whole information of given dataset for training.","PeriodicalId":117542,"journal":{"name":"2014 IEEE Symposium on Computational Intelligence and Data Mining (CIDM)","volume":"90 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133845141","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Massively parallelized support vector machines based on GPU-accelerated multiplicative updates","authors":"C. Kou, Chao-Hui Huang","doi":"10.1109/CIDM.2014.7008700","DOIUrl":"https://doi.org/10.1109/CIDM.2014.7008700","url":null,"abstract":"In this paper, we present multiple parallelized support vector machines (MPSVMs), which aims to deal with the situation when multiple SVMs are required to be performed concurrently. The proposed MPSVM is based on an optimization procedure for nonnegative quadratic programming (NQP), called multiplicative updates. By using graphical processing units (GPUs) to parallelize the numerical procedure of SVMs, the proposed MPSVM showed good performance for a certain range of data size and dimension. In the experiments, we compared the proposed MPSVM with other cutting-edge implementations of GPU-based SVMs and it showed competitive performance. Furthermore, the proposed MPSVM is designed to perform multiple SVMs in parallel. As a result, when multiple operations of SVM are required, MPSVM can be one of the best options in terms of time consumption.","PeriodicalId":117542,"journal":{"name":"2014 IEEE Symposium on Computational Intelligence and Data Mining (CIDM)","volume":"48 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127362450","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
S. Ortega-Martorell, I. Olier, T. Delgado-Goñi, M. Ciezka, M. Julià-Sapé, P. Lisboa, C. Arús
{"title":"Semi-supervised source extraction methodology for the nosological imaging of glioblastoma response to therapy","authors":"S. Ortega-Martorell, I. Olier, T. Delgado-Goñi, M. Ciezka, M. Julià-Sapé, P. Lisboa, C. Arús","doi":"10.1109/CIDM.2014.7008653","DOIUrl":"https://doi.org/10.1109/CIDM.2014.7008653","url":null,"abstract":"Glioblastomas are one the most aggressive brain tumors. Their usual bad prognosis is due to the heterogeneity of their response to treatment and the lack of early and robust biomarkers to decide whether the tumor is responding to therapy. In this work, we propose the use of a semi-supervised methodology for source extraction to identify the sources representing tumor response to therapy, untreated/unresponsive tumor, and normal brain; and create nosological images of the response to therapy based on those sources. Fourteen mice were used to calculate the sources, and an independent test set of eight mice was used to further evaluate the proposed approach. The preliminary results obtained indicate that was possible to discriminate response and untreated/unresponsive areas of the tumor, and that the color-coded images allowed convenient tracking of response, especially throughout the course of therapy.","PeriodicalId":117542,"journal":{"name":"2014 IEEE Symposium on Computational Intelligence and Data Mining (CIDM)","volume":"69 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127417970","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Accurate and interpretable regression trees using oracle coaching","authors":"U. Johansson, Cecilia Sönströd, Rikard König","doi":"10.1109/CIDM.2014.7008667","DOIUrl":"https://doi.org/10.1109/CIDM.2014.7008667","url":null,"abstract":"In many real-world scenarios, predictive models need to be interpretable, thus ruling out many machine learning techniques known to produce very accurate models, e.g., neural networks, support vector machines and all ensemble schemes. Most often, tree models or rule sets are used instead, typically resulting in significantly lower predictive performance. The overall purpose of oracle coaching is to reduce this accuracy vs. comprehensibility trade-off by producing interpretable models optimized for the specific production set at hand. The method requires production set inputs to be present when generating the predictive model, a demand fulfilled in most, but not all, predictive modeling scenarios. In oracle coaching, a highly accurate, but opaque, model is first induced from the training data. This model (“the oracle”) is then used to label both the training instances and the production instances. Finally, interpretable models are trained using different combinations of the resulting data sets. In this paper, the oracle coaching produces regression trees, using neural networks and random forests as oracles. The experiments, using 32 publicly available data sets, show that the oracle coaching leads to significantly improved predictive performance, compared to standard induction. In addition, it is also shown that a highly accurate opaque model can be successfully used as a pre-processing step to reduce the noise typically present in data, even in situations where production inputs are not available. In fact, just augmenting or replacing training data with another copy of the training set, but with the predictions from the opaque model as targets, produced significantly more accurate and/or more compact regression trees.","PeriodicalId":117542,"journal":{"name":"2014 IEEE Symposium on Computational Intelligence and Data Mining (CIDM)","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127004927","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}