{"title":"Weighted Frequent Subgraph Mining in Weighted Graph Databases","authors":"Masaki Shinoda, Tomonobu Ozaki, T. Ohkawa","doi":"10.1109/ICDMW.2009.12","DOIUrl":"https://doi.org/10.1109/ICDMW.2009.12","url":null,"abstract":"We focus on the problem of pattern discovery from externally and internally weighted labeled graphs because the target data can be modeled more naturally and in detail by using weighted graphs. For example, while external weight can be used for representing a degree of importance and reliability of a graph itself, internal weight reflects utility and significance of each component in a graph. Therefore, we can expect to realize more precise knowledge discovery by employing weighted graphs. From these backgrounds, in this paper, we discuss two pattern mining problems with external and internal weighted frequencies, and propose two algorithms to solve them efficiently.","PeriodicalId":351078,"journal":{"name":"2009 IEEE International Conference on Data Mining Workshops","volume":"175 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-12-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115989284","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
L. Tan, F. Fotouhi, W. Grosky, Horia F. Pop, N. Mouaddib
{"title":"Improving Similarity Join Algorithms Using Fuzzy Clustering Technique","authors":"L. Tan, F. Fotouhi, W. Grosky, Horia F. Pop, N. Mouaddib","doi":"10.1109/ICDMW.2009.50","DOIUrl":"https://doi.org/10.1109/ICDMW.2009.50","url":null,"abstract":"In this paper, we propose a pre-processing technique to improve existing string similarity join algorithms using fuzzy clustering. Our approach first identifies groups of related attributes and then, using this information, we apply existing string similarity join algorithms on these attributes. To identify the clustered attributes we use fuzzy techniques. This approach can be applied to the integration of knowledge bases and databases, as well as handle inconsistent values and naming conventions, incorrect or missing data values, and incomplete information from multiple sources with semi-compatible attributes or homogenous attributes. Using an experimental study, we have shown our preprocessing approach improves existing string similarity join algorithms by about 10 percent on precision and recall.","PeriodicalId":351078,"journal":{"name":"2009 IEEE International Conference on Data Mining Workshops","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-12-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114230900","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Bucket Learning: Improving Model Quality through Enhancing Local Patterns","authors":"Guangzhi Qu, Hui Wu","doi":"10.1016/j.knosys.2011.09.013","DOIUrl":"https://doi.org/10.1016/j.knosys.2011.09.013","url":null,"abstract":"","PeriodicalId":351078,"journal":{"name":"2009 IEEE International Conference on Data Mining Workshops","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-12-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"118166096","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"MSRA-MM 2.0: A Large-Scale Web Multimedia Dataset","authors":"Hao Li, Meng Wang, Xiansheng Hua","doi":"10.1109/ICDMW.2009.46","DOIUrl":"https://doi.org/10.1109/ICDMW.2009.46","url":null,"abstract":"In this paper, we introduce the second version of Microsoft Research Asia Multimedia (MSRA-MM), a dataset that aims to facilitate research in multimedia information retrieval and related areas. The images and videos in the dataset are collected from a commercial search engine with more than 1000 queries. It contains about 1 million images and 20,000 videos. We also provide the surrounding texts that are obtained from more than 1 million web pages. The images and videos have been comprehensively annotated, including their relevance levels to corresponding queries, semantic concepts of images, and category and quality information of videos. We define six standard tasks on the dataset: (1) image search reranking; (2) image annotation; (3) query-by-example image search; (4) video search reranking; (5) video categorization; and (6) video quality assessment.","PeriodicalId":351078,"journal":{"name":"2009 IEEE International Conference on Data Mining Workshops","volume":"80 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-12-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121476195","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Differential Privacy for Clinical Trial Data: Preliminary Evaluations","authors":"Duy Vu, A. Slavkovic","doi":"10.1109/ICDMW.2009.52","DOIUrl":"https://doi.org/10.1109/ICDMW.2009.52","url":null,"abstract":"The concept of differential privacy as a rigorous definition of privacy has emerged from the cryptographic community. However, further careful evaluation is needed before we can apply these theoretical results to privacy preservation in everyday data mining and statistical analysis. In this paper we demonstrate how to integrate a differential privacy framework with the classical statistical hypothesis testing in the domain of clinical trials where personal information is sensitive. We develop concrete methodology that researchers can use. We derive rules for the sample size adjustment whereby both statistical efficiency and differential privacy can be achieved for the specific tests for binomial random variables and in contingency tables.","PeriodicalId":351078,"journal":{"name":"2009 IEEE International Conference on Data Mining Workshops","volume":"263 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-12-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115595374","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
A. Stefanidis, Caixia Wang, Xu Lu, Kevin M. Curtin
{"title":"Multilayer Scene Similarity Assessment","authors":"A. Stefanidis, Caixia Wang, Xu Lu, Kevin M. Curtin","doi":"10.1109/ICDMW.2009.117","DOIUrl":"https://doi.org/10.1109/ICDMW.2009.117","url":null,"abstract":"As we move increasingly towards multi-source data analysis, the assessment of similarity of complex, multilayer scenes is becoming increasingly important for spatial data mining. In this paper, we present a content-based approach for scene similarity assessment. The proposed approach is based on a graph-matching scheme that models linear feature networks (road network) as graphs and additional GIS information (e.g. buildings) as layer content. This allows us to combine diverse but co-located pieces of information (e.g. roads and buildings) in an integrated similarity assessment process. In the paper we present key theoretical concepts and provide experimental results to demonstrate the capability and robustness of the proposed approach.","PeriodicalId":351078,"journal":{"name":"2009 IEEE International Conference on Data Mining Workshops","volume":"38 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-12-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123340611","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Mining of Attribute Interactions Using Information Theoretic Metrics","authors":"P. Chanda, Young-Rae Cho, A. Zhang, M. Ramanathan","doi":"10.1109/ICDMW.2009.51","DOIUrl":"https://doi.org/10.1109/ICDMW.2009.51","url":null,"abstract":"Knowledge of the statistical interactions between the attributes in a data set provides insight into the underlying structure of the data and explains the relationships (independence, synergy, redundancy) between the attributes. In a supervised learning problem, normally, a small subset of the classifying attributes are actually associated with the class label. Interaction information among the attributes captures the multivariate dependencies (synergy and redundancy) among the attributes and the class label. Mining the significant statistical interactions that contain information about the class label is a computationally challenging task - the number of possible interactions increases exponentially and most of these interactions contain redundant information when a number of correlated attributes are present. In this paper, we present a data mining method (named IM or Interaction Mining) to mine non-redundant attribute sets that have significant interactions with the class label. We further demonstrate that the mined statistical interactions are useful for improved feature selection as they successfully capture the multivariate inter-dependencies among the attributes.","PeriodicalId":351078,"journal":{"name":"2009 IEEE International Conference on Data Mining Workshops","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-12-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129038185","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A New Measure of Feature Selection Algorithms' Stability","authors":"J. Novovicová, P. Somol, P. Pudil","doi":"10.1109/ICDMW.2009.32","DOIUrl":"https://doi.org/10.1109/ICDMW.2009.32","url":null,"abstract":"Stability or robustness of feature selection methods is a topic of recent interest. A new stability measure based on the Shannon entropy is proposed in this paper to evaluate the overall occurrence of individual features in selected subsets of possibly varying cardinality. We compare the new measure to stability measures proposed recently by Somol et al. The new measure is computationally very efficient and adds another type of insight into the stability problem. All considered measures have been used to compare the stability of several feature selection methods (individually best ranking, sequential forward selection, sequential forward floating selection and dynamic oscillating search) on a set of examples.","PeriodicalId":351078,"journal":{"name":"2009 IEEE International Conference on Data Mining Workshops","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-12-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121062290","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Discovery of Quantitative Sequential Patterns from Event Sequences","authors":"Fumiya Nakagaito, Tomonobu Ozaki, T. Ohkawa","doi":"10.1109/ICDMW.2009.13","DOIUrl":"https://doi.org/10.1109/ICDMW.2009.13","url":null,"abstract":"In this paper, we consider the problem of frequent pattern mining in databases of temporal events with intervals. Since quantitative temporal information might play important roles in many application domains, it is critical to discover patterns to which numerical attributes are associated. To this end, we consider two kinds of temporal patterns with quantitative information on the durations and time differences of events, and propose corresponding algorithms by incorporating numerical clustering techniques into existing temporal pattern miners. The effectiveness of the proposed algorithms was assessed by using real world datasets.","PeriodicalId":351078,"journal":{"name":"2009 IEEE International Conference on Data Mining Workshops","volume":"107 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-12-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121739112","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Video2Text: Learning to Annotate Video Content","authors":"H. Aradhye, G. Toderici, J. Yagnik","doi":"10.1109/ICDMW.2009.79","DOIUrl":"https://doi.org/10.1109/ICDMW.2009.79","url":null,"abstract":"This paper discusses a new method for automatic discovery and organization of descriptive concepts (labels) within large real-world corpora of user-uploaded multimedia, such as YouTube. com. Conversely, it also provides validation of existing labels, if any. While training, our method does not assume any explicit manual annotation other than the weak labels already available in the form of video title, description, and tags. Prior work related to such auto-annotation assumed that a vocabulary of labels of interest (e. g., indoor, outdoor, city, landscape) is specified a priori. In contrast, the proposed method begins with an empty vocabulary. It analyzes audiovisual features of 25 million YouTube. com videos -- nearly 150 years of video data -- effectively searching for consistent correlation between these features and text metadata. It autonomously extends the label vocabulary as and when it discovers concepts it can reliably identify, eventually leading to a vocabulary with thousands of labels and growing. We believe that this work significantly extends the state of the art in multimedia data mining, discovery, and organization based on the technical merit of the proposed ideas as well as the enormous scale of the mining exercise in a very challenging, unconstrained, noisy domain.","PeriodicalId":351078,"journal":{"name":"2009 IEEE International Conference on Data Mining Workshops","volume":"37 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-12-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116200073","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}