{"title":"一个用于混合标记/未标记数据集的类发现和离群值检测的混合模型框架","authors":"David J. Miller, J. Browning","doi":"10.1109/NNSP.2003.1318048","DOIUrl":null,"url":null,"abstract":"Several authors have addressed learning as a classifier given by a mixed labeled/unlabeled training set. These works assumes the unlabeled sample originates from one of the (known) classes. This work considers the scenario in which unlabeled points may belong either to known/predefined or to here-to-fore undiscovered classes. There are several practical situations where such data may arise. We earlier proposed a novel statistical mixture model to fit in this mixed data. In this paper we review the method and introduce an alternative model. Our fundamental strategy is to view as observed the data not only the feature vector and the class label, but also the fact of label presence/absence for each point. Two types of mixture components are used to explain label presence/absence. \"Predefined\" components generate both labeled and unlabeled points and assume the labels that are missing at random. These components represent the known classes. \"Non-predefined\" components only generate unlabeled points. In localized regions, the data subsets are captured exclusively unlabeled. Such subsets may represent an outlier distribution, or new classes. The components' predefined/non-predefined natures are data-driven, learned with the other parameters via an algorithm based on expectation-maximization (EM). There are three natural applications presented: 1) robust classifier design, given by a mixed training set with outliers; 2) classification with rejections; and 3) identification of the unlabeled points (and their representative components) originated from unknown classes, i.e. new class discovery. The effectiveness of our models in discovering purely unlabeled data components (potential new classes) is evaluated both by synthetic and real data sets. Although each of our models has its own advantages, the original model is found is achieved by the best class discovery results.","PeriodicalId":315958,"journal":{"name":"2003 IEEE XIII Workshop on Neural Networks for Signal Processing (IEEE Cat. No.03TH8718)","volume":"13 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"A mixture model framework for class discovery and outlier detection in mixed labeled/unlabeled data sets\",\"authors\":\"David J. Miller, J. Browning\",\"doi\":\"10.1109/NNSP.2003.1318048\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Several authors have addressed learning as a classifier given by a mixed labeled/unlabeled training set. These works assumes the unlabeled sample originates from one of the (known) classes. This work considers the scenario in which unlabeled points may belong either to known/predefined or to here-to-fore undiscovered classes. There are several practical situations where such data may arise. We earlier proposed a novel statistical mixture model to fit in this mixed data. In this paper we review the method and introduce an alternative model. Our fundamental strategy is to view as observed the data not only the feature vector and the class label, but also the fact of label presence/absence for each point. Two types of mixture components are used to explain label presence/absence. \\\"Predefined\\\" components generate both labeled and unlabeled points and assume the labels that are missing at random. These components represent the known classes. \\\"Non-predefined\\\" components only generate unlabeled points. In localized regions, the data subsets are captured exclusively unlabeled. Such subsets may represent an outlier distribution, or new classes. The components' predefined/non-predefined natures are data-driven, learned with the other parameters via an algorithm based on expectation-maximization (EM). There are three natural applications presented: 1) robust classifier design, given by a mixed training set with outliers; 2) classification with rejections; and 3) identification of the unlabeled points (and their representative components) originated from unknown classes, i.e. new class discovery. The effectiveness of our models in discovering purely unlabeled data components (potential new classes) is evaluated both by synthetic and real data sets. Although each of our models has its own advantages, the original model is found is achieved by the best class discovery results.\",\"PeriodicalId\":315958,\"journal\":{\"name\":\"2003 IEEE XIII Workshop on Neural Networks for Signal Processing (IEEE Cat. No.03TH8718)\",\"volume\":\"13 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"1900-01-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2003 IEEE XIII Workshop on Neural Networks for Signal Processing (IEEE Cat. No.03TH8718)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/NNSP.2003.1318048\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2003 IEEE XIII Workshop on Neural Networks for Signal Processing (IEEE Cat. No.03TH8718)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/NNSP.2003.1318048","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
A mixture model framework for class discovery and outlier detection in mixed labeled/unlabeled data sets
Several authors have addressed learning as a classifier given by a mixed labeled/unlabeled training set. These works assumes the unlabeled sample originates from one of the (known) classes. This work considers the scenario in which unlabeled points may belong either to known/predefined or to here-to-fore undiscovered classes. There are several practical situations where such data may arise. We earlier proposed a novel statistical mixture model to fit in this mixed data. In this paper we review the method and introduce an alternative model. Our fundamental strategy is to view as observed the data not only the feature vector and the class label, but also the fact of label presence/absence for each point. Two types of mixture components are used to explain label presence/absence. "Predefined" components generate both labeled and unlabeled points and assume the labels that are missing at random. These components represent the known classes. "Non-predefined" components only generate unlabeled points. In localized regions, the data subsets are captured exclusively unlabeled. Such subsets may represent an outlier distribution, or new classes. The components' predefined/non-predefined natures are data-driven, learned with the other parameters via an algorithm based on expectation-maximization (EM). There are three natural applications presented: 1) robust classifier design, given by a mixed training set with outliers; 2) classification with rejections; and 3) identification of the unlabeled points (and their representative components) originated from unknown classes, i.e. new class discovery. The effectiveness of our models in discovering purely unlabeled data components (potential new classes) is evaluated both by synthetic and real data sets. Although each of our models has its own advantages, the original model is found is achieved by the best class discovery results.