{"title":"剪接位置预测任务领域自适应的自我训练风格评价","authors":"Nic Herndon, Doina Caragea","doi":"10.1145/2808797.2808809","DOIUrl":null,"url":null,"abstract":"We consider the problem of adding a large unlabeled sample from the target domain to boost the performance of a domain adaptation algorithm when only a small set of labeled examples are available from the target domain. In particular, we consider the problem setting motivated by the task of splice site prediction. For this task, annotating a genome using machine learning requires a lot of labeled data, whereas for non-model organisms, there is only some labeled data and lots of unlabeled data. With domain adaptation one can leverage the large amount of data from a related model organism, along with the labeled and unlabeled data from the organism of interest to train a classifier for the latter. Our goal is to analyze the three ways of incorporating the unlabeled data - with soft labels only (i.e., Expectation-Maximization), with hard labels only (i.e., self-training), or with both soft and hard labels - for the splice site prediction in particular, and more broadly for a general iterative domain adaptation setting. We provide empirical results on splice site prediction indicating that using soft labels only can lead to better classifier compared to the other two ways.","PeriodicalId":371988,"journal":{"name":"2015 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM)","volume":"35 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2015-08-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":"{\"title\":\"An evaluation of self-training styles for domain adaptation on the task of splice site prediction\",\"authors\":\"Nic Herndon, Doina Caragea\",\"doi\":\"10.1145/2808797.2808809\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"We consider the problem of adding a large unlabeled sample from the target domain to boost the performance of a domain adaptation algorithm when only a small set of labeled examples are available from the target domain. In particular, we consider the problem setting motivated by the task of splice site prediction. For this task, annotating a genome using machine learning requires a lot of labeled data, whereas for non-model organisms, there is only some labeled data and lots of unlabeled data. With domain adaptation one can leverage the large amount of data from a related model organism, along with the labeled and unlabeled data from the organism of interest to train a classifier for the latter. Our goal is to analyze the three ways of incorporating the unlabeled data - with soft labels only (i.e., Expectation-Maximization), with hard labels only (i.e., self-training), or with both soft and hard labels - for the splice site prediction in particular, and more broadly for a general iterative domain adaptation setting. We provide empirical results on splice site prediction indicating that using soft labels only can lead to better classifier compared to the other two ways.\",\"PeriodicalId\":371988,\"journal\":{\"name\":\"2015 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM)\",\"volume\":\"35 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2015-08-25\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"2\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2015 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/2808797.2808809\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2015 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2808797.2808809","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
An evaluation of self-training styles for domain adaptation on the task of splice site prediction
We consider the problem of adding a large unlabeled sample from the target domain to boost the performance of a domain adaptation algorithm when only a small set of labeled examples are available from the target domain. In particular, we consider the problem setting motivated by the task of splice site prediction. For this task, annotating a genome using machine learning requires a lot of labeled data, whereas for non-model organisms, there is only some labeled data and lots of unlabeled data. With domain adaptation one can leverage the large amount of data from a related model organism, along with the labeled and unlabeled data from the organism of interest to train a classifier for the latter. Our goal is to analyze the three ways of incorporating the unlabeled data - with soft labels only (i.e., Expectation-Maximization), with hard labels only (i.e., self-training), or with both soft and hard labels - for the splice site prediction in particular, and more broadly for a general iterative domain adaptation setting. We provide empirical results on splice site prediction indicating that using soft labels only can lead to better classifier compared to the other two ways.