{"title":"机器学习质量改进的数据评估与增强","authors":"Haihua Chen, Jiangping Chen, Junhua Ding","doi":"10.1109/QRS51102.2020.00014","DOIUrl":null,"url":null,"abstract":"The poor quality of a dataset may produce low quality machine learning system. Therefore, transfer learning as a demonstrated effective approach for data quality improvement has been widely used for improving the quality of machine learning. However, the \"quality improvement\" brought by transfer learning in some studies was not rigorously validated or was even misleading. In this paper, we first investigate the quality problem of the datasets that were used for building a machine learning system. The system was claimed to have achieved the best performance comparing to existing work on a machine learning task. However, the \"best performance\" was due to the poor quality of the datasets as well as the incorrect validation process. Then we described an experimental study to demonstrate the effectiveness of transfer learning for improving the quality of datasets. However, the experiment results also show the quality improvement of transfer learning is not guaranteed, and a set of requirements have to be meet before applying the approach. Based on the investigation and experiment results, we propose a group of data quality criteria and evaluation approaches for quality improvement of machine learning. We investigated the research problem and explained the results through studying a machine learning system for normalizing medical concepts in social media text with open datasets.","PeriodicalId":301814,"journal":{"name":"2020 IEEE 20th International Conference on Software Quality, Reliability and Security (QRS)","volume":"129 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"26","resultStr":"{\"title\":\"Data Evaluation and Enhancement for Quality Improvement of Machine Learning\",\"authors\":\"Haihua Chen, Jiangping Chen, Junhua Ding\",\"doi\":\"10.1109/QRS51102.2020.00014\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The poor quality of a dataset may produce low quality machine learning system. Therefore, transfer learning as a demonstrated effective approach for data quality improvement has been widely used for improving the quality of machine learning. However, the \\\"quality improvement\\\" brought by transfer learning in some studies was not rigorously validated or was even misleading. In this paper, we first investigate the quality problem of the datasets that were used for building a machine learning system. The system was claimed to have achieved the best performance comparing to existing work on a machine learning task. However, the \\\"best performance\\\" was due to the poor quality of the datasets as well as the incorrect validation process. Then we described an experimental study to demonstrate the effectiveness of transfer learning for improving the quality of datasets. However, the experiment results also show the quality improvement of transfer learning is not guaranteed, and a set of requirements have to be meet before applying the approach. Based on the investigation and experiment results, we propose a group of data quality criteria and evaluation approaches for quality improvement of machine learning. We investigated the research problem and explained the results through studying a machine learning system for normalizing medical concepts in social media text with open datasets.\",\"PeriodicalId\":301814,\"journal\":{\"name\":\"2020 IEEE 20th International Conference on Software Quality, Reliability and Security (QRS)\",\"volume\":\"129 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2020-12-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"26\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2020 IEEE 20th International Conference on Software Quality, Reliability and Security (QRS)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/QRS51102.2020.00014\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2020 IEEE 20th International Conference on Software Quality, Reliability and Security (QRS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/QRS51102.2020.00014","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Data Evaluation and Enhancement for Quality Improvement of Machine Learning
The poor quality of a dataset may produce low quality machine learning system. Therefore, transfer learning as a demonstrated effective approach for data quality improvement has been widely used for improving the quality of machine learning. However, the "quality improvement" brought by transfer learning in some studies was not rigorously validated or was even misleading. In this paper, we first investigate the quality problem of the datasets that were used for building a machine learning system. The system was claimed to have achieved the best performance comparing to existing work on a machine learning task. However, the "best performance" was due to the poor quality of the datasets as well as the incorrect validation process. Then we described an experimental study to demonstrate the effectiveness of transfer learning for improving the quality of datasets. However, the experiment results also show the quality improvement of transfer learning is not guaranteed, and a set of requirements have to be meet before applying the approach. Based on the investigation and experiment results, we propose a group of data quality criteria and evaluation approaches for quality improvement of machine learning. We investigated the research problem and explained the results through studying a machine learning system for normalizing medical concepts in social media text with open datasets.