{"title":"An Introduction to Text Classification with Applications to Medical Records","authors":"Yingqiu Zhou","doi":"10.1109/ITCA52113.2020.00105","DOIUrl":null,"url":null,"abstract":"We proposed and completed a real-life data mining problem, aka, text corpus classification. We extracted 15,500 medical documents relevant to ten different diseases and used the bag-of-words model to create a word occurrence vector for each document. The latent semantic analysis (LSA) method is then used to reduce the occurrence vector’s dimensionality to a feature vector of dimension 200. We selected a multilayer perceptron (MLP) neural network to do the final classification and report the performance comparison with the other six classifiers. We also completed the grid search for the best feature subspace dimensionality.","PeriodicalId":103309,"journal":{"name":"2020 2nd International Conference on Information Technology and Computer Application (ITCA)","volume":"18 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2020 2nd International Conference on Information Technology and Computer Application (ITCA)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ITCA52113.2020.00105","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
We proposed and completed a real-life data mining problem, aka, text corpus classification. We extracted 15,500 medical documents relevant to ten different diseases and used the bag-of-words model to create a word occurrence vector for each document. The latent semantic analysis (LSA) method is then used to reduce the occurrence vector’s dimensionality to a feature vector of dimension 200. We selected a multilayer perceptron (MLP) neural network to do the final classification and report the performance comparison with the other six classifiers. We also completed the grid search for the best feature subspace dimensionality.