{"title":"医疗服务提供者的语义嵌入和欺诈检测","authors":"Justin M. Johnson, T. Khoshgoftaar","doi":"10.1109/IRI49571.2020.00039","DOIUrl":null,"url":null,"abstract":"A medical provider’s specialty is a significant predictor for detecting fraudulent providers with machine learning algorithms. When the specialty variable is encoded using a one-hot representation, however, models are subjected to sparse and uninformative feature vectors. We explore three techniques for representing medical provider types with dense, semantic embeddings that capture specialty similarities. The first two methods (GloVe and Med-Word2Vec) use pre-trained word embeddings to convert provider specialty descriptions to short phrase embeddings. Next, we propose a method for constructing semantic provider type embeddings from the procedure-level activity within each specialty group. For each embedding technique, we use Principal Component Analysis to compare the performance of embedding sizes between 32-128. Each embedding technique is evaluated on a highly imbalanced Medicare fraud prediction task using Logistic Regression (LR), Random Forest (RF), Gradient Boosted Tree (GBT), and Multilayer Perceptron (MLP) learners. Experiments are repeated 30 times and confidence intervals show that all three semantic embeddings significantly outperform one-hot representations when using RF and GBT learners. Our contributions include a novel method for embedding medical specialties from procedure codes and a comparison of three semantic embedding techniques for Medicare fraud detection.","PeriodicalId":93159,"journal":{"name":"2020 IEEE 21st International Conference on Information Reuse and Integration for Data Science : IRI 2020 : proceedings : virtual conference, 11-13 August 2020. IEEE International Conference on Information Reuse and Integration (21st : 2...","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2020-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"5","resultStr":"{\"title\":\"Semantic Embeddings for Medical Providers and Fraud Detection\",\"authors\":\"Justin M. Johnson, T. Khoshgoftaar\",\"doi\":\"10.1109/IRI49571.2020.00039\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"A medical provider’s specialty is a significant predictor for detecting fraudulent providers with machine learning algorithms. When the specialty variable is encoded using a one-hot representation, however, models are subjected to sparse and uninformative feature vectors. We explore three techniques for representing medical provider types with dense, semantic embeddings that capture specialty similarities. The first two methods (GloVe and Med-Word2Vec) use pre-trained word embeddings to convert provider specialty descriptions to short phrase embeddings. Next, we propose a method for constructing semantic provider type embeddings from the procedure-level activity within each specialty group. For each embedding technique, we use Principal Component Analysis to compare the performance of embedding sizes between 32-128. Each embedding technique is evaluated on a highly imbalanced Medicare fraud prediction task using Logistic Regression (LR), Random Forest (RF), Gradient Boosted Tree (GBT), and Multilayer Perceptron (MLP) learners. Experiments are repeated 30 times and confidence intervals show that all three semantic embeddings significantly outperform one-hot representations when using RF and GBT learners. Our contributions include a novel method for embedding medical specialties from procedure codes and a comparison of three semantic embedding techniques for Medicare fraud detection.\",\"PeriodicalId\":93159,\"journal\":{\"name\":\"2020 IEEE 21st International Conference on Information Reuse and Integration for Data Science : IRI 2020 : proceedings : virtual conference, 11-13 August 2020. IEEE International Conference on Information Reuse and Integration (21st : 2...\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2020-08-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"5\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2020 IEEE 21st International Conference on Information Reuse and Integration for Data Science : IRI 2020 : proceedings : virtual conference, 11-13 August 2020. IEEE International Conference on Information Reuse and Integration (21st : 2...\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/IRI49571.2020.00039\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2020 IEEE 21st International Conference on Information Reuse and Integration for Data Science : IRI 2020 : proceedings : virtual conference, 11-13 August 2020. IEEE International Conference on Information Reuse and Integration (21st : 2...","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IRI49571.2020.00039","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Semantic Embeddings for Medical Providers and Fraud Detection
A medical provider’s specialty is a significant predictor for detecting fraudulent providers with machine learning algorithms. When the specialty variable is encoded using a one-hot representation, however, models are subjected to sparse and uninformative feature vectors. We explore three techniques for representing medical provider types with dense, semantic embeddings that capture specialty similarities. The first two methods (GloVe and Med-Word2Vec) use pre-trained word embeddings to convert provider specialty descriptions to short phrase embeddings. Next, we propose a method for constructing semantic provider type embeddings from the procedure-level activity within each specialty group. For each embedding technique, we use Principal Component Analysis to compare the performance of embedding sizes between 32-128. Each embedding technique is evaluated on a highly imbalanced Medicare fraud prediction task using Logistic Regression (LR), Random Forest (RF), Gradient Boosted Tree (GBT), and Multilayer Perceptron (MLP) learners. Experiments are repeated 30 times and confidence intervals show that all three semantic embeddings significantly outperform one-hot representations when using RF and GBT learners. Our contributions include a novel method for embedding medical specialties from procedure codes and a comparison of three semantic embedding techniques for Medicare fraud detection.