Hcpcs2Vec:医疗保险欺诈预测的医疗保健程序嵌入

2020 IEEE 6th International Conference on Collaboration and Internet Computing (CIC) Pub Date : 2020-12-01 DOI:10.1109/CIC50333.2020.00026

Justin M. Johnson, T. Khoshgoftaar

{"title":"Hcpcs2Vec:医疗保险欺诈预测的医疗保健程序嵌入","authors":"Justin M. Johnson, T. Khoshgoftaar","doi":"10.1109/CIC50333.2020.00026","DOIUrl":null,"url":null,"abstract":"This study evaluates semantic healthcare procedure code embeddings on a Medicare fraud classification problem using publicly available big data. Traditionally, categorical Medicare features are one-hot encoded for the purpose of supervised learning. One-hot encoding thousands of unique procedure codes leads to high-dimensional vectors that increase model complexity and fail to capture the inherent relationships between codes. We address these shortcomings by representing procedure codes using low-rank continuous vectors that capture various dimensions of similarity. We leverage publicly available data from the Centers for Medicare and Medicaid Services, with more than 56 million claims records, and train Word2Vec models on sequences of co-occurring codes from the Healthcare Common Procedure Coding System (HCPCS). Continuous-bag-of-words and skip-gram embed-dings are trained using a range of embedding and window sizes. The proposed embeddings are empirically evaluated on a Medicare fraud classification problem using the Extreme Gradient Boosting learner. Results are compared to both one-hot encodings and pre-trained embeddings from related works using the area under the receiver operating characteristic curve and geometric mean metrics. Statistical tests are used to show that the proposed embeddings significantly outperform one-hot encodings with 95% confidence. In addition to our empirical analysis, we briefly evaluate the quality of the learned embeddings by exploring nearest neighbors in vector space. To the best of our knowledge, this is the first study to train and evaluate HCPCS procedure embeddings on big Medicare data.","PeriodicalId":265435,"journal":{"name":"2020 IEEE 6th International Conference on Collaboration and Internet Computing (CIC)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"5","resultStr":"{\"title\":\"Hcpcs2Vec: Healthcare Procedure Embeddings for Medicare Fraud Prediction\",\"authors\":\"Justin M. Johnson, T. Khoshgoftaar\",\"doi\":\"10.1109/CIC50333.2020.00026\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"This study evaluates semantic healthcare procedure code embeddings on a Medicare fraud classification problem using publicly available big data. Traditionally, categorical Medicare features are one-hot encoded for the purpose of supervised learning. One-hot encoding thousands of unique procedure codes leads to high-dimensional vectors that increase model complexity and fail to capture the inherent relationships between codes. We address these shortcomings by representing procedure codes using low-rank continuous vectors that capture various dimensions of similarity. We leverage publicly available data from the Centers for Medicare and Medicaid Services, with more than 56 million claims records, and train Word2Vec models on sequences of co-occurring codes from the Healthcare Common Procedure Coding System (HCPCS). Continuous-bag-of-words and skip-gram embed-dings are trained using a range of embedding and window sizes. The proposed embeddings are empirically evaluated on a Medicare fraud classification problem using the Extreme Gradient Boosting learner. Results are compared to both one-hot encodings and pre-trained embeddings from related works using the area under the receiver operating characteristic curve and geometric mean metrics. Statistical tests are used to show that the proposed embeddings significantly outperform one-hot encodings with 95% confidence. In addition to our empirical analysis, we briefly evaluate the quality of the learned embeddings by exploring nearest neighbors in vector space. To the best of our knowledge, this is the first study to train and evaluate HCPCS procedure embeddings on big Medicare data.\",\"PeriodicalId\":265435,\"journal\":{\"name\":\"2020 IEEE 6th International Conference on Collaboration and Internet Computing (CIC)\",\"volume\":\"1 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2020-12-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"5\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2020 IEEE 6th International Conference on Collaboration and Internet Computing (CIC)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/CIC50333.2020.00026\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2020 IEEE 6th International Conference on Collaboration and Internet Computing (CIC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CIC50333.2020.00026","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 5

摘要

本研究使用公开可用的大数据评估医疗保险欺诈分类问题上的语义医疗程序代码嵌入。传统上，为了监督学习的目的，分类医疗保险特征是一次性编码的。一次编码数千个唯一的过程代码导致高维向量，增加了模型的复杂性，并且无法捕获代码之间的内在关系。我们通过使用捕获不同相似性维度的低秩连续向量来表示过程代码来解决这些缺点。我们利用医疗保险和医疗补助服务中心的公开数据，其中有超过5600万份索赔记录，并根据医疗保健通用程序编码系统(HCPCS)的共发生代码序列训练Word2Vec模型。使用一定范围的嵌入和窗口大小来训练连续词袋和跳格嵌入。使用极端梯度增强学习器对医疗保险欺诈分类问题进行了实证评估。利用接收器工作特征曲线下的面积和几何平均度量，将结果与相关工作中的单热编码和预训练嵌入进行比较。统计测试表明，所提出的嵌入在95%的置信度下显著优于单热编码。除了我们的经验分析之外，我们还通过在向量空间中探索最近邻来简要评估学习到的嵌入的质量。据我们所知，这是第一个在大医疗保险数据上训练和评估HCPCS程序嵌入的研究。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Hcpcs2Vec: Healthcare Procedure Embeddings for Medicare Fraud Prediction

This study evaluates semantic healthcare procedure code embeddings on a Medicare fraud classification problem using publicly available big data. Traditionally, categorical Medicare features are one-hot encoded for the purpose of supervised learning. One-hot encoding thousands of unique procedure codes leads to high-dimensional vectors that increase model complexity and fail to capture the inherent relationships between codes. We address these shortcomings by representing procedure codes using low-rank continuous vectors that capture various dimensions of similarity. We leverage publicly available data from the Centers for Medicare and Medicaid Services, with more than 56 million claims records, and train Word2Vec models on sequences of co-occurring codes from the Healthcare Common Procedure Coding System (HCPCS). Continuous-bag-of-words and skip-gram embed-dings are trained using a range of embedding and window sizes. The proposed embeddings are empirically evaluated on a Medicare fraud classification problem using the Extreme Gradient Boosting learner. Results are compared to both one-hot encodings and pre-trained embeddings from related works using the area under the receiver operating characteristic curve and geometric mean metrics. Statistical tests are used to show that the proposed embeddings significantly outperform one-hot encodings with 95% confidence. In addition to our empirical analysis, we briefly evaluate the quality of the learned embeddings by exploring nearest neighbors in vector space. To the best of our knowledge, this is the first study to train and evaluate HCPCS procedure embeddings on big Medicare data.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2020 IEEE 6th International Conference on Collaboration and Internet Computing (CIC)

自引率

0.00%

发文量