NAICS Code Prediction Using Supervised Methods

IF 1.5 Q2 SOCIAL SCIENCES, MATHEMATICAL METHODS
C. Oehlert, Evan T. Schulz, Anne Parker
{"title":"NAICS Code Prediction Using Supervised Methods","authors":"C. Oehlert, Evan T. Schulz, Anne Parker","doi":"10.1080/2330443X.2022.2033654","DOIUrl":null,"url":null,"abstract":"Abstract When compiling industry statistics or selecting businesses for further study, researchers often rely on North American Industry Classification System (NAICS) codes. However, codes are self-reported on tax forms and reporting incorrect codes or even leaving the code blank has no tax consequences, so they are often unusable. IRSs Statistics of Income (SOI) program validates NAICS codes for businesses in the statistical samples used to produce official tax statistics for various filing populations, including sole proprietorships (those filing Form 1040 Schedule C) and corporations (those filing Forms 1120). In this article we leverage these samples to explore ways to improve NAICS code reporting for all filers in the relevant populations. For sole proprietorships, we overcame several record linkage complications to combine data from SOI samples with other administrative data. Using the SOI-validated NAICS code values as ground truth, we trained classification-tree-based models (randomForest) to predict NAICS industry sector from other tax return data, including text descriptions, for businesses which did or did not initially report a valid NAICS code. For both sole proprietorships and corporations, we were able to improve slightly on the accuracy of valid self-reported industry sector and correctly identify sector for over half of businesses with no informative reported NAICS code.","PeriodicalId":43397,"journal":{"name":"Statistics and Public Policy","volume":null,"pages":null},"PeriodicalIF":1.5000,"publicationDate":"2022-01-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Statistics and Public Policy","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1080/2330443X.2022.2033654","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"SOCIAL SCIENCES, MATHEMATICAL METHODS","Score":null,"Total":0}
引用次数: 3

Abstract

Abstract When compiling industry statistics or selecting businesses for further study, researchers often rely on North American Industry Classification System (NAICS) codes. However, codes are self-reported on tax forms and reporting incorrect codes or even leaving the code blank has no tax consequences, so they are often unusable. IRSs Statistics of Income (SOI) program validates NAICS codes for businesses in the statistical samples used to produce official tax statistics for various filing populations, including sole proprietorships (those filing Form 1040 Schedule C) and corporations (those filing Forms 1120). In this article we leverage these samples to explore ways to improve NAICS code reporting for all filers in the relevant populations. For sole proprietorships, we overcame several record linkage complications to combine data from SOI samples with other administrative data. Using the SOI-validated NAICS code values as ground truth, we trained classification-tree-based models (randomForest) to predict NAICS industry sector from other tax return data, including text descriptions, for businesses which did or did not initially report a valid NAICS code. For both sole proprietorships and corporations, we were able to improve slightly on the accuracy of valid self-reported industry sector and correctly identify sector for over half of businesses with no informative reported NAICS code.
使用监督方法的NAICS代码预测
摘要在编制行业统计数据或选择企业进行进一步研究时,研究人员通常依赖北美行业分类系统(NAICS)代码。然而,代码是在纳税申报表上自我报告的,报告错误的代码甚至将代码留空都不会产生税务后果,因此它们通常无法使用。IRS收入统计(SOI)程序验证了统计样本中企业的NAICS代码,该统计样本用于为各种申报人群编制官方税务统计数据,包括独资企业(提交1040表格附表C的企业)和公司(提交1120表格的企业)。在本文中,我们利用这些样本来探索如何改进相关人群中所有提交者的NAICS代码报告。对于独资企业,我们克服了几个记录关联的复杂性,将SOI样本的数据与其他管理数据相结合。使用SOI验证的NAICS代码值作为基本事实,我们训练了基于分类树的模型(randomForest),以根据其他纳税申报数据预测NAICS行业部门,包括最初报告或未报告有效NAICS代码的企业的文本描述。对于独资企业和公司,我们能够略微提高有效的自我报告行业部门的准确性,并在没有信息报告NAICS代码的情况下正确识别超过一半的企业的行业。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
Statistics and Public Policy
Statistics and Public Policy SOCIAL SCIENCES, MATHEMATICAL METHODS-
CiteScore
3.20
自引率
6.20%
发文量
13
审稿时长
32 weeks
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信