Using machine learning algorithms to identify farms on the 2022 Census of Agriculture

Q3 Decision Sciences

Statistical Journal of the IAOS Pub Date : 2024-05-08 DOI:10.3233/sji-230089

Gavin Corral, Luca Sartore, Katherine Vande Pol, Denise A. Abreu, Linda J Young

{"title":"Using machine learning algorithms to identify farms on the 2022 Census of Agriculture","authors":"Gavin Corral, Luca Sartore, Katherine Vande Pol, Denise A. Abreu, Linda J Young","doi":"10.3233/sji-230089","DOIUrl":null,"url":null,"abstract":"As is the case for many National Statistics Institutes, the United States Department of Agriculture’s (USDA’s) National Agricultural Statistics Service (NASS) has observed dwindling survey response rates, and the requests for more information at finer temporal and spatial scales have led to increased response burdens. Non-survey data are becoming increasingly abundant and accessible. Consequently, NASS is exploring the potential to complete some or all of a survey record using non-survey data, which would reduce respondent burden and potentially lead to increased response rates. In this paper, the focus is on a large set of records associated with potential farms, which are operations with undetermined farm status (farm/non-farm) and are referred to here as operations with unknown status (OUS). Although they usually have some agriculture, most OUS records are eventually classified as non-farms. Those OUS that are classified as farms tend to have higher proportions of producers from under-represented groups compared to other records. Determining the probability that an OUS record is a farm is an important step in the imputation process. The OUS records that responded to the 2017 U.S. Census of Agriculture were used to develop models to predict farm status using multiple data sources. Evaluated models include bootstrap random forest (RF), logistic regression (LR), neural network (NN), and support vector machine (SVM). Although the SVM had the best outcomes for three of the five metrics, the sensitivity for identifying farms was the lowest (13.8%). The NN model had a sensitivity of 80.5%, which was substantially higher than the other models, and its specificity of 45.3% was the lowest of all models. Because sensitivity was the primary metric of interest and the NN performed reasonably well on the other metrics, the NN was selected as the preferred model.","PeriodicalId":55877,"journal":{"name":"Statistical Journal of the IAOS","volume":" 7","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-05-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Statistical Journal of the IAOS","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.3233/sji-230089","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"Decision Sciences","Score":null,"Total":0}

引用次数: 0

Abstract

As is the case for many National Statistics Institutes, the United States Department of Agriculture’s (USDA’s) National Agricultural Statistics Service (NASS) has observed dwindling survey response rates, and the requests for more information at finer temporal and spatial scales have led to increased response burdens. Non-survey data are becoming increasingly abundant and accessible. Consequently, NASS is exploring the potential to complete some or all of a survey record using non-survey data, which would reduce respondent burden and potentially lead to increased response rates. In this paper, the focus is on a large set of records associated with potential farms, which are operations with undetermined farm status (farm/non-farm) and are referred to here as operations with unknown status (OUS). Although they usually have some agriculture, most OUS records are eventually classified as non-farms. Those OUS that are classified as farms tend to have higher proportions of producers from under-represented groups compared to other records. Determining the probability that an OUS record is a farm is an important step in the imputation process. The OUS records that responded to the 2017 U.S. Census of Agriculture were used to develop models to predict farm status using multiple data sources. Evaluated models include bootstrap random forest (RF), logistic regression (LR), neural network (NN), and support vector machine (SVM). Although the SVM had the best outcomes for three of the five metrics, the sensitivity for identifying farms was the lowest (13.8%). The NN model had a sensitivity of 80.5%, which was substantially higher than the other models, and its specificity of 45.3% was the lowest of all models. Because sensitivity was the primary metric of interest and the NN performed reasonably well on the other metrics, the NN was selected as the preferred model.

查看原文本刊更多论文

使用机器学习算法识别 2022 年农业普查中的农场

与许多国家统计局一样，美国农业部（USDA）国家农业统计服务局（NASS）也发现调查回复率不断下降，而且要求在更精细的时间和空间尺度上提供更多信息，导致回复负担加重。非调查数据越来越丰富，也越来越容易获取。因此，NASS 正在探索利用非调查数据完成部分或全部调查记录的可能性，这将减轻应答者的负担，并有可能提高应答率。本文的重点是与潜在农场相关的大量记录，这些农场的农场地位（农场/非农场）尚未确定，在此称为地位不明的农场（OUS）。虽然它们通常都有一些农业活动，但大多数 OUS 记录最终都被归类为非农场。与其他记录相比，那些被归类为农场的 OUS 往往有更高比例的生产者来自代表性不足的群体。确定 OUS 记录是农场的概率是估算过程中的一个重要步骤。对 2017 年美国农业普查做出回应的 OUS 记录被用于开发模型，以利用多种数据源预测农场地位。评估的模型包括引导随机森林（RF）、逻辑回归（LR）、神经网络（NN）和支持向量机（SVM）。虽然 SVM 在五项指标中的三项结果最好，但识别农场的灵敏度最低（13.8%）。NN 模型的灵敏度为 80.5%，大大高于其他模型，而其特异性为 45.3%，是所有模型中最低的。由于灵敏度是主要指标，而 NN 在其他指标上的表现也相当不错，因此 NN 被选为首选模型。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Statistical Journal of the IAOS Economics, Econometrics and Finance-Economics and Econometrics

CiteScore

1.30

自引率

0.00%

发文量

116

期刊介绍： This is the flagship journal of the International Association for Official Statistics and is expected to be widely circulated and subscribed to by individuals and institutions in all parts of the world. The main aim of the Journal is to support the IAOS mission by publishing articles to promote the understanding and advancement of official statistics and to foster the development of effective and efficient official statistical services on a global basis. Papers are expected to be of wide interest to readers. Such papers may or may not contain strictly original material. All papers are refereed.