What is in a food store name? Leveraging large language models to enhance food environment data.

IF 3 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Frontiers in Artificial Intelligence Pub Date : 2024-12-06 eCollection Date: 2024-01-01 DOI:10.3389/frai.2024.1476950

Analee J Etheredge, Samuel Hosmer, Aldo Crossa, Rachel Suss, Mark Torrey

{"title":"What is in a food store name? Leveraging large language models to enhance food environment data.","authors":"Analee J Etheredge, Samuel Hosmer, Aldo Crossa, Rachel Suss, Mark Torrey","doi":"10.3389/frai.2024.1476950","DOIUrl":null,"url":null,"abstract":"Introduction: It is not uncommon to repurpose administrative food data to create food environment datasets in the health department and research settings; however, the available administrative data are rarely categorized in a way that supports meaningful insight or action, and ground-truthing or manually reviewing an entire city or neighborhood is rate-limiting to essential operations and analysis. We show that such categorizations should be viewed as a classification problem well addressed by recent advances in natural language processing and deep learning-with the advent of large language models (LLMs).Methods: To demonstrate how to automate the process of categorizing food stores, we use the foundation model BERT to give a first approximation to such categorizations: a best guess by store name. First, 10 food retail classes were developed to comprehensively categorize food store types from a public health perspective.Results: Based on this rubric, the model was tuned and evaluated (F1micro = 0.710, F1macro = 0.709) on an extensive storefront directory of New York City. Second, the model was applied to infer insights from a large, unlabeled dataset using store names alone, aiming to replicate known temporospatial patterns. Finally, a complimentary application of the model as a data quality enhancement tool was demonstrated on a secondary, pre-labeled restaurant dataset.Discussion: This novel application of an LLM to the enumeration of the food environment allowed for marked gains in efficiency compared to manual, in-person methods, addressing a known challenge to research and operations in a local health department.","PeriodicalId":33315,"journal":{"name":"Frontiers in Artificial Intelligence","volume":"7 ","pages":"1476950"},"PeriodicalIF":3.0000,"publicationDate":"2024-12-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11660183/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Frontiers in Artificial Intelligence","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.3389/frai.2024.1476950","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2024/1/1 0:00:00","PubModel":"eCollection","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

Introduction: It is not uncommon to repurpose administrative food data to create food environment datasets in the health department and research settings; however, the available administrative data are rarely categorized in a way that supports meaningful insight or action, and ground-truthing or manually reviewing an entire city or neighborhood is rate-limiting to essential operations and analysis. We show that such categorizations should be viewed as a classification problem well addressed by recent advances in natural language processing and deep learning-with the advent of large language models (LLMs).

Methods: To demonstrate how to automate the process of categorizing food stores, we use the foundation model BERT to give a first approximation to such categorizations: a best guess by store name. First, 10 food retail classes were developed to comprehensively categorize food store types from a public health perspective.

Results: Based on this rubric, the model was tuned and evaluated (F1_micro = 0.710, F1_macro = 0.709) on an extensive storefront directory of New York City. Second, the model was applied to infer insights from a large, unlabeled dataset using store names alone, aiming to replicate known temporospatial patterns. Finally, a complimentary application of the model as a data quality enhancement tool was demonstrated on a secondary, pre-labeled restaurant dataset.

Discussion: This novel application of an LLM to the enumeration of the food environment allowed for marked gains in efficiency compared to manual, in-person methods, addressing a known challenge to research and operations in a local health department.

查看原文本刊更多论文

食品店的名字里有什么？利用大型语言模型增强食品环境数据。

在卫生部门和研究机构中，重新利用行政食品数据来创建食品环境数据集并不罕见；然而，可用的行政数据很少以一种支持有意义的见解或行动的方式进行分类，并且实地调查或手动审查整个城市或社区限制了基本的操作和分析。我们表明，随着大型语言模型（llm）的出现，这种分类应该被视为自然语言处理和深度学习的最新进展很好地解决的分类问题。方法：为了演示如何自动化对食品商店进行分类的过程，我们使用基础模型BERT来给出这种分类的第一个近似：根据商店名称进行最佳猜测。首先，开发了10个食品零售类别，从公共卫生的角度对食品商店类型进行了综合分类。结果：基于这个标题，模型被调整和评估（F1micro = 0.710, F1macro = 0.709）在纽约市的一个广泛的店面目录。其次，该模型被应用于仅使用商店名称从大型未标记数据集中推断见解，旨在复制已知的时空模式。最后，在一个次要的、预先标记的餐馆数据集上演示了该模型作为数据质量增强工具的免费应用。讨论：与人工、现场方法相比，这种将法学硕士应用于食品环境枚举的新颖应用可以显著提高效率，解决了当地卫生部门研究和操作的已知挑战。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊