Implementing NLP in industrial process modeling: Addressing categorical variables

IF 3.9 2区 工程技术 Q2 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS
Eleni D. Koronaki , Geremy Loachamín-Suntaxi , Paris Papavasileiou , Dimitrios G. Giovanis , Martin Kathrein , Christoph Czettl , Andreas G. Boudouvis , Stéphane P.A. Bordas
{"title":"Implementing NLP in industrial process modeling: Addressing categorical variables","authors":"Eleni D. Koronaki ,&nbsp;Geremy Loachamín-Suntaxi ,&nbsp;Paris Papavasileiou ,&nbsp;Dimitrios G. Giovanis ,&nbsp;Martin Kathrein ,&nbsp;Christoph Czettl ,&nbsp;Andreas G. Boudouvis ,&nbsp;Stéphane P.A. Bordas","doi":"10.1016/j.compchemeng.2025.109146","DOIUrl":null,"url":null,"abstract":"<div><div>Important variables of processes are often categorical, i.e. names or labels representing, e.g. categories of inputs, or types of reactors or a sequence of steps. In this work, we use Natural Language Processing Models to derive embeddings of such inputs that represent their actual meaning, or reflect the “distances” between categories, i.e. how similar or dissimilar they are. This is a marked difference from the current standard practice of using binary, or one-hot encoding to replace categorical variables with sequences of ones and zeros. Combined with dimensionality reduction techniques, either linear such as Principal Component Analysis, or nonlinear such as Uniform Manifold Approximation and Projection, the proposed approach leads to a <em>meaningful</em>, low-dimensional feature space. The significance of obtaining meaningful embeddings is illustrated in the context of an industrial coating process for cutting tools that includes both numerical and categorical inputs. In this industrial process, subject matter expertise suggests that the categorical inputs are critical for determining the final outcome but this cannot be taken into account with the current state-of-the-art. The proposed approach enables feature importance which is a marked improvement compared to the current state-of-the-art in the encoding of categorical variables. The proposed approach is not limited to the case-study presented here and is suitable for applications with similar mix of categorical and numerical critical inputs.</div></div>","PeriodicalId":286,"journal":{"name":"Computers & Chemical Engineering","volume":"199 ","pages":"Article 109146"},"PeriodicalIF":3.9000,"publicationDate":"2025-04-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computers & Chemical Engineering","FirstCategoryId":"5","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0098135425001504","RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS","Score":null,"Total":0}
引用次数: 0

Abstract

Important variables of processes are often categorical, i.e. names or labels representing, e.g. categories of inputs, or types of reactors or a sequence of steps. In this work, we use Natural Language Processing Models to derive embeddings of such inputs that represent their actual meaning, or reflect the “distances” between categories, i.e. how similar or dissimilar they are. This is a marked difference from the current standard practice of using binary, or one-hot encoding to replace categorical variables with sequences of ones and zeros. Combined with dimensionality reduction techniques, either linear such as Principal Component Analysis, or nonlinear such as Uniform Manifold Approximation and Projection, the proposed approach leads to a meaningful, low-dimensional feature space. The significance of obtaining meaningful embeddings is illustrated in the context of an industrial coating process for cutting tools that includes both numerical and categorical inputs. In this industrial process, subject matter expertise suggests that the categorical inputs are critical for determining the final outcome but this cannot be taken into account with the current state-of-the-art. The proposed approach enables feature importance which is a marked improvement compared to the current state-of-the-art in the encoding of categorical variables. The proposed approach is not limited to the case-study presented here and is suitable for applications with similar mix of categorical and numerical critical inputs.
在工业过程建模中实现NLP:处理分类变量
过程的重要变量通常是分类的,即表示的名称或标签,例如输入的类别,或反应器的类型或步骤序列。在这项工作中,我们使用自然语言处理模型来推导这些输入的嵌入,这些输入代表了它们的实际含义,或者反映了类别之间的“距离”,即它们有多相似或不相似。这与当前使用二进制或单热编码的标准实践有明显区别,用1和0的序列替换分类变量。结合降维技术,无论是线性的,如主成分分析,或非线性的,如均匀流形逼近和投影,提出的方法导致一个有意义的,低维特征空间。在切削工具的工业涂层过程中,包括数值和分类输入,说明了获得有意义的嵌入的重要性。在这一工业过程中,主题专业知识表明,分类投入对确定最终结果至关重要,但目前的最新技术无法考虑到这一点。所提出的方法使特征的重要性,这是一个显着的改进,与目前的最先进的分类变量的编码。所提出的方法不仅限于这里提出的案例研究,而且适用于具有类似分类和数值关键输入混合的应用。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
Computers & Chemical Engineering
Computers & Chemical Engineering 工程技术-工程:化工
CiteScore
8.70
自引率
14.00%
发文量
374
审稿时长
70 days
期刊介绍: Computers & Chemical Engineering is primarily a journal of record for new developments in the application of computing and systems technology to chemical engineering problems.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信