Implementing NLP in industrial process modeling: Addressing categorical variables

IF 3.9 2区工程技术 Q2 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS

Computers & Chemical Engineering Pub Date : 2025-04-21 DOI:10.1016/j.compchemeng.2025.109146

Eleni D. Koronaki , Geremy Loachamín-Suntaxi , Paris Papavasileiou , Dimitrios G. Giovanis , Martin Kathrein , Christoph Czettl , Andreas G. Boudouvis , Stéphane P.A. Bordas

{"title":"Implementing NLP in industrial process modeling: Addressing categorical variables","authors":"Eleni D. Koronaki , Geremy Loachamín-Suntaxi , Paris Papavasileiou , Dimitrios G. Giovanis , Martin Kathrein , Christoph Czettl , Andreas G. Boudouvis , Stéphane P.A. Bordas","doi":"10.1016/j.compchemeng.2025.109146","DOIUrl":null,"url":null,"abstract":"<div><div>Important variables of processes are often categorical, i.e. names or labels representing, e.g. categories of inputs, or types of reactors or a sequence of steps. In this work, we use Natural Language Processing Models to derive embeddings of such inputs that represent their actual meaning, or reflect the “distances” between categories, i.e. how similar or dissimilar they are. This is a marked difference from the current standard practice of using binary, or one-hot encoding to replace categorical variables with sequences of ones and zeros. Combined with dimensionality reduction techniques, either linear such as Principal Component Analysis, or nonlinear such as Uniform Manifold Approximation and Projection, the proposed approach leads to a <em>meaningful</em>, low-dimensional feature space. The significance of obtaining meaningful embeddings is illustrated in the context of an industrial coating process for cutting tools that includes both numerical and categorical inputs. In this industrial process, subject matter expertise suggests that the categorical inputs are critical for determining the final outcome but this cannot be taken into account with the current state-of-the-art. The proposed approach enables feature importance which is a marked improvement compared to the current state-of-the-art in the encoding of categorical variables. The proposed approach is not limited to the case-study presented here and is suitable for applications with similar mix of categorical and numerical critical inputs.</div></div>","PeriodicalId":286,"journal":{"name":"Computers & Chemical Engineering","volume":"199 ","pages":"Article 109146"},"PeriodicalIF":3.9000,"publicationDate":"2025-04-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computers & Chemical Engineering","FirstCategoryId":"5","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0098135425001504","RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS","Score":null,"Total":0}

引用次数: 0

Abstract

Important variables of processes are often categorical, i.e. names or labels representing, e.g. categories of inputs, or types of reactors or a sequence of steps. In this work, we use Natural Language Processing Models to derive embeddings of such inputs that represent their actual meaning, or reflect the “distances” between categories, i.e. how similar or dissimilar they are. This is a marked difference from the current standard practice of using binary, or one-hot encoding to replace categorical variables with sequences of ones and zeros. Combined with dimensionality reduction techniques, either linear such as Principal Component Analysis, or nonlinear such as Uniform Manifold Approximation and Projection, the proposed approach leads to a meaningful, low-dimensional feature space. The significance of obtaining meaningful embeddings is illustrated in the context of an industrial coating process for cutting tools that includes both numerical and categorical inputs. In this industrial process, subject matter expertise suggests that the categorical inputs are critical for determining the final outcome but this cannot be taken into account with the current state-of-the-art. The proposed approach enables feature importance which is a marked improvement compared to the current state-of-the-art in the encoding of categorical variables. The proposed approach is not limited to the case-study presented here and is suitable for applications with similar mix of categorical and numerical critical inputs.

查看原文本刊更多论文

在工业过程建模中实现NLP：处理分类变量

过程的重要变量通常是分类的，即表示的名称或标签，例如输入的类别，或反应器的类型或步骤序列。在这项工作中，我们使用自然语言处理模型来推导这些输入的嵌入，这些输入代表了它们的实际含义，或者反映了类别之间的“距离”，即它们有多相似或不相似。这与当前使用二进制或单热编码的标准实践有明显区别，用1和0的序列替换分类变量。结合降维技术，无论是线性的，如主成分分析，或非线性的，如均匀流形逼近和投影，提出的方法导致一个有意义的，低维特征空间。在切削工具的工业涂层过程中，包括数值和分类输入，说明了获得有意义的嵌入的重要性。在这一工业过程中，主题专业知识表明，分类投入对确定最终结果至关重要，但目前的最新技术无法考虑到这一点。所提出的方法使特征的重要性，这是一个显着的改进，与目前的最先进的分类变量的编码。所提出的方法不仅限于这里提出的案例研究，而且适用于具有类似分类和数值关键输入混合的应用。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Computers & Chemical Engineering 工程技术-工程：化工

CiteScore

8.70

自引率

14.00%

发文量

374

审稿时长

70 days

期刊介绍： Computers & Chemical Engineering is primarily a journal of record for new developments in the application of computing and systems technology to chemical engineering problems.