Leveraging Feature Headers to Learn Sparse and Semantically Pertinent Linear Models

2022 26th International Computer Science and Engineering Conference (ICSEC) Pub Date : 2022-12-21 DOI:10.1109/ICSEC56337.2022.10049377

Sasin Janpuangtong

{"title":"Leveraging Feature Headers to Learn Sparse and Semantically Pertinent Linear Models","authors":"Sasin Janpuangtong","doi":"10.1109/ICSEC56337.2022.10049377","DOIUrl":null,"url":null,"abstract":"Readily available data and software tools have turned \"analytics\" into a game anyone can play. But genuine, serious modeling demands prudence: domain experts routinely use their knowledge to assess the relevance of various input features and to be judicious with model selection criteria. While engaged in analysis, they marshal knowledge to consider the meaning of the data involved. Seeking to automate and reproduce such aspects, the present paper proposes a framework that makes use of semantics latent within given feature headers to help produce sparse and semantically pertinent linear models, rather than exploiting mere correlations or (potentially spurious) patterns. This framework enables a model builder to employ both features’ data and certain semantic information derived from their headers to search for an optimal feature subset in order to improve generalization of a linear model being built. To do so, a characteristic called \"semantic inconsistency\" is formulated in order to quantify the degree of conflict between weights learned from data and the amount of relationship between a set of input features and the output being predicted in the semantic space. Using this quantity, semantic information can be incorporated into a regularization procedure in a manner that is quite general and may be computed from any form of background knowledge. The results obtained from validating the framework with four datasets indicate that taking the semantics of features into account can improve model generalization: the approach is shown to perform better than classic linear regression and regularization techniques that consider only complexity of learned models.","PeriodicalId":430850,"journal":{"name":"2022 26th International Computer Science and Engineering Conference (ICSEC)","volume":"210 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-12-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 26th International Computer Science and Engineering Conference (ICSEC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICSEC56337.2022.10049377","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Readily available data and software tools have turned "analytics" into a game anyone can play. But genuine, serious modeling demands prudence: domain experts routinely use their knowledge to assess the relevance of various input features and to be judicious with model selection criteria. While engaged in analysis, they marshal knowledge to consider the meaning of the data involved. Seeking to automate and reproduce such aspects, the present paper proposes a framework that makes use of semantics latent within given feature headers to help produce sparse and semantically pertinent linear models, rather than exploiting mere correlations or (potentially spurious) patterns. This framework enables a model builder to employ both features’ data and certain semantic information derived from their headers to search for an optimal feature subset in order to improve generalization of a linear model being built. To do so, a characteristic called "semantic inconsistency" is formulated in order to quantify the degree of conflict between weights learned from data and the amount of relationship between a set of input features and the output being predicted in the semantic space. Using this quantity, semantic information can be incorporated into a regularization procedure in a manner that is quite general and may be computed from any form of background knowledge. The results obtained from validating the framework with four datasets indicate that taking the semantics of features into account can improve model generalization: the approach is shown to perform better than classic linear regression and regularization techniques that consider only complexity of learned models.

查看原文本刊更多论文

利用特征头来学习稀疏和语义相关的线性模型

现成的数据和软件工具已经把“分析”变成了人人都能玩的游戏。但是真正的、严肃的建模需要谨慎:领域专家经常使用他们的知识来评估各种输入特征的相关性，并对模型选择标准做出明智的判断。在进行分析时，他们整理知识来考虑所涉及数据的含义。为了自动化和重现这些方面，本文提出了一个框架，该框架利用给定特征头中的语义潜在来帮助生成稀疏和语义相关的线性模型，而不是利用单纯的相关性或(潜在的虚假)模式。该框架使模型构建者能够同时使用特征数据和从其标题中派生的某些语义信息来搜索最优特征子集，以提高所构建线性模型的泛化性。为此，为了量化从数据中学习到的权重与一组输入特征与在语义空间中预测的输出之间的关系数量之间的冲突程度，制定了一个称为“语义不一致”的特征。使用这个量，语义信息可以以一种非常通用的方式合并到正则化过程中，并且可以从任何形式的背景知识中计算出来。用四个数据集验证框架获得的结果表明，考虑特征的语义可以改善模型泛化:该方法比只考虑学习模型复杂性的经典线性回归和正则化技术表现得更好。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2022 26th International Computer Science and Engineering Conference (ICSEC)

自引率

0.00%

发文量