{"title":"Leveraging Feature Headers to Learn Sparse and Semantically Pertinent Linear Models","authors":"Sasin Janpuangtong","doi":"10.1109/ICSEC56337.2022.10049377","DOIUrl":null,"url":null,"abstract":"Readily available data and software tools have turned \"analytics\" into a game anyone can play. But genuine, serious modeling demands prudence: domain experts routinely use their knowledge to assess the relevance of various input features and to be judicious with model selection criteria. While engaged in analysis, they marshal knowledge to consider the meaning of the data involved. Seeking to automate and reproduce such aspects, the present paper proposes a framework that makes use of semantics latent within given feature headers to help produce sparse and semantically pertinent linear models, rather than exploiting mere correlations or (potentially spurious) patterns. This framework enables a model builder to employ both features’ data and certain semantic information derived from their headers to search for an optimal feature subset in order to improve generalization of a linear model being built. To do so, a characteristic called \"semantic inconsistency\" is formulated in order to quantify the degree of conflict between weights learned from data and the amount of relationship between a set of input features and the output being predicted in the semantic space. Using this quantity, semantic information can be incorporated into a regularization procedure in a manner that is quite general and may be computed from any form of background knowledge. The results obtained from validating the framework with four datasets indicate that taking the semantics of features into account can improve model generalization: the approach is shown to perform better than classic linear regression and regularization techniques that consider only complexity of learned models.","PeriodicalId":430850,"journal":{"name":"2022 26th International Computer Science and Engineering Conference (ICSEC)","volume":"210 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-12-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 26th International Computer Science and Engineering Conference (ICSEC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICSEC56337.2022.10049377","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Readily available data and software tools have turned "analytics" into a game anyone can play. But genuine, serious modeling demands prudence: domain experts routinely use their knowledge to assess the relevance of various input features and to be judicious with model selection criteria. While engaged in analysis, they marshal knowledge to consider the meaning of the data involved. Seeking to automate and reproduce such aspects, the present paper proposes a framework that makes use of semantics latent within given feature headers to help produce sparse and semantically pertinent linear models, rather than exploiting mere correlations or (potentially spurious) patterns. This framework enables a model builder to employ both features’ data and certain semantic information derived from their headers to search for an optimal feature subset in order to improve generalization of a linear model being built. To do so, a characteristic called "semantic inconsistency" is formulated in order to quantify the degree of conflict between weights learned from data and the amount of relationship between a set of input features and the output being predicted in the semantic space. Using this quantity, semantic information can be incorporated into a regularization procedure in a manner that is quite general and may be computed from any form of background knowledge. The results obtained from validating the framework with four datasets indicate that taking the semantics of features into account can improve model generalization: the approach is shown to perform better than classic linear regression and regularization techniques that consider only complexity of learned models.