Modelling a Multilevel Data Structure Using a Composite Index

American journal of applied mathematics and statistics Pub Date : 2021-07-23 DOI:10.12691/AJAMS-9-3-1

Prabath Badullahewage, D. Attygalle

{"title":"Modelling a Multilevel Data Structure Using a Composite Index","authors":"Prabath Badullahewage, D. Attygalle","doi":"10.12691/AJAMS-9-3-1","DOIUrl":null,"url":null,"abstract":"When modelling complexed data structures related to a certain social aspect, there could be various hierarchical levels where data units are nested within each other. There could also be several variables in each level, and those variables may not be unique for each case or record, making the data structure even more complexed. Multilevel modelling has been used for decades, to handle such data structures, but may not be effective at all times to capture the structure fully, due to the extent of complexities of the data structure and the inherent issues of the procedure. On the contrary, ignoring the multilevel data structure when modelling, can lead to incorrect estimations and thereby may not achieve acceptable accuracies from the model. This research explains a simple approach where a complexed multilevel structure is compressed to a single level by combining higher level variables to form a composite index. Moreover, this composite index, also reduces the number of variables considered in the entire modelling process, substantially. The process is exemplified, using a primary data set gathered on household education expenditure using a systematic sampling survey. Several variables are collected on each household and another set of variables relating to each school going child in the household, creating a multilevel data structure. The composite index, named as, i°Household Level Education Indexi± is developed through a factor analysis where the detailed process of its construction is explained. The LASSO regression was performed to illustrate the use of the proposed composite index by predicting the monthly household education expenditure through a single level regression model. Finally, a Random Forest model was used to examine the feature importance, where the proposed composite index i°Household level education indexi± was the most important feature in predicting the monthly household educational expenditure.","PeriodicalId":91196,"journal":{"name":"American journal of applied mathematics and statistics","volume":"18 1","pages":"75-82"},"PeriodicalIF":0.0000,"publicationDate":"2021-07-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"American journal of applied mathematics and statistics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.12691/AJAMS-9-3-1","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

When modelling complexed data structures related to a certain social aspect, there could be various hierarchical levels where data units are nested within each other. There could also be several variables in each level, and those variables may not be unique for each case or record, making the data structure even more complexed. Multilevel modelling has been used for decades, to handle such data structures, but may not be effective at all times to capture the structure fully, due to the extent of complexities of the data structure and the inherent issues of the procedure. On the contrary, ignoring the multilevel data structure when modelling, can lead to incorrect estimations and thereby may not achieve acceptable accuracies from the model. This research explains a simple approach where a complexed multilevel structure is compressed to a single level by combining higher level variables to form a composite index. Moreover, this composite index, also reduces the number of variables considered in the entire modelling process, substantially. The process is exemplified, using a primary data set gathered on household education expenditure using a systematic sampling survey. Several variables are collected on each household and another set of variables relating to each school going child in the household, creating a multilevel data structure. The composite index, named as, i°Household Level Education Indexi± is developed through a factor analysis where the detailed process of its construction is explained. The LASSO regression was performed to illustrate the use of the proposed composite index by predicting the monthly household education expenditure through a single level regression model. Finally, a Random Forest model was used to examine the feature importance, where the proposed composite index i°Household level education indexi± was the most important feature in predicting the monthly household educational expenditure.

查看原文本刊更多论文

使用复合索引对多层数据结构建模

在对与某个社会方面相关的复杂数据结构建模时，可能存在各种层次结构，其中数据单元彼此嵌套。每个级别中也可能有几个变量，并且这些变量对于每个案例或记录可能不是唯一的，这使得数据结构更加复杂。多层建模已经使用了几十年，来处理这样的数据结构，但由于数据结构的复杂性和过程的固有问题，可能不是在任何时候都能有效地完全捕获结构。相反，在建模时忽略多层数据结构，可能导致不正确的估计，从而可能无法从模型中获得可接受的精度。本研究解释了一种简单的方法，将复杂的多层结构通过组合更高层次的变量形成复合指数，压缩到一个单一的层次。此外，该综合指数还大大减少了整个建模过程中考虑的变量数量。通过系统抽样调查收集的一组家庭教育支出的原始数据，举例说明了这一过程。收集每个家庭的几个变量，以及与家庭中每个上学儿童相关的另一组变量，从而创建多层次数据结构。通过因子分析得出综合指数“i°家庭教育水平指数±”，并详细说明了其构建过程。通过单水平回归模型预测家庭月教育支出，并进行LASSO回归，以说明所提出的综合指数的使用。最后，采用随机森林模型检验特征重要性，其中提出的综合指数i°家庭教育水平指数±是预测家庭月教育支出的最重要特征。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

American journal of applied mathematics and statistics

自引率

0.00%

发文量