Influence of pre-processing methods on the automatic priority prediction of native-language end-users’ maintenance requests through machine learning methods
{"title":"Influence of pre-processing methods on the automatic priority prediction of native-language end-users’ maintenance requests through machine learning methods","authors":"M. D’Orazio, G. Bernardini, E. Di Giuseppe","doi":"10.36680/j.itcon.2024.006","DOIUrl":null,"url":null,"abstract":"Feedback and requests by occupants are relevant sources of data to improve building management, and building maintenance. Indeed, most predictable faults can be directly identified by occupants and communicated to facility managers through communications written in the end-users’ native language. In this sense, natural language processing methods can support the request identification and attribution process if they are robust enough to extract useful information from these unstructured textual sources. Machine learning (ML) can support assessing and managing these data, especially in the case of many simultaneous communications. In this field, the application of pre-processing and ML methods to English-written databases has been widely provided, while efforts in other native languages are still limited, impacting the real applicability. Moreover, the performance of combinations of methods for pre-processing, ML and classification classes attribution, has been limitedly investigated while comparing different languages. To fill this gap, this work hence explores the performance of automatic priority assignment of maintenance end-users’ requests depending on the combined influence of: (a) different natural language pre-processing methods, (b) several supervised ML algorithms, (c) two priority classification rules (2-class versus 4-class), (d) the database language (i.e. the original database written in Italian, the native end-users’ language; a translated database version in English, as standard reference). Analyses are performed on a database of about 12000 maintenance requests written in Italian concerning a stock of 23 buildings open to the public. A random sample of the sentences is supervised and labelled by 20 expert annotators following the best-worst method to attribute a priority score. Labelled sentences are then pre-processed using four different approaches to progressively reduce the number of unique words (potential predictors). Five different consolidated ML methods are applied, and comparisons involve accuracy, precision, recall and F1-score for each combination of pre-processing action, ML method and the number of priority classes. Results show that, within each ML algorithm, different pre-processing methods limitedly impact the final accuracy and average F1-score. In both Italian and English conditions, the best performance is obtained by NN, LR, SVM methods, while NB generally fails, and by considering the 2-class priority classification scale. In this sense, results confirm that facility managers can be effectively supported by ML methods for preliminary priority assessments in building maintenance processes, even when the requests database is written in end-users’ native language.","PeriodicalId":51624,"journal":{"name":"Journal of Information Technology in Construction","volume":null,"pages":null},"PeriodicalIF":3.6000,"publicationDate":"2024-03-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Information Technology in Construction","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.36680/j.itcon.2024.006","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ENGINEERING, CIVIL","Score":null,"Total":0}
引用次数: 0
Abstract
Feedback and requests by occupants are relevant sources of data to improve building management, and building maintenance. Indeed, most predictable faults can be directly identified by occupants and communicated to facility managers through communications written in the end-users’ native language. In this sense, natural language processing methods can support the request identification and attribution process if they are robust enough to extract useful information from these unstructured textual sources. Machine learning (ML) can support assessing and managing these data, especially in the case of many simultaneous communications. In this field, the application of pre-processing and ML methods to English-written databases has been widely provided, while efforts in other native languages are still limited, impacting the real applicability. Moreover, the performance of combinations of methods for pre-processing, ML and classification classes attribution, has been limitedly investigated while comparing different languages. To fill this gap, this work hence explores the performance of automatic priority assignment of maintenance end-users’ requests depending on the combined influence of: (a) different natural language pre-processing methods, (b) several supervised ML algorithms, (c) two priority classification rules (2-class versus 4-class), (d) the database language (i.e. the original database written in Italian, the native end-users’ language; a translated database version in English, as standard reference). Analyses are performed on a database of about 12000 maintenance requests written in Italian concerning a stock of 23 buildings open to the public. A random sample of the sentences is supervised and labelled by 20 expert annotators following the best-worst method to attribute a priority score. Labelled sentences are then pre-processed using four different approaches to progressively reduce the number of unique words (potential predictors). Five different consolidated ML methods are applied, and comparisons involve accuracy, precision, recall and F1-score for each combination of pre-processing action, ML method and the number of priority classes. Results show that, within each ML algorithm, different pre-processing methods limitedly impact the final accuracy and average F1-score. In both Italian and English conditions, the best performance is obtained by NN, LR, SVM methods, while NB generally fails, and by considering the 2-class priority classification scale. In this sense, results confirm that facility managers can be effectively supported by ML methods for preliminary priority assessments in building maintenance processes, even when the requests database is written in end-users’ native language.
用户的反馈和要求是改进楼宇管理和楼宇维护的相关数据来源。事实上,大多数可预测的故障都可以由住户直接识别,并通过以终端用户母语编写的通信内容传达给设施管理人员。从这个意义上讲,如果自然语言处理方法足够强大,能够从这些非结构化文本来源中提取有用的信息,那么它们就能为请求识别和归因过程提供支持。机器学习(ML)可以为评估和管理这些数据提供支持,尤其是在许多通信同时进行的情况下。在这一领域,预处理和 ML 方法在英文数据库中的应用已经非常广泛,但在其他母语数据库中的应用仍然有限,影响了其实际应用性。此外,在比较不同语言时,对预处理、ML 和分类类归因方法组合的性能研究也很有限。为了填补这一空白,这项工作探讨了维护终端用户请求的自动优先级分配性能取决于以下因素的综合影响:(a) 不同的自然语言预处理方法;(b) 几种有监督的 ML 算法;(c) 两种优先级分类规则(2 类与 4 类);(d) 数据库语言(即以终端用户的母语意大利语编写的原始数据库;作为标准参考的英语翻译数据库版本)。分析是在一个数据库中进行的,该数据库包含约 12000 份用意大利语撰写的维修请求,涉及 23 座向公众开放的建筑。由 20 位专家注释者对随机抽取的句子进行监督和标注,并采用最佳-最差法对句子进行优先级评分。然后,使用四种不同的方法对标记的句子进行预处理,以逐步减少独特单词(潜在预测因子)的数量。应用了五种不同的综合 ML 方法,并对每种预处理方法、ML 方法和优先级数量组合的准确度、精确度、召回率和 F1 分数进行了比较。结果表明,在每种 ML 算法中,不同的预处理方法对最终准确率和平均 F1 分数的影响有限。在意大利语和英语条件下,NN、LR、SVM 方法的性能最好,而 NB 方法和考虑到 2 类优先级分类尺度的 NB 方法则普遍失败。从这个意义上说,结果证实,即使请求数据库是以最终用户的母语编写的,设施管理人员也可以有效地利用 ML 方法对建筑物维护流程中的初步优先级评估提供支持。