Benchmarking pre-trained text embedding models in aligning built asset information.

IF 3.9 2区综合性期刊 Q1 MULTIDISCIPLINARY SCIENCES

Scientific Reports Pub Date : 2025-07-04 DOI:10.1038/s41598-025-09052-5

Mehrzad Shahinmoghadam, Ali Motamedi

{"title":"Benchmarking pre-trained text embedding models in aligning built asset information.","authors":"Mehrzad Shahinmoghadam, Ali Motamedi","doi":"10.1038/s41598-025-09052-5","DOIUrl":null,"url":null,"abstract":"<p><p>Accurate mapping of the built asset information to various data classification systems and taxonomies is crucial for effective asset management, whether for compliance at project handover or ad-hoc data integration scenarios. Due to the complex nature of built asset data, which predominantly comprises technical text elements, this process remains largely manual and reliant on domain expert input. Recent breakthroughs in contextual text representation learning (text embedding), particularly through pre-trained large language models, offer promising approaches that can facilitate the automation of cross-mapping of the built asset data. However, no comprehensive evaluation has yet been conducted to assess these models' ability to effectively represent the complex semantics specific to built asset technical terminology. This study presents a comparative benchmark of state-of-the-art text embedding models to evaluate their effectiveness in aligning built asset information with domain-specific technical concepts. Our proposed datasets are derived from two renowned built asset data classification dictionaries. The results of our benchmarking across six proposed datasets, covering clustering, retrieval, and reranking tasks, showed performance variations among models, deviating from the common trend of larger models achieving higher scores. Our results underscore the importance of domain-specific evaluations and future research into domain adaptation techniques, with instruction-tuning as a promising direction. The benchmarking resources are published as an open-source library, which will be maintained and extended to support future evaluations in this field.</p>","PeriodicalId":21811,"journal":{"name":"Scientific Reports","volume":"15 1","pages":"23866"},"PeriodicalIF":3.9000,"publicationDate":"2025-07-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12227769/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Scientific Reports","FirstCategoryId":"103","ListUrlMain":"https://doi.org/10.1038/s41598-025-09052-5","RegionNum":2,"RegionCategory":"综合性期刊","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"MULTIDISCIPLINARY SCIENCES","Score":null,"Total":0}

引用次数: 0

Abstract

Accurate mapping of the built asset information to various data classification systems and taxonomies is crucial for effective asset management, whether for compliance at project handover or ad-hoc data integration scenarios. Due to the complex nature of built asset data, which predominantly comprises technical text elements, this process remains largely manual and reliant on domain expert input. Recent breakthroughs in contextual text representation learning (text embedding), particularly through pre-trained large language models, offer promising approaches that can facilitate the automation of cross-mapping of the built asset data. However, no comprehensive evaluation has yet been conducted to assess these models' ability to effectively represent the complex semantics specific to built asset technical terminology. This study presents a comparative benchmark of state-of-the-art text embedding models to evaluate their effectiveness in aligning built asset information with domain-specific technical concepts. Our proposed datasets are derived from two renowned built asset data classification dictionaries. The results of our benchmarking across six proposed datasets, covering clustering, retrieval, and reranking tasks, showed performance variations among models, deviating from the common trend of larger models achieving higher scores. Our results underscore the importance of domain-specific evaluations and future research into domain adaptation techniques, with instruction-tuning as a promising direction. The benchmarking resources are published as an open-source library, which will be maintained and extended to support future evaluations in this field.

Abstract Image

查看原文本刊更多论文

对预先训练的文本嵌入模型进行基准测试，以对齐已构建的资产信息。

将构建的资产信息准确映射到各种数据分类系统和分类法对于有效的资产管理至关重要，无论是在项目移交时还是在特殊数据集成场景中都是如此。由于构建资产数据的复杂性，它主要由技术文本元素组成，这个过程在很大程度上仍然是手动的，并且依赖于领域专家的输入。最近在上下文文本表示学习（文本嵌入）方面的突破，特别是通过预训练的大型语言模型，提供了有前途的方法，可以促进构建资产数据交叉映射的自动化。然而，还没有进行全面的评估来评估这些模型有效地表示特定于构建资产技术术语的复杂语义的能力。本研究提出了一个比较基准的最先进的文本嵌入模型，以评估它们在将已建资产信息与特定领域的技术概念对齐方面的有效性。我们提出的数据集来源于两个著名的建筑资产数据分类字典。我们对六个提议的数据集进行基准测试的结果，包括聚类、检索和重新排序任务，显示了模型之间的性能差异，偏离了大模型获得更高分数的普遍趋势。我们的研究结果强调了领域特定评估和领域适应技术未来研究的重要性，其中指令调优是一个有前途的方向。基准测试资源作为开源库发布，将对其进行维护和扩展，以支持该领域的未来评估。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Scientific Reports Natural Science Disciplines-

CiteScore

7.50

自引率

4.30%

发文量

19567

审稿时长

3.9 months

期刊介绍： We publish original research from all areas of the natural sciences, psychology, medicine and engineering. You can learn more about what we publish by browsing our specific scientific subject areas below or explore Scientific Reports by browsing all articles and collections. Scientific Reports has a 2-year impact factor: 4.380 (2021), and is the 6th most-cited journal in the world, with more than 540,000 citations in 2020 (Clarivate Analytics, 2021). •Engineering Engineering covers all aspects of engineering, technology, and applied science. It plays a crucial role in the development of technologies to address some of the world''s biggest challenges, helping to save lives and improve the way we live. •Physical sciences Physical sciences are those academic disciplines that aim to uncover the underlying laws of nature — often written in the language of mathematics. It is a collective term for areas of study including astronomy, chemistry, materials science and physics. •Earth and environmental sciences Earth and environmental sciences cover all aspects of Earth and planetary science and broadly encompass solid Earth processes, surface and atmospheric dynamics, Earth system history, climate and climate change, marine and freshwater systems, and ecology. It also considers the interactions between humans and these systems. •Biological sciences Biological sciences encompass all the divisions of natural sciences examining various aspects of vital processes. The concept includes anatomy, physiology, cell biology, biochemistry and biophysics, and covers all organisms from microorganisms, animals to plants. •Health sciences The health sciences study health, disease and healthcare. This field of study aims to develop knowledge, interventions and technology for use in healthcare to improve the treatment of patients.