机器学习在房地产行业文档分类中的挑战

26th Annual European Real Estate Society Conference Pub Date : 1900-01-01 DOI:10.15396/eres2019_370

Mario Bodenbender, Björn-Martin Kurzrock

{"title":"机器学习在房地产行业文档分类中的挑战","authors":"Mario Bodenbender, Björn-Martin Kurzrock","doi":"10.15396/eres2019_370","DOIUrl":null,"url":null,"abstract":"Data rooms are becoming more and more important for the real estate industry. They permit the creation of protected areas in which a variety of relevant documents are typically made available to interested parties. In addition to supporting purchase and sales processes, they are used primarily in larger construction projects.The structures and index designations of data rooms have not yet been uniformly regulated on an international basis. Data room indices are created based on different types of approaches and thus the indices also diverge in terms of their depth of detail as well as in the range of topics. In practice, rules already exist for structuring documentation for individual phases, as well as for transferring data between these phases. Since all of the documentation must be transferable when changing to another life cycle phase or participant, the information must always be clearly identified and structured in order to enable the protection, access and administration of this information at all times. This poses a challenge for companies because the documents are subject to several rounds of restructuring during their life cycle, which are not only costly, but also always entail the risk of data loss. The goal of current research is therefore a seamless storage as well as a permanent and unambiguous classification of the documents over the individual life cycle phases.In the field of text classification, machine learning offers considerable potential in the sense of reduced workload, process acceleration and quality improvement. In data rooms, machine learning (in particular document classification) is used to automatically classify the documents contained in the data room or the documents to be imported and assign them to a suitable index point. In this manner, a document is always classified in the class to which it belongs with the greatest probability (ex: due to word frequency). An essential prerequisite for the success of machine learning for document classification is the quality of the document classes as well as the training data. When defining the document classes, it must be guaranteed on the one hand that these do not overlap in terms of their content, so that it is possible to clearly allocate the documents thematically. On the other hand, it must also be possible to consider documents that may appear later and be able to scale the model according to the requirements. For the training and test set, as well as for the documents to be analyzed later, the quality of the respective documents and their readability are also decisive factors. In order to effectively analyze the documents, the content must also be standardized and it must be possible to remove non-relevant content in advance.Based on the empirical analysis of 8,965 digital documents of fourteen properties from eight different owners, the paper presents a model with more than 1,300 document classes as a basis for an automated structuring and migration of documents in the life cycle of real estate. To validate these classes, machine learning algorithms were learned and analyzed to determine under which conditions and how the highest possible accuracy of classification can be achieved. Stemmer and stop word lists used specifically for these analyses were also developed for this purpose. Using these lists, the accuracy of a classification is further increased by machine learning, since they were specifically aligned to terms used in the real estate industry.The paper also shows which aspects have to be taken into account at an early stage when digitizing extensive data/document inventories, since automation using machine learning can only be as good as the quality, legibility and interpretability of the data allow.","PeriodicalId":152375,"journal":{"name":"26th Annual European Real Estate Society Conference","volume":"59 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Challenges in Machine Learning for Document Classification in the Real Estate Industry\",\"authors\":\"Mario Bodenbender, Björn-Martin Kurzrock\",\"doi\":\"10.15396/eres2019_370\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Data rooms are becoming more and more important for the real estate industry. They permit the creation of protected areas in which a variety of relevant documents are typically made available to interested parties. In addition to supporting purchase and sales processes, they are used primarily in larger construction projects.The structures and index designations of data rooms have not yet been uniformly regulated on an international basis. Data room indices are created based on different types of approaches and thus the indices also diverge in terms of their depth of detail as well as in the range of topics. In practice, rules already exist for structuring documentation for individual phases, as well as for transferring data between these phases. Since all of the documentation must be transferable when changing to another life cycle phase or participant, the information must always be clearly identified and structured in order to enable the protection, access and administration of this information at all times. This poses a challenge for companies because the documents are subject to several rounds of restructuring during their life cycle, which are not only costly, but also always entail the risk of data loss. The goal of current research is therefore a seamless storage as well as a permanent and unambiguous classification of the documents over the individual life cycle phases.In the field of text classification, machine learning offers considerable potential in the sense of reduced workload, process acceleration and quality improvement. In data rooms, machine learning (in particular document classification) is used to automatically classify the documents contained in the data room or the documents to be imported and assign them to a suitable index point. In this manner, a document is always classified in the class to which it belongs with the greatest probability (ex: due to word frequency). An essential prerequisite for the success of machine learning for document classification is the quality of the document classes as well as the training data. When defining the document classes, it must be guaranteed on the one hand that these do not overlap in terms of their content, so that it is possible to clearly allocate the documents thematically. On the other hand, it must also be possible to consider documents that may appear later and be able to scale the model according to the requirements. For the training and test set, as well as for the documents to be analyzed later, the quality of the respective documents and their readability are also decisive factors. In order to effectively analyze the documents, the content must also be standardized and it must be possible to remove non-relevant content in advance.Based on the empirical analysis of 8,965 digital documents of fourteen properties from eight different owners, the paper presents a model with more than 1,300 document classes as a basis for an automated structuring and migration of documents in the life cycle of real estate. To validate these classes, machine learning algorithms were learned and analyzed to determine under which conditions and how the highest possible accuracy of classification can be achieved. Stemmer and stop word lists used specifically for these analyses were also developed for this purpose. Using these lists, the accuracy of a classification is further increased by machine learning, since they were specifically aligned to terms used in the real estate industry.The paper also shows which aspects have to be taken into account at an early stage when digitizing extensive data/document inventories, since automation using machine learning can only be as good as the quality, legibility and interpretability of the data allow.\",\"PeriodicalId\":152375,\"journal\":{\"name\":\"26th Annual European Real Estate Society Conference\",\"volume\":\"59 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"1900-01-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"26th Annual European Real Estate Society Conference\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.15396/eres2019_370\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"26th Annual European Real Estate Society Conference","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.15396/eres2019_370","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

对于房地产行业来说，数据机房变得越来越重要。它们允许建立保护区，其中各种相关文件通常向有关各方提供。除了支持采购和销售过程外，它们主要用于大型建筑项目。数据室的结构和索引名称尚未在国际基础上统一规定。数据室索引是基于不同类型的方法创建的，因此索引在细节深度和主题范围方面也存在差异。在实践中，已经存在为各个阶段构建文档的规则，以及在这些阶段之间传输数据的规则。由于在更改到另一个生命周期阶段或参与者时，所有文档都必须是可转让的，因此必须始终清楚地标识和结构化信息，以便始终能够保护、访问和管理这些信息。这给公司带来了挑战，因为文档在其生命周期中会经历几轮重组，这不仅成本高昂，而且总是会带来数据丢失的风险。因此，当前研究的目标是无缝存储以及文档在各个生命周期阶段的永久和明确分类。在文本分类领域，机器学习在减少工作量、加速过程和提高质量方面提供了相当大的潜力。在数据室中，使用机器学习(特别是文档分类)对数据室中包含的文档或需要导入的文档进行自动分类，并将其分配到合适的索引点。通过这种方式，文档总是以最大的概率被分类在它所属的类中(例如:根据词频)。机器学习在文档分类方面取得成功的一个必要先决条件是文档类和训练数据的质量。在定义文档类时，一方面必须保证这些类在内容上不重叠，这样才能清晰地按主题分配文档。另一方面，还必须能够考虑以后可能出现的文档，并且能够根据需求对模型进行缩放。对于训练集和测试集，以及后面要分析的文档，各自文档的质量和可读性也是决定性的因素。为了有效地分析文档，内容也必须标准化，并且必须能够提前删除不相关的内容。基于对来自8个不同业主的14个物业的8,965个数字文档的实证分析，本文提出了一个包含1,300多个文档类的模型，作为房地产生命周期中文档自动化结构化和迁移的基础。为了验证这些分类，学习和分析了机器学习算法，以确定在哪些条件下以及如何实现最高的分类精度。专门用于这些分析的停顿词和停顿词表也为此目的而开发。使用这些列表，通过机器学习可以进一步提高分类的准确性，因为它们与房地产行业中使用的术语特别一致。本文还显示了在数字化大量数据/文件清单的早期阶段必须考虑哪些方面，因为使用机器学习的自动化只能在数据的质量、易读性和可解释性允许的情况下实现。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Challenges in Machine Learning for Document Classification in the Real Estate Industry

Data rooms are becoming more and more important for the real estate industry. They permit the creation of protected areas in which a variety of relevant documents are typically made available to interested parties. In addition to supporting purchase and sales processes, they are used primarily in larger construction projects.The structures and index designations of data rooms have not yet been uniformly regulated on an international basis. Data room indices are created based on different types of approaches and thus the indices also diverge in terms of their depth of detail as well as in the range of topics. In practice, rules already exist for structuring documentation for individual phases, as well as for transferring data between these phases. Since all of the documentation must be transferable when changing to another life cycle phase or participant, the information must always be clearly identified and structured in order to enable the protection, access and administration of this information at all times. This poses a challenge for companies because the documents are subject to several rounds of restructuring during their life cycle, which are not only costly, but also always entail the risk of data loss. The goal of current research is therefore a seamless storage as well as a permanent and unambiguous classification of the documents over the individual life cycle phases.In the field of text classification, machine learning offers considerable potential in the sense of reduced workload, process acceleration and quality improvement. In data rooms, machine learning (in particular document classification) is used to automatically classify the documents contained in the data room or the documents to be imported and assign them to a suitable index point. In this manner, a document is always classified in the class to which it belongs with the greatest probability (ex: due to word frequency). An essential prerequisite for the success of machine learning for document classification is the quality of the document classes as well as the training data. When defining the document classes, it must be guaranteed on the one hand that these do not overlap in terms of their content, so that it is possible to clearly allocate the documents thematically. On the other hand, it must also be possible to consider documents that may appear later and be able to scale the model according to the requirements. For the training and test set, as well as for the documents to be analyzed later, the quality of the respective documents and their readability are also decisive factors. In order to effectively analyze the documents, the content must also be standardized and it must be possible to remove non-relevant content in advance.Based on the empirical analysis of 8,965 digital documents of fourteen properties from eight different owners, the paper presents a model with more than 1,300 document classes as a basis for an automated structuring and migration of documents in the life cycle of real estate. To validate these classes, machine learning algorithms were learned and analyzed to determine under which conditions and how the highest possible accuracy of classification can be achieved. Stemmer and stop word lists used specifically for these analyses were also developed for this purpose. Using these lists, the accuracy of a classification is further increased by machine learning, since they were specifically aligned to terms used in the real estate industry.The paper also shows which aspects have to be taken into account at an early stage when digitizing extensive data/document inventories, since automation using machine learning can only be as good as the quality, legibility and interpretability of the data allow.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

26th Annual European Real Estate Society Conference

自引率

0.00%

发文量