{"title":"开发一种基于规则和机器学习方法的公司实体匹配基准的法律形式分类和提取方法","authors":"Felix Kruse, Jan-Philipp Awick, J. Gómez, P. Loos","doi":"10.52825/bis.v1i.44","DOIUrl":null,"url":null,"abstract":"This paper explores the data integration process step record linkage. Thereby we focus on the entity company. For the integration of company data, the company name is a crucial attribute, which often includes the legal form. This legal form is not concise and consistent represented among different data sources, which leads to considerable data quality problems for the further process steps in record linkage. To solve these problems, we classify and ex-tract the legal form from the attribute company name. For this purpose, we iteratively developed four different approaches and compared them in a benchmark. The best approach is a hybrid approach combining a rule set and a supervised machine learning model. With our developed hybrid approach, any company data sets from research or business can be processed. Thus, the data quality for subsequent data processing steps such as record linkage can be improved. Furthermore, our approach can be adapted to solve the same data quality problems in other attributes.","PeriodicalId":56020,"journal":{"name":"Business & Information Systems Engineering","volume":"08 1","pages":"13-26"},"PeriodicalIF":7.4000,"publicationDate":"2021-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":"{\"title\":\"Developing a Legal Form Classification and Extraction Approach for Company Entity Matching Benchmark of Rule-Based and Machine Learning Approaches\",\"authors\":\"Felix Kruse, Jan-Philipp Awick, J. Gómez, P. Loos\",\"doi\":\"10.52825/bis.v1i.44\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"This paper explores the data integration process step record linkage. Thereby we focus on the entity company. For the integration of company data, the company name is a crucial attribute, which often includes the legal form. This legal form is not concise and consistent represented among different data sources, which leads to considerable data quality problems for the further process steps in record linkage. To solve these problems, we classify and ex-tract the legal form from the attribute company name. For this purpose, we iteratively developed four different approaches and compared them in a benchmark. The best approach is a hybrid approach combining a rule set and a supervised machine learning model. With our developed hybrid approach, any company data sets from research or business can be processed. Thus, the data quality for subsequent data processing steps such as record linkage can be improved. Furthermore, our approach can be adapted to solve the same data quality problems in other attributes.\",\"PeriodicalId\":56020,\"journal\":{\"name\":\"Business & Information Systems Engineering\",\"volume\":\"08 1\",\"pages\":\"13-26\"},\"PeriodicalIF\":7.4000,\"publicationDate\":\"2021-01-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"3\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Business & Information Systems Engineering\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://doi.org/10.52825/bis.v1i.44\",\"RegionNum\":3,\"RegionCategory\":\"管理学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, INFORMATION SYSTEMS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Business & Information Systems Engineering","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.52825/bis.v1i.44","RegionNum":3,"RegionCategory":"管理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}
Developing a Legal Form Classification and Extraction Approach for Company Entity Matching Benchmark of Rule-Based and Machine Learning Approaches
This paper explores the data integration process step record linkage. Thereby we focus on the entity company. For the integration of company data, the company name is a crucial attribute, which often includes the legal form. This legal form is not concise and consistent represented among different data sources, which leads to considerable data quality problems for the further process steps in record linkage. To solve these problems, we classify and ex-tract the legal form from the attribute company name. For this purpose, we iteratively developed four different approaches and compared them in a benchmark. The best approach is a hybrid approach combining a rule set and a supervised machine learning model. With our developed hybrid approach, any company data sets from research or business can be processed. Thus, the data quality for subsequent data processing steps such as record linkage can be improved. Furthermore, our approach can be adapted to solve the same data quality problems in other attributes.
期刊介绍:
Business & Information Systems Engineering (BISE) is a double-blind peer-reviewed journal with a primary focus on the design and utilization of information systems for social welfare. The journal aims to contribute to the understanding and advancement of information systems in ways that benefit societal well-being.