Chinese address standardisation via hybrid approach combining statistical and rule-based methods

Xi Chen, Cheng Fang, J. Chang, Yanjiang Yang, Yuan Hong, Haibing Lu
{"title":"Chinese address standardisation via hybrid approach combining statistical and rule-based methods","authors":"Xi Chen, Cheng Fang, J. Chang, Yanjiang Yang, Yuan Hong, Haibing Lu","doi":"10.1504/ijiem.2019.10024752","DOIUrl":null,"url":null,"abstract":"This paper is derived from the research project of cleansing customer address data for the State Grid Corporation of China (SGCC), which is the largest electric utility company in the world and was ranked the 2nd in the 2016 Fortune Global 500. Address standardisation involves development of a standard address format for data integration, de-duplication, auto address correction/completion, and is widely considered as a very challenging data cleansing task. Address standardisation is critical for routine business tasks, customer relationship management, business intelligence for customer-oriented cooperates, and others. Address standardisation is particularly difficult for the Chinese language. The underlying reasons include: 1) the current address standard placed in China is only realised at the city/town level; 2) due to a number of reasons, many hand-written addresses are incomplete or contain errors; 3) it is difficult to process the Chinese language in a machine fashion due to the language. characteristics. To tackle challenges, we propose a hybrid approach combining both statistical and rule-based methods, which are the two mainstream address standardisation approaches. Our hybrid approach utilises the merits of the both methods and can complete the address standardisation task with a little human efforts and computational time, while achieving high accuracy.","PeriodicalId":218661,"journal":{"name":"Int. J. Internet Enterp. Manag.","volume":"15 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-10-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Int. J. Internet Enterp. Manag.","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1504/ijiem.2019.10024752","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

This paper is derived from the research project of cleansing customer address data for the State Grid Corporation of China (SGCC), which is the largest electric utility company in the world and was ranked the 2nd in the 2016 Fortune Global 500. Address standardisation involves development of a standard address format for data integration, de-duplication, auto address correction/completion, and is widely considered as a very challenging data cleansing task. Address standardisation is critical for routine business tasks, customer relationship management, business intelligence for customer-oriented cooperates, and others. Address standardisation is particularly difficult for the Chinese language. The underlying reasons include: 1) the current address standard placed in China is only realised at the city/town level; 2) due to a number of reasons, many hand-written addresses are incomplete or contain errors; 3) it is difficult to process the Chinese language in a machine fashion due to the language. characteristics. To tackle challenges, we propose a hybrid approach combining both statistical and rule-based methods, which are the two mainstream address standardisation approaches. Our hybrid approach utilises the merits of the both methods and can complete the address standardisation task with a little human efforts and computational time, while achieving high accuracy.
通过统计和基于规则的方法相结合的混合方法实现中文地址标准化
本文来源于中国国家电网公司(SGCC)清洁客户地址数据的研究项目,该公司是全球最大的电力公司,在2016年《财富》世界500强中排名第二。地址标准化涉及开发用于数据集成、重复数据删除、自动地址更正/完成的标准地址格式,被广泛认为是一项非常具有挑战性的数据清理任务。地址标准化对于日常业务任务、客户关系管理、面向客户的合作的商业智能等都是至关重要的。中文的地址标准化尤其困难。潜在的原因包括:1)中国目前的地址标准只在城市/城镇层面实现;2)由于种种原因,许多手写地址不完整或有错误;3)由于语言的原因,很难以机器方式处理中文。特征。为了应对挑战,我们提出了一种结合统计和基于规则的方法的混合方法,这是两种主流的地址标准化方法。我们的混合方法利用了这两种方法的优点,可以用很少的人力和计算时间完成地址标准化任务,同时达到较高的精度。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信