基于条件随机场和随机规则语法的概率地址解析器

2016 IEEE 16th International Conference on Data Mining Workshops (ICDMW) Pub Date : 2016-12-12 DOI:10.1109/ICDMW.2016.0039

Minlue Wang, Valeriia Haberland, Amos Yeo, Andrew O. Martin, J. Howroyd, J. M. Bishop

{"title":"基于条件随机场和随机规则语法的概率地址解析器","authors":"Minlue Wang, Valeriia Haberland, Amos Yeo, Andrew O. Martin, J. Howroyd, J. M. Bishop","doi":"10.1109/ICDMW.2016.0039","DOIUrl":null,"url":null,"abstract":"Automatic semantic annotation of data from databases or the web is an important pre-process for data cleansing and record linkage. It can be used to resolve the problem of imperfect field alignment in a database or identify comparable fields for matching records from multiple sources. The annotation process is not trivial because data values may be noisy, such as abbreviations, variations or misspellings. In particular, overlapping features usually exist in a lexicon-based approach. In this work, we present a probabilistic address parser based on linear-chain conditional random fields (CRFs), which allow more expressive token-level features compared to hidden Markov models (HMMs). In additions, we also proposed two general enhancement techniques to improve the performance. One is taking original semi-structure of the data into account. Another is post-processing of the output sequences of the parser by combining its conditional probability and a score function, which is based on a learned stochastic regular grammar (SRG) that captures segment-level dependencies. Experiments were conducted by comparing the CRF parser to a HMM parser and a semi-Markov CRF parser in two real-world datasets. The CRF parser out-performed the HMM parser and the semi-Markov CRF in both datasets in terms of classification accuracy. Leveraging the structure of the data and combining the linear-chain CRF with the SRG further improved the parser to achieve an accuracy of 97% on a postal dataset and 96% on a company dataset.","PeriodicalId":373866,"journal":{"name":"2016 IEEE 16th International Conference on Data Mining Workshops (ICDMW)","volume":"12 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2016-12-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"13","resultStr":"{\"title\":\"A Probabilistic Address Parser Using Conditional Random Fields and Stochastic Regular Grammar\",\"authors\":\"Minlue Wang, Valeriia Haberland, Amos Yeo, Andrew O. Martin, J. Howroyd, J. M. Bishop\",\"doi\":\"10.1109/ICDMW.2016.0039\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Automatic semantic annotation of data from databases or the web is an important pre-process for data cleansing and record linkage. It can be used to resolve the problem of imperfect field alignment in a database or identify comparable fields for matching records from multiple sources. The annotation process is not trivial because data values may be noisy, such as abbreviations, variations or misspellings. In particular, overlapping features usually exist in a lexicon-based approach. In this work, we present a probabilistic address parser based on linear-chain conditional random fields (CRFs), which allow more expressive token-level features compared to hidden Markov models (HMMs). In additions, we also proposed two general enhancement techniques to improve the performance. One is taking original semi-structure of the data into account. Another is post-processing of the output sequences of the parser by combining its conditional probability and a score function, which is based on a learned stochastic regular grammar (SRG) that captures segment-level dependencies. Experiments were conducted by comparing the CRF parser to a HMM parser and a semi-Markov CRF parser in two real-world datasets. The CRF parser out-performed the HMM parser and the semi-Markov CRF in both datasets in terms of classification accuracy. Leveraging the structure of the data and combining the linear-chain CRF with the SRG further improved the parser to achieve an accuracy of 97% on a postal dataset and 96% on a company dataset.\",\"PeriodicalId\":373866,\"journal\":{\"name\":\"2016 IEEE 16th International Conference on Data Mining Workshops (ICDMW)\",\"volume\":\"12 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2016-12-12\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"13\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2016 IEEE 16th International Conference on Data Mining Workshops (ICDMW)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICDMW.2016.0039\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2016 IEEE 16th International Conference on Data Mining Workshops (ICDMW)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICDMW.2016.0039","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 13

摘要

数据库或网络数据的自动语义标注是数据清理和记录链接的重要预处理。它可用于解决数据库中字段对齐不完美的问题，或识别来自多个来源的匹配记录的可比较字段。注释过程不是简单的，因为数据值可能是嘈杂的，比如缩写、变化或拼写错误。特别是，重叠特征通常存在于基于词典的方法中。在这项工作中，我们提出了一个基于线性链条件随机场(CRFs)的概率地址解析器，与隐马尔可夫模型(hmm)相比，它允许更具表现力的标记级特征。此外，我们还提出了两种通用的增强技术来提高性能。一是考虑数据的原始半结构。另一个是通过结合条件概率和分数函数对解析器的输出序列进行后处理，分数函数基于学习的随机规则语法(SRG)，该语法捕获片段级依赖关系。通过将CRF解析器与HMM解析器和半马尔可夫CRF解析器在两个真实数据集中进行比较，进行了实验。在分类精度方面，CRF解析器在两个数据集中都优于HMM解析器和半马尔可夫CRF。利用数据的结构并将线性链CRF与SRG相结合，进一步改进了解析器，在邮政数据集上实现了97%的准确率，在公司数据集上实现了96%的准确率。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

A Probabilistic Address Parser Using Conditional Random Fields and Stochastic Regular Grammar

Automatic semantic annotation of data from databases or the web is an important pre-process for data cleansing and record linkage. It can be used to resolve the problem of imperfect field alignment in a database or identify comparable fields for matching records from multiple sources. The annotation process is not trivial because data values may be noisy, such as abbreviations, variations or misspellings. In particular, overlapping features usually exist in a lexicon-based approach. In this work, we present a probabilistic address parser based on linear-chain conditional random fields (CRFs), which allow more expressive token-level features compared to hidden Markov models (HMMs). In additions, we also proposed two general enhancement techniques to improve the performance. One is taking original semi-structure of the data into account. Another is post-processing of the output sequences of the parser by combining its conditional probability and a score function, which is based on a learned stochastic regular grammar (SRG) that captures segment-level dependencies. Experiments were conducted by comparing the CRF parser to a HMM parser and a semi-Markov CRF parser in two real-world datasets. The CRF parser out-performed the HMM parser and the semi-Markov CRF in both datasets in terms of classification accuracy. Leveraging the structure of the data and combining the linear-chain CRF with the SRG further improved the parser to achieve an accuracy of 97% on a postal dataset and 96% on a company dataset.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2016 IEEE 16th International Conference on Data Mining Workshops (ICDMW)

自引率

0.00%

发文量