基于多层次特征模型的跨域web信息提取

Qian Chen, Wenhao Zhu, Chaoyou Ju, Wu Zhang
{"title":"基于多层次特征模型的跨域web信息提取","authors":"Qian Chen, Wenhao Zhu, Chaoyou Ju, Wu Zhang","doi":"10.1109/ICNC.2014.6975936","DOIUrl":null,"url":null,"abstract":"One of the key problems of information extraction is to design a cross domain extraction procedure that can adapt different domain topics and text formats. However, most information extraction methods focus on specific areas or only have limited scalability for semi-structured texts. We argue that the problem of cross domain information extraction is basically introduced by domain related features. For example, the features used for price extraction in e-commerce websites cannot be directly applied in the case of extracting salary for recruiting websites. In worst case, a whole extraction model is required to be implemented despite the fact that there are common characters for price and salary. In this paper we propose a cross domain solution by dismantling domain relevant features into sub-features that are less domain related. The sub-features include composite features (those can be represented with a combination of several other features) and atomic features (features that can't be dismantled). To manage the features effectively we propose a multi-level feature model by organizing the features as well as their relations. With this model, we give an information extraction method that can be quickly shifted when the extraction domain changes.","PeriodicalId":208779,"journal":{"name":"2014 10th International Conference on Natural Computation (ICNC)","volume":"22 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2014-12-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Cross domain web information extraction with multi-level feature model\",\"authors\":\"Qian Chen, Wenhao Zhu, Chaoyou Ju, Wu Zhang\",\"doi\":\"10.1109/ICNC.2014.6975936\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"One of the key problems of information extraction is to design a cross domain extraction procedure that can adapt different domain topics and text formats. However, most information extraction methods focus on specific areas or only have limited scalability for semi-structured texts. We argue that the problem of cross domain information extraction is basically introduced by domain related features. For example, the features used for price extraction in e-commerce websites cannot be directly applied in the case of extracting salary for recruiting websites. In worst case, a whole extraction model is required to be implemented despite the fact that there are common characters for price and salary. In this paper we propose a cross domain solution by dismantling domain relevant features into sub-features that are less domain related. The sub-features include composite features (those can be represented with a combination of several other features) and atomic features (features that can't be dismantled). To manage the features effectively we propose a multi-level feature model by organizing the features as well as their relations. With this model, we give an information extraction method that can be quickly shifted when the extraction domain changes.\",\"PeriodicalId\":208779,\"journal\":{\"name\":\"2014 10th International Conference on Natural Computation (ICNC)\",\"volume\":\"22 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2014-12-08\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2014 10th International Conference on Natural Computation (ICNC)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICNC.2014.6975936\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2014 10th International Conference on Natural Computation (ICNC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICNC.2014.6975936","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

摘要

信息抽取的关键问题之一是设计一种能够适应不同领域主题和文本格式的跨领域抽取程序。然而,大多数信息提取方法都集中在特定领域,或者对半结构化文本的可扩展性有限。我们认为跨领域信息提取问题基本上是由领域相关特征引入的。例如,电子商务网站中用于价格提取的特征不能直接应用于招聘网站的工资提取。在最坏的情况下,需要实现整个提取模型,尽管价格和工资有共同的字符。本文提出了一种跨领域的解决方案,将领域相关的特征分解为与领域不太相关的子特征。子特性包括组合特性(可以用几个其他特性的组合来表示)和原子特性(不能拆除的特性)。为了有效地管理特征,我们提出了一种多层次的特征模型,通过对特征及其关系进行组织。在此基础上,给出了一种随提取域变化而快速转移的信息提取方法。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Cross domain web information extraction with multi-level feature model
One of the key problems of information extraction is to design a cross domain extraction procedure that can adapt different domain topics and text formats. However, most information extraction methods focus on specific areas or only have limited scalability for semi-structured texts. We argue that the problem of cross domain information extraction is basically introduced by domain related features. For example, the features used for price extraction in e-commerce websites cannot be directly applied in the case of extracting salary for recruiting websites. In worst case, a whole extraction model is required to be implemented despite the fact that there are common characters for price and salary. In this paper we propose a cross domain solution by dismantling domain relevant features into sub-features that are less domain related. The sub-features include composite features (those can be represented with a combination of several other features) and atomic features (features that can't be dismantled). To manage the features effectively we propose a multi-level feature model by organizing the features as well as their relations. With this model, we give an information extraction method that can be quickly shifted when the extraction domain changes.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信