基于多层次特征模型的跨域web信息提取

2014 10th International Conference on Natural Computation (ICNC) Pub Date : 2014-12-08 DOI:10.1109/ICNC.2014.6975936

Qian Chen, Wenhao Zhu, Chaoyou Ju, Wu Zhang

{"title":"基于多层次特征模型的跨域web信息提取","authors":"Qian Chen, Wenhao Zhu, Chaoyou Ju, Wu Zhang","doi":"10.1109/ICNC.2014.6975936","DOIUrl":null,"url":null,"abstract":"One of the key problems of information extraction is to design a cross domain extraction procedure that can adapt different domain topics and text formats. However, most information extraction methods focus on specific areas or only have limited scalability for semi-structured texts. We argue that the problem of cross domain information extraction is basically introduced by domain related features. For example, the features used for price extraction in e-commerce websites cannot be directly applied in the case of extracting salary for recruiting websites. In worst case, a whole extraction model is required to be implemented despite the fact that there are common characters for price and salary. In this paper we propose a cross domain solution by dismantling domain relevant features into sub-features that are less domain related. The sub-features include composite features (those can be represented with a combination of several other features) and atomic features (features that can't be dismantled). To manage the features effectively we propose a multi-level feature model by organizing the features as well as their relations. With this model, we give an information extraction method that can be quickly shifted when the extraction domain changes.","PeriodicalId":208779,"journal":{"name":"2014 10th International Conference on Natural Computation (ICNC)","volume":"22 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2014-12-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Cross domain web information extraction with multi-level feature model\",\"authors\":\"Qian Chen, Wenhao Zhu, Chaoyou Ju, Wu Zhang\",\"doi\":\"10.1109/ICNC.2014.6975936\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"One of the key problems of information extraction is to design a cross domain extraction procedure that can adapt different domain topics and text formats. However, most information extraction methods focus on specific areas or only have limited scalability for semi-structured texts. We argue that the problem of cross domain information extraction is basically introduced by domain related features. For example, the features used for price extraction in e-commerce websites cannot be directly applied in the case of extracting salary for recruiting websites. In worst case, a whole extraction model is required to be implemented despite the fact that there are common characters for price and salary. In this paper we propose a cross domain solution by dismantling domain relevant features into sub-features that are less domain related. The sub-features include composite features (those can be represented with a combination of several other features) and atomic features (features that can't be dismantled). To manage the features effectively we propose a multi-level feature model by organizing the features as well as their relations. With this model, we give an information extraction method that can be quickly shifted when the extraction domain changes.\",\"PeriodicalId\":208779,\"journal\":{\"name\":\"2014 10th International Conference on Natural Computation (ICNC)\",\"volume\":\"22 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2014-12-08\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2014 10th International Conference on Natural Computation (ICNC)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICNC.2014.6975936\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2014 10th International Conference on Natural Computation (ICNC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICNC.2014.6975936","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

信息抽取的关键问题之一是设计一种能够适应不同领域主题和文本格式的跨领域抽取程序。然而，大多数信息提取方法都集中在特定领域，或者对半结构化文本的可扩展性有限。我们认为跨领域信息提取问题基本上是由领域相关特征引入的。例如，电子商务网站中用于价格提取的特征不能直接应用于招聘网站的工资提取。在最坏的情况下，需要实现整个提取模型，尽管价格和工资有共同的字符。本文提出了一种跨领域的解决方案，将领域相关的特征分解为与领域不太相关的子特征。子特性包括组合特性(可以用几个其他特性的组合来表示)和原子特性(不能拆除的特性)。为了有效地管理特征，我们提出了一种多层次的特征模型，通过对特征及其关系进行组织。在此基础上，给出了一种随提取域变化而快速转移的信息提取方法。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Cross domain web information extraction with multi-level feature model

One of the key problems of information extraction is to design a cross domain extraction procedure that can adapt different domain topics and text formats. However, most information extraction methods focus on specific areas or only have limited scalability for semi-structured texts. We argue that the problem of cross domain information extraction is basically introduced by domain related features. For example, the features used for price extraction in e-commerce websites cannot be directly applied in the case of extracting salary for recruiting websites. In worst case, a whole extraction model is required to be implemented despite the fact that there are common characters for price and salary. In this paper we propose a cross domain solution by dismantling domain relevant features into sub-features that are less domain related. The sub-features include composite features (those can be represented with a combination of several other features) and atomic features (features that can't be dismantled). To manage the features effectively we propose a multi-level feature model by organizing the features as well as their relations. With this model, we give an information extraction method that can be quickly shifted when the extraction domain changes.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2014 10th International Conference on Natural Computation (ICNC)

自引率

0.00%

发文量