{"title":"Cross domain web information extraction with multi-level feature model","authors":"Qian Chen, Wenhao Zhu, Chaoyou Ju, Wu Zhang","doi":"10.1109/ICNC.2014.6975936","DOIUrl":null,"url":null,"abstract":"One of the key problems of information extraction is to design a cross domain extraction procedure that can adapt different domain topics and text formats. However, most information extraction methods focus on specific areas or only have limited scalability for semi-structured texts. We argue that the problem of cross domain information extraction is basically introduced by domain related features. For example, the features used for price extraction in e-commerce websites cannot be directly applied in the case of extracting salary for recruiting websites. In worst case, a whole extraction model is required to be implemented despite the fact that there are common characters for price and salary. In this paper we propose a cross domain solution by dismantling domain relevant features into sub-features that are less domain related. The sub-features include composite features (those can be represented with a combination of several other features) and atomic features (features that can't be dismantled). To manage the features effectively we propose a multi-level feature model by organizing the features as well as their relations. With this model, we give an information extraction method that can be quickly shifted when the extraction domain changes.","PeriodicalId":208779,"journal":{"name":"2014 10th International Conference on Natural Computation (ICNC)","volume":"22 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2014-12-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2014 10th International Conference on Natural Computation (ICNC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICNC.2014.6975936","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
One of the key problems of information extraction is to design a cross domain extraction procedure that can adapt different domain topics and text formats. However, most information extraction methods focus on specific areas or only have limited scalability for semi-structured texts. We argue that the problem of cross domain information extraction is basically introduced by domain related features. For example, the features used for price extraction in e-commerce websites cannot be directly applied in the case of extracting salary for recruiting websites. In worst case, a whole extraction model is required to be implemented despite the fact that there are common characters for price and salary. In this paper we propose a cross domain solution by dismantling domain relevant features into sub-features that are less domain related. The sub-features include composite features (those can be represented with a combination of several other features) and atomic features (features that can't be dismantled). To manage the features effectively we propose a multi-level feature model by organizing the features as well as their relations. With this model, we give an information extraction method that can be quickly shifted when the extraction domain changes.