{"title":"Extracting information from semi-structured Internet sources","authors":"Jong-Seok Jeong, Dong-Ik Oh","doi":"10.1109/ISIE.2001.931683","DOIUrl":null,"url":null,"abstract":"Information Harvest Warehouse (IHWA) is a web-based information search system. It is designed using the Component Based Software Engineering (CBSE) paradigm, where applications are to be developed by integrating server-side EJB and client-side JCC components. The search system is under a major reconstruction in order to be more general and robust, and to be ready for evolving electronic commerce demands. In this paper, we describe the development of the meta-information gathering service of IHWA (meta gatherer), which collects and extracts information from semi-structured or unstructured data sources. Focus is on the development of the information extraction service of the gatherer from semi-structured (DTD-unknown XML data) Internet information sources. The information extraction module implemented provides clean Java programming interfaces, so that it can be easily integrated with other applications. Its implementation is an efficient one as well, since it analyzes a source XML file in one path, where most other systems use the two paths approach.","PeriodicalId":124749,"journal":{"name":"ISIE 2001. 2001 IEEE International Symposium on Industrial Electronics Proceedings (Cat. No.01TH8570)","volume":"36 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2001-06-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ISIE 2001. 2001 IEEE International Symposium on Industrial Electronics Proceedings (Cat. No.01TH8570)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ISIE.2001.931683","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1
Abstract
Information Harvest Warehouse (IHWA) is a web-based information search system. It is designed using the Component Based Software Engineering (CBSE) paradigm, where applications are to be developed by integrating server-side EJB and client-side JCC components. The search system is under a major reconstruction in order to be more general and robust, and to be ready for evolving electronic commerce demands. In this paper, we describe the development of the meta-information gathering service of IHWA (meta gatherer), which collects and extracts information from semi-structured or unstructured data sources. Focus is on the development of the information extraction service of the gatherer from semi-structured (DTD-unknown XML data) Internet information sources. The information extraction module implemented provides clean Java programming interfaces, so that it can be easily integrated with other applications. Its implementation is an efficient one as well, since it analyzes a source XML file in one path, where most other systems use the two paths approach.