从多语言HTML文档中分类和提取信息

9th International Database Engineering & Application Symposium (IDEAS'05) Pub Date : 2005-07-25 DOI:10.1109/IDEAS.2005.15

S. Lim, Yiu-Kai Ng

{"title":"从多语言HTML文档中分类和提取信息","authors":"S. Lim, Yiu-Kai Ng","doi":"10.1109/IDEAS.2005.15","DOIUrl":null,"url":null,"abstract":"The amount of online information written in different natural languages and the number of non-English speaking Internet users have been increasing tremendously during the past decade. In order to provide high-performance access of multilingual information on the Internet, we have developed a data analysis and querying system (DatAQs) that: (i) analyzes, identifies, and categorizes languages used in HTML documents; (ii) extracts information from HTML documents of interest written in different languages; (iii) allows the user to submit queries for retrieving extracted information in the same natural language provided by the query engine of DatAQs using a menu-driven user interface; and (iv) processes the user's queries (as Boolean expressions) to generate the results. DatAQs extracts information from HTML documents that belong to various data-rich, narrow-in-breadth application domains, such as car ads, house rentals, job ads, stocks, university catalogs, etc. The average F-measure on identifying HTML documents written in a particular natural language correctly is 89%, whereas the F-measure on categorizing HTML documents belonged to the car-ads application domain is 94%.","PeriodicalId":357591,"journal":{"name":"9th International Database Engineering & Application Symposium (IDEAS'05)","volume":"25 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2005-07-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":"{\"title\":\"Categorizing and extracting information from multilingual HTML documents\",\"authors\":\"S. Lim, Yiu-Kai Ng\",\"doi\":\"10.1109/IDEAS.2005.15\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The amount of online information written in different natural languages and the number of non-English speaking Internet users have been increasing tremendously during the past decade. In order to provide high-performance access of multilingual information on the Internet, we have developed a data analysis and querying system (DatAQs) that: (i) analyzes, identifies, and categorizes languages used in HTML documents; (ii) extracts information from HTML documents of interest written in different languages; (iii) allows the user to submit queries for retrieving extracted information in the same natural language provided by the query engine of DatAQs using a menu-driven user interface; and (iv) processes the user's queries (as Boolean expressions) to generate the results. DatAQs extracts information from HTML documents that belong to various data-rich, narrow-in-breadth application domains, such as car ads, house rentals, job ads, stocks, university catalogs, etc. The average F-measure on identifying HTML documents written in a particular natural language correctly is 89%, whereas the F-measure on categorizing HTML documents belonged to the car-ads application domain is 94%.\",\"PeriodicalId\":357591,\"journal\":{\"name\":\"9th International Database Engineering & Application Symposium (IDEAS'05)\",\"volume\":\"25 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2005-07-25\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"3\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"9th International Database Engineering & Application Symposium (IDEAS'05)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/IDEAS.2005.15\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"9th International Database Engineering & Application Symposium (IDEAS'05)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IDEAS.2005.15","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 3

摘要

在过去十年中，用不同自然语言编写的在线信息的数量和不讲英语的互联网用户的数量急剧增加。为了在互联网上提供对多语言信息的高性能访问，我们开发了一个数据分析和查询系统(DatAQs)，它:(i)分析、识别和分类HTML文档中使用的语言;(ii)从不同语言的HTML文件中提取资料;(iii)允许用户使用菜单驱动的用户界面，以DatAQs查询引擎提供的相同自然语言提交检索提取信息的查询;(iv)处理用户的查询(作为布尔表达式)以生成结果。DatAQs从HTML文档中提取信息，这些文档属于各种数据丰富、范围狭窄的应用程序领域，如汽车广告、房屋租赁、招聘广告、股票、大学目录等。正确识别用特定自然语言编写的HTML文档的平均f值是89%，而分类属于car-ads应用程序领域的HTML文档的f值是94%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Categorizing and extracting information from multilingual HTML documents

The amount of online information written in different natural languages and the number of non-English speaking Internet users have been increasing tremendously during the past decade. In order to provide high-performance access of multilingual information on the Internet, we have developed a data analysis and querying system (DatAQs) that: (i) analyzes, identifies, and categorizes languages used in HTML documents; (ii) extracts information from HTML documents of interest written in different languages; (iii) allows the user to submit queries for retrieving extracted information in the same natural language provided by the query engine of DatAQs using a menu-driven user interface; and (iv) processes the user's queries (as Boolean expressions) to generate the results. DatAQs extracts information from HTML documents that belong to various data-rich, narrow-in-breadth application domains, such as car ads, house rentals, job ads, stocks, university catalogs, etc. The average F-measure on identifying HTML documents written in a particular natural language correctly is 89%, whereas the F-measure on categorizing HTML documents belonged to the car-ads application domain is 94%.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

9th International Database Engineering & Application Symposium (IDEAS'05)

自引率

0.00%

发文量