{"title":"Using HTML Tags to Improve Parallel Resources Extraction","authors":"Yanhui Feng, Yu Hong, Wei Tang, Jianmin Yao, Qiaoming Zhu","doi":"10.1109/IALP.2011.23","DOIUrl":null,"url":null,"abstract":"This paper proposes a new approach to extract parallel resources (including bilingual sentences and bilingual terms) from bilingual web pages, which have a primary language and a secondary language (the second language is often the translation to primary language). Our method is composed of four tasks: 1) parsing the web page into a DOM tree and segmenting inner texts of each node into series of monolingual snippets; 2) selecting adjacent snippet pairs in different languages and with higher translation scores as seeds for the next task; 3) constructing comprehensive wrappers from selected seeds, which save both HTML and surface formatting styles; 4) mining candidate instances and selecting good instances by their similarities with seeds. In this paper, we first propose to segment text by HTML tags, and select potential parallel resources by ranking all extracted candidates. According to the experimental results, our method can be applied to bilingual pages written in any other pair of languages. Experimental results also show that our approaches are effective in improving the parallel resources extraction.","PeriodicalId":297167,"journal":{"name":"2011 International Conference on Asian Language Processing","volume":"80 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2011-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"5","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2011 International Conference on Asian Language Processing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IALP.2011.23","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 5
Abstract
This paper proposes a new approach to extract parallel resources (including bilingual sentences and bilingual terms) from bilingual web pages, which have a primary language and a secondary language (the second language is often the translation to primary language). Our method is composed of four tasks: 1) parsing the web page into a DOM tree and segmenting inner texts of each node into series of monolingual snippets; 2) selecting adjacent snippet pairs in different languages and with higher translation scores as seeds for the next task; 3) constructing comprehensive wrappers from selected seeds, which save both HTML and surface formatting styles; 4) mining candidate instances and selecting good instances by their similarities with seeds. In this paper, we first propose to segment text by HTML tags, and select potential parallel resources by ranking all extracted candidates. According to the experimental results, our method can be applied to bilingual pages written in any other pair of languages. Experimental results also show that our approaches are effective in improving the parallel resources extraction.