{"title":"A Hybrid Approach to Sentence Alignment Using Genetic Algorithm","authors":"M. Gautam, R. Sinha","doi":"10.1109/ICCTA.2007.9","DOIUrl":null,"url":null,"abstract":"Sentence alignment in bilingual corpora has been an active research topic in the machine translation research groups. There have been multiple works in the past to align sentences in bilingual corpus in English and European languages and some Asian languages like Chinese and Japanese. This work introduces a novel approach for sentence alignment in bilingual corpora using lexical and statistical information about the language pair using genetic algorithm. The only lexical information used in this work is a restricted form of bilingual dictionary (incomplete). The algorithm works based on the weighted sum of a set of statistical parameters and the parameter denoting degree of dictionary match. No other lexical information like part of speech tagging, chunking, n-gram statistics etc has been used in this work. Our approach has been tested for structurally dissimilar language pair of English-Hindi and is shown to yield a high performance even under noisy conditions. We compare our results with that of Microsoft alignment tool on the same corpus and we find our results to be superior","PeriodicalId":308247,"journal":{"name":"2007 International Conference on Computing: Theory and Applications (ICCTA'07)","volume":"28 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2007-03-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"6","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2007 International Conference on Computing: Theory and Applications (ICCTA'07)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICCTA.2007.9","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 6
Abstract
Sentence alignment in bilingual corpora has been an active research topic in the machine translation research groups. There have been multiple works in the past to align sentences in bilingual corpus in English and European languages and some Asian languages like Chinese and Japanese. This work introduces a novel approach for sentence alignment in bilingual corpora using lexical and statistical information about the language pair using genetic algorithm. The only lexical information used in this work is a restricted form of bilingual dictionary (incomplete). The algorithm works based on the weighted sum of a set of statistical parameters and the parameter denoting degree of dictionary match. No other lexical information like part of speech tagging, chunking, n-gram statistics etc has been used in this work. Our approach has been tested for structurally dissimilar language pair of English-Hindi and is shown to yield a high performance even under noisy conditions. We compare our results with that of Microsoft alignment tool on the same corpus and we find our results to be superior