{"title":"句法不可比平行句的过滤器","authors":"Martin Kroon, S. Barbiers, J. Odijk, S. V. D. Pas","doi":"10.1075/avt.00029.kro","DOIUrl":null,"url":null,"abstract":"\n Massive automatic comparison of languages in parallel corpora\n will greatly speed up and enhance comparative syntactic research. Automatically\n extracting and mining syntactic differences from parallel corpora requires a\n pre-processing step that filters out sentence pairs that cannot be compared\n syntactically, for example because they involve “free” translations. In this\n paper we explore four possible filters: the Damerau-Levenshtein distance between\n POS-tags, the sentence-length ratio, the graph-edit distance between dependency\n parses, and a combination of the three in a logistic regression model. Results\n suggest that the dependency-parse filter is the most stable throughout language\n pairs, while the combination filter achieves the best results.","PeriodicalId":35138,"journal":{"name":"Linguistics in the Netherlands","volume":" ","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2019-11-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"A filter for syntactically incomparable parallel\\n sentences\",\"authors\":\"Martin Kroon, S. Barbiers, J. Odijk, S. V. D. Pas\",\"doi\":\"10.1075/avt.00029.kro\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"\\n Massive automatic comparison of languages in parallel corpora\\n will greatly speed up and enhance comparative syntactic research. Automatically\\n extracting and mining syntactic differences from parallel corpora requires a\\n pre-processing step that filters out sentence pairs that cannot be compared\\n syntactically, for example because they involve “free” translations. In this\\n paper we explore four possible filters: the Damerau-Levenshtein distance between\\n POS-tags, the sentence-length ratio, the graph-edit distance between dependency\\n parses, and a combination of the three in a logistic regression model. Results\\n suggest that the dependency-parse filter is the most stable throughout language\\n pairs, while the combination filter achieves the best results.\",\"PeriodicalId\":35138,\"journal\":{\"name\":\"Linguistics in the Netherlands\",\"volume\":\" \",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2019-11-05\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Linguistics in the Netherlands\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1075/avt.00029.kro\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"Arts and Humanities\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Linguistics in the Netherlands","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1075/avt.00029.kro","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"Arts and Humanities","Score":null,"Total":0}
A filter for syntactically incomparable parallel
sentences
Massive automatic comparison of languages in parallel corpora
will greatly speed up and enhance comparative syntactic research. Automatically
extracting and mining syntactic differences from parallel corpora requires a
pre-processing step that filters out sentence pairs that cannot be compared
syntactically, for example because they involve “free” translations. In this
paper we explore four possible filters: the Damerau-Levenshtein distance between
POS-tags, the sentence-length ratio, the graph-edit distance between dependency
parses, and a combination of the three in a logistic regression model. Results
suggest that the dependency-parse filter is the most stable throughout language
pairs, while the combination filter achieves the best results.
期刊介绍:
Linguistics in the Netherlands is a series of annual publications, sponsored by the Dutch Linguistics Association (Algemene Vereniging voor Taalwetenschap) and published by John Benjamins Publishing Company since Volume 8 in 1991. Each volume contains a careful selection through peer review of papers presented at the annual meeting of the society. The aim of the annual meeting is to provide members with an opportunity to report on their work in progress. Each volume presents an overview of research in different fields of linguistics in the Netherlands containing articles on phonetics, phonology, morphology, syntax and semantics.