Veronique Hoste, Klaar Vanopstal, Ayla Rigouts Terryn, Els Lefever
{"title":"数量和质量之间的权衡。大型爬行语料库与小型聚焦语料库在医学术语提取中的比较","authors":"Veronique Hoste, Klaar Vanopstal, Ayla Rigouts Terryn, Els Lefever","doi":"10.1556/084.2019.20.2.3","DOIUrl":null,"url":null,"abstract":"We investigate the cost-effectiveness of special-purpose crawled corpora versus more focused corpora for automatic terminology extraction (ATE). Our focus is on medical terminology on heart failure for two languages, viz. English for which we have more web and specialized resources at our disposal and the less resourced Dutch. We show that, although term density in the dedicated corpora is larger for both languages, the potential for term extraction is higher in the crawled corpora than in the dedicated corpora. Furthermore, in a set of experiments in which we evaluate both types of corpora, while keeping size constant, we observe that more Gold Standard (GS) terms are covered by the \"noisy\" crawled corpus than with a dedicated corpus of the same size.","PeriodicalId":44202,"journal":{"name":"Across Languages and Cultures","volume":"20 1","pages":"197-211"},"PeriodicalIF":1.0000,"publicationDate":"2019-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1556/084.2019.20.2.3","citationCount":"4","resultStr":"{\"title\":\"The Trade-off between Quantity and Quality. Comparing a Large Crawled Corpus and a Small Focused Corpus for Medical Terminology Extraction\",\"authors\":\"Veronique Hoste, Klaar Vanopstal, Ayla Rigouts Terryn, Els Lefever\",\"doi\":\"10.1556/084.2019.20.2.3\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"We investigate the cost-effectiveness of special-purpose crawled corpora versus more focused corpora for automatic terminology extraction (ATE). Our focus is on medical terminology on heart failure for two languages, viz. English for which we have more web and specialized resources at our disposal and the less resourced Dutch. We show that, although term density in the dedicated corpora is larger for both languages, the potential for term extraction is higher in the crawled corpora than in the dedicated corpora. Furthermore, in a set of experiments in which we evaluate both types of corpora, while keeping size constant, we observe that more Gold Standard (GS) terms are covered by the \\\"noisy\\\" crawled corpus than with a dedicated corpus of the same size.\",\"PeriodicalId\":44202,\"journal\":{\"name\":\"Across Languages and Cultures\",\"volume\":\"20 1\",\"pages\":\"197-211\"},\"PeriodicalIF\":1.0000,\"publicationDate\":\"2019-12-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://sci-hub-pdf.com/10.1556/084.2019.20.2.3\",\"citationCount\":\"4\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Across Languages and Cultures\",\"FirstCategoryId\":\"98\",\"ListUrlMain\":\"https://doi.org/10.1556/084.2019.20.2.3\",\"RegionNum\":3,\"RegionCategory\":\"文学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"0\",\"JCRName\":\"LANGUAGE & LINGUISTICS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Across Languages and Cultures","FirstCategoryId":"98","ListUrlMain":"https://doi.org/10.1556/084.2019.20.2.3","RegionNum":3,"RegionCategory":"文学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"0","JCRName":"LANGUAGE & LINGUISTICS","Score":null,"Total":0}
The Trade-off between Quantity and Quality. Comparing a Large Crawled Corpus and a Small Focused Corpus for Medical Terminology Extraction
We investigate the cost-effectiveness of special-purpose crawled corpora versus more focused corpora for automatic terminology extraction (ATE). Our focus is on medical terminology on heart failure for two languages, viz. English for which we have more web and specialized resources at our disposal and the less resourced Dutch. We show that, although term density in the dedicated corpora is larger for both languages, the potential for term extraction is higher in the crawled corpora than in the dedicated corpora. Furthermore, in a set of experiments in which we evaluate both types of corpora, while keeping size constant, we observe that more Gold Standard (GS) terms are covered by the "noisy" crawled corpus than with a dedicated corpus of the same size.
期刊介绍:
Across Languages and Cultures publishes original articles and reviews on all sub-disciplines of Translation and Interpreting (T/I) Studies: general T/I theory, descriptive T/I studies and applied T/I studies. Special emphasis is laid on the questions of multilingualism, language policy and translation policy. Publications on new research methods and models are encouraged. Publishes book reviews, news, announcements and advertisements.