J. Lang. Technol. Comput. Linguistics最新文献_第9页

More than Words: Using Token Context to Improve Canonicalization of Historical German 超越文字:使用标记上下文提高历史德语规范化

J. Lang. Technol. Comput. Linguistics Pub Date : 2010-07-01 DOI: 10.21248/jlcl.25.2010.127

Bryan Jurish

{"title":"More than Words: Using Token Context to Improve Canonicalization of Historical German","authors":"Bryan Jurish","doi":"10.21248/jlcl.25.2010.127","DOIUrl":"https://doi.org/10.21248/jlcl.25.2010.127","url":null,"abstract":"Historical text presents numerous challenges for contemporary natural language processing techniques. In particular, the absence of consistent orthographic conventions in historical text presents difficulties for any system requiring reference to a fixed lexicon accessed by orthographic form, such as information retrieval systems (Sokirko, 2003; Cafarella and Cutting, 2004), part-of-speech taggers (DeRose, 1988; Brill, 1992; Schmid, 1994), simple word stemmers (Lovins, 1968; Porter, 1980), or more sophisticated morphological analyzers (Geyken and Hanneforth, 2006; Zielinski et al., 2009).1 Traditional approaches to the problems arising from an attempt to incorporate historical text into such a system rely on the use of additional specialized (often application-specific) lexical resources to explicitly encode known historical variants. Such specialized lexica are not only costly and time-consuming to create, but also – in their simplest form of static finite word lists – necessarily incomplete in the case of a morphologically productive language like German, since a simple finite lexicon cannot account for highly productive morphological processes such as nominal composition (cf. Kempken et al., 2006). To facilitate the extension of synchronically-oriented natural language processing techniques to historical text while minimizing the need for specialized lexical resources, one may first attempt an automatic canonicalization of the input text. Canonicalization approaches (Jurish, 2008, 2010a; Gotscharek et al., 2009a) treat orthographic variation phenomena in historical text as instances of an error-correction problem (Shannon, 1948; Kukich, 1992; Brill and Moore, 2000), seeking to map each (unknown) word of the input text to one or more extant canonical cognates: synchronically active types which preserve both the root and morphosyntactic features of the associated historical form(s). To the extent that the canonicalization was successful, application-specific processing can then proceed normally using the returned canonical forms as input, without any need for additional modifications to the application lexicon. I distinguish between type-wise canonicalization techniques which process each input word independently and token-wise techniques which make use of the context in which a given instance of a word occurs. In this paper, I present a token-wise canonicalization","PeriodicalId":402489,"journal":{"name":"J. Lang. Technol. Comput. Linguistics","volume":"48 5","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132708535","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 40

Who Can See the Forest for the Trees? Extracting Multiword Negative Polarity Items from Dependency-Parsed Text 谁能只见树木不见森林?从依赖解析文本中提取多词负极性项

J. Lang. Technol. Comput. Linguistics Pub Date : 2010-07-01 DOI: 10.21248/jlcl.25.2010.130

F. Richter, Fabienne Fritzinger, Marion Weller

{"title":"Who Can See the Forest for the Trees? Extracting Multiword Negative Polarity Items from Dependency-Parsed Text","authors":"F. Richter, Fabienne Fritzinger, Marion Weller","doi":"10.21248/jlcl.25.2010.130","DOIUrl":"https://doi.org/10.21248/jlcl.25.2010.130","url":null,"abstract":"Ever since the groundbreaking work by Fauconnier (1975) and Ladusaw (1980), research on negative polarity items (npis) has been dominated by two fundamental assumptions about the licensing contexts of npis and their inherent semantic-pragmatic properties. The contexts in which npis may occur felicitously are said to have the semantic property of being downward entailing (which we will briefly explain below), and the elements themselves are often said to be located at the end of a pragmatically motivated scale, typically signalling a minimal amount, a smallest size, or similar concept. While the Ladusaw-Fauconnier theory has been substantially refined over time, and while there are very diverse variations on how the technical details of the theory are spelled out, its core insights are currently widely accepted and remain a point of reference for practically any ‘formal’ theory of npis. Some theories are syntactic in nature and formulate the relevant scope constraints relative to (possibly quite abstract) syntactic configurations, others are semantic and define hierarchies of negations of varying strength, and yet another group of theories is predominantly pragmatic, relying heavily on scalar implicatures, domain widening, and related concepts. There are, of course, also approaches in which syntax, semantics, and pragmatics all play a role. Overall, the number of papers and books that have been published on the subject of npis over the last 40 years is nothing short of intimidating.1 Given the sheer volume of the npi literature, it is all the more surprising and striking that much of the discussion revolves around a very small set of items. Especially some of the most sophisticated and influential papers, such as Kadmon and Landman (1993), Krifka (1995), and Chierchia (2006), discuss hardly more than a handful of items, and some studies almost exclusively focus on one, viz. English any, which can be regarded as the classical example for a minimizer, with its variants anything, anyone, anybody, anywhere, etc. Since with any one of the most prominent items of interest is a minimizer, investigations into the significance of this particular property for the entire class of npis have turned into a dominating topic and occasionally even push aside the observation that being a minimizer is not a necessary (nor a sufficient) property of npis. As a result of its narrow empirical focus, the tendency to build a very comprehensive theory on an extremely small, carefully chosen but deeply researched set of examples is characteristic for large parts of the literature on npis. This might mean that only a fraction of the properties and behavior of npis are treated in current theories.","PeriodicalId":402489,"journal":{"name":"J. Lang. Technol. Comput. Linguistics","volume":"62 3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124349298","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4

CAT und MÜ - Getrennte Welten? 凯特和感情分开的世界?

J. Lang. Technol. Comput. Linguistics Pub Date : 2009-07-01 DOI: 10.21248/jlcl.24.2009.120

Dino Azzano

引用次数: 1

Integration von regel- und statistikbasierten Methoden in der maschinellen Übersetzung 将标准与统计方法在机器翻译中统一起来

J. Lang. Technol. Comput. Linguistics Pub Date : 2009-07-01 DOI: 10.21248/jlcl.24.2009.121

K. Eberle

{"title":"Integration von regel- und statistikbasierten Methoden in der maschinellen Übersetzung","authors":"K. Eberle","doi":"10.21248/jlcl.24.2009.121","DOIUrl":"https://doi.org/10.21248/jlcl.24.2009.121","url":null,"abstract":"Warren Weavers Appell an die akademische Welt, zu untersuchen inwieweit es moglich ist, Texte automatisch zu ubersetzen, wird gemeinhin als Beginn der Maschinellen Ubersetzung verstanden (Weaver (2003); Hutchins (1995)). Seither sind rund 60 Jahre vergangen und das Problem der automatischen Ubersetzung von Texten ist keineswegs gelost, steht aber aktuell im Fokus der computerlinguistischen Forschung wie kaum ein anderes. Zu Beginn der Forschung standen eher Rechnerprobleme im Vordergrund und architektonisch die sogenannte direkte Ubersetzungsarchitektur, die schlagwortartig auch als Wort-zu-Wort-Ubersetzung gekennzeichnet wird. Danach, in der zweiten Generation der Maschinellen Ubersetzung, standen die sogenannten regelbasierten Ubersetzungssysteme im Zentrum, deren gemeinsames Grundprinzip, bei aller Vielfalt, die im Lauf der Jahre entstanden ist, gekennzeichnet ist durch die Idee, Satzen abstrakte strukturelle Analysen zuzuweisen und auf dieser Basis zu ubersetzen. (Diese Systeme werden zusammengefasst unter der Bezeichnung RBMT fur Rule Based Machine Translation). In der dritten Generation stehen statistische Modelle im Vordergrund (diese sind Instanzen der sog. SMT fur Statistics based Machine Translation). Ohne noch eine echte vierte Generation zu begrunden, stehen heute Forschungen im Zentrum, die versuchen, moglichst viel Wissen aus Sprachdaten abzuleiten und dabei Methoden verschiedener Ubersetzungstraditionen moglichst effizient in sogenannten hybriden Ansatzen zu verbinden. Eines der grosten Probleme fur die Maschinelle Ubersetzung, vermutlich das zentrale Problem uberhaupt, war und ist die Mehrdeutigkeit. Diese Eigenschaft erlaubt es den naturlichen Sprachen, mit einer moglichst geringen Anzahl von Zeichen und Zeichenkombinationen eine maximale Ausdruckskraft zu erzielen. Verwirrung wird dabei vermieden, indem Kontextwissen auserst effizient ausgenutzt wird, um die richtige Bedeutung hervorzuheben und die falschen Interpretationen auszufiltern. Dies aber ist das groste Hindernis fur den Erfolg einfacher Ubersetzungskonzeptionen. Wegen der Mehrdeutigkeit genugt es nicht, Ubersetzungsregeln als isolierte ein-eindeutige Wortbeziehungen anzulegen, sondern sie mussen als kontextsensitive n:m-Beziehungen definiert werden, wobei die qualitativ wirklich gute Ubersetzung bedeutet, dass zum Schluss der ganze Text und der Zweck des Texts in den Blick genommen werden muss, um die kontextuellen Einschrankungen vollstandig zu erfassen.","PeriodicalId":402489,"journal":{"name":"J. Lang. Technol. Comput. Linguistics","volume":"29 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129841532","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

Multi-Label Approaches to Web Genre Identification 网络类型识别的多标签方法

J. Lang. Technol. Comput. Linguistics Pub Date : 2009-07-01 DOI: 10.21248/jlcl.24.2009.115

Vedrana Vidulin, M. Luštrek, M. Gams

引用次数: 27

METIS-II: Low-Resource MT for German to English METIS-II:德语到英语的低资源MT

J. Lang. Technol. Comput. Linguistics Pub Date : 2009-07-01 DOI: 10.21248/jlcl.24.2009.122

M. Carl

引用次数: 1

Textsortenbezogene linguistische Untersuchungen zum Einsatz von Translation-Memory-Systemen an einem Korpus deutscher und spanischer Patentschriften 定义语言的研究，用于在功能上区别为德和西班牙专利

J. Lang. Technol. Comput. Linguistics Pub Date : 2009-07-01 DOI: 10.21248/jlcl.24.2009.123

Heribert Härtinger

{"title":"Textsortenbezogene linguistische Untersuchungen zum Einsatz von Translation-Memory-Systemen an einem Korpus deutscher und spanischer Patentschriften","authors":"Heribert Härtinger","doi":"10.21248/jlcl.24.2009.123","DOIUrl":"https://doi.org/10.21248/jlcl.24.2009.123","url":null,"abstract":"Patentschriften stellen eine häufig übersetzte Textsorte dar, zählen aber trotz des hohen Grades ihrer sprachlichen Standardisierung bislang nicht zu den typischen Einsatzgebieten von CAT-Tools. Die hier vorgestellte Studie untersuchte an einem Korpus deutscher und spanischer Patentschriften den Zusammenhang zwischen linguistischen Textsortenmerkmalen und dem Einsatznutzen integrierter Übersetzungssysteme. Im Mittelpunkt der Untersuchung standen die Analyse textsortentypischer Rekurrenzmuster mit Blick auf die erwartbaren Konsequenzen für die Retrieval-Effektivität kommerzieller Translation-Memory-Systeme sowie die Frage nach Textsortencharakteristika, die sich auf die Verwertbarkeit der Suchergebnisse auswirken können. Das zweisprachige, nach den Erfordernissen der Fragestellung ausgewählte Korpus bestand aus 60 vollständigen Textexemplaren und diente sowohl der Registrierung textinterner und textexterner Rekurrenzen als auch der Bewertung ihrer Retrieval-Relevanz anhand exemplarischer Satzinhaltsvergleiche. Die Analyse erfolgte aus der Perspektive einer integrierten Übersetzungsumgebung mit der Möglichkeit der Konkordanzsuche und eingebundener terminologisch-phraseographischer bzw. textographischer Datenbank, so dass auch textsortentypische Rekurrenzen unterhalb der Satzgrenze im Ergebnis berücksichtigt werden konnten. Als Testsoftware diente die Translator’s workbench der Firma SDL/Trados.","PeriodicalId":402489,"journal":{"name":"J. Lang. Technol. Comput. Linguistics","volume":"43 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116211563","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

A Psychological and Computational Study of Sub-Sentential Genre Recognition 次句子体裁识别的心理学与计算研究

J. Lang. Technol. Comput. Linguistics Pub Date : 2009-07-01 DOI: 10.21248/jlcl.24.2009.112

Philip M. McCarthy, John C. Myers, Stephen W. Briner, A. Graesser, D. McNamara

{"title":"A Psychological and Computational Study of Sub-Sentential Genre Recognition","authors":"Philip M. McCarthy, John C. Myers, Stephen W. Briner, A. Graesser, D. McNamara","doi":"10.21248/jlcl.24.2009.112","DOIUrl":"https://doi.org/10.21248/jlcl.24.2009.112","url":null,"abstract":"Genre recognition is a critical facet of text comprehension and text classification. In three experiments, we assessed the minimum number of words in a sentence needed for genre recognition to occur, the distribution of genres across text, and the relationship between reading ability and genre recognition. We also propose and demonstrate a computational model for genre recognition. Using corpora of narrative, history, and science sentences, we found that readers could recognize the genre of over 80% of the sentences and that recognition generally occurred within the first three words of sentences; in fact, 51% of the sentences could be correctly identified by the first word alone. We also report findings that many texts are heterogeneous in terms of genre. That is, around 20% of text appears to include sentences from other genres. In addition, our computational models fit closely the judgments of human result. This study offers a novel approach to genre identification at the sub-sentential level and has important implications for fields as diverse as reading comprehension and computational text classification.","PeriodicalId":402489,"journal":{"name":"J. Lang. Technol. Comput. Linguistics","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127247305","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 13

Cost-Sensitive Feature Extraction and Selection in Genre Classification 类型分类中代价敏感特征的提取与选择

J. Lang. Technol. Comput. Linguistics Pub Date : 2009-07-01 DOI: 10.21248/jlcl.24.2009.113

Ryan Levering, M. Cutler

引用次数: 2

Web Genre Benchmark Under Construction 网页类型基准正在建设中

J. Lang. Technol. Comput. Linguistics Pub Date : 2009-07-01 DOI: 10.21248/jlcl.24.2009.117

Marina Santini, S. Sharoff

{"title":"Web Genre Benchmark Under Construction","authors":"Marina Santini, S. Sharoff","doi":"10.21248/jlcl.24.2009.117","DOIUrl":"https://doi.org/10.21248/jlcl.24.2009.117","url":null,"abstract":"The project presented in this article focuses on the creation of web genre benchmarks (a.k.a. web genre reference corpora or web genre test collections), i.e. newly conceived test collections against which it will be possible to judge the performance of future genre-enabled web applications. The creation of web genre benchmarks is of key importance for the next generation of web applications because, at present, it is impossible to evaluate existing and in-progress genre-enabled prototypes. We suggest focusing on the following key points: ) propose a characterisation of genre suitable for digital environments and empirical approaches shared by a number of genre experts working in automatic genre identification; ) define the criteria for the construction of web genre benchmarks and draw up annotation guidelines; ) create web genre benchmarks in several languages; ) validate the methodology and evaluate the results. We describe work in progress and our plans for future development. Since it is sometimes difficult to anticipate the difficulties that will arise when developing a large resource, we present our ideas, our current views on genre issues and our first results with the aim of stimulating a proactive discussion, so that the stakeholders, i.e. researchers who will ultimately benefit from the resource, can contribute to its design.","PeriodicalId":402489,"journal":{"name":"J. Lang. Technol. Comput. Linguistics","volume":"43 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132948516","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 18