{"title":"More than Words: Using Token Context to Improve Canonicalization of Historical German","authors":"Bryan Jurish","doi":"10.21248/jlcl.25.2010.127","DOIUrl":"https://doi.org/10.21248/jlcl.25.2010.127","url":null,"abstract":"Historical text presents numerous challenges for contemporary natural language processing techniques. In particular, the absence of consistent orthographic conventions in historical text presents difficulties for any system requiring reference to a fixed lexicon accessed by orthographic form, such as information retrieval systems (Sokirko, 2003; Cafarella and Cutting, 2004), part-of-speech taggers (DeRose, 1988; Brill, 1992; Schmid, 1994), simple word stemmers (Lovins, 1968; Porter, 1980), or more sophisticated morphological analyzers (Geyken and Hanneforth, 2006; Zielinski et al., 2009).1 Traditional approaches to the problems arising from an attempt to incorporate historical text into such a system rely on the use of additional specialized (often application-specific) lexical resources to explicitly encode known historical variants. Such specialized lexica are not only costly and time-consuming to create, but also – in their simplest form of static finite word lists – necessarily incomplete in the case of a morphologically productive language like German, since a simple finite lexicon cannot account for highly productive morphological processes such as nominal composition (cf. Kempken et al., 2006). To facilitate the extension of synchronically-oriented natural language processing techniques to historical text while minimizing the need for specialized lexical resources, one may first attempt an automatic canonicalization of the input text. Canonicalization approaches (Jurish, 2008, 2010a; Gotscharek et al., 2009a) treat orthographic variation phenomena in historical text as instances of an error-correction problem (Shannon, 1948; Kukich, 1992; Brill and Moore, 2000), seeking to map each (unknown) word of the input text to one or more extant canonical cognates: synchronically active types which preserve both the root and morphosyntactic features of the associated historical form(s). To the extent that the canonicalization was successful, application-specific processing can then proceed normally using the returned canonical forms as input, without any need for additional modifications to the application lexicon. I distinguish between type-wise canonicalization techniques which process each input word independently and token-wise techniques which make use of the context in which a given instance of a word occurs. In this paper, I present a token-wise canonicalization","PeriodicalId":402489,"journal":{"name":"J. Lang. Technol. Comput. Linguistics","volume":"48 5","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132708535","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Who Can See the Forest for the Trees? Extracting Multiword Negative Polarity Items from Dependency-Parsed Text","authors":"F. Richter, Fabienne Fritzinger, Marion Weller","doi":"10.21248/jlcl.25.2010.130","DOIUrl":"https://doi.org/10.21248/jlcl.25.2010.130","url":null,"abstract":"Ever since the groundbreaking work by Fauconnier (1975) and Ladusaw (1980), research on negative polarity items (npis) has been dominated by two fundamental assumptions about the licensing contexts of npis and their inherent semantic-pragmatic properties. The contexts in which npis may occur felicitously are said to have the semantic property of being downward entailing (which we will briefly explain below), and the elements themselves are often said to be located at the end of a pragmatically motivated scale, typically signalling a minimal amount, a smallest size, or similar concept. While the Ladusaw-Fauconnier theory has been substantially refined over time, and while there are very diverse variations on how the technical details of the theory are spelled out, its core insights are currently widely accepted and remain a point of reference for practically any ‘formal’ theory of npis. Some theories are syntactic in nature and formulate the relevant scope constraints relative to (possibly quite abstract) syntactic configurations, others are semantic and define hierarchies of negations of varying strength, and yet another group of theories is predominantly pragmatic, relying heavily on scalar implicatures, domain widening, and related concepts. There are, of course, also approaches in which syntax, semantics, and pragmatics all play a role. Overall, the number of papers and books that have been published on the subject of npis over the last 40 years is nothing short of intimidating.1 Given the sheer volume of the npi literature, it is all the more surprising and striking that much of the discussion revolves around a very small set of items. Especially some of the most sophisticated and influential papers, such as Kadmon and Landman (1993), Krifka (1995), and Chierchia (2006), discuss hardly more than a handful of items, and some studies almost exclusively focus on one, viz. English any, which can be regarded as the classical example for a minimizer, with its variants anything, anyone, anybody, anywhere, etc. Since with any one of the most prominent items of interest is a minimizer, investigations into the significance of this particular property for the entire class of npis have turned into a dominating topic and occasionally even push aside the observation that being a minimizer is not a necessary (nor a sufficient) property of npis. As a result of its narrow empirical focus, the tendency to build a very comprehensive theory on an extremely small, carefully chosen but deeply researched set of examples is characteristic for large parts of the literature on npis. This might mean that only a fraction of the properties and behavior of npis are treated in current theories.","PeriodicalId":402489,"journal":{"name":"J. Lang. Technol. Comput. Linguistics","volume":"62 3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124349298","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"CAT und MÜ - Getrennte Welten?","authors":"Dino Azzano","doi":"10.21248/jlcl.24.2009.120","DOIUrl":"https://doi.org/10.21248/jlcl.24.2009.120","url":null,"abstract":"Im vorliegenden Artikel werden die Zusammenhänge zwischen computerunterstützter Übersetzung (Computer Assisted Translation, CAT) und maschineller Übersetzung (MÜ) untersucht. Im Mittelpunkt stehen die Systeme zur computergestützten Übersetzung sowie ihre Integrierbarkeit mit maschinellen Übersetzungssystemen. Eingangs werden einige terminologische Unterscheidungen getroffen, um die wichtigsten Begrifflichkeiten zu klären. Darüber hinaues werden die Hauptunterschiede zwischen CAT und MÜ erwähnt. Ein Überblick über die wichtigsten Komponenten eines CAT-Systems sowie über die gängigsten Produkte auf dem Markt dient als Grundlage für die Beschreibung der Integrationsmöglichkeiten. Vier Beispielsprozesse veranschaulichen die konkrete Arbeitsweise. Abschließend werden Vorteile und Nachteile einer Integration von CAT und MÜ besprochen","PeriodicalId":402489,"journal":{"name":"J. Lang. Technol. Comput. Linguistics","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128415324","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Integration von regel- und statistikbasierten Methoden in der maschinellen Übersetzung","authors":"K. Eberle","doi":"10.21248/jlcl.24.2009.121","DOIUrl":"https://doi.org/10.21248/jlcl.24.2009.121","url":null,"abstract":"Warren Weavers Appell an die akademische Welt, zu untersuchen inwieweit es moglich ist, Texte automatisch zu ubersetzen, wird gemeinhin als Beginn der Maschinellen Ubersetzung verstanden (Weaver (2003); Hutchins (1995)). Seither sind rund 60 Jahre vergangen und das Problem der automatischen Ubersetzung von Texten ist keineswegs gelost, steht aber aktuell im Fokus der computerlinguistischen Forschung wie kaum ein anderes. Zu Beginn der Forschung standen eher Rechnerprobleme im Vordergrund und architektonisch die sogenannte direkte Ubersetzungsarchitektur, die schlagwortartig auch als Wort-zu-Wort-Ubersetzung gekennzeichnet wird. Danach, in der zweiten Generation der Maschinellen Ubersetzung, standen die sogenannten regelbasierten Ubersetzungssysteme im Zentrum, deren gemeinsames Grundprinzip, bei aller Vielfalt, die im Lauf der Jahre entstanden ist, gekennzeichnet ist durch die Idee, Satzen abstrakte strukturelle Analysen zuzuweisen und auf dieser Basis zu ubersetzen. (Diese Systeme werden zusammengefasst unter der Bezeichnung RBMT fur Rule Based Machine Translation). In der dritten Generation stehen statistische Modelle im Vordergrund (diese sind Instanzen der sog. SMT fur Statistics based Machine Translation). Ohne noch eine echte vierte Generation zu begrunden, stehen heute Forschungen im Zentrum, die versuchen, moglichst viel Wissen aus Sprachdaten abzuleiten und dabei Methoden verschiedener Ubersetzungstraditionen moglichst effizient in sogenannten hybriden Ansatzen zu verbinden. Eines der grosten Probleme fur die Maschinelle Ubersetzung, vermutlich das zentrale Problem uberhaupt, war und ist die Mehrdeutigkeit. Diese Eigenschaft erlaubt es den naturlichen Sprachen, mit einer moglichst geringen Anzahl von Zeichen und Zeichenkombinationen eine maximale Ausdruckskraft zu erzielen. Verwirrung wird dabei vermieden, indem Kontextwissen auserst effizient ausgenutzt wird, um die richtige Bedeutung hervorzuheben und die falschen Interpretationen auszufiltern. Dies aber ist das groste Hindernis fur den Erfolg einfacher Ubersetzungskonzeptionen. Wegen der Mehrdeutigkeit genugt es nicht, Ubersetzungsregeln als isolierte ein-eindeutige Wortbeziehungen anzulegen, sondern sie mussen als kontextsensitive n:m-Beziehungen definiert werden, wobei die qualitativ wirklich gute Ubersetzung bedeutet, dass zum Schluss der ganze Text und der Zweck des Texts in den Blick genommen werden muss, um die kontextuellen Einschrankungen vollstandig zu erfassen.","PeriodicalId":402489,"journal":{"name":"J. Lang. Technol. Comput. Linguistics","volume":"29 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129841532","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Multi-Label Approaches to Web Genre Identification","authors":"Vedrana Vidulin, M. Luštrek, M. Gams","doi":"10.21248/jlcl.24.2009.115","DOIUrl":"https://doi.org/10.21248/jlcl.24.2009.115","url":null,"abstract":"A web page is a complex document which can share conventions of several genres, or contain several parts, each belonging to a different genre. To properly address the genre interplay, a recent proposal in automatic web genre identification is multi-label classification. The dominant approach to such classification is to transform one multi-label machine learning problem into several sub-problems of learning binary single-label classifiers, one for each genre. In this paper we explore multi-class transformation, where each combination of genres is labeled with a single distinct label. This approach is then compared to the binary approach to determine which one better captures the multi-label aspect of web genres. Experimental results show that both of the approaches failed to properly address multi-genre web pages. Obtained differences were a result of the variations in the recognition of one-genre web pages.","PeriodicalId":402489,"journal":{"name":"J. Lang. Technol. Comput. Linguistics","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128124969","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"METIS-II: Low-Resource MT for German to English","authors":"M. Carl","doi":"10.21248/jlcl.24.2009.122","DOIUrl":"https://doi.org/10.21248/jlcl.24.2009.122","url":null,"abstract":"METIS-II was a EU-FET MT project running from October 2004 to September 2007, which aimed at translating free text input without resorting to parallel corpora. The idea was to use ‘basic’ linguistic tools and representations and to link them with patterns and statistics from the monolingual target-language corpus. The METIS-II project has four partners, translating from their ‘home’ languages Greek, Dutch, German, and Spanish into English. The paper outlines the basic ideas of the project, their implementation, the resources used, and the results obtained. It emphazises on the German implementation.","PeriodicalId":402489,"journal":{"name":"J. Lang. Technol. Comput. Linguistics","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123320176","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Textsortenbezogene linguistische Untersuchungen zum Einsatz von Translation-Memory-Systemen an einem Korpus deutscher und spanischer Patentschriften","authors":"Heribert Härtinger","doi":"10.21248/jlcl.24.2009.123","DOIUrl":"https://doi.org/10.21248/jlcl.24.2009.123","url":null,"abstract":"Patentschriften stellen eine häufig übersetzte Textsorte dar, zählen aber trotz des hohen Grades ihrer sprachlichen Standardisierung bislang nicht zu den typischen Einsatzgebieten von CAT-Tools. Die hier vorgestellte Studie untersuchte an einem Korpus deutscher und spanischer Patentschriften den Zusammenhang zwischen linguistischen Textsortenmerkmalen und dem Einsatznutzen integrierter Übersetzungssysteme. Im Mittelpunkt der Untersuchung standen die Analyse textsortentypischer Rekurrenzmuster mit Blick auf die erwartbaren Konsequenzen für die Retrieval-Effektivität kommerzieller Translation-Memory-Systeme sowie die Frage nach Textsortencharakteristika, die sich auf die Verwertbarkeit der Suchergebnisse auswirken können. Das zweisprachige, nach den Erfordernissen der Fragestellung ausgewählte Korpus bestand aus 60 vollständigen Textexemplaren und diente sowohl der Registrierung textinterner und textexterner Rekurrenzen als auch der Bewertung ihrer Retrieval-Relevanz anhand exemplarischer Satzinhaltsvergleiche. Die Analyse erfolgte aus der Perspektive einer integrierten Übersetzungsumgebung mit der Möglichkeit der Konkordanzsuche und eingebundener terminologisch-phraseographischer bzw. textographischer Datenbank, so dass auch textsortentypische Rekurrenzen unterhalb der Satzgrenze im Ergebnis berücksichtigt werden konnten. Als Testsoftware diente die Translator’s workbench der Firma SDL/Trados.","PeriodicalId":402489,"journal":{"name":"J. Lang. Technol. Comput. Linguistics","volume":"43 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116211563","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Philip M. McCarthy, John C. Myers, Stephen W. Briner, A. Graesser, D. McNamara
{"title":"A Psychological and Computational Study of Sub-Sentential Genre Recognition","authors":"Philip M. McCarthy, John C. Myers, Stephen W. Briner, A. Graesser, D. McNamara","doi":"10.21248/jlcl.24.2009.112","DOIUrl":"https://doi.org/10.21248/jlcl.24.2009.112","url":null,"abstract":"Genre recognition is a critical facet of text comprehension and text classification. In three experiments, we assessed the minimum number of words in a sentence needed for genre recognition to occur, the distribution of genres across text, and the relationship between reading ability and genre recognition. We also propose and demonstrate a computational model for genre recognition. Using corpora of narrative, history, and science sentences, we found that readers could recognize the genre of over 80% of the sentences and that recognition generally occurred within the first three words of sentences; in fact, 51% of the sentences could be correctly identified by the first word alone. We also report findings that many texts are heterogeneous in terms of genre. That is, around 20% of text appears to include sentences from other genres. In addition, our computational models fit closely the judgments of human result. This study offers a novel approach to genre identification at the sub-sentential level and has important implications for fields as diverse as reading comprehension and computational text classification.","PeriodicalId":402489,"journal":{"name":"J. Lang. Technol. Comput. Linguistics","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127247305","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Cost-Sensitive Feature Extraction and Selection in Genre Classification","authors":"Ryan Levering, M. Cutler","doi":"10.21248/jlcl.24.2009.113","DOIUrl":"https://doi.org/10.21248/jlcl.24.2009.113","url":null,"abstract":"Automatic genre classification of Web pages is currently young compared to other Web classification tasks. Corpora are just starting to be collected and organized in a systematic way, feature extraction techniques are incon sistent and not well detailed, genres are constantly in dispute, and novel applications have not been implemented. This paper attempts to review and make progress in the area of feature extraction, an area that we believe can benefit all Web page classification, and genre classification in particular. We first present a framework for the extraction of various Web-specific feature groups from distinct data models based on a tree of potentials models and the transformations that create them. Then we introduce the concept of cost-sensitivity to this tree and provide an algorithm for per forming wrapper-based feature selection on this tree. Finally, we apply the cost-sensitive feature selection algorithm on two genre corpora and analyze the performance of the classification results.","PeriodicalId":402489,"journal":{"name":"J. Lang. Technol. Comput. Linguistics","volume":"38 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116850864","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Web Genre Benchmark Under Construction","authors":"Marina Santini, S. Sharoff","doi":"10.21248/jlcl.24.2009.117","DOIUrl":"https://doi.org/10.21248/jlcl.24.2009.117","url":null,"abstract":"The project presented in this article focuses on the creation of web genre benchmarks (a.k.a. web genre reference corpora or web genre test collections), i.e. newly conceived test collections against which it will be possible to judge the performance of future genre-enabled web applications. The creation of web genre benchmarks is of key importance for the next generation of web applications because, at present, it is impossible to evaluate existing and in-progress genre-enabled prototypes. We suggest focusing on the following key points: ) propose a characterisation of genre suitable for digital environments and empirical approaches shared by a number of genre experts working in automatic genre identification; ) define the criteria for the construction of web genre benchmarks and draw up annotation guidelines; ) create web genre benchmarks in several languages; ) validate the methodology and evaluate the results. We describe work in progress and our plans for future development. Since it is sometimes difficult to anticipate the difficulties that will arise when developing a large resource, we present our ideas, our current views on genre issues and our first results with the aim of stimulating a proactive discussion, so that the stakeholders, i.e. researchers who will ultimately benefit from the resource, can contribute to its design.","PeriodicalId":402489,"journal":{"name":"J. Lang. Technol. Comput. Linguistics","volume":"43 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132948516","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}