Laura Sinikallio, Senka Drobac, Minna Tamper, Rafael Leal, M. Koho, J. Tuominen, Matti La Mela, E. Hyvönen
{"title":"Plenary Debates of the Parliament of Finland as Linked Open Data and in Parla-CLARIN Markup","authors":"Laura Sinikallio, Senka Drobac, Minna Tamper, Rafael Leal, M. Koho, J. Tuominen, Matti La Mela, E. Hyvönen","doi":"10.4230/OASIcs.LDK.2021.8","DOIUrl":"https://doi.org/10.4230/OASIcs.LDK.2021.8","url":null,"abstract":"This paper presents a knowledge graph created by transforming the plenary debates of the Parliament of Finland (1907–) into Linked Open Data (LOD). The data, totaling over 900 000 speeches, with automatically created semantic annotations and rich ontology-based metadata, are published in a Linked Open Data Service and are used via a SPARQL API and as data dumps. The speech data is part of larger LOD publication FinnParla that also includes prosopographical data about the politicians. The data is being used for studying parliamentary language and culture in Digital Humanities in several universities. To serve a wider variety of users, the entirety of this data was also produced using Parla-CLARIN markup. We present the first publication of all Finnish parliamentary debates as data. Technical novelties in our approach include the use of both Parla-CLARIN and an RDF schema developed for representing the speeches, integration of the data to a new Parliament of Finland Ontology for deeper data analyses, and enriching the data with a variety of external national and international data sources.","PeriodicalId":377119,"journal":{"name":"International Conference on Language, Data, and Knowledge","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121751279","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Towards the Detection and Formal Representation of Semantic Shifts in Inflectional Morphology","authors":"Dagmar Gromann, Thierry Declerck","doi":"10.4230/OASIcs.LDK.2019.21","DOIUrl":"https://doi.org/10.4230/OASIcs.LDK.2019.21","url":null,"abstract":"Semantic shifts caused by derivational morphemes is a common subject of investigation in language modeling, while inflectional morphemes are frequently portrayed as semantically more stable. This study is motivated by the previously established observation that inflectional morphemes can be just as variable as derivational ones. For instance, the English plural “-s” can turn the fabric silk into the garments of a jockey, silks. While humans know that silk in this sense has no plural, it takes more for machines to arrive at this conclusion. Frequently utilized computational language resources, such as WordNet, or models for representing computational lexicons, like OntoLex-Lemon, have no descriptive mechanism to represent such inflectional semantic shifts. To investigate this phenomenon, we extract word pairs of different grammatical number from WordNet that feature additional senses in the plural and evaluate their distribution in vector space, i.e., pre-trained word2vec and fastText embeddings. We then propose an extension of OntoLex-Lemon to accommodate this phenomenon that we call inflectional morpho-semantic variation to provide a formal representation accessible to algorithms, neural networks, and agents. While the exact scope of the problem is yet to be determined, this first dataset shows that it is not negligible. 2012 ACM Subject Classification Information systems","PeriodicalId":377119,"journal":{"name":"International Conference on Language, Data, and Knowledge","volume":"21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123768701","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Graph-Based Annotation Engineering: Towards a Gold Corpus for Role and Reference Grammar","authors":"C. Chiarcos, Christian Fäth","doi":"10.4230/OASIcs.LDK.2019.9","DOIUrl":"https://doi.org/10.4230/OASIcs.LDK.2019.9","url":null,"abstract":"This paper describes the application of annotation engineering techniques for the construction of a corpus for Role and Reference Grammar (RRG). RRG is a semantics-oriented formalism for natural language syntax popular in comparative linguistics and linguistic typology, and predominantly applied for the description of non-European languages which are less-resourced in terms of natural language processing. Because of its crosslinguistic applicability and its conjoint treatment of syntax and semantics, RRG also represents a promising framework for research challenges within natural language processing. At the moment, however, these have not been explored as no RRG corpus data is publicly available. While RRG annotations cannot be easily derived from any single treebank in existence, we suggest that they can be reliably inferred from the intersection of syntactic and semantic annotations as represented by, for example, the Universal Dependencies (UD) and PropBank (PB), and we demonstrate this for the English Web Treebank, a 250,000 token corpus of various genres of English internet text. The resulting corpus is a gold corpus for future experiments in natural language processing in the sense that it is built on existing annotations which have been created manually. A technical challenge in this context is to align UD and PB annotations, to integrate them in a coherent manner, and to distribute and to combine their information on RRG constituent and operator projections. For this purpose, we describe a framework for flexible and scalable annotation engineering based on flexible, unconstrained graph transformations of sentence graphs by means of SPARQL Update. 2012 ACM Subject Classification Computing methodologies → Language resources; Information systems → Semantic web description languages; Computing methodologies → Natural language processing; Computing methodologies → Lexical semantics","PeriodicalId":377119,"journal":{"name":"International Conference on Language, Data, and Knowledge","volume":"307 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134299472","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Lennart Wachowiak, Christian Lang, B. Heinisch, Dagmar Gromann
{"title":"Towards Learning Terminological Concept Systems from Multilingual Natural Language Text","authors":"Lennart Wachowiak, Christian Lang, B. Heinisch, Dagmar Gromann","doi":"10.4230/OASIcs.LDK.2021.22","DOIUrl":"https://doi.org/10.4230/OASIcs.LDK.2021.22","url":null,"abstract":"Terminological Concept Systems (TCS) provide a means of organizing, structuring and representing domain-specific multilingual information and are important to ensure terminological consistency in many tasks, such as translation and cross-border communication. While several approaches to (semi-)automatic term extraction exist, learning their interrelations is vastly underexplored. We propose an automated method to extract terms and relations across natural languages and specialized domains. To this end, we adapt pretrained multilingual neural language models, which we evaluate on term extraction standard datasets with best performing results and a combination of relation extraction standard datasets with competitive results. Code and dataset are publicly available.2 2012 ACM Subject Classification Computing methodologies → Information extraction; Computing methodologies → Neural networks; Computing methodologies → Language resources","PeriodicalId":377119,"journal":{"name":"International Conference on Language, Data, and Knowledge","volume":"20 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132422516","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Functional Representation of Technical Artefacts in Ontology-Terminology Models","authors":"L. Giacomini","doi":"10.4230/OASIcs.LDK.2019.5","DOIUrl":"https://doi.org/10.4230/OASIcs.LDK.2019.5","url":null,"abstract":"The ontological coverage of technical artefacts in terminography should take into account a functional representation of conceptual information. We present a model for a function-based description which enables direct interfacing of ontological properties and terminology, and which was developed in the context of a project on term variation in technical texts. Starting from related research in the field of knowledge engineering, we introduce the components of the ontological function macrocategory and discuss the implementation of the model in lemon. 2012 ACM Subject Classification Information systems → Ontologies","PeriodicalId":377119,"journal":{"name":"International Conference on Language, Data, and Knowledge","volume":"73 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115358760","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Universal Dependencies for Multilingual Open Information Extraction","authors":"Massinissa Atmani, Mathieu Lafourcade","doi":"10.4230/OASIcs.LDK.2021.24","DOIUrl":"https://doi.org/10.4230/OASIcs.LDK.2021.24","url":null,"abstract":"In this paper, we present our approach for Multilingual Open Information Extraction. Our sequence labeling based approach builds only on Universal Dependency representation to capture OpenIE’s regularities and to perform Cross-lingual Multilingual OpenIE. We propose a new two-stage pipeline model for sequence labeling, that first identifies all the arguments of the relation and only then classifies them according to their most likely label. This paper also introduces a new benchmark evaluation for French. Experimental Evaluation shows that our approach achieves the best results in the available Benchmarks (English, French, Spanish and Portuguese).","PeriodicalId":377119,"journal":{"name":"International Conference on Language, Data, and Knowledge","volume":"33 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121784863","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Can Computational Meta-Documentary Linguistics Provide for Accountability and Offer an Alternative to \"Reproducibility\" in Linguistics?","authors":"T. Weber","doi":"10.4230/OASIcs.LDK.2019.26","DOIUrl":"https://doi.org/10.4230/OASIcs.LDK.2019.26","url":null,"abstract":"As an answer to the need for accountability in linguistics, computational methodology and big data approaches offer an interesting perspective to the field of meta-documentary linguistics. The focus of this paper lies on the scientific process of citing published data and the insights this gives to the workings of a discipline. The proposed methodology shall aid to bring out the narratives of linguistic research within the literature. This can be seen as an alternative, philological approach to documentary linguistics. 2012 ACM Subject Classification Applied","PeriodicalId":377119,"journal":{"name":"International Conference on Language, Data, and Knowledge","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121980006","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A Computational Simulation of Children's Language Acquisition (Crazy New Idea)","authors":"Ben Ambridge","doi":"10.4230/OASIcs.LDK.2021.4","DOIUrl":"https://doi.org/10.4230/OASIcs.LDK.2021.4","url":null,"abstract":"Many modern NLP models are already close to simulating children’s language acquisition; the main thing they currently lack is a \"real world\" representation of semantics that allows them to map from form to meaning and vice-versa. The aim of this \"Crazy Idea\" is to spark a discussion about how we might get there.","PeriodicalId":377119,"journal":{"name":"International Conference on Language, Data, and Knowledge","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121893528","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Crowd-Sourcing A High-Quality Dataset for Metaphor Identification in Tweets","authors":"Omnia Zayed, John P. McCrae, P. Buitelaar","doi":"10.4230/OASIcs.LDK.2019.10","DOIUrl":"https://doi.org/10.4230/OASIcs.LDK.2019.10","url":null,"abstract":"Metaphor is one of the most important elements of human communication, especially in informal settings such as social media. There have been a number of datasets created for metaphor identification, however, this task has proven difficult due to the nebulous nature of metaphoricity. In this paper, we present a crowd-sourcing approach for the creation of a dataset for metaphor identification, that is able to rapidly achieve large coverage over the different usages of metaphor in a given corpus while maintaining high accuracy. We validate this methodology by creating a set of 2,500 manually annotated tweets in English, for which we achieve inter-annotator agreement scores over 0.8, which is higher than other reported results that did not limit the task. This methodology is based on the use of an existing classifier for metaphor in order to assist in the identification and the selection of the examples for annotation, in a way that reduces the cognitive load for annotators and enables quick and accurate annotation. We selected a corpus of both general language tweets and political tweets relating to Brexit and we compare the resulting corpus on these two domains. As a result of this work, we have published the first dataset of tweets annotated for metaphors, which we believe will be invaluable for the development, training and evaluation of approaches for metaphor identification in tweets.","PeriodicalId":377119,"journal":{"name":"International Conference on Language, Data, and Knowledge","volume":"34 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127134269","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}