Journal for Language Technology and Computational Linguistics最新文献

筛选
英文 中文
Empirische Verortung konzeptioneller Nähe/Mündlichkeit inner- und außerhalb schriftsprachlicher Korpora 在没有语言社团的地方对概念上接近/文谈进行实证跟踪
Journal for Language Technology and Computational Linguistics Pub Date : 2023-05-15 DOI: 10.21248/jlcl.36.2023.240
Sarah Broll, Romana Schneider
{"title":"Empirische Verortung konzeptioneller Nähe/Mündlichkeit inner- und außerhalb schriftsprachlicher Korpora","authors":"Sarah Broll, Romana Schneider","doi":"10.21248/jlcl.36.2023.240","DOIUrl":"https://doi.org/10.21248/jlcl.36.2023.240","url":null,"abstract":"Linguistische Studien arbeiten häufig mit einer Differenzierung zwischen gesprochener und geschriebener Sprache bzw. zwischen Kommunikation der Nähe und Distanz. Die Annahme eines Kontinuums zwischen diesen Polen bietet sich für eine Verortungunterschiedlichster Äußerungsformen an, inklusive unkonventioneller Textsorten wie etwa Popsongs. Wir konzipieren, implementieren und evaluieren ein automatisiertes Verfahren, das mithilfe unkorrelierter Entscheidungsbäume entsprechende Vorhersagenauf Textebene durchführt. Für die Identifizierung der Pole definieren wir einen Merkmalskatalog aus Sprachphänomenen, die als Markierer für Nähe/Mündlichkeit bzw. Distanz/Schriftlichkeit diskutiert werden, und wenden diesen auf prototypische Nähe-/Mündlichkeitstexte sowie prototypische Distanz-/Schrifttexte an. Basierend auf der sehr guten Klassifikationsgüte verorten wir anschließend eine Reihe weiterer Textsorten mithilfe der trainierten Klassifikatoren. Dabei erscheinen Popsongs als „mittige Textsorte“, die linguistisch motivierte Merkmale unterschiedlicher Kontinuumsstufen vereint. Weiterhin weisen wir nach, dass unsere Modelle mündlich kommunizierte, aber vorab oder nachträglich verschriftlichte Äußerungen wie Reden oder Interviews vollkommenanders verorten als prototypische Gesprächsdaten und decken Klassifikationsunterschiede für Social-Media-Varianten auf. Ziel ist dabei nicht eine systematisch-verbindliche Einordung im Kontinuum, sondern eine empirische Annäherung an die Frage, welchemaschinell vergleichsweise einfach bestimmbaren Merkmale („shallow features“) nachweisbar Einfluss auf die Verortung haben.","PeriodicalId":137584,"journal":{"name":"Journal for Language Technology and Computational Linguistics","volume":"36 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-05-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125231883","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Optimizing the Training of Models for Automated Post-Correction of Arbitrary OCR-ed Historical Texts 任意OCR-ed历史文本自动后校正模型的优化训练
Journal for Language Technology and Computational Linguistics Pub Date : 2022-12-03 DOI: 10.21248/jlcl.35.2022.232
Tobias Englmeier, F. Fink, U. Springmann, K. Schulz
{"title":"Optimizing the Training of Models for Automated Post-Correction of Arbitrary OCR-ed Historical Texts","authors":"Tobias Englmeier, F. Fink, U. Springmann, K. Schulz","doi":"10.21248/jlcl.35.2022.232","DOIUrl":"https://doi.org/10.21248/jlcl.35.2022.232","url":null,"abstract":"Systems for post-correction of OCR-results for historical texts are based on statistical correction models obtained by supervised learning. For training, suitable collections of ground truth materials are needed. In this paper we investigate the dependency of the power of automated OCR post-correction on the form of ground truth data and other training settings used for the computation of a post-correction model. The post-correction system A-PoCoTo considered here is based on a profiler service that computes a statistical profile for an OCR-ed input text. We also look in detail at the influence of the profiler resources and other settings selected for training and evaluation. As a practical result of several fine-tuning steps, a general post-correction model is achieved where experiments for a large and heterogeneous collection of OCR-ed historical texts show a consistent improvement of base OCR accuracy. The results presented are meant to provide insights for libraries that want to apply OCR post-correction to a larger spectrum of distinct OCR-ed historical printings and ask for \"representative\" results.","PeriodicalId":137584,"journal":{"name":"Journal for Language Technology and Computational Linguistics","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-12-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127752939","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Frame Detection in German Political Discourses: How Far Can We Go Without Large-Scale Manual Corpus Annotation? 德国政治话语中的框架检测:没有大规模人工语料库标注,我们还能走多远?
Journal for Language Technology and Computational Linguistics Pub Date : 2022-07-01 DOI: 10.21248/jlcl.35.2022.227
Qi Yu, Anselm Fliethmann
{"title":"Frame Detection in German Political Discourses: How Far Can We Go Without Large-Scale Manual Corpus Annotation?","authors":"Qi Yu, Anselm Fliethmann","doi":"10.21248/jlcl.35.2022.227","DOIUrl":"https://doi.org/10.21248/jlcl.35.2022.227","url":null,"abstract":"Automated detection of frames in political discourses has gained increasing attention in natural language processing (NLP). Earlier studies in this area however focus heavily on frame detection in English using supervised machine learning approaches. Addressing the difficulty of the lack of annotated data for training and/or evaluating supervised models for low-resource languages, we investigate the potential of two NLP approaches that do not require large-scale manual corpus annotation from scratch: 1) LDA-based topic modelling, and 2) a combination of word2vec embeddings and handcrafted framing keywords based on a novel, expert-curated framing schema. We test these approaches using a novel corpus consisting of German-language news articles on the “Eu-ropean Refugee Crisis” between 2014-2018. We show that while topic modelling is insufficient in detecting frames in a dataset with highly homogeneous vocabulary, our second approach yields intriguing and more humanly interpretable results. This approach offers a promising opportunity to incorporate domain knowledge from political science and NLP techniques for bottom-up, explorative political text analyses.","PeriodicalId":137584,"journal":{"name":"Journal for Language Technology and Computational Linguistics","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124712135","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Share and Shout: Proto-Slogans in Online Political Communities 分享与呐喊:网络政治社区的原始口号
Journal for Language Technology and Computational Linguistics Pub Date : 2022-07-01 DOI: 10.21248/jlcl.35.2022.228
Irene Russo, G. Comandini, T. Caselli, V. Patti
{"title":"Share and Shout: Proto-Slogans in Online Political Communities","authors":"Irene Russo, G. Comandini, T. Caselli, V. Patti","doi":"10.21248/jlcl.35.2022.228","DOIUrl":"https://doi.org/10.21248/jlcl.35.2022.228","url":null,"abstract":"This paper proposes a methodology for investigating populism on social media by analyzing the emergence of proto-slogans, defined as nominal utterances (NUs) typical of a political community on social media. We extracted more than 700.000 comments from the public Facebook pages of two Italian populist parties’ leaders (Matteo Salvini and Luigi Di Maio) during the week preceding the 2019 European elections (i.e., from May 20 to May 26, 2019). These comments have been automatically clustered and manually annotated to find proto-slogans created by the parties’ supporters. Our manual annotation consists of four layers, namely: Nominal Utterances (NUs), a syntactic device widely used for slogans; Slogans for NUs with a slogan function; Top-down/Bottom-up , to recognize the slogans produced by the politicians and those produced by supporters; Proto-slogans , for NUs devoid of specific political content that nonetheless express partisanship and support for the leaders.","PeriodicalId":137584,"journal":{"name":"Journal for Language Technology and Computational Linguistics","volume":"284 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122962777","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
UNSC-NE: A Named Entity Extension to the UN Security Council Debates Corpus UNSC-NE:联合国安理会辩论语料库的具名实体扩展
Journal for Language Technology and Computational Linguistics Pub Date : 2022-07-01 DOI: 10.21248/jlcl.35.2022.229
Luis Glaser, R. Patz, Manfred Stede
{"title":"UNSC-NE: A Named Entity Extension to the UN Security Council Debates Corpus","authors":"Luis Glaser, R. Patz, Manfred Stede","doi":"10.21248/jlcl.35.2022.229","DOIUrl":"https://doi.org/10.21248/jlcl.35.2022.229","url":null,"abstract":"We present the Named Entity (NE) add-on to the previously published United Nations Security Council (UNSC) Debates corpus (Schoenfeld, Eckhard, Patz, Meegdenburg, & Pires, 2019). Starting from the argument that the annotated classes in Named Entity Recognition (NER) pipelines offer a tagset that is too limited for relevant research questions in political science, we employ Named Entity Linking (NEL), using DBpedia-spotlight to produce the UNSC-NE corpus add-on. The validity of the tagging and the potential for future research are then discussed in the context of UNSC debates on Women, Peace and Security (WPS).","PeriodicalId":137584,"journal":{"name":"Journal for Language Technology and Computational Linguistics","volume":"703 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116119929","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Explaining Offensive Language Detection 解释攻击性语言检测
Journal for Language Technology and Computational Linguistics Pub Date : 2020-07-01 DOI: 10.21248/jlcl.34.2020.223
Julian Risch, Robert S. Ruff, Ralf Krestel
{"title":"Explaining Offensive Language Detection","authors":"Julian Risch, Robert S. Ruff, Ralf Krestel","doi":"10.21248/jlcl.34.2020.223","DOIUrl":"https://doi.org/10.21248/jlcl.34.2020.223","url":null,"abstract":"","PeriodicalId":137584,"journal":{"name":"Journal for Language Technology and Computational Linguistics","volume":"132 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115760532","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
COLD: Annotation scheme and evaluation data set for complex offensive language in English 英语复杂攻击性语言标注方案与评价数据集
Journal for Language Technology and Computational Linguistics Pub Date : 2020-07-01 DOI: 10.21248/jlcl.34.2020.222
Alexis Palmer, Christine Carr, Melissa Robinson, Jordan Sanders
{"title":"COLD: Annotation scheme and evaluation data set for complex offensive language in English","authors":"Alexis Palmer, Christine Carr, Melissa Robinson, Jordan Sanders","doi":"10.21248/jlcl.34.2020.222","DOIUrl":"https://doi.org/10.21248/jlcl.34.2020.222","url":null,"abstract":"This paper presents a new, extensible annotation scheme for offensive language data sets. The annotation scheme expands coverage beyond fairly straightforward cases of offensive language to address several cases of complex, implicit, and/or pragmatically-triggered offensive language. We apply the annotation scheme to create a new Complex Offensive Language Data Set for English ( COLD-EN ). The primary purpose of this data set is to diagnose how well systems for automatic detection of abusive language are able to classify three types of complex offensive language: reclaimed slurs, offensive utterances containing pejorative adjectival nominalizations (and no slur terms), and utterances conveying offense through linguistic distancing. COLD offers a straightforward framework for error analysis. Our vision is that researchers will use this data set to diagnose the strengths and weaknesses of their offensive language detection systems. In this paper, we diagnose some strengths and weaknesses of a top-performing offensive language detection system by: a) using it to classify COLD , and b) investigating its performance on the 10 fine-grained categories supported by our annotation scheme. We evaluate the system’s performance when trained on five different standard data sets for offensive language detection. Systems trained on different data sets have different strengths and weaknesses, with most performing poorly on the phenomena of reclaimed slurs and pejorative nominalizations. NOTE: This paper contains sensitive and offensive material. The offensive materials are part of a complex puzzle we wish to better understand; they appear in the form of lightly-censored slurs and degrading insults. We do not condone this type of language, nor does it reflect the attitudes or beliefs of the authors.","PeriodicalId":137584,"journal":{"name":"Journal for Language Technology and Computational Linguistics","volume":"141 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127327801","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 13
Automating News Comment Moderation with Limited Resources: Benchmarking in Croatian and Estonian 自动化新闻评论审核与有限的资源:在克罗地亚和爱沙尼亚的基准
Journal for Language Technology and Computational Linguistics Pub Date : 2020-07-01 DOI: 10.21248/jlcl.34.2020.224
Ravi Shekhar, M. Pranjic, S. Pollak, Andraz Pelicon, Matthew Purver
{"title":"Automating News Comment Moderation with Limited Resources: Benchmarking in Croatian and Estonian","authors":"Ravi Shekhar, M. Pranjic, S. Pollak, Andraz Pelicon, Matthew Purver","doi":"10.21248/jlcl.34.2020.224","DOIUrl":"https://doi.org/10.21248/jlcl.34.2020.224","url":null,"abstract":"This article describes initial work into the automatic classification of user-generated content in news media to support human moderators. We work with real-world data — comments posted by readers under online news articles — in two less-resourced European languages, Croatian and Estonian. We describe our dataset, and experiments into automatic classification using a range of models. Performance obtained is reasonable but not as good as might be expected given similar work in offensive language classification in other languages; we then investigate possible reasons in terms of the variability and reliability of the data and its annotation.","PeriodicalId":137584,"journal":{"name":"Journal for Language Technology and Computational Linguistics","volume":"47 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125684446","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 11
Text Segmentation with Topic Models 主题模型的文本分割
Journal for Language Technology and Computational Linguistics Pub Date : 2012-07-01 DOI: 10.21248/jlcl.27.2012.158
Martin Riedl, Chris Biemann
{"title":"Text Segmentation with Topic Models","authors":"Martin Riedl, Chris Biemann","doi":"10.21248/jlcl.27.2012.158","DOIUrl":"https://doi.org/10.21248/jlcl.27.2012.158","url":null,"abstract":"This article presents a general method to use information retrieved from the Latent Dirichlet Allocation (LDA) topic model for Text Segmentation: Using topic assignments instead of words in two well-known Text Segmentation algorithms, namely TextTiling and C99, leads to significant improvements. Further, we introduce our own algorithm called TopicTiling, which is a simplified version of TextTiling (Hearst, 1997). In our study, we evaluate and optimize parameters of LDA and TopicTiling. A further contribution to improve the segmentation accuracy is obtained through stabilizing topic assignments by using information from all LDA inference iterations. Finally, we show that TopicTiling outperforms previous Text Segmentation algorithms on two widely used datasets, while being computationally less expensive than other algorithms.","PeriodicalId":137584,"journal":{"name":"Journal for Language Technology and Computational Linguistics","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130632923","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 61
E-Learning and Computational Linguistics An Introduction 电子学习与计算语言学导论
Journal for Language Technology and Computational Linguistics Pub Date : 2011-07-01 DOI: 10.21248/jlcl.26.2011.132
Maja Bärenfänger, Maik Stührenberg
{"title":"E-Learning and Computational Linguistics An Introduction","authors":"Maja Bärenfänger, Maik Stührenberg","doi":"10.21248/jlcl.26.2011.132","DOIUrl":"https://doi.org/10.21248/jlcl.26.2011.132","url":null,"abstract":"","PeriodicalId":137584,"journal":{"name":"Journal for Language Technology and Computational Linguistics","volume":"20 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133758268","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信