Slovenscina 2.0Pub Date : 2023-09-12DOI: 10.4312/slo2.0.2023.1.275-301
Gregor Donaj, Mirjam Sepesy Maučec
{"title":"Praktični vidiki uporabe podbesednih enot v strojnem prevajanju slovenščina-angleščina","authors":"Gregor Donaj, Mirjam Sepesy Maučec","doi":"10.4312/slo2.0.2023.1.275-301","DOIUrl":"https://doi.org/10.4312/slo2.0.2023.1.275-301","url":null,"abstract":"Večina sodobnih sistemov za strojno prevajanje temelji na arhitekturi nevronskih mrež. To velja za spletne ponudnike strojnega prevajanja, za raziskovalne sisteme in za orodja, ki so lahko v pomoč poklicnim prevajalcem v njihovi praksi. Čeprav lahko sisteme nevronskih mrež uporabljamo na običajnih centralnih procesnih enotah osebnih računalnikov in strežnikov, je za delovanje s smiselno hitrostjo potrebna uporaba grafičnih procesnih enot. Pri tem smo omejeni z velikostjo slovarja, kar zmanjšuje kakovost prevodov. Velikost slovarja besednih enot je še posebej pereč problem visoko pregibnih jezikov. Rešujemo ga z uporabo podbesednih enot, s katerimi dosežemo večjo pokritost jezika. V članku predstavljamo različne metode razcepljanja besed na podbesedne enote z različno velikimi slovarji in primerjamo njihovo uporabo v strojnem prevajalniku za jezikovni par slovenščina-angleščina. V primerjavo vključujemo še prevajalnik brez razcepljanja besed. Predstavljamo rezultate uspešnosti prevajanja z metriko BLEU, hitrosti učenja modelov in hitrosti prevajanja ter velikosti modelov. Dodajamo pregled praktičnih vidikov uporabe podbesednih enot v strojnem prevajalniku, ki ga uporabljamo skupaj z orodji za računalniško podprto prevajanje.","PeriodicalId":36888,"journal":{"name":"Slovenscina 2.0","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135878152","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Slovenscina 2.0Pub Date : 2023-09-12DOI: 10.4312/slo2.0.2023.1.161-188
Iztok Kosem, Jaka Čibej, Kaja Dobrovoljc, Taja Kuzman, Nikola Ljubešić
{"title":"Spremljevalni korpus Trendi in avtomatska kategorizacija","authors":"Iztok Kosem, Jaka Čibej, Kaja Dobrovoljc, Taja Kuzman, Nikola Ljubešić","doi":"10.4312/slo2.0.2023.1.161-188","DOIUrl":"https://doi.org/10.4312/slo2.0.2023.1.161-188","url":null,"abstract":"Prispevek predstavlja izdelavo korpusa Trendi, prvega spremljevalnega korpusa za slovenščino. Trenutna različica Trendi 2023-02 pokriva besedila od januarja 2019 do konca februarja 2023, vsebuje pa že več kot 700 milijonov pojavnic oz. več kot 586 milijonov besed. Namen korpusa je, da tako strokovni kot nestrokovni javnosti ponudi podatke o aktualni jezikovni rabi in omogoči spremljanje pojavljanja novih besed ter upadanja ali naraščanja rabe že obstoječih. Poleg same vsebine predstavimo tudi metodologijo in načela izdelave korpusa. Drugi del prispevka opisuje razvoj algoritma za avtomatsko kategorizacijo besedil z novičarskih portalov, ki je bil pripravljen za potrebe korpusa Trendi in tudi drugih korpusov s tovrstnimi besedili. Za namene algoritma je bil izdelan nabor 13 tematskih kategorij, ki so v veliki meri prekrivne z mednarodnimi standardi in kategorijami v primerljivih korpusih drugih jezikov. Na besedilih, označenih s kategorijami, smo naučili več različnih jezikovnih modelov in z najprimernejšim dosegli visoko zanesljivost določevanja tematike besedilom.","PeriodicalId":36888,"journal":{"name":"Slovenscina 2.0","volume":"31 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135826733","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Slovenscina 2.0Pub Date : 2017-12-30DOI: 10.4312/SLO2.0.2017.2.85-112
Matejka Grgič
{"title":"Teoretska izhodišča in metodološki okvir pri izdelavi uporabnikom prijaznega spletišča: primer platforme SMeJse – Slovenščina kot manjšinski jezik","authors":"Matejka Grgič","doi":"10.4312/SLO2.0.2017.2.85-112","DOIUrl":"https://doi.org/10.4312/SLO2.0.2017.2.85-112","url":null,"abstract":"This paper aims to present some theoretical and methodological issues related to the online portal SLOVENSCINA KOT MANJSINSKI JEZIK – SMeJse / SLOVENIAN AS A MINORITY LANGUAGE – SMiLe where existent tools, materials and information for the development of linguistic skills and abilities in Slovenian are collected. The platform was established by SLORI – Slovenski raziskovalni institut / Slovenian research institute of Trieste, Italy, and the Dijaski dom S. Kosovela / Slovenian student’s center of Trieste, Italy. The purpose of the portal is to stimulate different usages of the current Slovenian language in the Slovenian-Italian contact area, particularly in Italy, with the aim of assuring high communication proficiency in all kinds and varieties of the Slovenian language (the so called “equilingualism”), a balanced bilingualism and also the development of lects, still within the Slovenian linguistic continuum.Specific language policies are particularly successful for the development of linguistic skills which enable proficiency in the minority language, as well as equilingualism and balanced bilingualism among the speakers of the minority group. Such policies are based on the implementation of measures for an increased exposure to different language uses and on the creation of the need of language use in circles and situations where compensatory strategies are unsuitable. The portal is based on the newest linguistic, sociolinguistic and psycholinguistic studies concerning the Slovenian language in Italy, on the Slovenian-Italian language contact and on the acquisition of the minority language. An analysis of the status of the Slovenian language in Italy, its perception and its phenomena, as well as the overview of some language policies and methodological frames, has shown a gap between the existent tools and the needs of the community of speakers.","PeriodicalId":36888,"journal":{"name":"Slovenscina 2.0","volume":"5 1","pages":"85-112"},"PeriodicalIF":0.0,"publicationDate":"2017-12-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"70585811","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Slovenscina 2.0Pub Date : 2016-09-27DOI: 10.4312/SLO2.0.2016.2.189-219
T. Erjavec, Jaka Čibej, Darja Fišer
{"title":"Omogočanje dostopa do korpusov slovenskih spletnih besedil v luči pravnih omejitev","authors":"T. Erjavec, Jaka Čibej, Darja Fišer","doi":"10.4312/SLO2.0.2016.2.189-219","DOIUrl":"https://doi.org/10.4312/SLO2.0.2016.2.189-219","url":null,"abstract":"Web texts are becoming increasingly relevant sources of information, with web corpora useful for corpus linguistic studies and development of language technologies. Even though web texts are directly accessable, which substantially simplifies the collection procedure compilation of web corpora is still complex, time consuming and expensive. It is crucial that similar endeavours are not repeated, which is why it is necessary to make the created corpora easily and widely accessible both to researchers and a wider audience. While this is logistically and technically a straightforward procedure, legal constraints, such as copyright, privacy and terms of use severely hinder the dissemination of web corpora. This paper discusses legal conditions and actual practice in this area, gives an overview of current practices and proposes a range of mitigation measures on the example of the Janes corpus of Slovene user-generated content in order to ensure free and open dissemination of Slovene web corpora.","PeriodicalId":36888,"journal":{"name":"Slovenscina 2.0","volume":"4 1","pages":"189-219"},"PeriodicalIF":0.0,"publicationDate":"2016-09-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"70585778","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Slovenscina 2.0Pub Date : 2016-09-27DOI: 10.4312/slo2.0.2016.2.1-37
Špela Arhar Holdt, K. Dobrovoljc
{"title":"The value of the Janes corpus for Slovenian language standardization","authors":"Špela Arhar Holdt, K. Dobrovoljc","doi":"10.4312/slo2.0.2016.2.1-37","DOIUrl":"https://doi.org/10.4312/slo2.0.2016.2.1-37","url":null,"abstract":"The main objective of this article is to assess the value of the Janes corpus for research in the field of language standardization. Unlike the existing reference corpora of written Slovenian, the newly available Janes corpus of user-generated content mostly consists of texts that have not been modified by a proofreading expert; it therefore offers a more realistic insight into the trends of language use, as well as the intuitiveness of existing language rules, within a wider language community. We illustrate this methodological potential in a case study of nominal phrases with nonagreeing premodifiers, such as solo petje and RTV prispevek, by comparing their usage in Janes and the reference Kres corpus. The results reveal: this type of phrases is used more often in Janes and includes a longer list of candidates than in Kres; both corpora include a large number of phrases with variant spelling as either one or two words, irrespective of the premodifier in question; and, somewhat surprising, Janes displays a more consistent language use, suggesting that prescriptive regulation actually increases the level of inconsistency in language use. The article, a revised and enhanced extension of a prior conference paper, concludes with a discussion on possible future approaches to this linguistic issue and advocates for inclusion of Janes into Slovenian language standardisation methodology.","PeriodicalId":36888,"journal":{"name":"Slovenscina 2.0","volume":"4 1","pages":"1-37"},"PeriodicalIF":0.0,"publicationDate":"2016-09-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"70585761","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Slovenscina 2.0Pub Date : 2015-12-01DOI: 10.4312/slo2.0.2015.2.67-94
Vesna Požgaj Hadži, Tatjana Balažic Bulc
{"title":"(Re)standardization in the Vice of National Identity: the Cases of Croatian, Serbian, Bosnian, and Montenegrin","authors":"Vesna Požgaj Hadži, Tatjana Balažic Bulc","doi":"10.4312/slo2.0.2015.2.67-94","DOIUrl":"https://doi.org/10.4312/slo2.0.2015.2.67-94","url":null,"abstract":"Among different functions of linguistic standardization, the unifying, separatist, and prestige functions play a special role. In this paper, we focus on the separatist function, which calls for a redefinition of the status of standard languages. In addition, politics plays an important role within this process. In such cases we are often dealing with restandardization or – in other words – the reshaping of an already standardized language; however, on different terms. We present the results of such processes on the four successor-languages of the former Serbo-Croatian, i.e. Croatian, Serbian, Bosnian, and Montenegrin. All underwent numerous (necessary as well as unnecessary) changes following the separation, especially in lexis and phonetics, the changes bearing significant symbolic meaning. The reasons for changes are thus external (new sociopolitical order) as well as internal (change in the relation towards the neighboring standard languages, increased interest in linguistic matters, partisanship of individual linguists within institutions, etc.), and in both cases, closely linked to political structures.","PeriodicalId":36888,"journal":{"name":"Slovenscina 2.0","volume":"3 1","pages":"67-94"},"PeriodicalIF":0.0,"publicationDate":"2015-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"70585718","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Slovenscina 2.0Pub Date : 2015-12-01DOI: 10.4312/slo2.0.2015.1.59-61
Darja Fišer
{"title":"Internet Slovene Research Summer Camp for Secondary School Pupils","authors":"Darja Fišer","doi":"10.4312/slo2.0.2015.1.59-61","DOIUrl":"https://doi.org/10.4312/slo2.0.2015.1.59-61","url":null,"abstract":"","PeriodicalId":36888,"journal":{"name":"Slovenscina 2.0","volume":"3 1","pages":"59-61"},"PeriodicalIF":0.0,"publicationDate":"2015-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"70585710","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Slovenscina 2.0Pub Date : 2014-12-01DOI: 10.4312/SLO2.0.2014.1.41-61
N. Logar, P. Gantar, Iztok Kosem
{"title":"Collocations and examples of use: a lexical-semantic approach to terminology","authors":"N. Logar, P. Gantar, Iztok Kosem","doi":"10.4312/SLO2.0.2014.1.41-61","DOIUrl":"https://doi.org/10.4312/SLO2.0.2014.1.41-61","url":null,"abstract":"The paper describes the compilation of an online terminological database that also includes a lexical-semantic framework of terms in the form of collocations and examples of use. Both types of information were extracted from a specialised corpus automatically, using Word Sketch and GDEX functions in the Sketch Engine corpus tool. Each entry contains links to two corpora: the LSP corpus of the public relations field KoRP and the Gigafida corpus, a reference corpus of Slovene. Preliminary results of the survey conducted among the target users of the terminological database indicate that the information on the term’s typical collocations is very useful for fully understanding the term, its meaning and role in the context.","PeriodicalId":36888,"journal":{"name":"Slovenscina 2.0","volume":"2 1","pages":"41-61"},"PeriodicalIF":0.0,"publicationDate":"2014-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"70585663","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Slovenscina 2.0Pub Date : 2014-12-01DOI: 10.4312/SLO2.0.2014.2.15-36
J. Simpson
{"title":"WHAT WOULD DR MURRAY HAVE MADE OF THE OED ONLINE TODAY","authors":"J. Simpson","doi":"10.4312/SLO2.0.2014.2.15-36","DOIUrl":"https://doi.org/10.4312/SLO2.0.2014.2.15-36","url":null,"abstract":"During the final years of the twentieth century the text of the Oxford English Dictionary (OED) was transformed from a print resource to a digital one. Surprisingly, the way in which data was structured in the print version lent itself fairly easily to this transformation. This paper looks briefly at the publishing history of the OED, and then at continuity and change in editorial policy across the two media, and finally at new options (such as data visualisation through graphs, charts, and animations, as well as linking through to other sources) that are opened to users of the dictionary as a result of its availability as a digital resource. The paper concludes that although Dr Murray, the dictionary’s original editor, would have been pleased by the way his text has migrated from the print to the digital medium, the real significance of the development is that the modern user can now begin to analyse language change, and not just the history of individual words, through the functionality of the OED Online web site.","PeriodicalId":36888,"journal":{"name":"Slovenscina 2.0","volume":"2 1","pages":"15-36"},"PeriodicalIF":0.0,"publicationDate":"2014-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"70585701","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}