Pierpaolo Basile, A. Caputo, Tommaso Caselli, Pierluigi Cassotti, Rossella Varvara
{"title":"基于“L’unitcom”的意大利语历时语料库","authors":"Pierpaolo Basile, A. Caputo, Tommaso Caselli, Pierluigi Cassotti, Rossella Varvara","doi":"10.4000/books.aaccademia.8245","DOIUrl":null,"url":null,"abstract":"English. In this paper, we describe the creation of a diachronic corpus for Italian by exploiting the digital archive of the newspaper “L’Unità”. We automatically clean and annotate the corpus with PoS tags, lemmas, named entities and syntactic dependencies. Moreover, we compute frequency-based time series for tokens, lemmas and entities. We show some interesting corpus statistics taking into account the temporal dimension and describe some examples of usage of time series. 1 Motivation and Background Diachronic linguistics is one of the two major temporal dimensions of language study proposed by de Saussure in his Cours de languistique générale and has a long tradition in Linguistics. Recently, the increasing availability of diachronic corpora as well as the development of new NLP techniques for representing word meanings has boosted the application of computational models to investigate historical language data (Hamilton et al., 2016; Tahmasebi et al., 2018; Tang, 2018). This culminated in SemEval-2020 Unsupervised Lexical Semantic Change Detection (Schlechtweg et al., 2020), the first attempt to systematically evaluate automatic methods for language change detection. Italian is a Romance language which has undergone lots of changes in its history. Its official Copyright c ©2020 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). adoption as a national language occurred only after the Unification of Italy (1861), having previously been a literary language. Diachronic corpora of Italian are currently available and accessible to the public (e.g., DiaCORIS and MIDIA). Unfortunately, restricted access/distribution of these resources limits their utilisation. This actually prevents the investigation of more recent NLP methods to the diachronic dimensions. To obviate this limit, we collect and make freely available1 a new corpus based on the newspaper “L’Unità”. Founded by Antonio Gramsci on February, 12th 1924, “L’Unità” was the official newspaper of the Italian Communist Party (PCI 2, henceforth). The newspaper had a troubled history: with the dissolution of PCI in 1991, the newspaper continued to live as the official newspaper of the new Democratic Party of the Left (PDS/DS) until July, 31th 2014. After that date, it ceased its publication until June, 30th 2015, and it was definitely closed on June, 3rd 2017. Since 2017, the historical archive of “L’Unità” has been made again visible and available on the Web.3 One of the main issues of this resource is the lack of information about who owns the rights of the original archive. To our knowledge, the online version of the archive was legally obtained by downloading the original archive before the closure of the newspaper. The current archive, available online, does not contain the local editions of the newspaper and the photographic archive. The main contribution of this work lies in the https://github.com/swapUniba/unita/ It is the acronym of Partito Comunista Italiano. https://archivio.unita.news/ resource itself and its accessibility to the research community at large. The corpus is distributed in two formats: raw text and pre-processed. The validity of the corpus for the automatic study of language change is currently tested as part of the DIACR-Ita task 4 at EVALITA 2020. However, we illustrate some further potential applications of the use of the corpus. 2 Italian diachronic corpora Various Italian diachronic corpora are currently available and accessible to the public. DiaCORIS 5 (Onelli et al., 2006) comprises written Italian texts produced between 1861 and 1945, for a total of 100 million words, while MIDIA 6 (Gaeta et al., 2013) covers written documents in Italian between the beginning of the XIII century and the first half of the XX century, for a total of 7,5 million words over 800 texts belonging to different genres. The Corpus OVI dell’Italiano antico7 consists of 1948 texts from the XII to the XIV centuries, for a total of 536.000 words. The LIZ8 database comprehends 1,000 literary texts from the XIII to the XX century. Lastly, the Corpus of Alcide de Gasperi’s public documents (Tonelli et al., 2019) includes 1,762 documents (newspaper articles, propaganda documents, official letters, parliamentary speeches, for a total of 3.000.000 tokens) written from the Italian politician Alcide De Gasperi and published between 1901 and 1954. These existing resources differ from each other and from the present corpus in different ways. First, the span of time the texts come from. The OVI Corpus considers texts from the early stages of the Italian language, with a time span of three centuries. The MIDIA corpus and the LIZ database cover 7 centuries, from the XIII to the first half of the XX century. DiaCORIS, the De Gasperi’s corpus and L’Unità corpus contain texts from a shorter and more recent period of time. However, the time span considered in L’Unità corpus is interesting for the study of the Italian language because of the deep changes that occurred https://diacr-ita.github.io/ DIACR-Ita/ http://corpora.dslo.unibo.it/ DiaCORIS/ www.corpusmidia.unito.it http://gattoweb.ovi.cnr.it https://www.zanichelli. it/ricerca/prodotti/ liz-4-0-letteratura-italiana-zanichelli in that period. Indeed, the second half of the XX century has seen a wider spread and use of Italian among all the social classes. Second, these corpora differ for the genres represented. The DiaCORIS and MIDIA corpora have been designed as representative and balanced samples of written Italian (considering, among other genres, academic prose, fiction, press, legal texts, etc). The OVI corpus and the LIZ database comprehend only literary texts. The De Gasperi’s corpus is representative of political text from a single author. L’Unità corpus is representative only of press language, but this restriction may be an advantage in the study of diachronic lexical change. Indeed, observed semantic changes cannot be attributed to attestation from different genres in different periods, but can be interpreted as true semantic shifts. Lastly, even if most of the corpora can be queried online (with the exception of the LIZ database), only the De Gasperi’s corpus can be freely downloaded. This restriction affects the usability of these resources for the NLP community. With L’Unità corpus we aim at releasing a new diachronic resource that is freely available and that can be used in the theoretical and computational study of language change.","PeriodicalId":300279,"journal":{"name":"Proceedings of the Seventh Italian Conference on Computational Linguistics CLiC-it 2020","volume":"50 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"5","resultStr":"{\"title\":\"A Diachronic Italian Corpus based on \\\"L'Unità\\\"\",\"authors\":\"Pierpaolo Basile, A. Caputo, Tommaso Caselli, Pierluigi Cassotti, Rossella Varvara\",\"doi\":\"10.4000/books.aaccademia.8245\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"English. In this paper, we describe the creation of a diachronic corpus for Italian by exploiting the digital archive of the newspaper “L’Unità”. We automatically clean and annotate the corpus with PoS tags, lemmas, named entities and syntactic dependencies. Moreover, we compute frequency-based time series for tokens, lemmas and entities. We show some interesting corpus statistics taking into account the temporal dimension and describe some examples of usage of time series. 1 Motivation and Background Diachronic linguistics is one of the two major temporal dimensions of language study proposed by de Saussure in his Cours de languistique générale and has a long tradition in Linguistics. Recently, the increasing availability of diachronic corpora as well as the development of new NLP techniques for representing word meanings has boosted the application of computational models to investigate historical language data (Hamilton et al., 2016; Tahmasebi et al., 2018; Tang, 2018). This culminated in SemEval-2020 Unsupervised Lexical Semantic Change Detection (Schlechtweg et al., 2020), the first attempt to systematically evaluate automatic methods for language change detection. Italian is a Romance language which has undergone lots of changes in its history. Its official Copyright c ©2020 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). adoption as a national language occurred only after the Unification of Italy (1861), having previously been a literary language. Diachronic corpora of Italian are currently available and accessible to the public (e.g., DiaCORIS and MIDIA). Unfortunately, restricted access/distribution of these resources limits their utilisation. This actually prevents the investigation of more recent NLP methods to the diachronic dimensions. To obviate this limit, we collect and make freely available1 a new corpus based on the newspaper “L’Unità”. Founded by Antonio Gramsci on February, 12th 1924, “L’Unità” was the official newspaper of the Italian Communist Party (PCI 2, henceforth). The newspaper had a troubled history: with the dissolution of PCI in 1991, the newspaper continued to live as the official newspaper of the new Democratic Party of the Left (PDS/DS) until July, 31th 2014. After that date, it ceased its publication until June, 30th 2015, and it was definitely closed on June, 3rd 2017. Since 2017, the historical archive of “L’Unità” has been made again visible and available on the Web.3 One of the main issues of this resource is the lack of information about who owns the rights of the original archive. To our knowledge, the online version of the archive was legally obtained by downloading the original archive before the closure of the newspaper. The current archive, available online, does not contain the local editions of the newspaper and the photographic archive. The main contribution of this work lies in the https://github.com/swapUniba/unita/ It is the acronym of Partito Comunista Italiano. https://archivio.unita.news/ resource itself and its accessibility to the research community at large. The corpus is distributed in two formats: raw text and pre-processed. The validity of the corpus for the automatic study of language change is currently tested as part of the DIACR-Ita task 4 at EVALITA 2020. However, we illustrate some further potential applications of the use of the corpus. 2 Italian diachronic corpora Various Italian diachronic corpora are currently available and accessible to the public. DiaCORIS 5 (Onelli et al., 2006) comprises written Italian texts produced between 1861 and 1945, for a total of 100 million words, while MIDIA 6 (Gaeta et al., 2013) covers written documents in Italian between the beginning of the XIII century and the first half of the XX century, for a total of 7,5 million words over 800 texts belonging to different genres. The Corpus OVI dell’Italiano antico7 consists of 1948 texts from the XII to the XIV centuries, for a total of 536.000 words. The LIZ8 database comprehends 1,000 literary texts from the XIII to the XX century. Lastly, the Corpus of Alcide de Gasperi’s public documents (Tonelli et al., 2019) includes 1,762 documents (newspaper articles, propaganda documents, official letters, parliamentary speeches, for a total of 3.000.000 tokens) written from the Italian politician Alcide De Gasperi and published between 1901 and 1954. These existing resources differ from each other and from the present corpus in different ways. First, the span of time the texts come from. The OVI Corpus considers texts from the early stages of the Italian language, with a time span of three centuries. The MIDIA corpus and the LIZ database cover 7 centuries, from the XIII to the first half of the XX century. DiaCORIS, the De Gasperi’s corpus and L’Unità corpus contain texts from a shorter and more recent period of time. However, the time span considered in L’Unità corpus is interesting for the study of the Italian language because of the deep changes that occurred https://diacr-ita.github.io/ DIACR-Ita/ http://corpora.dslo.unibo.it/ DiaCORIS/ www.corpusmidia.unito.it http://gattoweb.ovi.cnr.it https://www.zanichelli. it/ricerca/prodotti/ liz-4-0-letteratura-italiana-zanichelli in that period. Indeed, the second half of the XX century has seen a wider spread and use of Italian among all the social classes. Second, these corpora differ for the genres represented. The DiaCORIS and MIDIA corpora have been designed as representative and balanced samples of written Italian (considering, among other genres, academic prose, fiction, press, legal texts, etc). The OVI corpus and the LIZ database comprehend only literary texts. The De Gasperi’s corpus is representative of political text from a single author. L’Unità corpus is representative only of press language, but this restriction may be an advantage in the study of diachronic lexical change. Indeed, observed semantic changes cannot be attributed to attestation from different genres in different periods, but can be interpreted as true semantic shifts. Lastly, even if most of the corpora can be queried online (with the exception of the LIZ database), only the De Gasperi’s corpus can be freely downloaded. This restriction affects the usability of these resources for the NLP community. With L’Unità corpus we aim at releasing a new diachronic resource that is freely available and that can be used in the theoretical and computational study of language change.\",\"PeriodicalId\":300279,\"journal\":{\"name\":\"Proceedings of the Seventh Italian Conference on Computational Linguistics CLiC-it 2020\",\"volume\":\"50 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"1900-01-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"5\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the Seventh Italian Conference on Computational Linguistics CLiC-it 2020\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.4000/books.aaccademia.8245\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the Seventh Italian Conference on Computational Linguistics CLiC-it 2020","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.4000/books.aaccademia.8245","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 5
A Diachronic Italian Corpus based on "L'Unità"
English. In this paper, we describe the creation of a diachronic corpus for Italian by exploiting the digital archive of the newspaper “L’Unità”. We automatically clean and annotate the corpus with PoS tags, lemmas, named entities and syntactic dependencies. Moreover, we compute frequency-based time series for tokens, lemmas and entities. We show some interesting corpus statistics taking into account the temporal dimension and describe some examples of usage of time series. 1 Motivation and Background Diachronic linguistics is one of the two major temporal dimensions of language study proposed by de Saussure in his Cours de languistique générale and has a long tradition in Linguistics. Recently, the increasing availability of diachronic corpora as well as the development of new NLP techniques for representing word meanings has boosted the application of computational models to investigate historical language data (Hamilton et al., 2016; Tahmasebi et al., 2018; Tang, 2018). This culminated in SemEval-2020 Unsupervised Lexical Semantic Change Detection (Schlechtweg et al., 2020), the first attempt to systematically evaluate automatic methods for language change detection. Italian is a Romance language which has undergone lots of changes in its history. Its official Copyright c ©2020 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). adoption as a national language occurred only after the Unification of Italy (1861), having previously been a literary language. Diachronic corpora of Italian are currently available and accessible to the public (e.g., DiaCORIS and MIDIA). Unfortunately, restricted access/distribution of these resources limits their utilisation. This actually prevents the investigation of more recent NLP methods to the diachronic dimensions. To obviate this limit, we collect and make freely available1 a new corpus based on the newspaper “L’Unità”. Founded by Antonio Gramsci on February, 12th 1924, “L’Unità” was the official newspaper of the Italian Communist Party (PCI 2, henceforth). The newspaper had a troubled history: with the dissolution of PCI in 1991, the newspaper continued to live as the official newspaper of the new Democratic Party of the Left (PDS/DS) until July, 31th 2014. After that date, it ceased its publication until June, 30th 2015, and it was definitely closed on June, 3rd 2017. Since 2017, the historical archive of “L’Unità” has been made again visible and available on the Web.3 One of the main issues of this resource is the lack of information about who owns the rights of the original archive. To our knowledge, the online version of the archive was legally obtained by downloading the original archive before the closure of the newspaper. The current archive, available online, does not contain the local editions of the newspaper and the photographic archive. The main contribution of this work lies in the https://github.com/swapUniba/unita/ It is the acronym of Partito Comunista Italiano. https://archivio.unita.news/ resource itself and its accessibility to the research community at large. The corpus is distributed in two formats: raw text and pre-processed. The validity of the corpus for the automatic study of language change is currently tested as part of the DIACR-Ita task 4 at EVALITA 2020. However, we illustrate some further potential applications of the use of the corpus. 2 Italian diachronic corpora Various Italian diachronic corpora are currently available and accessible to the public. DiaCORIS 5 (Onelli et al., 2006) comprises written Italian texts produced between 1861 and 1945, for a total of 100 million words, while MIDIA 6 (Gaeta et al., 2013) covers written documents in Italian between the beginning of the XIII century and the first half of the XX century, for a total of 7,5 million words over 800 texts belonging to different genres. The Corpus OVI dell’Italiano antico7 consists of 1948 texts from the XII to the XIV centuries, for a total of 536.000 words. The LIZ8 database comprehends 1,000 literary texts from the XIII to the XX century. Lastly, the Corpus of Alcide de Gasperi’s public documents (Tonelli et al., 2019) includes 1,762 documents (newspaper articles, propaganda documents, official letters, parliamentary speeches, for a total of 3.000.000 tokens) written from the Italian politician Alcide De Gasperi and published between 1901 and 1954. These existing resources differ from each other and from the present corpus in different ways. First, the span of time the texts come from. The OVI Corpus considers texts from the early stages of the Italian language, with a time span of three centuries. The MIDIA corpus and the LIZ database cover 7 centuries, from the XIII to the first half of the XX century. DiaCORIS, the De Gasperi’s corpus and L’Unità corpus contain texts from a shorter and more recent period of time. However, the time span considered in L’Unità corpus is interesting for the study of the Italian language because of the deep changes that occurred https://diacr-ita.github.io/ DIACR-Ita/ http://corpora.dslo.unibo.it/ DiaCORIS/ www.corpusmidia.unito.it http://gattoweb.ovi.cnr.it https://www.zanichelli. it/ricerca/prodotti/ liz-4-0-letteratura-italiana-zanichelli in that period. Indeed, the second half of the XX century has seen a wider spread and use of Italian among all the social classes. Second, these corpora differ for the genres represented. The DiaCORIS and MIDIA corpora have been designed as representative and balanced samples of written Italian (considering, among other genres, academic prose, fiction, press, legal texts, etc). The OVI corpus and the LIZ database comprehend only literary texts. The De Gasperi’s corpus is representative of political text from a single author. L’Unità corpus is representative only of press language, but this restriction may be an advantage in the study of diachronic lexical change. Indeed, observed semantic changes cannot be attributed to attestation from different genres in different periods, but can be interpreted as true semantic shifts. Lastly, even if most of the corpora can be queried online (with the exception of the LIZ database), only the De Gasperi’s corpus can be freely downloaded. This restriction affects the usability of these resources for the NLP community. With L’Unità corpus we aim at releasing a new diachronic resource that is freely available and that can be used in the theoretical and computational study of language change.