Domenico Cantone, S. Cristofaro, S. Faro, Emanuele Giaquinta
{"title":"Finite State Models for the Generation of Large Corpora of Natural Language Texts","authors":"Domenico Cantone, S. Cristofaro, S. Faro, Emanuele Giaquinta","doi":"10.3233/978-1-58603-975-2-175","DOIUrl":null,"url":null,"abstract":"Natural languages are probably one of the most common type of input for text processing algorithms. Therefore, it is often desirable to have a large training/testing set of input of this kind, especially when dealing with algorithms tuned for natural language texts. In many cases the problem due to the lack of big corpus of natural language texts can be solved by simply concatenating a set of collected texts, even with heterogeneous contexts and by different authors. \n \nIn this note we present a preliminary study on a finite state model for text generation which maintains statistical and structural characteristics of natural language texts, i.e., Zipf's law and inverse-rank power law, thus providing a very good approximation for testing purposes.","PeriodicalId":286427,"journal":{"name":"Finite-State Methods and Natural Language Processing","volume":"258 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2009-07-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Finite-State Methods and Natural Language Processing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.3233/978-1-58603-975-2-175","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 3
Abstract
Natural languages are probably one of the most common type of input for text processing algorithms. Therefore, it is often desirable to have a large training/testing set of input of this kind, especially when dealing with algorithms tuned for natural language texts. In many cases the problem due to the lack of big corpus of natural language texts can be solved by simply concatenating a set of collected texts, even with heterogeneous contexts and by different authors.
In this note we present a preliminary study on a finite state model for text generation which maintains statistical and structural characteristics of natural language texts, i.e., Zipf's law and inverse-rank power law, thus providing a very good approximation for testing purposes.