{"title":"ILiAD: An Interactive Corpus for Linguistic Annotated Data from Twitter Posts","authors":"Simon Gonzalez","doi":"arxiv-2407.15374","DOIUrl":null,"url":null,"abstract":"Social Media platforms have offered invaluable opportunities for linguistic\nresearch. The availability of up-to-date data, coming from any part in the\nworld, and coming from natural contexts, has allowed researchers to study\nlanguage in real time. One of the fields that has made great use of social\nmedia platforms is Corpus Linguistics. There is currently a wide range of\nprojects which have been able to successfully create corpora from social media.\nIn this paper, we present the development and deployment of a linguistic corpus\nfrom Twitter posts in English, coming from 26 news agencies and 27 individuals.\nThe main goal was to create a fully annotated English corpus for linguistic\nanalysis. We include information on morphology and syntax, as well as NLP\nfeatures such as tokenization, lemmas, and n- grams. The information is\npresented through a range of powerful visualisations for users to explore\nlinguistic patterns in the corpus. With this tool, we aim to contribute to the\narea of language technologies applied to linguistic research.","PeriodicalId":501285,"journal":{"name":"arXiv - CS - Digital Libraries","volume":"430 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-07-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Digital Libraries","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2407.15374","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Social Media platforms have offered invaluable opportunities for linguistic
research. The availability of up-to-date data, coming from any part in the
world, and coming from natural contexts, has allowed researchers to study
language in real time. One of the fields that has made great use of social
media platforms is Corpus Linguistics. There is currently a wide range of
projects which have been able to successfully create corpora from social media.
In this paper, we present the development and deployment of a linguistic corpus
from Twitter posts in English, coming from 26 news agencies and 27 individuals.
The main goal was to create a fully annotated English corpus for linguistic
analysis. We include information on morphology and syntax, as well as NLP
features such as tokenization, lemmas, and n- grams. The information is
presented through a range of powerful visualisations for users to explore
linguistic patterns in the corpus. With this tool, we aim to contribute to the
area of language technologies applied to linguistic research.