TAG-it:探索意大利语文本中年龄、话题和性别特征的多面表征

EVALITA Evaluation of NLP and Speech Tools for Italian - December 17th, 2020 Pub Date : 1900-01-01 DOI:10.4000/BOOKS.AACCADEMIA.7285

Roberto Labadie Tamayo, Daniel C. Castro, Reynier Ortega Bueno

{"title":"TAG-it:探索意大利语文本中年龄、话题和性别特征的多面表征","authors":"Roberto Labadie Tamayo, Daniel C. Castro, Reynier Ortega Bueno","doi":"10.4000/BOOKS.AACCADEMIA.7285","DOIUrl":null,"url":null,"abstract":"English. This paper describes our system for participating in the TAG-it Author Profiling task at EVALITA 2020. The task aims to predict age and gender of blogs users from their posts, as the topic they wrote about. Our proposal combines learned representations by RNN at word and sentence levels, Transformer Neural Nets and hand-crafted stylistic features. All these representations are mixed and fed into a fully connected layer from a feed-forward neural network in order to make predictions for addressed subtasks. Experimental results show that our model achieves encouraging performance. The growing integration of social media with people’s daily live has made this medium a common environment for the deployment of technologies that allow the retrieval of useful information in the development of business activities, social outreach processes, forensic tasks, etc. That is because people frequently upload and share content in these media with various purposes such as socialization of points of view about some topic or promotion of personal business, etc. The analysis of textual information from such data, is one of the main reasons why researches become trending on the Natural Language Processing (NLP) field. However, the fact that this information varies greatly in terms of its format, even when it comes from the same person, besides textual sequences are unstructured information, make challenging the process of analyzing it automatically. Author Profiling (AP) task aims at discovering different marks or patterns (linguistic or not) from texts, that allow a user to be characterized in terms of Copyright © 2020 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). their age, gender, personality or any other demographic attribute. Many forums, due to the applicability of AP, share tasks directed to mining features that in general way, predict that valuable information. Those tasks commonly make special focus on popular languages such as English and Spanish. Nevertheless, other languages are explored on important forums too, that is the case of EVALITA 1, this one, promoting analysis of NLP tasks in the Italian language. Among the challenges from its last campaign EVALITA 2018 was the AP (in terms of gender) task GxG (Dell’Orletta and Nissim, 2018), exploring the gender-predicting issue. The analysis of age, gender and the topic a text is related with, are tasks well explored and the most approaches employ data representation based on stylistic features, n-gram representations and/or words embedding combined with Machine Learning (ML) methods like Support Vector Machine (SVM) and Random Forest (Pizarro, 2019). Also some authors by using Deep Learning (DL) models like Convolutional Neural Networks (CNN) and Long-Short Term Memory (LSTM) combined with stylistic features (Aragón and López-Monroy, 2018) (Bayot and Gonçalves, 2018) have yield encouraging performances. In this work we address precisely, the automatic detection of gender and age of the authors, besides the identification of the prevailing topic on textual information from blogs. Also, we describe our developed model for participating on TAG-it: Topic, Age and Gender prediction for Italian2 (Cimino A., 2020) task at EVALITA 2020 (Basile et al., 2020). Having in account the proved ability of DL http://www.evalita.it/ https://sites.google.com/view/","PeriodicalId":184564,"journal":{"name":"EVALITA Evaluation of NLP and Speech Tools for Italian - December 17th, 2020","volume":"9 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"UOBIT @ TAG-it: Exploring a Multi-faceted Representation for Profiling Age, Topic and Gender in Italian Texts\",\"authors\":\"Roberto Labadie Tamayo, Daniel C. Castro, Reynier Ortega Bueno\",\"doi\":\"10.4000/BOOKS.AACCADEMIA.7285\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"English. This paper describes our system for participating in the TAG-it Author Profiling task at EVALITA 2020. The task aims to predict age and gender of blogs users from their posts, as the topic they wrote about. Our proposal combines learned representations by RNN at word and sentence levels, Transformer Neural Nets and hand-crafted stylistic features. All these representations are mixed and fed into a fully connected layer from a feed-forward neural network in order to make predictions for addressed subtasks. Experimental results show that our model achieves encouraging performance. The growing integration of social media with people’s daily live has made this medium a common environment for the deployment of technologies that allow the retrieval of useful information in the development of business activities, social outreach processes, forensic tasks, etc. That is because people frequently upload and share content in these media with various purposes such as socialization of points of view about some topic or promotion of personal business, etc. The analysis of textual information from such data, is one of the main reasons why researches become trending on the Natural Language Processing (NLP) field. However, the fact that this information varies greatly in terms of its format, even when it comes from the same person, besides textual sequences are unstructured information, make challenging the process of analyzing it automatically. Author Profiling (AP) task aims at discovering different marks or patterns (linguistic or not) from texts, that allow a user to be characterized in terms of Copyright © 2020 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). their age, gender, personality or any other demographic attribute. Many forums, due to the applicability of AP, share tasks directed to mining features that in general way, predict that valuable information. Those tasks commonly make special focus on popular languages such as English and Spanish. Nevertheless, other languages are explored on important forums too, that is the case of EVALITA 1, this one, promoting analysis of NLP tasks in the Italian language. Among the challenges from its last campaign EVALITA 2018 was the AP (in terms of gender) task GxG (Dell’Orletta and Nissim, 2018), exploring the gender-predicting issue. The analysis of age, gender and the topic a text is related with, are tasks well explored and the most approaches employ data representation based on stylistic features, n-gram representations and/or words embedding combined with Machine Learning (ML) methods like Support Vector Machine (SVM) and Random Forest (Pizarro, 2019). Also some authors by using Deep Learning (DL) models like Convolutional Neural Networks (CNN) and Long-Short Term Memory (LSTM) combined with stylistic features (Aragón and López-Monroy, 2018) (Bayot and Gonçalves, 2018) have yield encouraging performances. In this work we address precisely, the automatic detection of gender and age of the authors, besides the identification of the prevailing topic on textual information from blogs. Also, we describe our developed model for participating on TAG-it: Topic, Age and Gender prediction for Italian2 (Cimino A., 2020) task at EVALITA 2020 (Basile et al., 2020). Having in account the proved ability of DL http://www.evalita.it/ https://sites.google.com/view/\",\"PeriodicalId\":184564,\"journal\":{\"name\":\"EVALITA Evaluation of NLP and Speech Tools for Italian - December 17th, 2020\",\"volume\":\"9 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"1900-01-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"EVALITA Evaluation of NLP and Speech Tools for Italian - December 17th, 2020\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.4000/BOOKS.AACCADEMIA.7285\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"EVALITA Evaluation of NLP and Speech Tools for Italian - December 17th, 2020","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.4000/BOOKS.AACCADEMIA.7285","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

摘要

英语。本文描述了我们在EVALITA 2020上参与TAG-it作者分析任务的系统。这项任务的目的是预测博客用户的年龄和性别，从他们的帖子，作为他们写的主题。我们的建议结合了RNN在单词和句子级别的学习表征，Transformer神经网络和手工制作的风格特征。所有这些表征被混合并馈入一个来自前馈神经网络的全连接层，以便对寻址子任务进行预测。实验结果表明，该模型取得了令人鼓舞的效果。社交媒体与人们日常生活的日益融合，使这种媒体成为部署技术的共同环境，这些技术允许在商业活动、社会推广过程、法医任务等的发展中检索有用的信息。这是因为人们经常在这些媒体上上传和分享内容，有各种各样的目的，比如关于某个话题的观点的社会化，或者促进个人业务等。从这些数据中分析文本信息是自然语言处理(NLP)领域研究成为趋势的主要原因之一。然而，这些信息即使来自同一个人，其格式也有很大差异，而且文本序列是非结构化信息，这给自动分析这些信息的过程带来了挑战。作者分析(AP)任务旨在从文本中发现不同的标记或模式(语言或非语言)，允许其作者根据本文的版权©2020对用户进行特征描述。在知识共享许可国际署名4.0 (CC BY 4.0)下允许使用。他们的年龄，性别，性格或任何其他人口统计属性。由于AP的适用性，许多论坛共享针对挖掘功能的任务，这些功能通常可以预测有价值的信息。这些任务通常特别关注英语和西班牙语等流行语言。然而，其他语言也在重要的论坛上进行了探讨，这就是EVALITA 1的情况，这个论坛促进了意大利语中NLP任务的分析。其上一个活动EVALITA 2018的挑战之一是AP(就性别而言)任务GxG (Dell 'Orletta and Nissim, 2018)，探索性别预测问题。年龄、性别和文本相关主题的分析是经过充分探索的任务，大多数方法采用基于风格特征、n-gram表示和/或单词嵌入的数据表示，并结合支持向量机(SVM)和随机森林等机器学习(ML)方法(Pizarro, 2019)。此外，一些作者通过使用深度学习(DL)模型，如卷积神经网络(CNN)和长短期记忆(LSTM)结合风格特征(Aragón和López-Monroy, 2018) (Bayot和gonalves, 2018)，也取得了令人鼓舞的成绩。在这项工作中，我们精确地解决了作者性别和年龄的自动检测，以及对博客文本信息的流行主题的识别。此外，我们还描述了我们开发的模型，用于参与EVALITA 2020 (Basile等人，2020)的TAG-it: Topic, Age和Gender预测意大利语2 (Cimino A.， 2020)任务。考虑到已证明的DL能力http://www.evalita.it/ https://sites.google.com/view/

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

UOBIT @ TAG-it: Exploring a Multi-faceted Representation for Profiling Age, Topic and Gender in Italian Texts

English. This paper describes our system for participating in the TAG-it Author Profiling task at EVALITA 2020. The task aims to predict age and gender of blogs users from their posts, as the topic they wrote about. Our proposal combines learned representations by RNN at word and sentence levels, Transformer Neural Nets and hand-crafted stylistic features. All these representations are mixed and fed into a fully connected layer from a feed-forward neural network in order to make predictions for addressed subtasks. Experimental results show that our model achieves encouraging performance. The growing integration of social media with people’s daily live has made this medium a common environment for the deployment of technologies that allow the retrieval of useful information in the development of business activities, social outreach processes, forensic tasks, etc. That is because people frequently upload and share content in these media with various purposes such as socialization of points of view about some topic or promotion of personal business, etc. The analysis of textual information from such data, is one of the main reasons why researches become trending on the Natural Language Processing (NLP) field. However, the fact that this information varies greatly in terms of its format, even when it comes from the same person, besides textual sequences are unstructured information, make challenging the process of analyzing it automatically. Author Profiling (AP) task aims at discovering different marks or patterns (linguistic or not) from texts, that allow a user to be characterized in terms of Copyright © 2020 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). their age, gender, personality or any other demographic attribute. Many forums, due to the applicability of AP, share tasks directed to mining features that in general way, predict that valuable information. Those tasks commonly make special focus on popular languages such as English and Spanish. Nevertheless, other languages are explored on important forums too, that is the case of EVALITA 1, this one, promoting analysis of NLP tasks in the Italian language. Among the challenges from its last campaign EVALITA 2018 was the AP (in terms of gender) task GxG (Dell’Orletta and Nissim, 2018), exploring the gender-predicting issue. The analysis of age, gender and the topic a text is related with, are tasks well explored and the most approaches employ data representation based on stylistic features, n-gram representations and/or words embedding combined with Machine Learning (ML) methods like Support Vector Machine (SVM) and Random Forest (Pizarro, 2019). Also some authors by using Deep Learning (DL) models like Convolutional Neural Networks (CNN) and Long-Short Term Memory (LSTM) combined with stylistic features (Aragón and López-Monroy, 2018) (Bayot and Gonçalves, 2018) have yield encouraging performances. In this work we address precisely, the automatic detection of gender and age of the authors, besides the identification of the prevailing topic on textual information from blogs. Also, we describe our developed model for participating on TAG-it: Topic, Age and Gender prediction for Italian2 (Cimino A., 2020) task at EVALITA 2020 (Basile et al., 2020). Having in account the proved ability of DL http://www.evalita.it/ https://sites.google.com/view/

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

EVALITA Evaluation of NLP and Speech Tools for Italian - December 17th, 2020

自引率

0.00%

发文量