文本回归中基于Lasso的变量选择方法:以短文本为例

IF 1.4 4区 数学 Q2 STATISTICS & PROBABILITY
Marzia Freo, Alessandra Luati
{"title":"文本回归中基于Lasso的变量选择方法:以短文本为例","authors":"Marzia Freo,&nbsp;Alessandra Luati","doi":"10.1007/s10182-023-00472-0","DOIUrl":null,"url":null,"abstract":"<div><p>Communication through websites is often characterised by short texts, made of few words, such as image captions or tweets. This paper explores the class of supervised learning methods for the analysis of short texts, as an alternative to unsupervised methods, widely employed to infer topics from structured texts. The aim is to assess the effectiveness of text data in social sciences, when they are used as explanatory variables in regression models. To this purpose, we compare different variable selection procedures when text regression models are fitted to real, short, text data. We discuss the results obtained by several variants of lasso, screening-based methods and randomisation-based models, such as sure independence screening and stability selection, in terms of number and importance of selected variables, assessed through goodness-of-fit measures, inclusion frequency and model class reliance. Latent Dirichlet allocation results are also considered as a term of comparison. Our perspective is primarily empirical and our starting point is the analysis of two real case studies, though bootstrap replications of each dataset are considered. The first case study aims at explaining price variations based on the information contained in the description of items on sale on e-commerce platforms. The second regards open questions in surveys on satisfaction ratings. The case studies are different in nature and representative of different kinds of short texts, as, in one case, a concise descriptive text is considered, whereas, in the other case, the text expresses an opinion.</p></div>","PeriodicalId":55446,"journal":{"name":"Asta-Advances in Statistical Analysis","volume":null,"pages":null},"PeriodicalIF":1.4000,"publicationDate":"2023-03-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://link.springer.com/content/pdf/10.1007/s10182-023-00472-0.pdf","citationCount":"0","resultStr":"{\"title\":\"Lasso-based variable selection methods in text regression: the case of short texts\",\"authors\":\"Marzia Freo,&nbsp;Alessandra Luati\",\"doi\":\"10.1007/s10182-023-00472-0\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><p>Communication through websites is often characterised by short texts, made of few words, such as image captions or tweets. This paper explores the class of supervised learning methods for the analysis of short texts, as an alternative to unsupervised methods, widely employed to infer topics from structured texts. The aim is to assess the effectiveness of text data in social sciences, when they are used as explanatory variables in regression models. To this purpose, we compare different variable selection procedures when text regression models are fitted to real, short, text data. We discuss the results obtained by several variants of lasso, screening-based methods and randomisation-based models, such as sure independence screening and stability selection, in terms of number and importance of selected variables, assessed through goodness-of-fit measures, inclusion frequency and model class reliance. Latent Dirichlet allocation results are also considered as a term of comparison. Our perspective is primarily empirical and our starting point is the analysis of two real case studies, though bootstrap replications of each dataset are considered. The first case study aims at explaining price variations based on the information contained in the description of items on sale on e-commerce platforms. The second regards open questions in surveys on satisfaction ratings. The case studies are different in nature and representative of different kinds of short texts, as, in one case, a concise descriptive text is considered, whereas, in the other case, the text expresses an opinion.</p></div>\",\"PeriodicalId\":55446,\"journal\":{\"name\":\"Asta-Advances in Statistical Analysis\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":1.4000,\"publicationDate\":\"2023-03-20\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://link.springer.com/content/pdf/10.1007/s10182-023-00472-0.pdf\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Asta-Advances in Statistical Analysis\",\"FirstCategoryId\":\"100\",\"ListUrlMain\":\"https://link.springer.com/article/10.1007/s10182-023-00472-0\",\"RegionNum\":4,\"RegionCategory\":\"数学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"STATISTICS & PROBABILITY\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Asta-Advances in Statistical Analysis","FirstCategoryId":"100","ListUrlMain":"https://link.springer.com/article/10.1007/s10182-023-00472-0","RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"STATISTICS & PROBABILITY","Score":null,"Total":0}
引用次数: 0

摘要

通过网站进行的交流通常以短文为特征,如图片说明或推文等。本文探讨了一类用于分析短文的监督学习方法,以替代广泛用于从结构化文本中推断主题的无监督方法。目的是评估文本数据在社会科学中用作回归模型解释变量时的有效性。为此,我们比较了将文本回归模型拟合到真实、简短的文本数据时的不同变量选择程序。我们从所选变量的数量和重要性(通过拟合优度、纳入频率和模型类别依赖性进行评估)的角度,讨论了拉索的几种变体、基于筛选的方法和基于随机化的模型(如确定的独立性筛选和稳定性选择)所获得的结果。潜在德里赫特分配结果也被视为一种比较。我们的视角主要是实证性的,我们的出发点是分析两个真实的案例研究,但也考虑了每个数据集的引导复制。第一个案例研究旨在根据电子商务平台上销售商品描述中包含的信息来解释价格变化。第二个案例涉及满意度调查中的开放式问题。案例研究的性质不同,代表了不同类型的短文,其中一个案例研究的是简洁的描述性文本,而另一个案例研究的是表达观点的文本。
本文章由计算机程序翻译,如有差异,请以英文原文为准。

Lasso-based variable selection methods in text regression: the case of short texts

Lasso-based variable selection methods in text regression: the case of short texts

Communication through websites is often characterised by short texts, made of few words, such as image captions or tweets. This paper explores the class of supervised learning methods for the analysis of short texts, as an alternative to unsupervised methods, widely employed to infer topics from structured texts. The aim is to assess the effectiveness of text data in social sciences, when they are used as explanatory variables in regression models. To this purpose, we compare different variable selection procedures when text regression models are fitted to real, short, text data. We discuss the results obtained by several variants of lasso, screening-based methods and randomisation-based models, such as sure independence screening and stability selection, in terms of number and importance of selected variables, assessed through goodness-of-fit measures, inclusion frequency and model class reliance. Latent Dirichlet allocation results are also considered as a term of comparison. Our perspective is primarily empirical and our starting point is the analysis of two real case studies, though bootstrap replications of each dataset are considered. The first case study aims at explaining price variations based on the information contained in the description of items on sale on e-commerce platforms. The second regards open questions in surveys on satisfaction ratings. The case studies are different in nature and representative of different kinds of short texts, as, in one case, a concise descriptive text is considered, whereas, in the other case, the text expresses an opinion.

求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
Asta-Advances in Statistical Analysis
Asta-Advances in Statistical Analysis 数学-统计学与概率论
CiteScore
2.20
自引率
14.30%
发文量
39
审稿时长
>12 weeks
期刊介绍: AStA - Advances in Statistical Analysis, a journal of the German Statistical Society, is published quarterly and presents original contributions on statistical methods and applications and review articles.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信