EPJ Data SciencePub Date : 2024-06-07DOI: 10.1140/epjds/s13688-024-00481-2
Marco Bronzini, Carlo Nicolini, Bruno Lepri, Andrea Passerini, Jacopo Staiano
{"title":"Glitter or gold? Deriving structured insights from sustainability reports via large language models","authors":"Marco Bronzini, Carlo Nicolini, Bruno Lepri, Andrea Passerini, Jacopo Staiano","doi":"10.1140/epjds/s13688-024-00481-2","DOIUrl":"https://doi.org/10.1140/epjds/s13688-024-00481-2","url":null,"abstract":"<p>Over the last decade, several regulatory bodies have started requiring the disclosure of non-financial information from publicly listed companies, in light of the investors’ increasing attention to Environmental, Social, and Governance (ESG) issues. Publicly released information on sustainability practices is often disclosed in diverse, unstructured, and multi-modal documentation. This poses a challenge in efficiently gathering and aligning the data into a unified framework to derive insights related to Corporate Social Responsibility (CSR). Thus, using Information Extraction (IE) methods becomes an intuitive choice for delivering insightful and actionable data to stakeholders. In this study, we employ Large Language Models (LLMs), In-Context Learning, and the Retrieval-Augmented Generation (RAG) paradigm to extract structured insights related to ESG aspects from companies’ sustainability reports. We then leverage graph-based representations to conduct statistical analyses concerning the extracted insights. These analyses revealed that ESG criteria cover a wide range of topics, exceeding 500, often beyond those considered in existing categorizations, and are addressed by companies through a variety of initiatives. Moreover, disclosure similarities emerged among companies from the same region or sector, validating ongoing hypotheses in the ESG literature. Lastly, by incorporating additional company attributes into our analyses, we investigated which factors impact the most on companies’ ESG ratings, showing that ESG disclosure affects the obtained ratings more than other financial or company data.</p>","PeriodicalId":11887,"journal":{"name":"EPJ Data Science","volume":"64 1","pages":""},"PeriodicalIF":3.6,"publicationDate":"2024-06-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141549803","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
EPJ Data SciencePub Date : 2024-06-05DOI: 10.1140/epjds/s13688-024-00480-3
Pau Muñoz, Alejandro Bellogín, Raúl Barba-Rojas, Fernando Díez
{"title":"Quantifying polarization in online political discourse","authors":"Pau Muñoz, Alejandro Bellogín, Raúl Barba-Rojas, Fernando Díez","doi":"10.1140/epjds/s13688-024-00480-3","DOIUrl":"https://doi.org/10.1140/epjds/s13688-024-00480-3","url":null,"abstract":"<p>In an era of increasing political polarization, its analysis becomes crucial for the understanding of democratic dynamics. This paper presents a comprehensive research on measuring political polarization on X (Twitter) during election cycles in Spain, from 2011 to 2019. A wide comparative analysis is performed on algorithms used to identify and measure polarization or controversy on microblogging platforms. This analysis is specifically tailored towards publications made by official political party accounts during pre-campaign, campaign, election day, and the week post-election. Guided by the findings of this comparative evaluation, we propose a novel algorithm better suited to capture polarization in the context of political events, which is validated with real data. As a consequence, our research contributes a significant advancement in the field of political science, social network analysis, and overall computational social science, by providing a realistic method to capture polarization from online political discourse.</p>","PeriodicalId":11887,"journal":{"name":"EPJ Data Science","volume":"69 1","pages":""},"PeriodicalIF":3.6,"publicationDate":"2024-06-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141256074","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
EPJ Data SciencePub Date : 2024-05-17DOI: 10.1140/epjds/s13688-024-00476-z
Oleg Sobchuk, Mason Youngblood, Olivier Morin
{"title":"First-mover advantage in music","authors":"Oleg Sobchuk, Mason Youngblood, Olivier Morin","doi":"10.1140/epjds/s13688-024-00476-z","DOIUrl":"https://doi.org/10.1140/epjds/s13688-024-00476-z","url":null,"abstract":"<p>Why do some songs and musicians become successful while others do not? We show that one of the reasons may be the “first-mover advantage”: artists that stand at the foundation of new music genres tend to be more successful than those who join these genres later on. To test this hypothesis, we have analyzed a massive dataset of over 920,000 songs, including 110 music genres: 10 chosen intentionally and preregistered, and 100 chosen randomly. For this, we collected the data from two music services: Spotify, which provides detailed information about songs’ success (the precise number of times each song was listened to), and Every Noise at Once, which provides detailed genre tags for musicians. 91 genres, out of 110, show the first-mover advantage—clearly suggesting that it is an important mechanism in music success and evolution.</p>","PeriodicalId":11887,"journal":{"name":"EPJ Data Science","volume":"14 1","pages":""},"PeriodicalIF":3.6,"publicationDate":"2024-05-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141064173","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
EPJ Data SciencePub Date : 2024-05-08DOI: 10.1140/epjds/s13688-024-00473-2
Amir Mehrjoo, Rubén Cuevas, Ángel Cuevas
{"title":"Online advertisement in a pink-colored market","authors":"Amir Mehrjoo, Rubén Cuevas, Ángel Cuevas","doi":"10.1140/epjds/s13688-024-00473-2","DOIUrl":"https://doi.org/10.1140/epjds/s13688-024-00473-2","url":null,"abstract":"<p>It is surprising that women are often charged more for products and services marketed explicitly to them. This phenomenon, known as the pink tax, is a major issue that questions women’s buying power. Nevertheless, it is not just limited to physical products – even online advertising can be subject to this type of gender-price discrimination. That is where our research comes in. We have developed a new methodology to measure what we call the digital marketing pink tax – the additional expense of delivering advertisements to female audiences. Analyzing data from Facebook advertising platforms across 187 countries and 40 territories shows this issue is systematic. Particularly, the digital marketing pink tax is prevalent in 79% of audiences across the world and 98% of audiences in highly developed countries. Therefore, advertisers incur a median cost of 30% more to display advertisements to women than men. In contrast, advertisers have to pay less digital marketing pink tax in less-developed countries (5%). Our research indicates that countries in the Middle East and Africa with a low Human Development Index (<i>HDI</i>) do not experience this phenomenon. Our comprehensive investigation of 24 industries reveals that advertisers must pay up to 64% of the digital marketing pink tax to target women in some industries. Our findings also suggest a connection between the digital marketing pink tax and the consumer pink tax – the extra charge placed on products marketed to women. Overall, our research sheds light on an important issue affecting women worldwide. Raising awareness of the digital marketing pink tax and advocating for better regulation.</p>","PeriodicalId":11887,"journal":{"name":"EPJ Data Science","volume":"59 1","pages":""},"PeriodicalIF":3.6,"publicationDate":"2024-05-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140925708","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
EPJ Data SciencePub Date : 2024-05-06DOI: 10.1140/epjds/s13688-024-00475-0
Peter Mehler, Eva Iris Otto, Anna Sapienza
{"title":"Who makes open source code? The hybridisation of commercial and open source practices","authors":"Peter Mehler, Eva Iris Otto, Anna Sapienza","doi":"10.1140/epjds/s13688-024-00475-0","DOIUrl":"https://doi.org/10.1140/epjds/s13688-024-00475-0","url":null,"abstract":"<p>While Free and Open Source (F/OSS) coding has traditionally been described as a separate commons linked to values of openness and sharing, recent research suggests an increasing integration of private corporations into F/OSS practices, blurring the boundaries between F/OSS and commodified coding. However, there is a dearth of empirical, and especially quantitative studies exploring this phenomenon. To address this gap, we model the power dynamics and infrastructural aspects of software production within GitHub, a central hub for F/OSS development, using a large-scale, directed network. Using various network statistics, we detect the ecosystem’s most impactful actors and find a nuanced picture of the influence of individuals, open source organizations, and private corporations in F/OSS practices. We find that the majority of public repositories on GitHub depend on a small core of specialized repositories and users. In accordance with expectations, individuals and open source organizations are more prevalent in this core of elite GitHub users, however, we also find a significant amount of private organizations with an indirect, yet consistent influence within GitHub. In addition, we find that directly influential individuals tend to facilitate sponsorship methods more often than indirectly or non-influential individuals. Our research highlights a hybridization of F/OSS and sheds light on the complex interplay between influence, power, and code production in the multi-language dependency ecosystem of GitHub.</p>","PeriodicalId":11887,"journal":{"name":"EPJ Data Science","volume":"61 1","pages":""},"PeriodicalIF":3.6,"publicationDate":"2024-05-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140883269","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
EPJ Data SciencePub Date : 2024-04-22DOI: 10.1140/epjds/s13688-024-00466-1
Alex D. Singleton, Seth Spielman
{"title":"Segmentation using large language models: A new typology of American neighborhoods","authors":"Alex D. Singleton, Seth Spielman","doi":"10.1140/epjds/s13688-024-00466-1","DOIUrl":"https://doi.org/10.1140/epjds/s13688-024-00466-1","url":null,"abstract":"<p>In the United States, recent changes to the National Statistical System have amplified the geographic-demographic resolution trade-off. That is, when working with demographic and economic data from the American Community Survey, as one zooms in geographically one loses resolution demographically due to very large margins of error. In this paper, we present a solution to this problem in the form of an AI based open and reproducible geodemographic classification system for the United States using small area estimates from the American Community Survey (ACS). We employ a partitioning clustering algorithm to a range of socio-economic, demographic, and built environment variables. Our approach utilizes an open source software pipeline that ensures adaptability to future data updates. A key innovation is the integration of GPT4, a state-of-the-art large language model, to generate intuitive cluster descriptions and names. This represents a novel application of natural language processing in geodemographic research and showcases the potential for human-AI collaboration within the geospatial domain.</p>","PeriodicalId":11887,"journal":{"name":"EPJ Data Science","volume":"24 1","pages":""},"PeriodicalIF":3.6,"publicationDate":"2024-04-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140636597","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
EPJ Data SciencePub Date : 2024-04-19DOI: 10.1140/epjds/s13688-024-00472-3
Chiara Zappalà, Sandro Sousa, Tiago Cunha, Alessandro Pluchino, Andrea Rapisarda, Roberta Sinatra
{"title":"Early career wins and tournament prestige characterize tennis players’ trajectories","authors":"Chiara Zappalà, Sandro Sousa, Tiago Cunha, Alessandro Pluchino, Andrea Rapisarda, Roberta Sinatra","doi":"10.1140/epjds/s13688-024-00472-3","DOIUrl":"https://doi.org/10.1140/epjds/s13688-024-00472-3","url":null,"abstract":"<p>Success in sports is a complex phenomenon that has only garnered limited research attention. In particular, we lack a deep scientific understanding of success in sports like tennis and the factors that contribute to it. Here, we study the unfolding of tennis players’ careers to understand the role of early career stages and the impact of specific tournaments on players’ trajectories. We employ a comprehensive approach combining network science and analysis of the Association of Tennis Professionals (ATP) tournament data and introduce a novel method to quantify tournament prestige based on the eigenvector centrality of the co-attendance network of tournaments. Focusing on the interplay between participation in central tournaments and players’ performance, we find that the level of the tournament where players achieve their first win is associated with becoming a top player. This work sheds light on the critical role of the initial stages in the progression of players’ careers, offering valuable insights into the dynamics of success in tennis.</p>","PeriodicalId":11887,"journal":{"name":"EPJ Data Science","volume":"2 1","pages":""},"PeriodicalIF":3.6,"publicationDate":"2024-04-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140623057","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
EPJ Data SciencePub Date : 2024-04-19DOI: 10.1140/epjds/s13688-024-00467-0
Serena Tardelli, Leonardo Nizzoli, Marco Avvenuti, Stefano Cresci, Maurizio Tesconi
{"title":"Multifaceted online coordinated behavior in the 2020 US presidential election","authors":"Serena Tardelli, Leonardo Nizzoli, Marco Avvenuti, Stefano Cresci, Maurizio Tesconi","doi":"10.1140/epjds/s13688-024-00467-0","DOIUrl":"https://doi.org/10.1140/epjds/s13688-024-00467-0","url":null,"abstract":"<p>Organized attempts to manipulate public opinion during election run-ups have dominated online debates in the last few years. Such attempts require numerous accounts to <i>act in coordination</i> to exert influence. Yet, the ways in which coordinated behavior surfaces during major online political debates is still largely unclear. This study sheds light on coordinated behaviors that took place on Twitter (now X) during the 2020 US Presidential Election. Utilizing state-of-the-art network science methods, we detect and characterize the coordinated communities that participated in the debate. Our approach goes beyond previous analyses by proposing a multifaceted characterization of the coordinated communities that allows obtaining nuanced results. In particular, we uncover three main categories of coordinated users: (<i>i</i>) moderate groups genuinely interested in the electoral debate, (<i>ii</i>) conspiratorial groups that spread false information and divisive narratives, and (<i>iii</i>) foreign influence networks that either sought to tamper with the debate or that exploited it to publicize their own agendas. We also reveal a large use of automation by far-right foreign influence and conspiratorial communities. Conversely, left-leaning supporters were overall less coordinated and engaged primarily in harmless, factual communication. Our results also showed that Twitter was effective at thwarting the activity of some coordinated groups, while it failed on some other equally suspicious ones. Overall, this study advances the understanding of online human interactions and contributes new knowledge to mitigate cyber social threats.</p>","PeriodicalId":11887,"journal":{"name":"EPJ Data Science","volume":"48 1","pages":""},"PeriodicalIF":3.6,"publicationDate":"2024-04-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140623004","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
EPJ Data SciencePub Date : 2024-04-16DOI: 10.1140/epjds/s13688-024-00470-5
Mohsen Ghasemizade, Jeremiah Onaolapo
{"title":"Developing a hierarchical model for unraveling conspiracy theories","authors":"Mohsen Ghasemizade, Jeremiah Onaolapo","doi":"10.1140/epjds/s13688-024-00470-5","DOIUrl":"https://doi.org/10.1140/epjds/s13688-024-00470-5","url":null,"abstract":"<p>A conspiracy theory (CT) suggests covert groups or powerful individuals secretly manipulate events. Not knowing about existing conspiracy theories could make one more likely to believe them, so this work aims to compile a list of CTs shaped as a tree that is as comprehensive as possible. We began with a manually curated ‘tree’ of CTs from academic papers and Wikipedia. Next, we examined 1769 CT-related articles from four fact-checking websites, focusing on their core content, and used a technique called Keyphrase Extraction to label the documents. This process yielded 769 identified conspiracies, each assigned a label and a family name. The second goal of this project was to detect whether an article is a conspiracy theory, so we built a binary classifier with our labeled dataset. This model uses a transformer-based machine learning technique and is pre-trained on a large corpus called RoBERTa, resulting in an F1 score of 87%. This model helps to identify potential conspiracy theories in new articles. We used a combination of clustering (HDBSCAN) and a dimension reduction technique (UMAP) to assign a label from the tree to these new articles detected as conspiracy theories. We then labeled these groups accordingly to help us match them to the tree. These can lead us to detect new conspiracy theories and expand the tree using computational methods. We successfully generated a tree of conspiracy theories and built a pipeline to detect and categorize conspiracy theories within any text corpora. This pipeline gives us valuable insights through any databases formatted as text.</p>","PeriodicalId":11887,"journal":{"name":"EPJ Data Science","volume":"1 1","pages":""},"PeriodicalIF":3.6,"publicationDate":"2024-04-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140610910","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Scaling law of real traffic jams under varying travel demand","authors":"Rui Chen, Yuming Lin, Huan Yan, Jiazhen Liu, Yu Liu, Yong Li","doi":"10.1140/epjds/s13688-024-00471-4","DOIUrl":"https://doi.org/10.1140/epjds/s13688-024-00471-4","url":null,"abstract":"<p>The escalation of urban traffic congestion has reached a critical extent due to rapid urbanization, capturing considerable attention within urban science and transportation research. Although preceding studies have validated the scale-free distributions in spatio-temporal congestion clusters across cities, the influence of travel demand on that distribution has yet to be explored. Using a unique traffic dataset during the COVID-19 pandemic in Shanghai 2022, we present empirical evidence that travel demand plays a pivotal role in shaping the scaling laws of traffic congestion. We uncover a noteworthy negative linear correlation between the travel demand and the traffic resilience represented by scaling exponents of congestion cluster size and recovery duration. Additionally, we reveal that travel demand broadly dominates the scale of congestion in the form of scaling laws, including the aggregated volume of congestion clusters, the number of congestion clusters, and the number of congested roads. Subsequent micro-level analysis of congestion propagation also unveils that cascade diffusion determines the demand sensitivity of congestion, while other intrinsic components, namely spontaneous generation and dissipation, are rather stable. Our findings of traffic congestion under diverse travel demand can profoundly enrich our understanding of the scale-free nature of traffic congestion and provide insights into internal mechanisms of congestion propagation.</p>","PeriodicalId":11887,"journal":{"name":"EPJ Data Science","volume":"38 1","pages":""},"PeriodicalIF":3.6,"publicationDate":"2024-04-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140563982","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}