{"title":"Authorship Attribution in the Era of LLMs: Problems, Methodologies, and Challenges.","authors":"Baixiang Huang, Canyu Chen, Kai Shu","doi":"10.1145/3715073.3715076","DOIUrl":"10.1145/3715073.3715076","url":null,"abstract":"<p><p>Accurate attribution of authorship is crucial for maintaining the integrity of digital content, improving forensic investigations, and mitigating the risks of misinformation and plagiarism. Addressing the imperative need for proper authorship attribution is essential to uphold the credibility and accountability of authentic authorship. The rapid advancements of Large Language Models (LLMs) have blurred the lines between human and machine authorship, posing significant challenges for traditional methods. We present a comprehensive literature review that examines the latest research on authorship attribution in the era of LLMs. This survey systematically explores the landscape of this field by categorizing four representative problems: (1) Human-written Text Attribution; (2) LLM-generated Text Detection; (3) LLM-generated Text Attribution; and (4) Human-LLM Co-authored Text Attribution. We also discuss the challenges related to ensuring the generalization and explainability of authorship attribution methods. Generalization requires the ability to generalize across various domains, while explainability emphasizes providing transparent and understandable insights into the decisions made by these models. By evaluating the strengths and limitations of existing methods and benchmarks, we identify key open problems and future research directions in this field. This literature review serves a roadmap for researchers and practitioners interested in understanding the state of the art in this rapidly evolving field. Additional resources and a curated list of papers are available and regularly updated at https://llm-authorship.github.io/.</p>","PeriodicalId":90050,"journal":{"name":"SIGKDD explorations : newsletter of the Special Interest Group (SIG) on Knowledge Discovery & Data Mining","volume":"26 2","pages":"21-43"},"PeriodicalIF":0.0,"publicationDate":"2024-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12019761/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144055709","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Heitor Murilo Gomes, Jesse Read, A. Bifet, J. P. Barddal, João Gama
{"title":"Machine learning for streaming data: state of the art, challenges, and opportunities","authors":"Heitor Murilo Gomes, Jesse Read, A. Bifet, J. P. Barddal, João Gama","doi":"10.1145/3373464.3373470","DOIUrl":"https://doi.org/10.1145/3373464.3373470","url":null,"abstract":"Incremental learning, online learning, and data stream learning are terms commonly associated with learning algorithms that update their models given a continuous influx of data without performing multiple passes over data. Several works have been devoted to this area, either directly or indirectly as characteristics of big data processing, i.e., Velocity and Volume. Given the current industry needs, there are many challenges to be addressed before existing methods can be efficiently applied to real-world problems. In this work, we focus on elucidating the connections among the current stateof- the-art on related fields; and clarifying open challenges in both academia and industry. We treat with special care topics that were not thoroughly investigated in past position and survey papers. This work aims to evoke discussion and elucidate the current research opportunities, highlighting the relationship of different subareas and suggesting courses of action when possible.","PeriodicalId":90050,"journal":{"name":"SIGKDD explorations : newsletter of the Special Interest Group (SIG) on Knowledge Discovery & Data Mining","volume":"21 1","pages":"6-22"},"PeriodicalIF":0.0,"publicationDate":"2019-11-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"81600616","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Sorour E. Amiri, Anika Tabassum, E. Ewing, B. Prakash
{"title":"Tracking and analyzing dynamics of news-cycles during global pandemics: a historical perspective","authors":"Sorour E. Amiri, Anika Tabassum, E. Ewing, B. Prakash","doi":"10.1145/3373464.3373476","DOIUrl":"https://doi.org/10.1145/3373464.3373476","url":null,"abstract":"How does the tone of reporting during a disease outbreak change in relation to the number of cases, categories of victims, and accumulating deaths? How do newspapers and medical journals contribute to the narrative of a historical pandemic? Can data mining experts help history scholars to scale up the process of examining articles, extracting new insights and understanding the public opinion of a pandemic? We explore these problems in this paper, using the 19thcentury Russian Flu epidemic as an example. We study two different types of historical data sources: the US medical discussion and popular reporting during the epidemic, from its outbreak in late 1889 through the successive waves that lasted through 1893. We analyze and compare these articles and reports to answer three major questions. First, we analyze how newspapers and medical journals report the Russian flu and describe the situation. Next, we help historians in understanding the tone of related reports and how they vary across data sources. We also examine the temporal changes in the discussion to get an in-depth understanding of how public opinion changed about the pandemic. Finally, we aggregate all of the algorithms in an easy to use framework GrippeStory to help history scholars investigate historical pandemic data in general, across chronological periods and locations. Our extensive experiments and analysis on a large number of historical articles show that GrippeStory gives meaningful and useful results for historians and it outperforms the baselines.","PeriodicalId":90050,"journal":{"name":"SIGKDD explorations : newsletter of the Special Interest Group (SIG) on Knowledge Discovery & Data Mining","volume":"31 1","pages":"91-100"},"PeriodicalIF":0.0,"publicationDate":"2019-11-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"81905969","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"An Interview with Dr. Balaji Krishnapuram, Winner of SIGKDD Service Award","authors":"Balaji Krishnapuram","doi":"10.1145/3373464.3373466","DOIUrl":"https://doi.org/10.1145/3373464.3373466","url":null,"abstract":"Dr. Balaji Krishnapuram, director and distinguished engineer at IBM Watson Health, is honored for his contributions to society through the development of machine learning products to improve healthcare. The ACM SIGKDD Service Award is the highest service award in the field of knowledge discovery and data mining. It is conferred on one individual or one group for their outstanding professional services and contributions to the field of knowledge discovery and data mining. Dr. Krishnapuram sat down with SIGKDD Explorations to discuss how he first got involved in the KDD Conference, his work at IBM Watson Health and what exites him about the future of machine learning, data science and artificial intelligence.","PeriodicalId":90050,"journal":{"name":"SIGKDD explorations : newsletter of the Special Interest Group (SIG) on Knowledge Discovery & Data Mining","volume":"36 1","pages":"1-2"},"PeriodicalIF":0.0,"publicationDate":"2019-11-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"79496606","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Liang Wu, Fred Morstatter, Kathleen M. Carley, Huan Liu
{"title":"Misinformation in Social Media: Definition, Manipulation, and Detection","authors":"Liang Wu, Fred Morstatter, Kathleen M. Carley, Huan Liu","doi":"10.1145/3373464.3373475","DOIUrl":"https://doi.org/10.1145/3373464.3373475","url":null,"abstract":"The widespread dissemination of misinformation in social media has recently received a lot of attention in academia. While the problem of misinformation in social media has been intensively studied, there are seemingly different definitions for the same problem, and inconsistent results in different studies. In this survey, we aim to consolidate the observations, and investigate how an optimal method can be selected given specific conditions and contexts. To this end, we first introduce a definition for misinformation in social media and we examine the difference between misinformation detection and classic supervised learning. Second, we describe the diffusion of misinformation and introduce how spreaders propagate misinformation in social networks. Third, we explain characteristics of individual methods of misinformation detection, and provide commentary on their advantages and pitfalls. By reflecting applicability of different methods, we hope to enable the intensive research in this area to be conveniently reused in real-world applications and open up potential directions for future studies.","PeriodicalId":90050,"journal":{"name":"SIGKDD explorations : newsletter of the Special Interest Group (SIG) on Knowledge Discovery & Data Mining","volume":"20 1","pages":"80-90"},"PeriodicalIF":0.0,"publicationDate":"2019-11-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"84537320","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Solve for Good: A Data Science for Social Good Marketplace","authors":"R. Ghani, Lisa Green, Alberto Bengoa, Mohak Shah","doi":"10.1145/3373464.3373468","DOIUrl":"https://doi.org/10.1145/3373464.3373468","url":null,"abstract":"Solve for Good is a platform for social good organizations to pose their problems that need data intensive help, and for volunteers to help solve those problems. Once the projects are submitted by the organization, they go through a scoping process (done by scoping volunteers and guided by our Data Science Scoping Process). Once a project scope is finalized, it becomes available for data science volunteers to start working on. The finished work is reviewed by a QA team consisting of volunteers and staff of the organization that submitted the project.\u0000 Solve for Good comes out of our experience working with government agencies, non-profits, universities, volunteers, professionals, students, and the private sector over the last several years. We repeatedly get contacted by governments, non-profits, and other social good organizations asking for help with data projects. We also have smart, passionate individuals who contact us offering their help, often in a volunteer capacity, on weekends, evenings, or for a few days or weeks. Solve for Good is our attempt at linking these two. We are just starting out, and looking for help in doing this better, and getting feedback from you. Join us at http://www.solveforgood.org as a volunteer to help solve problems, as an organization to submit problems, as partners to help us expand the platform and provide resources to run it, and as corporations or foundations to loan volunteers and donate resources used in solving these problems.","PeriodicalId":90050,"journal":{"name":"SIGKDD explorations : newsletter of the Special Interest Group (SIG) on Knowledge Discovery & Data Mining","volume":"1 1","pages":"3-5"},"PeriodicalIF":0.0,"publicationDate":"2019-11-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82817908","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A Survey of Multi-Label Topic Models","authors":"Sophie Burkhardt, S. Kramer","doi":"10.1145/3373464.3373474","DOIUrl":"https://doi.org/10.1145/3373464.3373474","url":null,"abstract":"Every day, an enormous amount of text data is produced. Sources of text data include news, social media, emails, text messages, medical reports, scientific publications and fiction. To keep track of this data, there are categories, key words, tags or labels that are assigned to each text. Automatically predicting such labels is the task of multi-label text classification. Often however, we are interested in more than just the pure classification: rather, we would like to understand which parts of a text belong to the label, which words are important for the label or which labels occur together. Because of this, topic models may be used for multi-label classification as an interpretable model that is flexible and easily extensible. This survey demonstrates the manifold possibilities and flexibility of the topic model framework for the complex setting of multi-label text classification by categorizing different variants of models.","PeriodicalId":90050,"journal":{"name":"SIGKDD explorations : newsletter of the Special Interest Group (SIG) on Knowledge Discovery & Data Mining","volume":"51 1","pages":"61-79"},"PeriodicalIF":0.0,"publicationDate":"2019-11-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"75692281","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"The Holy Grail of: Teaming humans and machine learning for detecting cyber threats","authors":"Ignacio Arnaldo, K. Veeramachaneni","doi":"10.1145/3373464.3373472","DOIUrl":"https://doi.org/10.1145/3373464.3373472","url":null,"abstract":"Although there is a large corpus of research focused on using machine learning to detect cyber threats, the solutions presented are rarely actually adopted in the real world. In this paper, we discuss the challenges that currently limit the adoption of machine learning in security operations, with a special focus on label acquisition, model deployment, and the integration of model findings into existing investigation workflows. Moreover, we posit that the conventional approach to the development of machine learning models, whereby researchers work offline on representative datasets to develop accurate models, is not valid for many cybersecurity use cases. Instead, a different approach is needed: to integrate the creation and maintenance of machine learning models into security operations themselves.","PeriodicalId":90050,"journal":{"name":"SIGKDD explorations : newsletter of the Special Interest Group (SIG) on Knowledge Discovery & Data Mining","volume":"6 1","pages":"39-47"},"PeriodicalIF":0.0,"publicationDate":"2019-11-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"75515703","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Gene Expression and Protein Function: A Survey of Deep Learning Methods","authors":"Saket K. Sathe, Sayani Aggarwal, Jiliang Tang","doi":"10.1145/3373464.3373471","DOIUrl":"https://doi.org/10.1145/3373464.3373471","url":null,"abstract":"Deep learning methods have found increasing interest in recent years because of their wide applicability for prediction and inference in numerous disciplines such as image recognition, natural language processing, and speech recognition. Computational biology is a data-intensive field in which the types of data can be very diverse. These different types of structured data require different neural architectures. The problems of gene expression and protein function prediction are related areas in computational biology (since genes control the production of proteins). This survey provides an overview of the various types of problems in this domain and the neural architectures that work for these data sets. Since deep learning is a new field compared to traditional machine learning, much of the work in this area corresponds to traditional machine learning rather than deep learning. However, as the sizes of protein and gene expression data sets continue to grow, the possibility of using data-hungry deep learning methods continues to increase. Indeed, the previous five years have seen a sudden increase in deep learning models, although some areas of protein analytics and gene expression still remain relatively unexplored. Therefore, aside from the survey on the deep learning work directly related to these problems, we also point out existing deep learning work from other domains that has the potential to be applied to these domains.","PeriodicalId":90050,"journal":{"name":"SIGKDD explorations : newsletter of the Special Interest Group (SIG) on Knowledge Discovery & Data Mining","volume":"4 1","pages":"23-38"},"PeriodicalIF":0.0,"publicationDate":"2019-11-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"87412096","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Top Challenges from the first Practical Online Controlled Experiments Summit","authors":"Somit Gupta, Ron Kohavi, Diane Tang, Ya Xu","doi":"10.1145/3331651.3331655","DOIUrl":"https://doi.org/10.1145/3331651.3331655","url":null,"abstract":"Online controlled experiments (OCEs), also known as A/B tests, have become ubiquitous in evaluating the impact of changes made to software products and services. While the concept of online controlled experiments is simple, there are many practical challenges in running OCEs at scale. To understand the top practical challenges in running OCEs at scale and encourage further academic and industrial exploration, representatives with experience in large-scale experimentation from thirteen different organizations (Airbnb, Amazon, Booking.com, Facebook, Google, LinkedIn, Lyft, Microsoft, Netflix, Twitter, Uber, Yandex, and Stanford University) were invited to the first Practical Online Controlled Experiments Summit. All thirteen organizations sent representatives. Together these organizations have tested more than one hundred thousand experiment treatments last year. Thirty-four experts from these organizations participated in the summit in Sunnyvale, CA, USA on December 13-14, 2018.\u0000 While there are papers from individual organizations on some of the challenges and pitfalls in running OCEs at scale, this is the first paper to provide the top challenges faced across the industry for running OCEs at scale and some common solutions.","PeriodicalId":90050,"journal":{"name":"SIGKDD explorations : newsletter of the Special Interest Group (SIG) on Knowledge Discovery & Data Mining","volume":"43 3 1","pages":"20-35"},"PeriodicalIF":0.0,"publicationDate":"2019-05-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"72987057","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}