Sikun Guo, Amir Hassan Shariatmadari, Guangzhi Xiong, Aidong Zhang
{"title":"Embracing Foundation Models for Advancing Scientific Discovery.","authors":"Sikun Guo, Amir Hassan Shariatmadari, Guangzhi Xiong, Aidong Zhang","doi":"10.1109/bigdata62323.2024.10825618","DOIUrl":"10.1109/bigdata62323.2024.10825618","url":null,"abstract":"<p><p>Machine learning foundation models, particularly large language models (LLMs) such as GPT-4o, have revolutionized traditional applications in computer vision and natural language processing, marking a significant shift in recent years. Building on these advancements, recent efforts have explored the potential of foundation models in hypothesis generation, highlighting their possibility in aiding human researchers in scientific discovery. In this paper, we envision a future where academia increasingly integrates foundation models to accelerate and enhance the process of scientific discovery. Motivated by potential application scenarios of foundation models in scientific research, our vision is anchored in a central question: How can we accelerate scientific discovery with the aid of foundation models? To address this overarching question, we raise two key challenges that need to be addressed: (1) how to effectively harness the parametric knowledge embedded in foundation models to propel scientific discovery? and (2) how to develop rigorous yet scalable methods to evaluate the effectiveness of foundation models in supporting scientific research? To tackle these two challenges, we propose our approaches, termed knowledge-grounded Chain-of-Idea (KG-CoI) hypothesis generation and IdeaBench - Benchmarking LLM hypothesis generators in a customizable manner. Through addressing these challenges, we outline our vision in hope to inspire new ideas and innovations in harnessing foundation models for advancing scientific discovery, paving the way for a new era of research collaboration between humans and artificial intelligence.</p>","PeriodicalId":520404,"journal":{"name":"Proceedings : ... IEEE International Conference on Big Data. IEEE International Conference on Big Data","volume":"2024 ","pages":"1746-1755"},"PeriodicalIF":0.0,"publicationDate":"2024-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11923747/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143672235","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Lightweight Transformer exhibits comparable performance to LLMs for Seizure Prediction: A case for light-weight models for EEG data.","authors":"Paras Parani, Umair Mohammad, Fahad Saeed","doi":"10.1109/bigdata62323.2024.10825319","DOIUrl":"10.1109/bigdata62323.2024.10825319","url":null,"abstract":"<p><p>Predicting seizures ahead of time will have a significant positive clinical impact for people with epilepsy. Advances in machine learning/artificial intelligence (ML/AI) has provided us the tools needed to perform such predictive tasks. To date, advanced deep learning (DL) architectures such as the convolutional neural network (CNN) and long short-term memory (LSTM) have been used with mixed results. However, highly connected activity exhibited by epileptic seizures necessitates the design of more complex ML techniques which can better capture the complex interconnected neurological processes. Other challenges include the variability of EEG sensor data quality, different epilepsy and seizure profiles, lack of annotated datasets and absence of ML-ready benchmarks. In addition, successful models will need to perform inference in almost real-time using limited hardware compute-capacity. To address these challenges, we propose a lightweight architecture, called <i>ESPFormer</i>, whose novelty lies in the simple and smaller model-size and a lower computational load footprint needed to infer in real-time compared to other works in the literature. To quantify the performance of this lightweight model, we compared its performance with a custom-designed residual neural network (ResNet), a pre-trained vision transformer (ViT) and a pre-trained large-language model (LLM). We tested ESPFormer on MLSPred-Bench which is the largest patient-independent seizure prediction dataset comprising 12 benchmarks. Our results demonstrate that ESPFormer provides the best performance in terms of prediction accuracy for 4/12 benchmarks with an average improvement of 2.65% compared to the LLM, 3.35% compared to the ViT and 17.65% compared to the ResNet - and comparable results for other benchmarks. Our results indicate that lightweight transformer architecture may outperform resource-intensive LLM based models for real-time EEG-based seizure predictions.</p>","PeriodicalId":520404,"journal":{"name":"Proceedings : ... IEEE International Conference on Big Data. IEEE International Conference on Big Data","volume":"2024 ","pages":"4941-4945"},"PeriodicalIF":0.0,"publicationDate":"2024-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11877310/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143560598","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Suicide Risk Assessment on Social Media with Semi-Supervised Learning.","authors":"Max Lovitt, Haotian Ma, Song Wang, Yifan Peng","doi":"10.1109/bigdata62323.2024.10825422","DOIUrl":"10.1109/bigdata62323.2024.10825422","url":null,"abstract":"<p><p>With social media communities increasingly becoming places where suicidal individuals post and congregate, natural language processing presents an exciting avenue for the development of automated suicide risk assessment systems. However, past efforts suffer from a lack of labeled data and class imbalances within the available labeled data. To accommodate this task's imperfect data landscape, we propose a semi-supervised framework that leverages labeled (n=500) and unlabeled (n=1,500) data and expands upon the self-training algorithm with a novel pseudo-label acquisition process designed to handle imbalanced datasets. To further ensure pseudo-label quality, we manually verify a subset of the pseudo-labeled data that was not predicted unanimously across multiple trials of pseudo-label generation. We test various models to serve as the backbone for this framework, ultimately deciding that RoBERTa performs the best. Ultimately, by leveraging partially validated pseudo-labeled data in addition to ground-truth labeled data, we substantially improve our model's ability to assess suicide risk from social media posts.</p>","PeriodicalId":520404,"journal":{"name":"Proceedings : ... IEEE International Conference on Big Data. IEEE International Conference on Big Data","volume":"2024 ","pages":"8541-8549"},"PeriodicalIF":0.0,"publicationDate":"2024-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11786971/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143083053","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Feature Interaction Detection in Big Data Through a New Choquet Integral based Deep Neural Network.","authors":"Matthew Fried, Honggang Wang, Hua Fang","doi":"10.1109/bigdata62323.2024.10825719","DOIUrl":"https://doi.org/10.1109/bigdata62323.2024.10825719","url":null,"abstract":"<p><p>Learning from massive amounts of domain-specific information requires new algorithms and models for parsing the ever-expanding field of big data. Such algorithms for exploring and identifying key features in vast databases require analysis of complex interactions to uncover critical features under a variety of circumstances. We study a comprehensive collection of health-related data, showing that our novel Choquet Integral activation function for deep neural networks transforms high-dimensional data into simpler sub-feature sets that better model complex interactions. While standard methods account for unitary feature tracking, they do not extend to multiple feature subsets, an impactful and necessary knowledge base. To this end, our novel activation function creates a sub-additive tool that better considers the weighted compilation of features within a robust set of standard benchmarks, advancing the synergistic and antagonistic relationships among features, capturing non-linear dependencies. We present the theoretical underpinnings, highlighting balanced fuzzy measures and sub-additivity for an optimized model based on real-world health data targeting weight loss. We further test different model settings, akin to hyper-parameter optimization. Despite computational time consumption, which could be improved via nowadays more powerful computing units, this novel method can be implemented as a pre-trained model using big data to identify heretofore unknown sub-additive feature interactions in a variety of fields such as biomedicine, fraud detection, cyber-security, and finance.</p>","PeriodicalId":520404,"journal":{"name":"Proceedings : ... IEEE International Conference on Big Data. IEEE International Conference on Big Data","volume":"2024 ","pages":"700-708"},"PeriodicalIF":0.0,"publicationDate":"2024-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12033041/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144057626","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}