{"title":"MedT5SQL: a transformers-based large language model for text-to-SQL conversion in the healthcare domain.","authors":"Alaa Marshan, Anwar Nais Almutairi, Athina Ioannou, David Bell, Asmat Monaghan, Mahir Arzoky","doi":"10.3389/fdata.2024.1371680","DOIUrl":"10.3389/fdata.2024.1371680","url":null,"abstract":"<p><strong>Introduction: </strong>In response to the increasing prevalence of electronic medical records (EMRs) stored in databases, healthcare staff are encountering difficulties retrieving these records due to their limited technical expertise in database operations. As these records are crucial for delivering appropriate medical care, there is a need for an accessible method for healthcare staff to access EMRs.</p><p><strong>Methods: </strong>To address this, natural language processing (NLP) for Text-to-SQL has emerged as a solution, enabling non-technical users to generate SQL queries using natural language text. This research assesses existing work on Text-to-SQL conversion and proposes the MedT5SQL model specifically designed for EMR retrieval. The proposed model utilizes the Text-to-Text Transfer Transformer (T5) model, a Large Language Model (LLM) commonly used in various text-based NLP tasks. The model is fine-tuned on the MIMICSQL dataset, the first Text-to-SQL dataset for the healthcare domain. Performance evaluation involves benchmarking the MedT5SQL model on two optimizers, varying numbers of training epochs, and using two datasets, MIMICSQL and WikiSQL.</p><p><strong>Results: </strong>For MIMICSQL dataset, the model demonstrates considerable effectiveness in generating question-SQL pairs achieving accuracy of 80.63%, 98.937%, and 90% for exact match accuracy matrix, approximate string-matching, and manual evaluation, respectively. When testing the performance of the model on WikiSQL dataset, the model demonstrates efficiency in generating SQL queries, with an accuracy of 44.2% on WikiSQL and 94.26% for approximate string-matching.</p><p><strong>Discussion: </strong>Results indicate improved performance with increased training epochs. This work highlights the potential of fine-tuned T5 model to convert medical-related questions written in natural language to Structured Query Language (SQL) in healthcare domain, providing a foundation for future research in this area.</p>","PeriodicalId":52859,"journal":{"name":"Frontiers in Big Data","volume":"7 ","pages":"1371680"},"PeriodicalIF":2.4,"publicationDate":"2024-06-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11233734/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141581493","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Frontiers in Big DataPub Date : 2024-06-18eCollection Date: 2024-01-01DOI: 10.3389/fdata.2024.1359317
Serban Stan, Mohammad Rostami
{"title":"Source-free domain adaptation for semantic image segmentation using internal representations.","authors":"Serban Stan, Mohammad Rostami","doi":"10.3389/fdata.2024.1359317","DOIUrl":"10.3389/fdata.2024.1359317","url":null,"abstract":"<p><p>Semantic segmentation models trained on annotated data fail to generalize well when the input data distribution changes over extended time period, leading to requiring re-training to maintain performance. Classic unsupervised domain adaptation (UDA) attempts to address a similar problem when there is target domain with no annotated data points through transferring knowledge from a source domain with annotated data. We develop an online UDA algorithm for semantic segmentation of images that improves model generalization on unannotated domains in scenarios where source data access is restricted during adaptation. We perform model adaptation by minimizing the distributional distance between the source latent features and the target features in a shared embedding space. Our solution promotes a shared domain-agnostic latent feature space between the two domains, which allows for classifier generalization on the target dataset. To alleviate the need of access to source samples during adaptation, we approximate the source latent feature distribution via an appropriate surrogate distribution, in this case a Gaussian mixture model (GMM).</p>","PeriodicalId":52859,"journal":{"name":"Frontiers in Big Data","volume":"7 ","pages":"1359317"},"PeriodicalIF":2.4,"publicationDate":"2024-06-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11217319/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141494242","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Frontiers in Big DataPub Date : 2024-06-17eCollection Date: 2024-01-01DOI: 10.3389/fdata.2024.1359906
Grace Ataguba, Rita Orji
{"title":"Toward the design of persuasive systems for a healthy workplace: a real-time posture detection.","authors":"Grace Ataguba, Rita Orji","doi":"10.3389/fdata.2024.1359906","DOIUrl":"10.3389/fdata.2024.1359906","url":null,"abstract":"<p><p>Persuasive technologies, in connection with human factor engineering requirements for healthy workplaces, have played a significant role in ensuring a change in human behavior. Healthy workplaces suggest different best practices applicable to body posture, proximity to the computer system, movement, lighting conditions, computer system layout, and other significant psychological and cognitive aspects. Most importantly, body posture suggests how users should sit or stand in workplaces in line with best and healthy practices. In this study, we developed two study phases (pilot and main) using two deep learning models: convolutional neural networks (CNN) and Yolo-V3. To train the two models, we collected posture datasets from creative common license YouTube videos and Kaggle. We classified the dataset into comfortable and uncomfortable postures. Results show that our YOLO-V3 model outperformed CNN model with a mean average precision of 92%. Based on this finding, we recommend that YOLO-V3 model be integrated in the design of persuasive technologies for a healthy workplace. Additionally, we provide future implications for integrating proximity detection taking into consideration the ideal number of centimeters users should maintain in a healthy workplace.</p>","PeriodicalId":52859,"journal":{"name":"Frontiers in Big Data","volume":"7 ","pages":"1359906"},"PeriodicalIF":2.4,"publicationDate":"2024-06-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11215059/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141477886","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Frontiers in Big DataPub Date : 2024-06-14eCollection Date: 2024-01-01DOI: 10.3389/fdata.2024.1371518
Laura Smets, Werner Van Leekwijck, Ing Jyh Tsang, Steven Latré
{"title":"An encoding framework for binarized images using hyperdimensional computing.","authors":"Laura Smets, Werner Van Leekwijck, Ing Jyh Tsang, Steven Latré","doi":"10.3389/fdata.2024.1371518","DOIUrl":"10.3389/fdata.2024.1371518","url":null,"abstract":"<p><strong>Introduction: </strong>Hyperdimensional Computing (HDC) is a brain-inspired and lightweight machine learning method. It has received significant attention in the literature as a candidate to be applied in the wearable Internet of Things, near-sensor artificial intelligence applications, and on-device processing. HDC is computationally less complex than traditional deep learning algorithms and typically achieves moderate to good classification performance. A key aspect that determines the performance of HDC is encoding the input data to the hyperdimensional (HD) space.</p><p><strong>Methods: </strong>This article proposes a novel lightweight approach relying only on native HD arithmetic vector operations to encode binarized images that preserves the similarity of patterns at nearby locations by using point of interest selection and <i>local linear mapping</i>.</p><p><strong>Results: </strong>The method reaches an accuracy of 97.92% on the test set for the MNIST data set and 84.62% for the Fashion-MNIST data set.</p><p><strong>Discussion: </strong>These results outperform other studies using native HDC with different encoding approaches and are on par with more complex hybrid HDC models and lightweight binarized neural networks. The proposed encoding approach also demonstrates higher robustness to noise and blur compared to the baseline encoding.</p>","PeriodicalId":52859,"journal":{"name":"Frontiers in Big Data","volume":"7 ","pages":"1371518"},"PeriodicalIF":2.4,"publicationDate":"2024-06-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11214273/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141472535","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Frontiers in Big DataPub Date : 2024-05-30eCollection Date: 2024-01-01DOI: 10.3389/fdata.2024.1363978
Elizabeth Newman, Lior Horesh, Haim Avron, Misha E Kilmer
{"title":"Stable tensor neural networks for efficient deep learning.","authors":"Elizabeth Newman, Lior Horesh, Haim Avron, Misha E Kilmer","doi":"10.3389/fdata.2024.1363978","DOIUrl":"10.3389/fdata.2024.1363978","url":null,"abstract":"<p><p>Learning from complex, multidimensional data has become central to computational mathematics, and among the most successful high-dimensional function approximators are deep neural networks (DNNs). Training DNNs is posed as an optimization problem to learn network weights or parameters that well-approximate a mapping from input to target data. Multiway data or tensors arise naturally in myriad ways in deep learning, in particular as input data and as high-dimensional weights and features extracted by the network, with the latter often being a bottleneck in terms of speed and memory. In this work, we leverage tensor representations and processing to efficiently parameterize DNNs when learning from high-dimensional data. We propose tensor neural networks (t-NNs), a natural extension of traditional fully-connected networks, that can be trained efficiently in a reduced, yet more powerful parameter space. Our t-NNs are built upon matrix-mimetic tensor-tensor products, which retain algebraic properties of matrix multiplication while capturing high-dimensional correlations. Mimeticity enables t-NNs to inherit desirable properties of modern DNN architectures. We exemplify this by extending recent work on stable neural networks, which interpret DNNs as discretizations of differential equations, to our multidimensional framework. We provide empirical evidence of the parametric advantages of t-NNs on dimensionality reduction using autoencoders and classification using fully-connected and stable variants on benchmark imaging datasets MNIST and CIFAR-10.</p>","PeriodicalId":52859,"journal":{"name":"Frontiers in Big Data","volume":"7 ","pages":"1363978"},"PeriodicalIF":3.1,"publicationDate":"2024-05-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11170703/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141318951","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Frontiers in Big DataPub Date : 2024-05-30eCollection Date: 2024-01-01DOI: 10.3389/fdata.2024.1379921
Yan Chen, Kate Sherren, Kyung Young Lee, Lori McCay-Peet, Shan Xue, Michael Smit
{"title":"From theory to practice: insights and hurdles in collecting social media data for social science research.","authors":"Yan Chen, Kate Sherren, Kyung Young Lee, Lori McCay-Peet, Shan Xue, Michael Smit","doi":"10.3389/fdata.2024.1379921","DOIUrl":"10.3389/fdata.2024.1379921","url":null,"abstract":"<p><p>Social media has profoundly changed our modes of self-expression, communication, and participation in public discourse, generating volumes of conversations and content that cover every aspect of our social lives. Social media platforms have thus become increasingly important as data sources to identify social trends and phenomena. In recent years, academics have steadily lost ground on access to social media data as technology companies have set more restrictions on Application Programming Interfaces (APIs) or entirely closed public APIs. This circumstance halts the work of many social scientists who have used such data to study issues of public good. We considered the viability of eight approaches for image-based social media data collection: data philanthropy organizations, data repositories, data donation, third-party data companies, homegrown tools, and various web scraping tools and scripts. This paper discusses the advantages and challenges of these approaches from literature and from the authors' experience. We conclude the paper by discussing mechanisms for improving social media data collection that will enable this future frontier of social science research.</p>","PeriodicalId":52859,"journal":{"name":"Frontiers in Big Data","volume":"7 ","pages":"1379921"},"PeriodicalIF":3.1,"publicationDate":"2024-05-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11169574/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141319551","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Frontiers in Big DataPub Date : 2024-05-30eCollection Date: 2024-01-01DOI: 10.3389/fdata.2024.1412837
Marianna Gonçalves Dias Chaves, Adriel Bilharva da Silva, Emílio Graciliano Ferreira Mercuri, Steffen Manfred Noe
{"title":"Particulate matter forecast and prediction in Curitiba using machine learning.","authors":"Marianna Gonçalves Dias Chaves, Adriel Bilharva da Silva, Emílio Graciliano Ferreira Mercuri, Steffen Manfred Noe","doi":"10.3389/fdata.2024.1412837","DOIUrl":"10.3389/fdata.2024.1412837","url":null,"abstract":"<p><strong>Introduction: </strong>Air quality is directly affected by pollutant emission from vehicles, especially in large cities and metropolitan areas or when there is no compliance check for vehicle emission standards. Particulate Matter (PM) is one of the pollutants emitted from fuel burning in internal combustion engines and remains suspended in the atmosphere, causing respiratory and cardiovascular health problems to the population. In this study, we analyzed the interaction between vehicular emissions, meteorological variables, and particulate matter concentrations in the lower atmosphere, presenting methods for predicting and forecasting PM2.5.</p><p><strong>Methods: </strong>Meteorological and vehicle flow data from the city of Curitiba, Brazil, and particulate matter concentration data from optical sensors installed in the city between 2020 and 2022 were organized in hourly and daily averages. Prediction and forecasting were based on two machine learning models: Random Forest (RF) and Long Short-Term Memory (LSTM) neural network. The baseline model for prediction was chosen as the Multiple Linear Regression (MLR) model, and for forecast, we used the naive estimation as baseline.</p><p><strong>Results: </strong>RF showed that on hourly and daily prediction scales, the planetary boundary layer height was the most important variable, followed by wind gust and wind velocity in hourly or daily cases, respectively. The highest PM prediction accuracy (99.37%) was found using the RF model on a daily scale. For forecasting, the highest accuracy was 99.71% using the LSTM model for 1-h forecast horizon with 5 h of previous data used as input variables.</p><p><strong>Discussion: </strong>The RF and LSTM models were able to improve prediction and forecasting compared with MLR and Naive, respectively. The LSTM was trained with data corresponding to the period of the COVID-19 pandemic (2020 and 2021) and was able to forecast the concentration of PM2.5 in 2022, in which the data show that there was greater circulation of vehicles and higher peaks in the concentration of PM2.5. Our results can help the physical understanding of factors influencing pollutant dispersion from vehicle emissions at the lower atmosphere in urban environment. This study supports the formulation of new government policies to mitigate the impact of vehicle emissions in large cities.</p>","PeriodicalId":52859,"journal":{"name":"Frontiers in Big Data","volume":"7 ","pages":"1412837"},"PeriodicalIF":3.1,"publicationDate":"2024-05-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11169811/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141318950","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Frontiers in Big DataPub Date : 2024-05-30eCollection Date: 2024-01-01DOI: 10.3389/fdata.2024.1330392
Jinghui Chen, Takayuki Mizuno, Shohei Doi
{"title":"Analyzing political party positions through multi-language twitter text embeddings.","authors":"Jinghui Chen, Takayuki Mizuno, Shohei Doi","doi":"10.3389/fdata.2024.1330392","DOIUrl":"10.3389/fdata.2024.1330392","url":null,"abstract":"<p><p>Traditional monolingual word embedding models transform words into high-dimensional vectors which represent semantics relations between words as relationships between vectors in the high-dimensional space. They serve as productive tools to interpret multifarious aspects of the social world in social science research. Building on the previous research which interprets multifaceted meanings of words by projecting them onto word-level dimensions defined by differences between antonyms, we extend the architecture of establishing word-level cultural dimensions to the sentence level and adopt a Language-agnostic BERT model (LaBSE) to detect position similarities in a multi-language environment. We assess the efficacy of our sentence-level methodology using Twitter data from US politicians, comparing it to the traditional word-level embedding model. We also adopt Latent Dirichlet Allocation (LDA) to investigate detailed topics in these tweets and interpret politicians' positions from different angles. In addition, we adopt Twitter data from Spanish politicians and visualize their positions in a multi-language space to analyze position similarities across countries. The results show that our sentence-level methodology outperform traditional word-level model. We also demonstrate that our methodology is effective dealing with fine-sorted themes from the result that political positions towards different topics vary even within the same politicians. Through verification using American and Spanish political datasets, we find that the positioning of American and Spanish politicians on our defined liberal-conservative axis aligns with social common sense, political news, and previous research. Our architecture improves the standard word-level methodology and can be considered as a useful architecture for sentence-level applications in the future.</p>","PeriodicalId":52859,"journal":{"name":"Frontiers in Big Data","volume":"7 ","pages":"1330392"},"PeriodicalIF":3.1,"publicationDate":"2024-05-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11169868/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141318949","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Frontiers in Big DataPub Date : 2024-05-02eCollection Date: 2024-01-01DOI: 10.3389/fdata.2024.1366415
Muhammad Aasem, Muhammad Javed Iqbal
{"title":"Toward explainable AI in radiology: Ensemble-CAM for effective thoracic disease localization in chest X-ray images using weak supervised learning.","authors":"Muhammad Aasem, Muhammad Javed Iqbal","doi":"10.3389/fdata.2024.1366415","DOIUrl":"https://doi.org/10.3389/fdata.2024.1366415","url":null,"abstract":"<p><p>Chest X-ray (CXR) imaging is widely employed by radiologists to diagnose thoracic diseases. Recently, many deep learning techniques have been proposed as computer-aided diagnostic (CAD) tools to assist radiologists in minimizing the risk of incorrect diagnosis. From an application perspective, these models have exhibited two major challenges: (1) They require large volumes of annotated data at the training stage and (2) They lack explainable factors to justify their outcomes at the prediction stage. In the present study, we developed a class activation mapping (CAM)-based ensemble model, called Ensemble-CAM, to address both of these challenges via weakly supervised learning by employing explainable AI (XAI) functions. Ensemble-CAM utilizes class labels to predict the location of disease in association with interpretable features. The proposed work leverages ensemble and transfer learning with class activation functions to achieve three objectives: (1) minimizing the dependency on strongly annotated data when locating thoracic diseases, (2) enhancing confidence in predicted outcomes by visualizing their interpretable features, and (3) optimizing cumulative performance via fusion functions. Ensemble-CAM was trained on three CXR image datasets and evaluated through qualitative and quantitative measures via heatmaps and Jaccard indices. The results reflect the enhanced performance and reliability in comparison to existing standalone and ensembled models.</p>","PeriodicalId":52859,"journal":{"name":"Frontiers in Big Data","volume":"7 ","pages":"1366415"},"PeriodicalIF":3.1,"publicationDate":"2024-05-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11096460/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140960924","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Frontiers in Big DataPub Date : 2024-04-04eCollection Date: 2024-01-01DOI: 10.3389/fdata.2024.1396638
Joe Germino, Annalisa Szymanski, Heather A Eicher-Miller, Ronald Metoyer, Nitesh V Chawla
{"title":"Corrigendum: A community focused approach toward making healthy and affordable daily diet recommendations.","authors":"Joe Germino, Annalisa Szymanski, Heather A Eicher-Miller, Ronald Metoyer, Nitesh V Chawla","doi":"10.3389/fdata.2024.1396638","DOIUrl":"https://doi.org/10.3389/fdata.2024.1396638","url":null,"abstract":"<p><p>[This corrects the article DOI: 10.3389/fdata.2023.1086212.].</p>","PeriodicalId":52859,"journal":{"name":"Frontiers in Big Data","volume":"7 ","pages":"1396638"},"PeriodicalIF":3.1,"publicationDate":"2024-04-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11024675/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140870346","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}