Journal of Big Data最新文献_第4页

An adaptive composite time series forecasting model for short-term traffic flow 短期交通流量的自适应复合时间序列预测模型

IF 8.1 2区计算机科学

Journal of Big Data Pub Date : 2024-08-03 DOI: 10.1186/s40537-024-00967-w

Qitan Shao, Xinglin Piao, Xiangyu Yao, Yuqiu Kong, Yongli Hu, Baocai Yin, Yong Zhang

{"title":"An adaptive composite time series forecasting model for short-term traffic flow","authors":"Qitan Shao, Xinglin Piao, Xiangyu Yao, Yuqiu Kong, Yongli Hu, Baocai Yin, Yong Zhang","doi":"10.1186/s40537-024-00967-w","DOIUrl":"https://doi.org/10.1186/s40537-024-00967-w","url":null,"abstract":"<p>Short-term traffic flow forecasting is a hot issue in the field of intelligent transportation. The research field of traffic forecasting has evolved greatly in past decades. With the rapid development of deep learning and neural networks, a series of effective methods have been proposed to address the short-term traffic flow forecasting problem, which makes it possible to examine and forecast traffic situations more accurately than ever. Different from linear based methods, deep learning based methods achieve traffic flow forecasting by exploring the complex nonlinear relationships in traffic flow. Most existing methods always use a single framework for feature extraction and forecasting only. These approaches treat all traffic flow equally and consider them contain same attribute. However, the traffic flow from different time spots or roads may contain distinct attributes information (such as congested and uncongested). A simple single framework usually ignore the different attributes embedded in different distributions of data. This would decrease the accuracy of traffic forecasting. To tackle these issues, we propose an adaptive composite framework, named Long-Short-Combination (LSC). In the proposed method, two data forecasting modules(L and S) are designed for short-term traffic flow with different attributes respectively. Furthermore, we also integrate an attribute forecasting module (C) to forecast the traffic attributes for each time point in future time series. The proposed framework has been assessed on real-world datasets. The experimental results demonstrate that the proposed model has excellent forecasting performance.</p>","PeriodicalId":15158,"journal":{"name":"Journal of Big Data","volume":"56 1","pages":""},"PeriodicalIF":8.1,"publicationDate":"2024-08-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141930321","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

DiabSense: early diagnosis of non-insulin-dependent diabetes mellitus using smartphone-based human activity recognition and diabetic retinopathy analysis with Graph Neural Network DiabSense：利用基于智能手机的人体活动识别和图神经网络的糖尿病视网膜病变分析，早期诊断非胰岛素依赖型糖尿病

IF 8.1 2区计算机科学

Journal of Big Data Pub Date : 2024-08-03 DOI: 10.1186/s40537-024-00959-w

Md Nuho Ul Alam, Ibrahim Hasnine, Erfanul Hoque Bahadur, Abdul Kadar Muhammad Masum, Mercedes Briones Urbano, Manuel Masias Vergara, Jia Uddin, Imran Ashraf, Md. Abdus Samad

{"title":"DiabSense: early diagnosis of non-insulin-dependent diabetes mellitus using smartphone-based human activity recognition and diabetic retinopathy analysis with Graph Neural Network","authors":"Md Nuho Ul Alam, Ibrahim Hasnine, Erfanul Hoque Bahadur, Abdul Kadar Muhammad Masum, Mercedes Briones Urbano, Manuel Masias Vergara, Jia Uddin, Imran Ashraf, Md. Abdus Samad","doi":"10.1186/s40537-024-00959-w","DOIUrl":"https://doi.org/10.1186/s40537-024-00959-w","url":null,"abstract":"<p>Non-Insulin-Dependent Diabetes Mellitus (NIDDM) is a chronic health condition caused by high blood sugar levels, and if not treated early, it can lead to serious complications i.e. blindness. Human Activity Recognition (HAR) offers potential for early NIDDM diagnosis, emerging as a key application for HAR technology. This research introduces DiabSense, a state-of-the-art smartphone-dependent system for early staging of NIDDM. DiabSense incorporates HAR and Diabetic Retinopathy (DR) upon leveraging the power of two different Graph Neural Networks (GNN). HAR uses a comprehensive array of 23 human activities resembling Diabetes symptoms, and DR is a prevalent complication of NIDDM. Graph Attention Network (GAT) in HAR achieved 98.32% accuracy on sensor data, while Graph Convolutional Network (GCN) in the Aptos 2019 dataset scored 84.48%, surpassing other state-of-the-art models. The trained GCN analyzed retinal images of four experimental human subjects for DR report generation, and GAT generated their average duration of daily activities over 30 days. The daily activities in non-diabetic periods of diabetic patients were measured and compared with the daily activities of the experimental subjects, which helped generate risk factors. Fusing risk factors with DR conditions enabled early diagnosis recommendations for the experimental subjects despite the absence of any apparent symptoms. The comparison of DiabSense system outcome with clinical diagnosis reports in the experimental subjects was conducted using the A1C test. The test results confirmed the accurate assessment of early diagnosis requirements for experimental subjects by the system. Overall, DiabSense exhibits significant potential for ensuring early NIDDM treatment, improving millions of lives worldwide.</p>","PeriodicalId":15158,"journal":{"name":"Journal of Big Data","volume":"75 1","pages":""},"PeriodicalIF":8.1,"publicationDate":"2024-08-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141930322","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Tc-llama 2: fine-tuning LLM for technology and commercialization applications Tc-llama 2：为技术和商业化应用微调 LLM

IF 8.1 2区计算机科学

Journal of Big Data Pub Date : 2024-08-02 DOI: 10.1186/s40537-024-00963-0

Jeyoon Yeom, Hakyung Lee, Hoyoon Byun, Yewon Kim, Jeongeun Byun, Yunjeong Choi, Sungjin Kim, Kyungwoo Song

{"title":"Tc-llama 2: fine-tuning LLM for technology and commercialization applications","authors":"Jeyoon Yeom, Hakyung Lee, Hoyoon Byun, Yewon Kim, Jeongeun Byun, Yunjeong Choi, Sungjin Kim, Kyungwoo Song","doi":"10.1186/s40537-024-00963-0","DOIUrl":"https://doi.org/10.1186/s40537-024-00963-0","url":null,"abstract":"<p>This paper introduces TC-Llama 2, a novel application of large language models (LLMs) in the technology-commercialization field. Traditional methods in this field, reliant on statistical learning and expert knowledge, often face challenges in processing the complex and diverse nature of technology-commercialization data. TC-Llama 2 addresses these limitations by utilizing the advanced generalization capabilities of LLMs, specifically adapting them to this intricate domain. Our model, based on the open-source LLM framework, Llama 2, is customized through instruction tuning using bilingual Korean-English datasets. Our approach involves transforming technology-commercialization data into formats compatible with LLMs, enabling the model to learn detailed technological knowledge and product hierarchies effectively. We introduce a unique model evaluation strategy, leveraging new matching and generation tasks to verify the alignment of the technology-commercialization relationship in TC-Llama 2. Our results, derived from refining task-specific instructions for inference, provide valuable insights into customizing language models for specific sectors, potentially leading to new applications in technology categorization, utilization, and predictive product development.</p>","PeriodicalId":15158,"journal":{"name":"Journal of Big Data","volume":"51 1","pages":""},"PeriodicalIF":8.1,"publicationDate":"2024-08-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141930412","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

An ensemble machine learning model for predicting one-year mortality in elderly coronary heart disease patients with anemia 预测患有贫血的老年冠心病患者一年死亡率的集合机器学习模型

IF 8.1 2区计算机科学

Journal of Big Data Pub Date : 2024-07-24 DOI: 10.1186/s40537-024-00966-x

Longcan Cheng, Yan Nie, Hongxia Wen, Yan Li, Yali Zhao, Qian Zhang, Mingxing Lei, Shihui Fu

{"title":"An ensemble machine learning model for predicting one-year mortality in elderly coronary heart disease patients with anemia","authors":"Longcan Cheng, Yan Nie, Hongxia Wen, Yan Li, Yali Zhao, Qian Zhang, Mingxing Lei, Shihui Fu","doi":"10.1186/s40537-024-00966-x","DOIUrl":"https://doi.org/10.1186/s40537-024-00966-x","url":null,"abstract":"<h3 data-test=\"abstract-sub-heading\">Objective</h3><p>This study was designed to develop and validate a robust predictive model for one-year mortality in elderly coronary heart disease (CHD) patients with anemia using machine learning methods.</p><h3 data-test=\"abstract-sub-heading\">Methods</h3><p>Demographics, tests, comorbidities, and drugs were collected for a cohort of 974 elderly patients with CHD. A prospective analysis was performed to evaluate predictive performances of the developed models. External validation of models was performed in a series of 112 elderly CHD patients with anemia.</p><h3 data-test=\"abstract-sub-heading\">Results</h3><p>The overall one-year mortality was 43.6%. Risk factors included heart rate, chronic heart failure, tachycardia and β receptor blockers. Protective factors included hemoglobin, albumin, high density lipoprotein cholesterol, estimated glomerular filtration rate (eGFR), left ventricular ejection fraction (LVEF), aspirin, clopidogrel, calcium channel blockers, angiotensin converting enzyme inhibitors (ACEIs)/angiotensin receptor blockers (ARBs), and statins. Compared with other algorithms, an ensemble machine learning model performed the best with area under the curve (95% confidence interval) being 0.828 (0.805–0.870) and Brier score being 0.170. Calibration and density curves further confirmed favorable predicted probability and discriminative ability of an ensemble machine learning model. External validation of Ensemble Model also exhibited good performance with area under the curve (95% confidence interval) being 0.825 (0.734–0.916) and Brier score being 0.185. Patients in the high-risk group had more than six-fold probability of one-year mortality compared with those in the low-risk group (<i>P</i> < 0.001). Shaley Additive exPlanation identified the top five risk factors that associated with one-year mortality were hemoglobin, albumin, eGFR, LVEF, and ACEIs/ARBs.</p><h3 data-test=\"abstract-sub-heading\">Conclusions</h3><p>This model identifies key risk factors and protective factors, providing valuable insights for improving risk assessment, informing clinical decision-making and performing targeted interventions. It outperforms other algorithms with predictive performance and provides significant opportunities for personalized risk mitigation strategies, with clinical implications for improving patient care.</p>","PeriodicalId":15158,"journal":{"name":"Journal of Big Data","volume":"17 1","pages":""},"PeriodicalIF":8.1,"publicationDate":"2024-07-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141772925","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Hate speech detection in the Bengali language: a comprehensive survey 孟加拉语中的仇恨言论检测：一项全面调查

IF 8.1 2区计算机科学

Journal of Big Data Pub Date : 2024-07-23 DOI: 10.1186/s40537-024-00956-z

Abdullah Al Maruf, Ahmad Jainul Abidin, Md. Mahmudul Haque, Zakaria Masud Jiyad, Aditi Golder, Raaid Alubady, Zeyar Aung

{"title":"Hate speech detection in the Bengali language: a comprehensive survey","authors":"Abdullah Al Maruf, Ahmad Jainul Abidin, Md. Mahmudul Haque, Zakaria Masud Jiyad, Aditi Golder, Raaid Alubady, Zeyar Aung","doi":"10.1186/s40537-024-00956-z","DOIUrl":"https://doi.org/10.1186/s40537-024-00956-z","url":null,"abstract":"<p>The detection of hate speech (HS) in online platforms has become extremely important for maintaining a safe and inclusive environment. While significant progress has been made in English-language HS detection, methods for detecting HS in other languages, such as Bengali, have not been explored much like English. In this survey, we outlined the key challenges specific to HS detection in Bengali, including the scarcity of labeled datasets, linguistic nuances, and contextual variations. We also examined different approaches and methodologies employed by researchers to address these challenges, including classical machine learning techniques, ensemble approaches, and more recent deep learning advancements. Furthermore, we explored the performance metrics used for evaluation, including the accuracy, precision, recall, receiver operating characteristic (ROC) curve, area under the ROC curve (AUC), sensitivity, specificity, and F1 score, providing insights into the effectiveness of the proposed models. Additionally, we identified the limitations and future directions of research in Bengali HS detection, highlighting the need for larger annotated datasets, cross-lingual transfer learning techniques, and the incorporation of contextual information to improve the detection accuracy. This survey provides a comprehensive overview of the current state-of-the-art HS detection methods used in Bengali text and serves as a valuable resource for researchers and practitioners interested in understanding the advancements, challenges, and opportunities in addressing HS in the Bengali language, ultimately assisting in the creation of reliable and effective online platform detection systems.</p>","PeriodicalId":15158,"journal":{"name":"Journal of Big Data","volume":"14 1","pages":""},"PeriodicalIF":8.1,"publicationDate":"2024-07-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141772928","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Predictive modelling of MapReduce job performance in cloud environments using machine learning techniques 使用机器学习技术对云环境中的 MapReduce 作业性能进行预测建模

IF 8.1 2区计算机科学

Journal of Big Data Pub Date : 2024-07-23 DOI: 10.1186/s40537-024-00964-z

Mohammed Bergui, Soufiane Hourri, Said Najah, Nikola S. Nikolov

{"title":"Predictive modelling of MapReduce job performance in cloud environments using machine learning techniques","authors":"Mohammed Bergui, Soufiane Hourri, Said Najah, Nikola S. Nikolov","doi":"10.1186/s40537-024-00964-z","DOIUrl":"https://doi.org/10.1186/s40537-024-00964-z","url":null,"abstract":"<p>Within the Hadoop ecosystem, MapReduce stands as a cornerstone for managing, processing, and mining large-scale datasets. Yet, the absence of efficient solutions for precise estimation of job execution times poses a persistent challenge, impacting task allocation and distribution within Hadoop clusters. In this study, we present a comprehensive machine learning approach for predicting the execution time of MapReduce jobs, encompassing data collection, preprocessing, feature engineering, and model evaluation. Leveraging a rich dataset derived from comprehensive Hadoop MapReduce job traces, we explore the intricate relationship between cluster parameters and job performance. Through a comparative analysis of machine learning models, including linear regression, decision tree, random forest, and gradient-boosted regression trees, we identify the random forest model as the most effective, demonstrating superior predictive accuracy and robustness. Our findings underscore the critical role of features such as data size and resource allocation in determining job performance. With this work, we aim to enhance resource management efficiency and enable more effective utilisation of cloud-based Hadoop clusters for large-scale data processing tasks.</p>","PeriodicalId":15158,"journal":{"name":"Journal of Big Data","volume":"48 1","pages":""},"PeriodicalIF":8.1,"publicationDate":"2024-07-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141772927","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Introducing Mplots: scaling time series recurrence plots to massive datasets 介绍 Mplots：根据海量数据集缩放时间序列递推图

IF 8.1 2区计算机科学

Journal of Big Data Pub Date : 2024-07-20 DOI: 10.1186/s40537-024-00954-1

Maryam Shahcheraghi, Ryan Mercer, João Manuel de Almeida Rodrigues, Audrey Der, Hugo Filipe Silveira Gamboa, Zachary Zimmerman, Kerry Mauck, Eamonn Keogh

{"title":"Introducing Mplots: scaling time series recurrence plots to massive datasets","authors":"Maryam Shahcheraghi, Ryan Mercer, João Manuel de Almeida Rodrigues, Audrey Der, Hugo Filipe Silveira Gamboa, Zachary Zimmerman, Kerry Mauck, Eamonn Keogh","doi":"10.1186/s40537-024-00954-1","DOIUrl":"https://doi.org/10.1186/s40537-024-00954-1","url":null,"abstract":"<p>Time series similarity matrices (informally, recurrence plots or dot-plots), are useful tools for time series data mining. They can be used to guide data exploration, and various useful features can be derived from them and then fed into downstream analytics. However, time series similarity matrices suffer from very poor scalability, taxing both time and memory requirements. In this work, we introduce novel ideas that allow us to scale the largest time series similarity matrices that can be examined by several orders of magnitude. The first idea is a novel algorithm to compute the matrices in a way that removes dependency on the subsequence length. This algorithm is so fast that it allows us to now address datasets where the memory limitations begin to dominate. Our second novel contribution is a multiscale algorithm that computes an approximation of the matrix appropriate for the limitations of the user’s memory/screen-resolution, then performs a local, just-in-time recomputation of any region that the user wishes to zoom-in on. Given that this largely removes time and space barriers, human visual attention then becomes the bottleneck. We further introduce algorithms that search massive matrices with quadrillions of cells and then prioritize regions for later examination by either humans or algorithms. We will demonstrate the utility of our ideas for data exploration, segmentation, and classification in domains as diverse as astronomy, bioinformatics, entomology, and wildlife monitoring.</p>","PeriodicalId":15158,"journal":{"name":"Journal of Big Data","volume":"47 1","pages":""},"PeriodicalIF":8.1,"publicationDate":"2024-07-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141742305","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Emotion AWARE: an artificial intelligence framework for adaptable, robust, explainable, and multi-granular emotion analysis 情感 AWARE：用于适应性强、稳健、可解释和多粒度情感分析的人工智能框架

IF 8.1 2区计算机科学

Journal of Big Data Pub Date : 2024-07-10 DOI: 10.1186/s40537-024-00953-2

Gihan Gamage, Daswin De Silva, Nishan Mills, Damminda Alahakoon, Milos Manic

{"title":"Emotion AWARE: an artificial intelligence framework for adaptable, robust, explainable, and multi-granular emotion analysis","authors":"Gihan Gamage, Daswin De Silva, Nishan Mills, Damminda Alahakoon, Milos Manic","doi":"10.1186/s40537-024-00953-2","DOIUrl":"https://doi.org/10.1186/s40537-024-00953-2","url":null,"abstract":"<p>Emotions are fundamental to human behaviour. How we feel, individually and collectively, determines how humanity evolves and advances into our shared future. The rapid digitalisation of our personal, social and professional lives means we are frequently using digital media to express, understand and respond to emotions. Although recent developments in Artificial Intelligence (AI) are able to analyse sentiment and detect emotions, they are not effective at comprehending the complexity and ambiguity of digital emotion expressions in knowledge-focused activities of customers, people, and organizations. In this paper, we address this challenge by proposing a novel AI framework for the adaptable, robust, and explainable detection of multi-granular assembles of emotions. This framework consolidates lexicon generation and finetuned Large Language Model (LLM) approaches to formulate multi-granular assembles of two, eight and fourteen emotions. The framework is robust to ambiguous emotion expressions that are implied in conversation, adaptable to domain-specific emotion semantics, and the assembles are explainable using constituent terms and intensity. We conducted nine empirical studies using datasets representing diverse human emotion behaviours. The results of these studies comprehensively demonstrate and evaluate the core capabilities of the framework, and consistently outperforms state-of-the-art approaches in adaptable, robust, and explainable multi-granular emotion detection.</p>","PeriodicalId":15158,"journal":{"name":"Journal of Big Data","volume":"153 1","pages":""},"PeriodicalIF":8.1,"publicationDate":"2024-07-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141587939","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Examining ALS: reformed PCA and random forest for effective detection of ALS 检查 ALS：改革 PCA 和随机森林，有效检测 ALS

IF 8.1 2区计算机科学

Journal of Big Data Pub Date : 2024-07-10 DOI: 10.1186/s40537-024-00951-4

Abdullah Alqahtani, Shtwai Alsubai, Mohemmed Sha, Ashit Kumar Dutta

{"title":"Examining ALS: reformed PCA and random forest for effective detection of ALS","authors":"Abdullah Alqahtani, Shtwai Alsubai, Mohemmed Sha, Ashit Kumar Dutta","doi":"10.1186/s40537-024-00951-4","DOIUrl":"https://doi.org/10.1186/s40537-024-00951-4","url":null,"abstract":"<p>ALS (Amyotrophic Lateral Sclerosis) is a fatal neurodegenerative disease of the human motor system. It is a group of progressive diseases that affects the nerve cells in the brain and spinal cord that control the muscle movement of the body hence, detection and classification of ALS at the right time is considered to be one of the vital aspects that can save the life of humans. Therefore, in various studies, different AI techniques are used for the detection of ALS, however, these methods are considered to be ineffectual in terms of identifying the disease due to the employment of ineffective algorithms. Hence, the proposed model utilizes Modified Principal Component Analysis (MPCA) and Modified Random Forest (MRF) for performing dimensionality reduction of all the potential features considered for effective classification of the ALS presence and absence of ALS causing mutation in the corresponding gene. The MPCA is adapted for capturing all the Low-Importance Data transformation. Furthermore, The MPCA is objected to performing three various approaches: Covariance Matrix Correlation, Eigen Vector- Eigenvalue decomposition, and selecting the desired principal components. This is done in aspects of implying the LI (Lower-Importance) Data Transformation. By choosing these potential components without any loss of features ensures better viability of selecting the attributes for ALS-causing gene classification. This is followed by the classification of the proposed model by using Modified RF by updating the clump detector technique. The clump detector is proceeded by clustering approach using K-means, and the data reduced by their dimension are grouped accordingly. These clustered data are analyzed either for ALS causing or devoid of causing ALS. Finally, the model’s performance is assessed using different evaluation metrics like accuracy, recall, F1 score, and precision, and the proposed model is further compared with the existing models to assess the efficacy of the proposed model.</p>","PeriodicalId":15158,"journal":{"name":"Journal of Big Data","volume":"21 1","pages":""},"PeriodicalIF":8.1,"publicationDate":"2024-07-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141587940","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Exploring AI-driven approaches for unstructured document analysis and future horizons 探索人工智能驱动的非结构化文件分析方法及未来展望

IF 8.1 2区计算机科学

Journal of Big Data Pub Date : 2024-07-05 DOI: 10.1186/s40537-024-00948-z

Supriya V. Mahadevkar, Shruti Patil, Ketan Kotecha, Lim Way Soong, Tanupriya Choudhury

{"title":"Exploring AI-driven approaches for unstructured document analysis and future horizons","authors":"Supriya V. Mahadevkar, Shruti Patil, Ketan Kotecha, Lim Way Soong, Tanupriya Choudhury","doi":"10.1186/s40537-024-00948-z","DOIUrl":"https://doi.org/10.1186/s40537-024-00948-z","url":null,"abstract":"<p>In the current industrial landscape, a significant number of sectors are grappling with the challenges posed by unstructured data, which incurs financial losses amounting to millions annually. If harnessed effectively, this data has the potential to substantially boost operational efficiency. Traditional methods for extracting information have their limitations; however, solutions powered by artificial intelligence (AI) could provide a more fitting alternative. There is an evident gap in scholarly research concerning a comprehensive evaluation of AI-driven techniques for the extraction of information from unstructured content. This systematic literature review aims to identify, assess, and deliberate on prospective research directions within the field of unstructured document information extraction. It has been observed that prevailing extraction methods primarily depend on static patterns or rules, often proving inadequate when faced with complex document structures typically encountered in real-world scenarios, such as medical records. Datasets currently available to the public suffer from low quality and are tailored for specific tasks only. This underscores an urgent need for developing new datasets that accurately reflect complex issues encountered in practical settings. The review reveals that AI-based techniques show promise in autonomously extracting information from diverse unstructured documents, encompassing both printed and handwritten text. Challenges arise, however, when dealing with varied document layouts. Proposing a framework through hybrid AI-based approaches, this review envisions processing a high-quality dataset for automatic information extraction from unstructured documents. Additionally, it emphasizes the importance of collaborative efforts between organizations and researchers to address the diverse challenges associated with unstructured data analysis.</p>","PeriodicalId":15158,"journal":{"name":"Journal of Big Data","volume":"31 1","pages":""},"PeriodicalIF":8.1,"publicationDate":"2024-07-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141576733","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0