Journal of Big Data最新文献

筛选
英文 中文
DiabSense: early diagnosis of non-insulin-dependent diabetes mellitus using smartphone-based human activity recognition and diabetic retinopathy analysis with Graph Neural Network DiabSense:利用基于智能手机的人体活动识别和图神经网络的糖尿病视网膜病变分析,早期诊断非胰岛素依赖型糖尿病
IF 8.1 2区 计算机科学
Journal of Big Data Pub Date : 2024-08-03 DOI: 10.1186/s40537-024-00959-w
Md Nuho Ul Alam, Ibrahim Hasnine, Erfanul Hoque Bahadur, Abdul Kadar Muhammad Masum, Mercedes Briones Urbano, Manuel Masias Vergara, Jia Uddin, Imran Ashraf, Md. Abdus Samad
{"title":"DiabSense: early diagnosis of non-insulin-dependent diabetes mellitus using smartphone-based human activity recognition and diabetic retinopathy analysis with Graph Neural Network","authors":"Md Nuho Ul Alam, Ibrahim Hasnine, Erfanul Hoque Bahadur, Abdul Kadar Muhammad Masum, Mercedes Briones Urbano, Manuel Masias Vergara, Jia Uddin, Imran Ashraf, Md. Abdus Samad","doi":"10.1186/s40537-024-00959-w","DOIUrl":"https://doi.org/10.1186/s40537-024-00959-w","url":null,"abstract":"<p>Non-Insulin-Dependent Diabetes Mellitus (NIDDM) is a chronic health condition caused by high blood sugar levels, and if not treated early, it can lead to serious complications i.e. blindness. Human Activity Recognition (HAR) offers potential for early NIDDM diagnosis, emerging as a key application for HAR technology. This research introduces DiabSense, a state-of-the-art smartphone-dependent system for early staging of NIDDM. DiabSense incorporates HAR and Diabetic Retinopathy (DR) upon leveraging the power of two different Graph Neural Networks (GNN). HAR uses a comprehensive array of 23 human activities resembling Diabetes symptoms, and DR is a prevalent complication of NIDDM. Graph Attention Network (GAT) in HAR achieved 98.32% accuracy on sensor data, while Graph Convolutional Network (GCN) in the Aptos 2019 dataset scored 84.48%, surpassing other state-of-the-art models. The trained GCN analyzed retinal images of four experimental human subjects for DR report generation, and GAT generated their average duration of daily activities over 30 days. The daily activities in non-diabetic periods of diabetic patients were measured and compared with the daily activities of the experimental subjects, which helped generate risk factors. Fusing risk factors with DR conditions enabled early diagnosis recommendations for the experimental subjects despite the absence of any apparent symptoms. The comparison of DiabSense system outcome with clinical diagnosis reports in the experimental subjects was conducted using the A1C test. The test results confirmed the accurate assessment of early diagnosis requirements for experimental subjects by the system. Overall, DiabSense exhibits significant potential for ensuring early NIDDM treatment, improving millions of lives worldwide.</p>","PeriodicalId":15158,"journal":{"name":"Journal of Big Data","volume":"75 1","pages":""},"PeriodicalIF":8.1,"publicationDate":"2024-08-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141930322","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Tc-llama 2: fine-tuning LLM for technology and commercialization applications Tc-llama 2:为技术和商业化应用微调 LLM
IF 8.1 2区 计算机科学
Journal of Big Data Pub Date : 2024-08-02 DOI: 10.1186/s40537-024-00963-0
Jeyoon Yeom, Hakyung Lee, Hoyoon Byun, Yewon Kim, Jeongeun Byun, Yunjeong Choi, Sungjin Kim, Kyungwoo Song
{"title":"Tc-llama 2: fine-tuning LLM for technology and commercialization applications","authors":"Jeyoon Yeom, Hakyung Lee, Hoyoon Byun, Yewon Kim, Jeongeun Byun, Yunjeong Choi, Sungjin Kim, Kyungwoo Song","doi":"10.1186/s40537-024-00963-0","DOIUrl":"https://doi.org/10.1186/s40537-024-00963-0","url":null,"abstract":"<p>This paper introduces TC-Llama 2, a novel application of large language models (LLMs) in the technology-commercialization field. Traditional methods in this field, reliant on statistical learning and expert knowledge, often face challenges in processing the complex and diverse nature of technology-commercialization data. TC-Llama 2 addresses these limitations by utilizing the advanced generalization capabilities of LLMs, specifically adapting them to this intricate domain. Our model, based on the open-source LLM framework, Llama 2, is customized through instruction tuning using bilingual Korean-English datasets. Our approach involves transforming technology-commercialization data into formats compatible with LLMs, enabling the model to learn detailed technological knowledge and product hierarchies effectively. We introduce a unique model evaluation strategy, leveraging new matching and generation tasks to verify the alignment of the technology-commercialization relationship in TC-Llama 2. Our results, derived from refining task-specific instructions for inference, provide valuable insights into customizing language models for specific sectors, potentially leading to new applications in technology categorization, utilization, and predictive product development.</p>","PeriodicalId":15158,"journal":{"name":"Journal of Big Data","volume":"51 1","pages":""},"PeriodicalIF":8.1,"publicationDate":"2024-08-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141930412","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
An ensemble machine learning model for predicting one-year mortality in elderly coronary heart disease patients with anemia 预测患有贫血的老年冠心病患者一年死亡率的集合机器学习模型
IF 8.1 2区 计算机科学
Journal of Big Data Pub Date : 2024-07-24 DOI: 10.1186/s40537-024-00966-x
Longcan Cheng, Yan Nie, Hongxia Wen, Yan Li, Yali Zhao, Qian Zhang, Mingxing Lei, Shihui Fu
{"title":"An ensemble machine learning model for predicting one-year mortality in elderly coronary heart disease patients with anemia","authors":"Longcan Cheng, Yan Nie, Hongxia Wen, Yan Li, Yali Zhao, Qian Zhang, Mingxing Lei, Shihui Fu","doi":"10.1186/s40537-024-00966-x","DOIUrl":"https://doi.org/10.1186/s40537-024-00966-x","url":null,"abstract":"<h3 data-test=\"abstract-sub-heading\">Objective</h3><p>This study was designed to develop and validate a robust predictive model for one-year mortality in elderly coronary heart disease (CHD) patients with anemia using machine learning methods.</p><h3 data-test=\"abstract-sub-heading\">Methods</h3><p>Demographics, tests, comorbidities, and drugs were collected for a cohort of 974 elderly patients with CHD. A prospective analysis was performed to evaluate predictive performances of the developed models. External validation of models was performed in a series of 112 elderly CHD patients with anemia.</p><h3 data-test=\"abstract-sub-heading\">Results</h3><p>The overall one-year mortality was 43.6%. Risk factors included heart rate, chronic heart failure, tachycardia and β receptor blockers. Protective factors included hemoglobin, albumin, high density lipoprotein cholesterol, estimated glomerular filtration rate (eGFR), left ventricular ejection fraction (LVEF), aspirin, clopidogrel, calcium channel blockers, angiotensin converting enzyme inhibitors (ACEIs)/angiotensin receptor blockers (ARBs), and statins. Compared with other algorithms, an ensemble machine learning model performed the best with area under the curve (95% confidence interval) being 0.828 (0.805–0.870) and Brier score being 0.170. Calibration and density curves further confirmed favorable predicted probability and discriminative ability of an ensemble machine learning model. External validation of Ensemble Model also exhibited good performance with area under the curve (95% confidence interval) being 0.825 (0.734–0.916) and Brier score being 0.185. Patients in the high-risk group had more than six-fold probability of one-year mortality compared with those in the low-risk group (<i>P</i> &lt; 0.001). Shaley Additive exPlanation identified the top five risk factors that associated with one-year mortality were hemoglobin, albumin, eGFR, LVEF, and ACEIs/ARBs.</p><h3 data-test=\"abstract-sub-heading\">Conclusions</h3><p>This model identifies key risk factors and protective factors, providing valuable insights for improving risk assessment, informing clinical decision-making and performing targeted interventions. It outperforms other algorithms with predictive performance and provides significant opportunities for personalized risk mitigation strategies, with clinical implications for improving patient care.</p>","PeriodicalId":15158,"journal":{"name":"Journal of Big Data","volume":"17 1","pages":""},"PeriodicalIF":8.1,"publicationDate":"2024-07-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141772925","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Hate speech detection in the Bengali language: a comprehensive survey 孟加拉语中的仇恨言论检测:一项全面调查
IF 8.1 2区 计算机科学
Journal of Big Data Pub Date : 2024-07-23 DOI: 10.1186/s40537-024-00956-z
Abdullah Al Maruf, Ahmad Jainul Abidin, Md. Mahmudul Haque, Zakaria Masud Jiyad, Aditi Golder, Raaid Alubady, Zeyar Aung
{"title":"Hate speech detection in the Bengali language: a comprehensive survey","authors":"Abdullah Al Maruf, Ahmad Jainul Abidin, Md. Mahmudul Haque, Zakaria Masud Jiyad, Aditi Golder, Raaid Alubady, Zeyar Aung","doi":"10.1186/s40537-024-00956-z","DOIUrl":"https://doi.org/10.1186/s40537-024-00956-z","url":null,"abstract":"<p>The detection of hate speech (HS) in online platforms has become extremely important for maintaining a safe and inclusive environment. While significant progress has been made in English-language HS detection, methods for detecting HS in other languages, such as Bengali, have not been explored much like English. In this survey, we outlined the key challenges specific to HS detection in Bengali, including the scarcity of labeled datasets, linguistic nuances, and contextual variations. We also examined different approaches and methodologies employed by researchers to address these challenges, including classical machine learning techniques, ensemble approaches, and more recent deep learning advancements. Furthermore, we explored the performance metrics used for evaluation, including the accuracy, precision, recall, receiver operating characteristic (ROC) curve, area under the ROC curve (AUC), sensitivity, specificity, and F1 score, providing insights into the effectiveness of the proposed models. Additionally, we identified the limitations and future directions of research in Bengali HS detection, highlighting the need for larger annotated datasets, cross-lingual transfer learning techniques, and the incorporation of contextual information to improve the detection accuracy. This survey provides a comprehensive overview of the current state-of-the-art HS detection methods used in Bengali text and serves as a valuable resource for researchers and practitioners interested in understanding the advancements, challenges, and opportunities in addressing HS in the Bengali language, ultimately assisting in the creation of reliable and effective online platform detection systems.</p>","PeriodicalId":15158,"journal":{"name":"Journal of Big Data","volume":"14 1","pages":""},"PeriodicalIF":8.1,"publicationDate":"2024-07-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141772928","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Predictive modelling of MapReduce job performance in cloud environments using machine learning techniques 使用机器学习技术对云环境中的 MapReduce 作业性能进行预测建模
IF 8.1 2区 计算机科学
Journal of Big Data Pub Date : 2024-07-23 DOI: 10.1186/s40537-024-00964-z
Mohammed Bergui, Soufiane Hourri, Said Najah, Nikola S. Nikolov
{"title":"Predictive modelling of MapReduce job performance in cloud environments using machine learning techniques","authors":"Mohammed Bergui, Soufiane Hourri, Said Najah, Nikola S. Nikolov","doi":"10.1186/s40537-024-00964-z","DOIUrl":"https://doi.org/10.1186/s40537-024-00964-z","url":null,"abstract":"<p>Within the Hadoop ecosystem, MapReduce stands as a cornerstone for managing, processing, and mining large-scale datasets. Yet, the absence of efficient solutions for precise estimation of job execution times poses a persistent challenge, impacting task allocation and distribution within Hadoop clusters. In this study, we present a comprehensive machine learning approach for predicting the execution time of MapReduce jobs, encompassing data collection, preprocessing, feature engineering, and model evaluation. Leveraging a rich dataset derived from comprehensive Hadoop MapReduce job traces, we explore the intricate relationship between cluster parameters and job performance. Through a comparative analysis of machine learning models, including linear regression, decision tree, random forest, and gradient-boosted regression trees, we identify the random forest model as the most effective, demonstrating superior predictive accuracy and robustness. Our findings underscore the critical role of features such as data size and resource allocation in determining job performance. With this work, we aim to enhance resource management efficiency and enable more effective utilisation of cloud-based Hadoop clusters for large-scale data processing tasks.</p>","PeriodicalId":15158,"journal":{"name":"Journal of Big Data","volume":"48 1","pages":""},"PeriodicalIF":8.1,"publicationDate":"2024-07-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141772927","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Introducing Mplots: scaling time series recurrence plots to massive datasets 介绍 Mplots:根据海量数据集缩放时间序列递推图
IF 8.1 2区 计算机科学
Journal of Big Data Pub Date : 2024-07-20 DOI: 10.1186/s40537-024-00954-1
Maryam Shahcheraghi, Ryan Mercer, João Manuel de Almeida Rodrigues, Audrey Der, Hugo Filipe Silveira Gamboa, Zachary Zimmerman, Kerry Mauck, Eamonn Keogh
{"title":"Introducing Mplots: scaling time series recurrence plots to massive datasets","authors":"Maryam Shahcheraghi, Ryan Mercer, João Manuel de Almeida Rodrigues, Audrey Der, Hugo Filipe Silveira Gamboa, Zachary Zimmerman, Kerry Mauck, Eamonn Keogh","doi":"10.1186/s40537-024-00954-1","DOIUrl":"https://doi.org/10.1186/s40537-024-00954-1","url":null,"abstract":"<p>Time series similarity matrices (informally, recurrence plots or dot-plots), are useful tools for time series data mining. They can be used to guide data exploration, and various useful features can be derived from them and then fed into downstream analytics. However, time series similarity matrices suffer from very poor scalability, taxing both time and memory requirements. In this work, we introduce novel ideas that allow us to scale the largest time series similarity matrices that can be examined by several orders of magnitude. The first idea is a novel algorithm to compute the matrices in a way that removes dependency on the subsequence length. This algorithm is so fast that it allows us to now address datasets where the memory limitations begin to dominate. Our second novel contribution is a multiscale algorithm that computes an approximation of the matrix appropriate for the limitations of the user’s memory/screen-resolution, then performs a local, just-in-time recomputation of any region that the user wishes to zoom-in on. Given that this largely removes time and space barriers, human visual attention then becomes the bottleneck. We further introduce algorithms that search massive matrices with quadrillions of cells and then prioritize regions for later examination by either humans or algorithms. We will demonstrate the utility of our ideas for data exploration, segmentation, and classification in domains as diverse as astronomy, bioinformatics, entomology, and wildlife monitoring.</p>","PeriodicalId":15158,"journal":{"name":"Journal of Big Data","volume":"47 1","pages":""},"PeriodicalIF":8.1,"publicationDate":"2024-07-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141742305","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Emotion AWARE: an artificial intelligence framework for adaptable, robust, explainable, and multi-granular emotion analysis 情感 AWARE:用于适应性强、稳健、可解释和多粒度情感分析的人工智能框架
IF 8.1 2区 计算机科学
Journal of Big Data Pub Date : 2024-07-10 DOI: 10.1186/s40537-024-00953-2
Gihan Gamage, Daswin De Silva, Nishan Mills, Damminda Alahakoon, Milos Manic
{"title":"Emotion AWARE: an artificial intelligence framework for adaptable, robust, explainable, and multi-granular emotion analysis","authors":"Gihan Gamage, Daswin De Silva, Nishan Mills, Damminda Alahakoon, Milos Manic","doi":"10.1186/s40537-024-00953-2","DOIUrl":"https://doi.org/10.1186/s40537-024-00953-2","url":null,"abstract":"<p>Emotions are fundamental to human behaviour. How we feel, individually and collectively, determines how humanity evolves and advances into our shared future. The rapid digitalisation of our personal, social and professional lives means we are frequently using digital media to express, understand and respond to emotions. Although recent developments in Artificial Intelligence (AI) are able to analyse sentiment and detect emotions, they are not effective at comprehending the complexity and ambiguity of digital emotion expressions in knowledge-focused activities of customers, people, and organizations. In this paper, we address this challenge by proposing a novel AI framework for the adaptable, robust, and explainable detection of multi-granular assembles of emotions. This framework consolidates lexicon generation and finetuned Large Language Model (LLM) approaches to formulate multi-granular assembles of two, eight and fourteen emotions. The framework is robust to ambiguous emotion expressions that are implied in conversation, adaptable to domain-specific emotion semantics, and the assembles are explainable using constituent terms and intensity. We conducted nine empirical studies using datasets representing diverse human emotion behaviours. The results of these studies comprehensively demonstrate and evaluate the core capabilities of the framework, and consistently outperforms state-of-the-art approaches in adaptable, robust, and explainable multi-granular emotion detection.</p>","PeriodicalId":15158,"journal":{"name":"Journal of Big Data","volume":"153 1","pages":""},"PeriodicalIF":8.1,"publicationDate":"2024-07-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141587939","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Examining ALS: reformed PCA and random forest for effective detection of ALS 检查 ALS:改革 PCA 和随机森林,有效检测 ALS
IF 8.1 2区 计算机科学
Journal of Big Data Pub Date : 2024-07-10 DOI: 10.1186/s40537-024-00951-4
Abdullah Alqahtani, Shtwai Alsubai, Mohemmed Sha, Ashit Kumar Dutta
{"title":"Examining ALS: reformed PCA and random forest for effective detection of ALS","authors":"Abdullah Alqahtani, Shtwai Alsubai, Mohemmed Sha, Ashit Kumar Dutta","doi":"10.1186/s40537-024-00951-4","DOIUrl":"https://doi.org/10.1186/s40537-024-00951-4","url":null,"abstract":"<p>ALS (Amyotrophic Lateral Sclerosis) is a fatal neurodegenerative disease of the human motor system. It is a group of progressive diseases that affects the nerve cells in the brain and spinal cord that control the muscle movement of the body hence, detection and classification of ALS at the right time is considered to be one of the vital aspects that can save the life of humans. Therefore, in various studies, different AI techniques are used for the detection of ALS, however, these methods are considered to be ineffectual in terms of identifying the disease due to the employment of ineffective algorithms. Hence, the proposed model utilizes Modified Principal Component Analysis (MPCA) and Modified Random Forest (MRF) for performing dimensionality reduction of all the potential features considered for effective classification of the ALS presence and absence of ALS causing mutation in the corresponding gene. The MPCA is adapted for capturing all the Low-Importance Data transformation. Furthermore, The MPCA is objected to performing three various approaches: Covariance Matrix Correlation, Eigen Vector- Eigenvalue decomposition, and selecting the desired principal components. This is done in aspects of implying the LI (Lower-Importance) Data Transformation. By choosing these potential components without any loss of features ensures better viability of selecting the attributes for ALS-causing gene classification. This is followed by the classification of the proposed model by using Modified RF by updating the clump detector technique. The clump detector is proceeded by clustering approach using K-means, and the data reduced by their dimension are grouped accordingly. These clustered data are analyzed either for ALS causing or devoid of causing ALS. Finally, the model’s performance is assessed using different evaluation metrics like accuracy, recall, F1 score, and precision, and the proposed model is further compared with the existing models to assess the efficacy of the proposed model.</p>","PeriodicalId":15158,"journal":{"name":"Journal of Big Data","volume":"21 1","pages":""},"PeriodicalIF":8.1,"publicationDate":"2024-07-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141587940","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Exploring AI-driven approaches for unstructured document analysis and future horizons 探索人工智能驱动的非结构化文件分析方法及未来展望
IF 8.1 2区 计算机科学
Journal of Big Data Pub Date : 2024-07-05 DOI: 10.1186/s40537-024-00948-z
Supriya V. Mahadevkar, Shruti Patil, Ketan Kotecha, Lim Way Soong, Tanupriya Choudhury
{"title":"Exploring AI-driven approaches for unstructured document analysis and future horizons","authors":"Supriya V. Mahadevkar, Shruti Patil, Ketan Kotecha, Lim Way Soong, Tanupriya Choudhury","doi":"10.1186/s40537-024-00948-z","DOIUrl":"https://doi.org/10.1186/s40537-024-00948-z","url":null,"abstract":"<p>In the current industrial landscape, a significant number of sectors are grappling with the challenges posed by unstructured data, which incurs financial losses amounting to millions annually. If harnessed effectively, this data has the potential to substantially boost operational efficiency. Traditional methods for extracting information have their limitations; however, solutions powered by artificial intelligence (AI) could provide a more fitting alternative. There is an evident gap in scholarly research concerning a comprehensive evaluation of AI-driven techniques for the extraction of information from unstructured content. This systematic literature review aims to identify, assess, and deliberate on prospective research directions within the field of unstructured document information extraction. It has been observed that prevailing extraction methods primarily depend on static patterns or rules, often proving inadequate when faced with complex document structures typically encountered in real-world scenarios, such as medical records. Datasets currently available to the public suffer from low quality and are tailored for specific tasks only. This underscores an urgent need for developing new datasets that accurately reflect complex issues encountered in practical settings. The review reveals that AI-based techniques show promise in autonomously extracting information from diverse unstructured documents, encompassing both printed and handwritten text. Challenges arise, however, when dealing with varied document layouts. Proposing a framework through hybrid AI-based approaches, this review envisions processing a high-quality dataset for automatic information extraction from unstructured documents. Additionally, it emphasizes the importance of collaborative efforts between organizations and researchers to address the diverse challenges associated with unstructured data analysis.</p>","PeriodicalId":15158,"journal":{"name":"Journal of Big Data","volume":"31 1","pages":""},"PeriodicalIF":8.1,"publicationDate":"2024-07-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141576733","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
New custom rating for improving recommendation system performance 用于提高推荐系统性能的新自定义评级
IF 8.1 2区 计算机科学
Journal of Big Data Pub Date : 2024-07-02 DOI: 10.1186/s40537-024-00952-3
Tora Fahrudin, Dedy Rahman Wijaya
{"title":"New custom rating for improving recommendation system performance","authors":"Tora Fahrudin, Dedy Rahman Wijaya","doi":"10.1186/s40537-024-00952-3","DOIUrl":"https://doi.org/10.1186/s40537-024-00952-3","url":null,"abstract":"<p>Recommendation system is currently attracting the interest of many explorers. Various new businesses have surfaced with the rise of online marketing (E-Commerce) in response to Covid-19 pandemic. This phenomenon allows recommendation items through a system called Collaborative Filtering (CF), aiming to improve shopping experience of users. Typically, the effectiveness of CF relies on the precise identification of similar profile users by similarity algorithms. Traditional similarity measures are based on the user-item rating matrix. Approximately, four custom ratings (CR) were used along with a new rating formula, termed New Custom Rating (NCR), derived from the popularity of users and items in addition to the original rating. Specifically, NCR optimized recommendation system performance by using the popularity of users and items to determine new ratings value, rather than solely relying on the original rating. Additionally, the formulas improved the representativeness of the new rating values and the accuracy of similarity algorithm calculations. Consequently, the increased accuracy of recommendation system was achieved. The implementation of NCR across four CR algorithms and recommendation system using five public datasets was examined. Consequently, the experimental results showed that NCR significantly increased recommendation system accuracy, as evidenced by reductions in RMSE, MSE, and MAE as well as increasing FCP and Hit Rate. Moreover, by combining the popularity of users and items into rating calculations, NCR improved the accuracy of various recommendation system algorithms reducing RMSE, MSE, and MAE up to 62.10%, 53.62%, 65.97%, respectively, while also increasing FCP and Hit Rate up to 11.89% and 31.42%, respectively.</p>","PeriodicalId":15158,"journal":{"name":"Journal of Big Data","volume":"23 1","pages":""},"PeriodicalIF":8.1,"publicationDate":"2024-07-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141520245","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信