{"title":"CSDNet: cross-sketch with dual gated attention for fine-grained image captioning network","authors":"Md. Shamim Hossain, Shamima Aktar, Md. Bipul Hossen, Mohammad Alamgir Hossain, Naijie Gu, Zhangjin Huang","doi":"10.1007/s11042-024-20220-z","DOIUrl":"https://doi.org/10.1007/s11042-024-20220-z","url":null,"abstract":"<p>In the realm of extracting inter and intra-modal interactions, contemporary models often face challenges such as reduced computational efficiency, particularly when dealing with lengthy visual sequences. To address these issues, this study introduces an innovative model, the Cross-Sketch with Dual Gated Attention Network (CSDNet), designed to handle second-order intra- and inter-modal interactions by integrating a couple of attention modules. Leveraging bilinear pooling to effectively capture these second-order interactions typically requires substantial computational resources due to the processing of large-dimensional tensors. Due to these resource demands, the first module Cross-Sketch Attention (CSA) is proposed, which employs Cross-Tensor Sketch Pooling on attention features to reduce dimensionality while preserving crucial information without sacrificing caption quality. Furthermore, to enhance caption by integrating another novel attention module, Dual Gated Attention (DGA), which contributes additional spatial and channel-wise attention distributions to improve caption generation performance. Our method demonstrates significant computational efficiency improvements, reducing computation time per epoch by an average of 13.54% compared to the base model, which leads to expedited convergence and improved performance metrics. Additionally, we observe a 0.07% enhancement in the METEOR score compared to the base model. Through the application of reinforcement learning optimization, our model achieves a remarkable CIDEr-D score of 132.2% on the MS-COCO dataset. This consistently outperforms baseline performance across a comprehensive range of evaluation metrics.</p>","PeriodicalId":18770,"journal":{"name":"Multimedia Tools and Applications","volume":null,"pages":null},"PeriodicalIF":3.6,"publicationDate":"2024-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142266106","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Noor Ul Ain Tahir, Zuping Zhang, Muhammad Asim, Sundas Iftikhar, Ahmed A. Abd El-Latif
{"title":"PVDM-YOLOv8l: a solution for reliable pedestrian and vehicle detection in autonomous vehicles under adverse weather conditions","authors":"Noor Ul Ain Tahir, Zuping Zhang, Muhammad Asim, Sundas Iftikhar, Ahmed A. Abd El-Latif","doi":"10.1007/s11042-024-20219-6","DOIUrl":"https://doi.org/10.1007/s11042-024-20219-6","url":null,"abstract":"<p>Ensuring the safe navigation of autonomous vehicles in intelligent transportation system depends on their ability to detect pedestrians and vehicles. While transformer-based models for object detection have shown remarkable advancements, accurately identifying pedestrians and vehicles in adverse weather conditions remains a challenging task. Adverse weather introduces image quality degradation, leading to issues such as low contrast, reduced visibility, blurred edges, false detection, misdetection of tiny objects, and other impediments that further complicate the accuracy of detection. This paper introduces a novel Pedestrian and Vehicle Detection Model under adverse weather conditions, denoted as PVDM-YOLOv8l. In our proposed model, we first incorporate the Swin-Transformer method, which is designed for global extraction of feature of small objects to identify in poor visibility, into the YOLOv8l backbone structure. To enhance detection accuracy and address the impact of inaccurate features on recognition performance, CBAM is integrated between the neck and head networks of YOLOv8l, aiming to gather crucial information and obtain essential data. Finally, we adopted the loss function Wise-IOU v3. This function was implemented to mitigate the adverse effects of low-quality instances by minimizing negative gradients. Additionally, we enhanced and augmented the DAWN dataset and created a custom dataset, named DAWN2024, to cater to the specific requirements of our study. To verify the superiority of PVDM-YOLOV8l, its performance was compared against several commonly used object detectors, including YOLOv3, YOLOv3-tiny, YOLOv3-spp, YOLOv5, YOLOv6, and all the versions of YOLOv8 (n, m, s, l, and x) and some traditional models. The experimental results demonstrate that our proposed model achieved a 6.6%, 5.4%, 6%, and 5.1% improvement in precision, recall, F1-score and mean Average Precision (mAP) on the custom DAWN2024 dataset. This substantial improvement in accuracy indicates a significant leap in the capability of our model to detect pedestrians and vehicles under adverse weather conditions, which is crucial for the safe navigation of autonomous vehicles.</p>","PeriodicalId":18770,"journal":{"name":"Multimedia Tools and Applications","volume":null,"pages":null},"PeriodicalIF":3.6,"publicationDate":"2024-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142266111","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Dongli Wang, Xiaolin Zhu, Jinfu Liu, Zixin Zhang, Yan Zhou
{"title":"Multi-dimensional convolution transformer for group activity recognition","authors":"Dongli Wang, Xiaolin Zhu, Jinfu Liu, Zixin Zhang, Yan Zhou","doi":"10.1007/s11042-024-19973-4","DOIUrl":"https://doi.org/10.1007/s11042-024-19973-4","url":null,"abstract":"<p>Group activity recognition, which aims to understand the activity performed by a group of people, has attracted growing attention in the realm of computer vision over the past decade. In this paper, we propose a novel multi-dimensional convolution Transformer network for group activity recognition, which not only models spatial-temporal feature representations, but also combines channel information to analyze the spatial-temporal dependencies of individual actors. Specifically, we first construct a multi-scale feature extraction module in the feature extraction stage, which can exploit discriminative high-level and low-level feature representations. The multi-branching strategy combined with the dilated convolution can further capture multi-scale feature information in complex group scenarios. Then, to construct the inter-dependence among involved actors from different dimensions, we design a multi-dimensional convolution Transformer in the relational reasoning stage, which consists of the following three parts: a channel attention module, a spatial-temporal convolutional Transformer, and a spatial-temporal attention module. Finally, the final activity recognition result is obtained by using a softmax classifier. Extensive experiments on two public GAR datasets demonstrate that the recognition accuracy on the Volleyball Dataset and Collective Activity Dataset can reach 92.8% and 96.1%, respectively, which is a significant improvement compared with the mainstream methods in recent years.</p>","PeriodicalId":18770,"journal":{"name":"Multimedia Tools and Applications","volume":null,"pages":null},"PeriodicalIF":3.6,"publicationDate":"2024-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142266112","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Sorel Bagio Nono Fotso, William Nodem Atchoffo, Armand C. Nzeukou, Jimmi Hervé Talla Mbé
{"title":"Enhanced security in lossless audio encryption using zigzag scrambling, DNA coding, SHA-256, and hopfield networks: a practical vlc system implementation","authors":"Sorel Bagio Nono Fotso, William Nodem Atchoffo, Armand C. Nzeukou, Jimmi Hervé Talla Mbé","doi":"10.1007/s11042-024-20196-w","DOIUrl":"https://doi.org/10.1007/s11042-024-20196-w","url":null,"abstract":"<p>This paper presents a novel lossless audio encryption algorithm based on a modified zigzag scrambling technique, SHA-256, DNA coding, cipher block chaining (CBC) mode, and the delayed Hopfield neural network (HNN). The algorithm mainly includes the scrambling and diffusion stages. In the scrambling stage, the audio signal is converted into a square matrix on which the modified zigzag scrambling technique is applied. Then follows the confusion stage in which bit-level permutation, DNA coding, and CBC mode are applied successively. Besides, the delayed HNN serving in the encryption process is controlled by the plain audio signal through the hash function SHA-256 to resist differential attack. The proposed algorithm has been assessed on ten audio signals using more than fourteen performance measures. Compare to the state-of-the-art, the obtained results show better performances. Indeed, higher resistance to differential attack is obtained; this is seen through higher values of number of sample change rate (NSCR) and unified average changing intensity (UACI). Also, more disorder is detected in the encrypted audio signal through higher values of the information entropy. Furthermore, the proposed algorithm possesses a larger key space arising from the high number of parameters of the delayed HNN, which results in a higher resistance to brute force attacks. A real-life implementation of the proposed encryption technique is achieved with a visible light communication (VLC) system; this highlights its feasibility and effectiveness in securing optical wireless communication systems.</p>","PeriodicalId":18770,"journal":{"name":"Multimedia Tools and Applications","volume":null,"pages":null},"PeriodicalIF":3.6,"publicationDate":"2024-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142266110","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Voxel completion and 3D asymmetrical convolution networks for Lidar semantic segmentation","authors":"Yan Zhou, Jingwei Liu, Jianxun Li, Haibin Zhou","doi":"10.1007/s11042-024-19975-2","DOIUrl":"https://doi.org/10.1007/s11042-024-19975-2","url":null,"abstract":"<p>The point cloud data collected by LiDAR is large in scale and contains rich spatial structure detail information, through the collection and labeling of LiDAR data, the automatic driving system can obtain detailed information about the environment around the vehicle. Due to lack of sufficient laser points, some methods transform the point cloud to dense representations such as multi-view or voxelized grids for processing, ignoring the information loss problem caused by the LiDAR imaging characteristics as well as the point cloud transformations, which leads to a degradation of the segmentation performance. In this work, We investigate a 3D semantic segmentation scheme with only LiDAR inputs, called voxel completion and 3D asymmetric convolution network. We propose a voxel completion sub-network to improve the feature extraction capability of the network by enlarging the receptive field and using multi-scale feature extraction to reduce the empty units in the voxels and obtain more complete voxel features. In addition, due to the presence of a large number of cubic objects in the autopilot scenario, to better match the autopilot scenario, we propose a 3D asymmetric convolution network that includes three components: a 3D residual block, an asymmetric convolution block, and a context module. These components are combined together to explore 3D geometric patterns, which can maintain their intrinsic properties and improve the performance of the network. Extensive experiments on the SemanticKITTI and nuScenes benchmark datasets demonstrate the superiority of the approach. For example, on the nuScenes validation set, our method outperforms the state-of-the-art method by 0.3% in mIoU.</p>","PeriodicalId":18770,"journal":{"name":"Multimedia Tools and Applications","volume":null,"pages":null},"PeriodicalIF":3.6,"publicationDate":"2024-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142266315","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"An effective binary dynamic grey wolf optimization algorithm for the 0-1 knapsack problem","authors":"Feyza Erdoğan, Murat Karakoyun, Şaban Gülcü","doi":"10.1007/s11042-024-20121-1","DOIUrl":"https://doi.org/10.1007/s11042-024-20121-1","url":null,"abstract":"<p>Metaheuristic algorithms are recommended and frequently used methods for solving optimization problems. Today, it has been adapted to many challenging problems and its successes have been identified. The grey wolf optimizer (GWO) is one of the most advanced metaheuristics. Because of the advantages it provides, GWO has been applied to solve many different problems. In this study, a new variant of GWO, the Binary Dynamic Grey Wolf Optimizer (BDGWO), is proposed for the solution of binary optimization problems. The main contributions of BDGWO compared to other binary GWO variants are that it uses the XOR bitwise operation to binarize and is based on the dynamic coefficient method developed to determine the effect of the three dominant wolves (alpha, beta, and delta) in the algorithm. BDGWO is a simple, feasible, and successful method that strikes a balance between local search and global search in solving binary optimization problems. To determine the success and accuracy of the proposed BDGWO, it was tested on the 0-1 knapsack problem (0-1 KP), which is classified as an NP-Hard problem. The BDGWO was compared with 17 different binary methods across a total of 55 data sets from three different studies published in the last four years. The Friedman test was applied to interpret the experimental results more easily and to evaluate the algorithm results statistically. As a result of the experiments, it has been proven that the BDGWO is an effective and successful method in accordance with its purpose.</p>","PeriodicalId":18770,"journal":{"name":"Multimedia Tools and Applications","volume":null,"pages":null},"PeriodicalIF":3.6,"publicationDate":"2024-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142266107","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yanni Liu, Ayong Ye, Qiulin Chen, Yuexin Zhang, Jianwei Chen
{"title":"DE-DFKD: diversity enhancing data-free knowledge distillation","authors":"Yanni Liu, Ayong Ye, Qiulin Chen, Yuexin Zhang, Jianwei Chen","doi":"10.1007/s11042-024-20193-z","DOIUrl":"https://doi.org/10.1007/s11042-024-20193-z","url":null,"abstract":"<p>Data-Free Knowledge Distillation (DFKD) can be used to train students using synthetic data, when the original dataset of the teacher network is not accessible. However, existing studies mainly focus on how to use the prior knowledge of the teacher network to synthesize data, ignoring the lack of diversity of synthesized data, which leads to the inability of the student network to learn the real data distribution and low robustness. In this paper, we propose a Diversity-Enhanced Data-Free Knowledge Distillation (DE-DFKD) method based on the idea of generative image modelling, which introduces conditional generative networks and metric learning to solve the problem of class imbalance and single intra-class data distribution in synthetic datasets. The experimental results show that DE-DFKD synthesizes better quality data on MNIST, CIFAR-10, and CIFAR-100 datasets with Frechet Inception Distance (FID) values of 51.79, 60.25, and 50.1, respectively, and higher accuracy of student networks compared with existing schemes.</p>","PeriodicalId":18770,"journal":{"name":"Multimedia Tools and Applications","volume":null,"pages":null},"PeriodicalIF":3.6,"publicationDate":"2024-09-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142266118","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Adaptive Tasmanian Devil Optimization algorithm based efficient task scheduling for big data application in a cloud computing environment","authors":"Ashis Kumar Mishra, Subasis Mohapatra, Pradip Kumar Sahu","doi":"10.1007/s11042-024-19887-1","DOIUrl":"https://doi.org/10.1007/s11042-024-19887-1","url":null,"abstract":"<p>One of the most difficult issues in cloud computing is scheduling tasks on appropriate resources on the cloud.This is significant because multiple tasks may need to be efficiently scheduled across different virtual machines to maximize resource utilization and minimize makespan. As a result, various efforts have been made to use metaheuristic algorithms to tackle the task scheduling problem. However, these techniques may occasionally experience early convergence and be trapped in local search. This research proposes a multi-objective-based task scheduling in cloud computing for big data applications to address these issues. To accomplish this goal, the adaptive Tasmanian Devil Optimization (ATDO) method is created in this study, with a focus on resolving challenging optimization issues. Following that, the opposition-based learning technique (OBL) is combined with TDO to maintain the population diversity and improve convergence on the ideal answer. In addition, cost, makespan,and resource utilization are taken into account when designing the multi-objective function (MOF). The proposed strategy included efficient solution representation, efficient fitness function derivation, TDO, and OBL operators. The effectiveness of the strategy is examined using several evaluation metrics, and its efficacy is compared with those of other approaches.The proposed method takes a minimum time of 2134 ms for scheduling 1000 tasks and 20.97 degree of imbalance.</p>","PeriodicalId":18770,"journal":{"name":"Multimedia Tools and Applications","volume":null,"pages":null},"PeriodicalIF":3.6,"publicationDate":"2024-09-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142266114","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Musatafa Abbas Abbood Albadr, Masri Ayob, Sabrina Tiun, Raad Z. Homod, Fahad Taha AL-Dhief, Mohammed Hasan Mutar
{"title":"Parkinson's disease diagnosis by voice data using particle swarm optimization-extreme learning machine approach","authors":"Musatafa Abbas Abbood Albadr, Masri Ayob, Sabrina Tiun, Raad Z. Homod, Fahad Taha AL-Dhief, Mohammed Hasan Mutar","doi":"10.1007/s11042-024-20108-y","DOIUrl":"https://doi.org/10.1007/s11042-024-20108-y","url":null,"abstract":"<p>Various speech processing approaches (e.g., acoustic feature extraction techniques) and Machine Learning (ML) algorithms have been applied to diagnosing Parkinson's disease (PD). However, the majority of these researches have used conventional techniques which obtain a low accuracy rate in diagnosing PD and still need further improvement. Particle Swarm Optimization-Extreme Learning Machine (PSO-ELM), one of the most recent and effective ML techniques, could be considered an accurate strategy in the classification process but has not been applied to solve the problem of PD diagnosis. Thus, in order to enhance the precision of the PD diagnosing, this study employs the PSO-ELM classifier and examines how well it performs on seven feature extraction techniques (basic features, WT (Wavelet Transform), MFCC (Mel Frequency Cepstral Coefficients), bandwidth + formant, intensity parameters, TQWT (Tunable Q-factor Wavelet Transform), and vocal fold features). The PSO-ELM approach has the capability to <b>a)</b> prevents overfitting, <b>b)</b> solve the binary and multi class classification issues, and <b>c)</b> perform like a kernel-based support vector machine with a structure of neural network. Therefore, if the combination of PSO-ELM classifier and appropriate feature extraction technique can improve learning performance, this combination can produce an effective method for identifying PD. In this study, the PD's voice samples have been taken from the Parkinson’s Disease Classification Benchmark Dataset. To discover a useful feature extraction technique to couple with the PSO-ELM classifier, we applied PSO-ELM to each extracted feature with the utilisation of unbalanced and balanced dataset. According to the experimental results, the MFCC features assist the PSO-ELM classifier to attaining its greatest accuracy, up to 97.35% using unbalanced dataset and 100.00% using balanced dataset. This shows that combining PSO-ELM with MFCC can improve learning performance, ultimately creating an effective method for identifying PD.</p>","PeriodicalId":18770,"journal":{"name":"Multimedia Tools and Applications","volume":null,"pages":null},"PeriodicalIF":3.6,"publicationDate":"2024-09-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142266160","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Principal component fusion based unexposed biological feature enhancement of fundus images","authors":"Neha Singh, Ashish Kumar Bhandari","doi":"10.1007/s11042-024-20110-4","DOIUrl":"https://doi.org/10.1007/s11042-024-20110-4","url":null,"abstract":"<p>In the field of ophthalmology, digital images play an important role for automatic detection of various kind of eye diseases. Digital images in the field image enhancement are the first stage to assisting ophthalmologist for diagnosis. As a result, various algorithms, and methods for the enhancement of retinal images have been developed, which may face obstacles that are common in augmentation processes, such as false edges and weak illuminated that obscure image particulars. To eliminate such issues, this paper projected a novel framework for unexposed retinal image. The proposed paper uses multiscale Gaussian function for estimation of illumination layer from unexposed color retinal image and then it is corrected by gamma method. Further to this, the principal component analysis (PCA) is utilized here to generate fused enhance result for unexposed retinal images. Then, contrast limited technique is employed here for further edge and contextual details improvement. When compared to several enhancement-based state-of-the-art procedures, experimental results show that the suggested method produces results with good contrast and brightness. The significance of the proposed method that this method may help ophthalmologists screen for unexposed retinal illnesses more efficiently and build better automated image analysis for healthcare diagnosis.</p>","PeriodicalId":18770,"journal":{"name":"Multimedia Tools and Applications","volume":null,"pages":null},"PeriodicalIF":3.6,"publicationDate":"2024-09-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142266158","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}