Recent advancements in deep learning algorithms have significantly expanded the capabilities of systems to handle vision-to-language (V2L) tasks. Visual question answering (VQA) presents challenges that require a deep understanding of visual and language content to perform complex reasoning tasks. The existing VQA models often rely on grid-based or region-based visual features, which capture global context and object-specific details, respectively. However, balancing the complementary strengths of each feature type while minimizing fusion noise remains a significant challenge. This study propose a multi-scale dual-stream visual feature extraction method that combines grid and region features to enhance both global and local visual feature representations. Also, a visual graph relational reasoning (VGRR) approach is proposed to further improve reasoning by constructing a graph that models spatial and semantic relationships between visual objects, using Graph Attention Networks (GATs) for relational reasoning. To enhance the interaction between visual and textual modalities, we further propose a cross-modal self-attention fusion strategy, which enables the model to focus selectively on the most relevant parts of both the image and the question. The proposed model is evaluated on the VQA 2.0 and GQA benchmark datasets, demonstrating competitive performance with significant accuracy improvements compared to state-of-the-art methods. Ablation studies confirm the effectiveness of each module in enhancing visual-textual understanding and answer prediction.