Fall-related disability is prevalent among older adults. This paper introduces a novel multimodal data fusion detection approach aimed at the early identification of such conditions in everyday settings, thereby enabling prompt intervention. The methodology utilizes both video cameras and waist sensors to gather visual and sensory data during human motion. The video-based analysis investigates the spatial-temporal characteristics and the interrelations of human joint points. These features are extracted by the ST-GCN network and effectively distinguished through classification, achieving an accuracy rate of 73.85\(\%\). The sensor-based analysis focuses on the examination of the amplitude and frequency variations in 3D acceleration and declination data. By integrating the Mann-Whitney U test and DTW analysis for refined data differentiation, this method achieves an accuracy rate of 80.77\(\%\). The paper finally presents a fusion analysis technique that gives precedence to samples yielding consistent results from both methods. When encountering inconsistent results, a multi-layer neural network is developed to determine the fusion weights for the two data types. These weights are used to generate the final assessment outcomes. The fusion method demonstrates a marked increase in accuracy, reaching 91.54\(\%\), which significantly surpasses the performance of the individual methods.