Microseismic source localization methods with deep learning can directly predict the source location from recorded microseismic data, showing remarkably high accuracy and efficiency. Two main categories of deep learning-based localization methods are coordinate prediction methods and heatmap prediction methods. Coordinate prediction methods provide only a source coordinate and generally do not provide a measure of confidence in the source location. Heatmap prediction methods require the assumption that the microseismic source is located on a grid point. Thus, they tend to provide lower resolution information and localization results may lose precision. This study reviews and compares previous methods for locating the source based on deep learning. To address the limitations of existing methods, we devise a network fusing a convolutional neural network and a Transformer to locate microseismic sources. We first introduce the multi-modal heatmap combining the Gaussian heatmap and the offset coefficient map to represent the source location. The offset coefficients are utilized to correct the source locations predicted by the Gaussian heatmap so that the source is no longer confined to the grid point. We then propose a fusion network to accurately estimate the source location. A gated multi-scale feature fusion module is developed to efficiently fuse features from different branches. Experiments on synthetic and field data demonstrate that the proposed method yields highly accurate localization results. A comprehensive comparison of coordinate prediction method and heatmap prediction methods with our proposed method demonstrates that the proposed method outperforms the other methods.