The accurate diagnosis of tumors is crucial for improving treatment outcomes. To precisely delineate the nucleus regions of tumor cells in hematoxylin and eosin (H&E) stained tissue images and reduce computational overhead, we propose a novel encoder-decoder architecture named Convolution and focused linear attention fusion UNet (CLA-UNet), which integrates depthwise separable convolution and convolution-focused linear attention into the U-Net network. The innovation of this study is reflected in the following three aspects: first, at the skip connections, it utilizes the Global–Local Feature Fusion and Split-Input Transformer (GLFS Transformer) block to extract global feature information, which is then input to the corresponding layers of the decoder; second, it employs depthwise separable convolution blocks to construct the backbone network, thereby deepening the network; finally, it adds a channel attention module at the decoder to focus on important channel information. Experimental results on the MoNuSeg public database of tumor cells show that the algorithm achieves an IoU, Dice score, precision, and recall of 66.18%, 79.57%, 83.23%, and 76.91%, respectively. Compared with other segmentation methods, this algorithm demonstrates superior segmentation performance. The model proposed in this study significantly outperforms other comparison models in segmentation results, while maintaining an extremely low parameter count and computational cost. The lightweight design of the model facilitates the promotion and application of this research.