Achieving highly efficient treatment planning in intensity-modulated radiotherapy (IMRT) is challenging due to the complex interactions between radiation beams and the human body. The introduction of artificial intelligence (AI) has automated treatment planning, significantly improving efficiency. However, existing automatic treatment planning agents often rely on supervised or unsupervised AI models that require large datasets of high-quality patient data for training. Additionally, these networks are generally not universally applicable across patient cases from different institutions and can be vulnerable to adversarial attacks. Deep reinforcement learning (DRL), which mimics the trial-and-error process used by human planners, offers a promising new approach to address these challenges.
This work aims to develop a stochastic policy-based DRL agent for automatic treatment planning that facilitates effective training with limited datasets, universal applicability across diverse patient datasets, and robust performance under adversarial attacks.
We employ an actor–critic with experience replay (ACER) architecture to develop the automatic treatment planning agent. This agent operates the treatment planning system (TPS) for inverse treatment planning by automatically tuning treatment planning parameters (TPPs). We use prostate cancer IMRT patient cases as our testbed, which includes one target and two organs at risk (OARs), along with 18 discrete TPP tuning actions. The network takes dose–volume histograms (DVHs) as input and outputs a policy for effective TPP tuning, accompanied by an evaluation function for that policy. Training utilizes DVHs from treatment plans generated by an in-house TPS under randomized TPPs for a single patient case, with validation conducted on two other independent cases. Both online asynchronous learning and offline, sample-efficient experience replay methods are employed to update the network parameters. After training, six groups, comprising more than 300 initial treatment plans drawn from three datasets, were used for testing. These groups have beam and anatomical configurations distinct from those of the training case. The ProKnow scoring system for prostate cancer IMRT, with a maximum score of 9, is used to evaluate plan quality. The robustness of the network is further assessed through adversarial attacks using the fast gradient sign method (FGSM).
Despite being trained on treatment plans from a single patient case, the network converges efficiently when validated on two independent cases. For testing performance, the mean standard deviation of the plan scores across all test cases before ACER-based treatment planning is . After implementing ACER-based treatment planning, of the cases achieve a perfect score of 9, with only scoring between 8 and 9, and no cases being below 7. The corresponding mean standard deviation is . This performance highlights the ACER agent's high generality across patient data from various sources. Further analysis indicates that the ACER agent effectively prioritizes leading reasonable TPP tuning actions over obviously unsuitable ones by several orders of magnitude, showing its efficacy. Additionally, results from FGSM attacks demonstrate that the ACER-based agent remains comparatively robust against various levels of perturbation.
We successfully trained a DRL agent using the ACER technique for high-quality treatment planning in prostate cancer IMRT. It achieves high generality across diverse patient datasets and exhibits high robustness against adversarial attacks.