Given the recent increased emphasis on multimodal neural networks to solve complex modeling tasks, the problem of outcome prediction for a course of treatment can be framed as fundamentally multimodal in nature. A patient's response to treatment will vary based on their specific anatomy and the proposed treatment plan—these factors are spatial and closely related. However, additional factors may also have importance, such as non-spatial descriptive clinical characteristics, which can be structured as tabular data. It is critical to provide models with as comprehensive of a patient representation as possible, but inputs with differing data structures are incompatible in raw form; traditional models that consider these inputs require feature engineering prior to modeling. In neural networks, feature engineering can be organically integrated into the model itself, under one governing optimization, rather than performed prescriptively beforehand. However, the native incompatibility of different data structures must be addressed. Methods to reconcile structural incompatibilities in multimodal model inputs are called data fusion. We present a novel joint early pre-spatial (JEPS) fusion technique and demonstrate that differences in fusion approach can produce significant model performance differences even when the data is identical.
To present a novel pre-spatial fusion technique for volumetric neural networks and demonstrate its impact on model performance for pretreatment prediction of overall survival (OS).
From a retrospective cohort of 531 head and neck patients treated at our clinic, we prepared an OS dataset of 222 data-complete cases at a 2-year post-treatment time threshold. Each patient's data included CT imaging, dose array, approved structure set, and a tabular summary of the patient's demographics and survey data. To establish single-modality baselines, we fit both a Cox Proportional Hazards model (CPH) and a dense neural network on only the tabular data, then we trained a 3D convolutional neural network (CNN) on only the volume data. Then, we trained five competing architectures for fusion of both modalities: two early fusion models, a late fusion model, a traditional joint fusion model, and the novel JEPS, where clinical data is merged into training upstream of most convolution operations. We used standardized 10-fold cross validation to directly compare the performance of all models on identical train/test splits of patients, using area under the receiver-operator curve (AUC) as the primary performance metric. We used a two-tailed Student t-test to assess the statistical significance (p-value threshold 0.05) of any observed performance differences.
The JEPS design scored the highest, achieving a mean AUC of 0.779 ± 0.080. The late fusion model and clinical-only CPH model scored second and third highest with 0.746 ± 0.066 and 0.720 ± 0.091 mean AUC, respectively. The performance differences between these three models were not statistically significant. All other comparison models scored significantly worse than the top performing JEPS model.
For our OS evaluation, our JEPS fusion architecture achieves better integration of inputs and significantly improves predictive performance over most common multimodal approaches. The JEPS fusion technique is easily applied to any volumetric CNN.