1 Introduction

One of the long-term goals of cancer research is to identify prognostic factors that affect patients’ survival time, which in turn allows clinicians to make early decision on treatment [1]. So far, many biomarkers have been shown to be sensitive to the prognosis of cancers [2]. For example, quite a number of cancer prognosis models are based on the histopathological images, since they reveal the morphological attributes of cells that are closely associated with the progress of cancer disease. Moreover, it is known that mutations in genes can cause cancer by accelerating cell division rates. Accordingly, many researchers also use the genomic data such as gene expression signatures to drive cancer prognosis.

Fig. 1.
figure 1

The flowchart of the proposed method.

In many medical applications, it is common to acquire multiple biomarkers to predict the disease status more accurately. Recently, some researchers also integrated the histopathological imaging and gene-level data to predict the clinical outcome of cancer patients [3, 4]. For instance, Yuan et al. [3] combined the pathological image data with copy number variation data to devise an integrated predictor of survival for breast cancer patients. Yao et al. [4] developed a novel correlational survival prediction framework for the integration of pathological images and genomic data on different cancer cohorts. All these results indicate that using multiple biomarkers can reveal hidden information that are overlooked by only using imaging or genomic data, and thus can better predict patients’ clinical outcome.

Although integrating imaging and genomic features can achieve promising results, simply combining them may bring the problem of curse of dimensionalities. Thus, feature selection, which can be considered as the identification of key biomarkers, is commonly used to select useful features. Specifically, the studies in [2, 3] firstly put the features from different modalities together and then applied the traditional single-modality feature selection method (i.e., LASSO) to discover the key components that affect the patients’ survival. Shao et al. [5] exploited and utilized the correlation across multi-modal data to select important imaging and genomic biomarkers related to cancer prognosis. However, these feature selection algorithms are only based on the collected multi-modal data and the survival information, which neglect to take the prior diagnosis information into consideration. As a matter of fact, the diagnosis information (i.e., TNM stage) could also provide guidance to help predict the clinical outcome of patients. For example, the patients who are in stage III suffer from more aggressive cancers than those in stage I, and thus generally have higher survival risks.

Based on the above considerations, in this paper, we propose a diagnosis-guided multi-modal feature selection method (DGM2FS) to discover relevant imaging and genomic risk factors for survival analysis. Specifically, based on the task relationship learning approach, our method can automatically derive the correlation between the diagnosis and prognosis tasks to help identify survival-associated features. In addition, we also consider the association between multi-modal data by adding a regularization term that is capable of capturing the inter-correlation between the selected imaging and genomic components. The experimental results on a public available lung squamous cell carcinoma (LUSC) dataset demonstrate that the proposed method outperforms conventional methods in terms of prognosis prediction.

Table 1. Demographics and clinical characteristics of the LUSC dataset.

2 Method

We summarize our framework in Fig. 1, which consists of the following 3 steps, i.e., feature extraction, the proposed diagnosis-guided multi-modal feature selection method (DGM2FS), and prognostic prediction. We will firstly provide the description of the dataset used in this study.

Dataset: The Cancer Genome Atlas (TCGA) project has generated genomic and imaging data for thousands of tumour samples across more than 30 types of cancers. In this study, we test our method on the lung squamous cell carcinoma (LUSC) cohort derived from TCGA. Specifically, the LUSC dataset contains the H&E stained whole-slide images and gene expression data for 446 patients. For each patient, the corresponding survival information (i.e., survival status, survival time) and the TNM stage information are all available. We show the details of the cohort information in Table 1, where censored patients mean that these patients did not suffer the outcome of death events during the follow-up period, while the non-censored category refers to the patients whose survival information are accurate from diagnosis to death.

Feature Extraction: For whole-slide images of each patient, 2−8 regions of interest (ROI) of size 3K by 3K are extracted at first. Then, we apply an unsupervised method introduced in [6] to segment the nuclei from these extracted patches. Next, for each segmented nucleus, we extract seven cell-level features, including nuclei area (denoted as area), the major and minor axis length of cell nucleus, the ratio of major axis length to minor axis length (major, minor, and ratio), and mean, maximum, and minimum distances (distMean, distMax, and distMin) to its neighbouring nuclei. Finally, for each cell-level feature, a 10-bin histogram and five statistics (i.e., mean, SD, skewness, kurtosis, and entropy) are used to aggregate the cell-level features into patient-level features, and thus a 105-dimensional imaging feature can be obtained for each patient. As to gene expression data, to overcome the large number of genes which poses a challenge to the statistical analysis, we use the co-expression network analysis algorithms [7] to cluster genes into co-expressed modules and summarize each module into an eigengene using singular value decomposition, and this process yields to 58-dimensional eigengene features.

Multi-modal Feature Selection for Prognosis Prediction: For the derived imaging and eigengene features, we firstly introduce the multi-modal feature selection (M2FS) algorithm for identifying survival-associated features without the guidance of diagnosis information. Specifically, let \(X = {[{x_1},{x_2},...,{x_N}]^T} = [{X^H},{X^G}] \in {R^{N \times (p + q)}}\). Here, \(X^H \in R^{N \times p}\) and \(X^G \in R^{N \times q}\) correspond to the histopathological imaging data and eigengene data. N is the number of the patients, and p and q are the feature number of imaging data and eigengene data, respectively. We use a triplet \(({x_i},{t_i},{\delta _i})\) to represent each observation in survival analysis, where \(x_i \in R^{p+q}\) is the patient vector, \(t_i\) is the observed survival time, and \(\delta _i\) is the censoring indicator. Here, \(\delta _i=1\) or \(\delta _i=0\) indicates a non-censored or a censored instance, respectively. Then, the objective function of M2FS is:

(1)

In the M2FS model, \({l_S}({x_i}) = {c_i} \left\| {x_i}{w_S} - {t_i} \right\| _2^2\) is a weighted regression term, where \(w_s=[w_S^H,w_S^G] \in R^{p+q}\) and \(c_i\) is the weight of the survival regression loss for each patient defined as follows:

(2)

where \(c_i=1\) if \(x_i\) is a non-censored patient. By considering that the actual survival time for a censored instance should be larger than its observed time, we define \(c_i\) as 0 if \({x_i}{w_S} - {t_i} \ge 0\). Otherwise, we let \(c_i\) equal to a constant \(\sigma \) that is larger than 1 since the difference between actual survival time and the estimated survival time is indeed greater than \({t_i}-{x_i}{w_S}\). The second term in Eq. (1) is used to minimize the distance between the projections of imaging i.e., \({X^H}w_S^H\) and genomic i.e., \({X^G}w_S^G\) data, so that their inter-correlations can be captured. The third and fourth L1-norm terms are used to select a small number of image and eigengene features for the following prognosis prediction.

Diagnosis-Guided Multi-modal Feature Selection: In the above M2FS method, we only utilize the multi-modal data and the survival information for prognosis prediction, which overlooks the available diagnosis information (i.e., TNM stage) is also correlated with patients’ survival. To address this problem, we propose to incorporate the diagnosis information and use the task relationship learning term [8] to automatically discover the relationship between the diagnosis and prognosis tasks to boost the prognosis prediction performance. The task relationship learning term is defined as \(tr(W{M^{ - 1}}{W^T})\) with \(M \ge 0\) and \(tr(M)=1\). Here, \(W = [{w_D},{w_S}] \in {R^{(p + q) \times 2}}\), where \({w_D=[w_D^{H},w_D^{G}]^T}\) and \({w_S=[w_S^{H},w_S^{G}]^T}\) correspond to the linear discriminant functions for the diagnosis and prognosis tasks on the multi-modal data, respectively. \(M^{-1}\) denotes the inverse of the matrix \(M \in R^{2 \times 2}\), and M is defined as a task covariance matrix that will benefit learning on W by automatically inducing the correct relationship M between different tasks. The constraint term \(M \ge 0\) is used to restrict M as positive semidefinite matrix, and \(tr(M) = 1\) is used to penalize the complexity of M. By incorporating the above relationship induced term, the objective function of the proposed diagnosis-guided multi-modal feature selection (i.e., DGM2FS) method can be formulated as:

$$\begin{aligned} \begin{aligned}&{\min _{W,M}}\sum \limits _{k \in \{S,D\}}{\sum \limits _{i = 1}^N {{l_k}({x_i})+\alpha }}\sum \limits _{k \in \{S,D\}} {\left\| {{X^H}w_k^H - {X^G}w_k^G} \right\| _\mathrm{{2}}^\mathrm{{2}}} \\ \quad \quad \quad \quad&+ \sum \limits _{k \in \{S,D\}} {\sum \limits _{j \in \{H,G\}}{\beta _k^j{{\left\| {w_k^j}\right\| }_1}}}+\gamma tr(W{M^{ - 1}}{W^T})\\ \quad \quad \quad \quad&s.t. \qquad M \ge 0 , tr(M)=1\\ \end{aligned} \end{aligned}$$
(3)

In comparison with the above M2FS model (shown in Eq. (1)), the first term of Eq. (3) further incorporates the empirical loss for the diagnosis task i.e., \({l_D}({x_i}) = \left\| {x_i}{w_D} - {y_i}\right\| _2^2\), where \(y_i\) indicates the categorical TNM stage for patient i. The second term is used to capture the correlation between the imaging and eigengene data for both diagnosis and prognosis tasks. The third term enforces some elements of \(w_k^{j}\) to be zero, and thus can select important features for different tasks. In what follows, we will develop an efficient optimization algorithm to solve the objective function defined in Eq. (3).

Optimization: We adopt an alternating strategy to optimize W and M in the proposed DGM2FS model. Specifically, given a fixed M, we define \(M^{-1}=[M_{DD},M_{DS};M_{DS},M_{SS}] \in R^{2 \times 2}\). Then, the optimization problem for W is:

(4)

Where, \(T=[t_1,t_2,...,t_N]^T \in R^N\) and \(Y=[y_1,y_2,...,y_N]^T \in R^N\) indicate the recorded survival time and TNM stage information for the patients in the training set. \(C \in R^{N \times N}\) is a diagonal matrix, with the k-th element as \(c_i\) defined in Eq. (2). It is worth noting that, Eq. (4) is convex with respect to each \(w_k^{j}(k \in \{S,D\}, j \in \{H,G\})\) in W, and thus can be solved alternatively to obtain the optimal W. After W is determined, we follow the method in [8] to get the closed-formed solution for \(M={({W^T}W)^{\frac{1}{2}}}/tr({({W^T}W)^{\frac{1}{2}}})\). Finally, we renew the weight of each instance i.e., c(i) according to Eq. (2). The above steps will repeat until WM converge to fixed values.

Fig. 2.
figure 2

Comparison of different feature selection methods on Concordance Index.

Prognostic Prediction: We build the Cox model [4] for prognosis prediction. Specifically, we firstly divide all patients into 5 folds, with 4 folds used for training and the remaining for testing, then the Cox proportional hazards model is built on the selected features in the training set, through which we calculate the Concordance Index (CI) [4] that measures the fraction of all pairs of patients whose survival risks are correctly ordered on the testing set to evaluate the performance of prognosis prediction. The CI value ranges from 0 to 1, where larger CI value means the better prediction performance and vice versa.

3 Experimental Results

Experimental Settings: The parameters \(\alpha ,\beta _k^{j}(k \in \{S,D\}, j \in \{H,G\})\) and r in the proposed DGM2FS are tuned from \(\{0.5,1,1.5\}\),\(\{0.1,0.5,1\}\) and \(\{1,1.5,2\}\), respectively. The parameter \(\sigma \) in Eq. (2) is fixed as 1.5.

Results and Discussion: We compare DGM2FS with the following baseline methods by the measurement of CI. (1) LASSO-Cox [2]: use the LASSO method for variable selection in the Cox model. (2) DLasso-Cox: add the diagnosis information as a feature to the multi-modal data, then use the LASSO-Cox model for feature selection. (3) OSCCA [5]: a multi-modal feature selection method for survival analysis without the guidance of diagnosis knowledge. (4) M2FS: a variant of DGM2FS (shown in Eq. (1)), which neglects to take the diagnosis information into consideration. (5) DGFS: a variant of DGM2FS, which miss the second term in Eq. (3) that can capture the correlation between imaging and genomic data. The results are shown in Fig. 2. As can be seen from Fig. 2, DGM2FS and its variant DGFS achieve higher CI value than the competing methods. These results clearly demonstrate that the incorporation of the diagnosis information (i.e., TNM stage) under task relationship learning framework can help improve the prognosis performance. In addition, we observe that the DGM2FS model could provide better prognostic prediction (\(0.795\pm 0.043\)) than the DGFS algorithm (\(0.769\pm 0.039\)), which also shows the advantage of taking the correlation among different modalities into account for feature selection.

Fig. 3.
figure 3

The correlation coefficient matrix between the diagnosis and prognosis tasks for each fold of cross-validation learned by DGM2FS.

In addition, we visualize the learned correlation coefficient matrix (i.e., calculated from M in Eq. (3)) between the diagnosis and prognosis tasks in Fig. 3. From Fig. 3, we observe that the diagnosis and prognosis tasks are positively correlated in each fold of cross-validation, which again verify that adding the diagnosis information could improve the performance of prognosis prediction. Since it is of great importance to identify the biomarkers that affect prognosis prediction. We focus on the biological significance of the selected imaging and eigenegene features that are appeared at least four times in the 5-fold cross-validation. Specifically, our method selects 6 image features including major_bin6, distMean_bin2, distMin_kurtosis, distMin_bin3, distMean_entropy and distMin_entropy, where most of them are related to the distance among segmented cells. The cells with smaller neighboring distance (i.e., disMean_bin2 and distMin_bin3) usually correspond to the cancer cells or lymphocytes that cluster together in the tissue images, which is generally considered to be the key factors affecting the survival of lung cancer patients [9]. As to genomic features, three eigengenes i.e., eigengene 10, eigengene 25, eigengene 29 are identified (details are shown in Supplementary). The enrichment analysis on eigengene 10 shows that its corresponding module contains 30 genes, and 14 of them are enriched with the biological process of immune response, which is consistent with the existing study [10] that the immune response plays an important role in the development of lung cancer.

Fig. 4.
figure 4

Comparison of different feature selection methods for patient stratification.

Finally, we also compare the prognostic power of different approaches by stratifying cancer patients into 3 subgroups with different predicted outcomes, with the experimental results shown in Fig. 4. Specifically, the stage method divides all the patients into 3 subgroups according to the TNM stage. For DGM2FS or M2FS approach, k-means clustering algorithm is adopted to aggregate the patients into 3 subgroups based on the selected features. Then, we test if these 3 subgroups has distinct survival outcome using log-rank test [2]. As can be seen from Fig. 4, the proposed DGM2FS method could achieve superior stratification performance than the competitors. These results further show the promise of incorporating diagnosis information for patient stratification from multi-modal data, which opens up the opportunity for building personalized treatment plan in the stage of cancer development.

4 Conclusion

In this paper, we develop DGM2FS, an effective multi-modal feature selection method that can identify survival associated biomarkers from both histopathological image and gene expression data. The main advantage of our approach is its capability of utilizing the diagnosis information to guide the feature selection process, which can more accurately predict the clinical outcome for lung squamous cell carcinoma patients. DGM2FS is a general framework and can be easily transferred to other types of cancers or predict the response of a specific treatment, which opens up new opportunity in personalized treatment.