Keywords

1 Introduction

Echo imaging is based on the acoustic pulse-echo measurement: an ultrasound pulse is transmitted, and echo signals are subsequently received. Temporal Resolution (TR) is the ability to locate moving structures at anytime accurately and is determined by imaging frame rate (FR). More images per second improve TR. In high-end cart-based imaging systems, the frame rate is increased by using specialized beamforming and imaging hardware, or by limiting the imaging field-of-view. In mobile point-of-care imaging, given cost, memory storage, and limitations in computation power and data transmission, full field-of-view high frame rate imaging technology is currently unavailable.

Traditional 2-dimensional (2D) echo imaging is on the basis of a TR of less than 100 Hz. Although these frame rates are adequate to assess cardiac morphology and certain functional aspects, they do not allow the resolution of all mechanical events, as some of them are very short-lived [1]. High frame rates enable us to see rapidly moving structures (such as valves) without motion artifacts and also perform velocity and deformation analysis (i.e., tissue Doppler).

There are two sets of approaches to increase the frame rate in echocardiography. The first one is based on acquisition schemes, while the second one is based on post-processing techniques.

In the first sets of approaches, several technical advances in cardiac ultrasound allow data to be acquired at a very high frame rate. The main drawback of such high frame rate data acquisition is that it typically has resulted in image quality degradation [1], and increased hardware complexity [2]. Retrospective gating [3], plane wave/diverging wave imaging [4], and multi-line transmit systems [5] are among the methods in ultrafast imaging. In point-of-care imaging, such a high frame rate imaging technology is currently unavailable. In the second sets of approaches, various post-processing methods have been developed to avoid the computational costs and complex hardware requirements associated with acquisition schemes. The imaging process remains the same as traditional echo imaging with standard clinical echocardiography equipment, and the frame rate up-conversion (FRUC) is done in the processing time. FRUC is a technique that increases the frame rate of the video by inserting newly generated frames into the original sequence.

Several FRUC algorithms have been proposed that use motion estimation and dictionary learning [2, 6, 7]. Deep learning has been also used for future frame prediction in computer vision [8, 9]. The most recent methods use variational auto-encoders to reduce image reconstruction artifacts [10] and adversarial loss [11] to obtain more realistic results.

In this paper, we propose the first deep-learning-based solution for frame rate up-conversion in echocardiography that can be used to augment conventional imaging without the need of specialized beamforming and imaging hardware, or limiting the imaging field of view. It should be noted that our design is robust to variations in heart rate. The proposed technique takes advantage of both variational autoencoders (VAE) and generative adversarial networks (GAN), and conditions the latent space of the VAE through taking into account not only the immediate previous frames but also the appearance of end-diastolic and end-systolic frames. Using data from 3,112 patient studies, we demonstrate that the proposed technique can increase the frame rate by 5 times without compromising the imaging FOV, and generate realistic images that are visually indistinguishable from clinically acquired echo data.

2 Methods

We start by explaining how our model generates new echo cine frames, before detailing the training procedure. The future frame \({\hat{\mathbf{x}}}_{\mathbf{t}}\) is synthesized based on a latent variable \(\mathbf{z}_{\mathbf{t}-{\mathbf{1}}}\) and the previous frame \({\hat{\mathbf{x }}}_{\mathbf{t}-\mathbf{1}}\). This process is shown in the red box in Fig. 1. The latent variable \(\mathbf{z}_{\mathbf{t}-\mathbf{1}}\) is sampled from a prior distribution \(p(\mathbf{z}_{\mathbf{t}-\mathbf{1}})\) that is learned during the training procedure. The previous frame \(\hat{\mathbf {x}}_{t-1}\) can be either a ground-truth frame (for the initial frames) or the last predicted frame. The recurrent generator network G predicts sequence of future frames \(\hat{x}_{1:T}\) using convolutional Long Short Term Memory (LSTM) [12]. As shown in Fig. 2, predicted pixel-space transformations between current frame and its next frame are convolved with the input image to generate the next frame. The training procedure is illustrated in the black box in Fig. 1, and discussed in detail in the following sections.

2.1 Variational Autoencoders

To address the challenge of mapping from a high-dimensional input to a high-dimensional output distribution, learning a low-dimensional latent code to represent aspects of the possible outputs not contained in the input image is of great help. Intuitively, the latent codes encapsulate any ambiguous or stochastic events that might affect the future. The predictions are conditioned on a set of c context frames, \(\mathbf{x}_{\mathbf{t}-\mathbf{c}}, ..., \mathbf{x}_{\mathbf{t}-\mathbf{1}}\) (\(c=1\) for conditioning on one frame). Our goal is to sample from \(p(\mathbf{x}_{\mathbf{t}}| \mathbf{x}_{\mathbf{t}-\mathbf{c}:\mathbf{t}-\mathbf{1}}, \mathbf{z}_{\mathbf{t}-\mathbf{c}:\mathbf{t}-\mathbf{1}})\), which is intractable as it involves marginalizing over the latent variables. We instead maximize the variational lower bound as in the variational autoencoder [13]. To encode any transitional information between consecutive frames, the encoder E is conditioned on \(\mathbf{x}_{\mathbf{t}-\mathbf{1}}\) and \(\mathbf{x}_{\mathbf{t}}\). Moreover, to encode the volume changes of the cardiac chambers during a cycle, the encoder is conditioned on end diastolic (ED) and end systolic (ES) frames. This is a conditional version of variational autoencoder, which embed ground truth frames in the latent code \(\mathbf{z}_{\mathbf{t}-\mathbf{1}}\). During training, the latent code is sampled from a Gaussian distribution \(\mathcal {N}(\mu _{\mathbf{z}_{\mathbf{t}-\mathbf{1}}}, \sigma ^2_\mathbf{{z}_{t-1}})\) using a reparameterization approach [13]. The reconstruction loss is as follows:

$$\begin{aligned} \mathcal {L}_{R} (G,E) = \mathbb {E}_{\mathbf {x}_{0:T},\ \mathbf {z}_{t}\sim E(\mathbf {x}_{t-1}, \mathbf {x}_{t}, \mathbf {x}_{ED}, \mathbf {x}_{ES} )|_{t=0}^{t=T-1}} \bigg [\sum _{t=1}^{T} || \mathbf {x}_t - G(\mathbf {x}_{0}, \mathbf {z}_{0:t-1})||_1 \bigg ]. \end{aligned}$$
(1)

A regularization term encourages the approximate posterior to be close to the prior distribution:

$$\begin{aligned} \mathcal {L}_{KL} (E) = \mathbb {E}_{\mathbf {x}_{0:T}} \bigg [\sum _{t=1}^{T} \mathcal {D}_{KL} \big ( E(\mathbf {x}_{t-1},\mathbf {x}_{t}, \mathbf {x}_{ED}, \mathbf {x}_{ES} ) || p(\mathbf {z}_{t-1}) \big ) \bigg ]. \end{aligned}$$
(2)

2.2 Generative Adversarial Networks

We can enforce our model to generate sharper and more realistic frames with the help of GANs. Given a discriminator network D that is trained to distinguish between generated videos \(\hat{x}_{1:T}\) from real videos \(x_{1:T}\), the generator can be trained to match the distribution of real echo cines using the binary cross-entropy loss:

$$\begin{aligned} \mathcal {L}_{GAN} (G, D) = \mathbb {E}_{\mathbf {x}_{1:T}} \big [log D (\mathbf {x}_{0:T-1})\big ]\ + \ \mathbb {E}_{\mathbf {x}_{1:T},\ \mathbf {z}_{t} \sim p(\mathbf {z}_{t})|_{t=0}^{T-1}} [ log(1- D(G(\mathbf {x}_{0}, \mathbf {z}_{0:T-1}))]. \end{aligned}$$
(3)
Fig. 1.
figure 1

Frame rate up-conversion network.

2.3 Complementary Effect of VAE and GAN

GAN models are capable of generating natural videos under the guidance of learned discriminator networks. However, GANs suffer from the mode collapse [14], which can lead to the generator producing limited varieties of samples, by finding the most realistic image from the discriminator perspective. In other words, \(\mathbf {\hat{x}}\) will be independent of \(\mathbf {z}\). On the other hand, VAEs encourage latent variables to be meaningful so that they can make accurate predictions at training time. However, latent variables used in VAEs are the encoding of the ground truth images, unlike GANs which are trained with completely random variables. Moreover, the discriminator D does not see results sampled from the prior during training. To combine both approaches (Shown in Fig. 1), another discriminator network \(D_{VAE}\) can be introduced to improve the performance of the generator [11]. Note that the same generator network with shared weights is used at every time step. The latent variables in this approach are sampled from the VAE’s latent distribution \(q(\mathbf {z}_{t}|\mathbf {x}_{t}, \mathbf {x}_{t-1})\):

$$\begin{aligned}&\mathcal {L}_{GAN}^{VAE} (G, D_{VAE}) = \mathbb {E}_{\mathbf {x}_{1:T}} \big [log D_{VAE} (\mathbf {x}_{0:T-1})\big ]\ \nonumber \\&\qquad \qquad + \mathbb {E}_{\mathbf {x}_{1:T},\ \mathbf {z}_{t} \sim q(\mathbf {z}_{t}|\mathbf {x}_{t}, \mathbf {x}_{t-1}, , \mathbf {x}_{ED}, \mathbf {x}_{ES})} [ log(1- D_{VAE}(G(\mathbf {x}_{0}, \mathbf {z}_{0:T-1}))]. \end{aligned}$$
(4)

Therefore, the final objective of the echo cine series prediction is:

$$\begin{aligned} G^{*} , E^{*} = \arg \underset{G, E}{\max } \ \underset{D, D_{VAE}}{max} \lambda _{R} \mathcal {L}_{R}(G,E) + \lambda _{KL} \mathcal {L}_{KL}(E) + \mathcal {L}_{GAN}(G,D) \nonumber \\ +\mathcal {L}_{GAN}^{VAE}(G,E, D_{VAE}), \end{aligned}$$
(5)

where \(\lambda _{R}\) and \(\lambda _{KL}\) control the relative importance of each term.

2.4 Network Architecture

Figure 2 depicts our generator network G. The network is inspired from the architecture proposed by [8], i.e., convolutional dynamic neural advection (CDNA). The sequence of future frames is predicted by feeding the latent variable \(\mathbf{z}_{\mathbf{t}-\mathbf{1}}\) and the previous frame \({\hat{\mathbf{x}}}_{{\mathbf{t}-\mathbf{1}}}\) (either the ground truth or the previous prediction frame). Latent codes are concatenated along the channel dimension of all the convolutional layers of the network. Each convolutional layer is followed by instance normalization [15] and rectified linear (ReLU) activations [16]. Convolutional LSTMs are used to model motion. For each time step prediction, the network predicts four convolutional kernels to produce a set of transformed frames. The network also predicts a synthesized frame and a compositing mask by passing the final layer output through two convolutional layers with sigmoid and softmax activation functions, respectively. Finally, these sets of transformed frames along with the synthesized and previous frames are merged by the mask.

The encoder E is a standard convolutional network except that the two input images and ED and ES frames are concatenated along the channel dimension. The architecture is the same as the one used in [14]. As for the architecture of the discriminator, we used a 3D convolutional neural network using all the T frames. Both the discriminators, D and \(D_{VAE}\), have the same architecture with separate weights. The architecture is inspired from the one used in [14] except that the 2D convolution filters are inflated to 3D ones.

Fig. 2.
figure 2

The detailed generator network.

3 Experiments and Results

We carried out experiments on a set of 2D apical 4 chambers (AP4) cine series collected from the Picture Archiving and Communication System at Vancouver General Hospital, with ethics approval of the Clinical Medical Research Ethics Board, in consultation with the Information Privacy Office. Data set consists of 3,112 individual patient studies. Experiments were run by randomly dividing these cases into mutually exclusive patients, such that 75% of the cases were available for training and validation, and 25% for test. These clinical echo cine series included various heart rates (i.e., from 47 to 104 beats per minute). Location of the ES and ED frames in a cardiac cycle is recorded by an expert sonographer. Each cine is temporally down-sampled by a factor of 5, and the model is trained to reconstruct the original cine series.

Evaluating the performance of video prediction is a common challenge. The standard quantitative metrics are mean-squared error (MSE), peak signal-to-noise ratio (PSNR) and structural similarity (SSIM). Although these standard metrics provide a way to benchmark the proposed method against its counterparts, often, they do not correlate properly with the human preference [17]. Therefore, we also use the learned perceptual image distance metric (LPIPS) [17] to evaluate our method. The LPIPS is calculated by \(\mathcal {L}_2\) distance between deep features of images. Deep features are extracted by employing pre-trained AlexNet.

Table 1. Performance comparison between the proposed, VAE-only and VAE+GAN techniques. The results show that the proposed method performs the best since it provides the lowest LPIPS and the highest SSIM.

Table 1 benchmarks the performance of the proposed method against the VAE-only and VAE+GAN techniques. First, we compared the performance of the three methods, including the proposed method by considering the MSE and PSNR metrics. As reported in Table 1, the VAE-only technique achieves the lowest MSE and highest PSNR. Although this result may imply that the VAE-only technique performs better than others, it produces blurry and unrealistic images. A sample result is shown in Fig. 3. This means that the MSE and PSNR metrics are not good candidates in this application. The reason that the MSE and PSNR of the proposed and VAE+GAN techniques are larger than those of the VAE-only is that the GAN gives priority to matching joint distributions of pixels, but not the per-pixel similarity. Our experiments have shown that the LPIPS corresponds better to human preferences and it is also discussed in [17]. Therefore, to fairly compare the proposed technique against its counterparts, the LPIPS and SSIM metrics must be taken into consideration. As shown in the table, the proposed technique provides the lowest LPIPS and the highest SSIM. This means that it outperforms the VAE-only and VAE+GAN techniques.

Fig. 3.
figure 3

Qualitative visualization: the prediction of our proposed model in comparison with the VAE-only model. The VAE-only model generates blurry and unrealistic images (video clip 3b and Fig. 3c). In contrast, the proposed method generates images that are visually indistinguishable from real images (video clip 3e and Fig. 3f). Click the image to play the video clip (see supplementary material).

Fig. 4.
figure 4

Detailed comparison between the proposed and VAE+GAN techniques. From the LPIPS, MSE, PSNR and SSIM point of views, the proposed technique exceeds its counterpart.

Figure 4 shows a more detailed comparison between the proposed and VAE+GAN techniques. In this figure, the average LPIPS, MSE, PSNR and SSIM metrics are plotted against the time step. As illustrated, all four metrics are improved when the proposed technique is employed. This happens since the proposed technique conditions the latent space of the VAE through the use of the appearance of ED and ES frames. Apart from what technique is used to predict the future frames, we expect to see the performance to degrade as the time step is increased. This phenomenon applies to the proposed technique too; however, the rate of performance degradation is slower compared to that of its counterpart.

4 Conclusion and Future Works

In this paper, we proposed a new frame rate up-conversion technique for echocardiography. The proposed technique takes advantage of both VAE and GAN, and produces realistic frames at a high frame rate that can be used to augment conventional imaging. The proposed technique is robust to variations in heart rate since its latent space not only uses immediate previous frames, but it also takes into account the appearance of end-diastolic and end-systolic frames in its estimation. Our results show that the proposed technique can increase the frame rate by at least 5 times without any requirement for limiting the imaging field of view. Our comparison to state-of-the-art using a large patient dataset shows that the proposed approach can reconstruct rapid events in echo, such as the motion of valves, at high temporal resolution.