2.1. Ice Segmentation
Efforts have been devoted to study ice segmentation based on remote sensing images. Most existing methods adopt satellite remote sensing images captured by different sensors, such as moderate-resolution imaging spectroradiometry (MODIS), advanced very high resolution radiometry (AVHRR), synthetic aperture radar (SAR) and so on. These methods fall into three groups: traditional threshold methods, methods based on machine learning and methods based on neural networks.
Traditional threshold methods. Selkowitz and Forster (2016) [
13] classified pixels as ice or snow by calculating the normalized difference snow index (NDSI) using Landsat TM and ETM+ images to map persistent ice and snow cover (PISC) across the western U.S. automatically. Liu et al. (2016) [
14] employed a straightforward threshold method on the visible and infrared satellite images to identify sea and freshwater ice and estimate ice concentration. Su et al. (2013) [
15] proposed an effective approach of gray level co-occurrence matrix (GLCM) texture analysis based on the ratio-threshold segmentation for Bohai Sea ice extraction using MODIS 250 m imagery. Experiments showed that this method is more reliable for sea ice segmentation compared with the conventional threshold method. In addition, Engram et al. (2018) [
16] adopted a threshold method on log-transformed data to discriminate bedfast ice and floating ice with the SAR imagery across Arctic Alaska. Along with the above methods, Beaton et al. (2019) [
17] presented a calibrated thresholds approach to classifying pixels as snow/ice, mixed ice/water or open water using MODIS satellite imagery.
Methods based on traditional machine learning. Using the development of machine learning, Deng and Clausi (2005) [
18] focused on a novel Markov random field (MRF) to segment SAR sea ice imagery, which used a function-based parameter to weigh the two components in a Markov random field (MRF). This achieved unsupervised segmentation of sea ice imagery. Dabboor and Geldsetzer (2013) [
19] applied a supervised maximum likelihood (ML) classification approach to classify the river covers as first-year ice (FYI), multiyear ice (MYI) and open water (OW) using SAR imagery in the Canadian Arctic. Chu and Lindenschmidt (2016) [
7] adopted the fuzzy k-means clustering method to classify the river covers as open water, intact sheet ice, smooth rubble ice and rough rubble ice with integration of MODIS and RADARSAT-2. Romanov (2017) [
20] proposed a decision-tree approach to detect ice with AVHRR data.
Methods based on neural networks. In traditional machine learning techniques, most of the applied features need to be identified by a domain expert, in order to reduce the complexity of the data and make patterns more visible to learning algorithms. On the other hand a neural network, especially a deep neural network, has strong mapping and generalization abilities, which can self-organize, self-study and fit an arbitrary, nonlinear relationship between a dependent variable and independent variables without an accurate mathematical model. If sufficient and high-quality labeled data is available, a deep neural network can extract features more efficiently from data in an incremental manner. Karvonen (2004) [
21] presented an approach based on pulse-coupled neural networks (PCNNs) for segmentation and classification of Baltic Sea ice SAR images. With the wide application of CNN, Wang et al. (2016) [
22] used a basic, deep convolutional neural network (CNN) to estimate ice concentration using dual-pol SAR scenes collected during melting. Remarkably, Singh et al. (2019) [
23] used some semantic segmentation models (e.g., UNet [
24], SegNet [
25], DeepLab [
26] and DenseNet [
27]) based on CNNs to categorize segment river ice images into water and two distinct types of ice (frazil ice and anchor ice). It provided fairly good results and increased in accuracy compared to previous methods using support vector machines (SVMs). This indicates a promising for future exploration of deep convolutional neural networks applied in ice detection and segmentation to some extent.
2.2. Semantic Segmentation Based on a Deep Convolutional Neural Network
Semantic segmentation is a fundamental task and has shown great potential in a number of applications, such as scene understanding, autonomous driving, video surveillance and so on. Moreover, due to the demands of some practical tasks (e.g., land classification, change detection and so on), semantic segmentation is required in remote sensing technology. A fully convolutional network (FCN) [
28] was the pioneering work to replace the full connection layer at the end of a classification model with a convolution layer. This brought in a new way of thinking and a solution for semantic segmentation. Recently, semantic segmentation models based on FCN have been constantly emerging. They are generally divided into four categories: encoder-decoder structure, dilated convolutions, spatial pyramid pooling and recurrent neural networks.
Encoder-decoder architectures. Encoder-decoder structure based on FCN was proposed to recover high resolution representations from low resolution or mid resolution representations. SegNet (2017) [
25] adopted maximum indices in the pooling layer instead of the features directly, introducing more encoding information and improving the segmentation resolution. Similar to SegNet, U-net (2015) [
24] had a more structured network structure, and a better result has been obtained by splicing the results of each layer of the encoder into the decoder. RefineNet (2017) [
29] made a well-designed RefineNet module that integrates the high resolution features with the low resolution features in a stage-wise refinement manner by using three individual components: residual conv unit (RCU), multiresolution fusion and chained residual pooling. GCN (2017) [
30] used a large convolution kernel and decomposed the convolution kernel of a large kxk into two, 1xk and kx1, to balance the accuracy contradiction between location and classification. Gao et al. (2019) [
31] proposed a method to extract roads from optical satellite images using encoder-decoder architectures and a deep residual convolutional neural network. Fuentes-Pacheco et al. (2019) [
32] presented a convolutional neural network with an encoder-decoder architecture to address the problem of fig plant segmentation. El Adoui et al. (2019) [
33] proposed a different encoder and decoder CNN architectures to automate the breast tumor segmentation in dynamic-contrast-enhanced magnetic resonance imaging based on SegNet [
25] and U-Net [
24].
Dilated convolution. Unlike encoder-decoder, dilated convolution (2015) [
34] introduced a new parameter into the convolution kernel, which defined the spacing between values of kernel. It was designed to increase the receptive field without reducing the spatial resolution. That work removed the last two pooling layers from the pretrained classification VGG (2014) [
35] network and replaced the subsequent convolution layers with dilated convolution. DRN (2017) [
36] studied gridding artifacts introduced by dilation and developed an approach to remove these artifacts. DeepLabV1 (2014) [
37] used dilated convolution and a fully-connected conditional random field (CRF) based on VGG (2014) [
35]. In the same way, dilated convolution was applied with the ResNet (2016) [
38] in DeepLabV2 (2017) [
26] and DeepLabV3 (2017) [
39]. Fu et al. (2017) [
40] improved the density of output class maps by introducing atrous convolution. DDCMN (2019) [
41] is a network for semantic mapping, called the dense dilated convolutions merging network, used to recognize multiscale and complex shaped objects with similar colors and textures.
Spatial pyramid pooling. Spatial pyramid pooling was adopted to aggregate multiscale context information for better segmentation. PSPNet (2017) [
42] is a pyramid pooling module used to ensemble multiscale information in different sub-regions. DeepLabV2 (2017) [
26] and DeepLabV3 (2017) [
39] use dilated spatial pyramid pooling (ASPP) to realize multiscale information for semantic context. And that specific method was to use parallel dilated convolution with different dilated rates, obtaining better segmentation results. However, another work deemed that the ASPP module in the scale-axis was not dense enough and the receptive field was not large enough. Therefore, DenseASPP (2018) [
43] was proposed to connect a group of dilated convolutional layers in a dense way, obtaining a larger scale range. Chen et al. (2018) [
44] combined a spatial pyramid pooling module and encode-decoder structure to encode multiscale contextual information and capture sharper object boundaries by recovering the spatial information gradually. He et al. (2019) [
45] improved the performance of the road extraction network by integrating atrous spatial pyramid pooling (ASPP) with an encoder-decoder network.
Recurrent neural networks. Recurrent neural networks have been successfully applied for modeling long-temporal sequences. RNNs are able to exploit long-range dependencies and improve semantic segmentation accuracy successfully. Byeon (2015) [
46] used two-dimensional long short term memory recurrent neural networks (2D LSTM networks) to address the problem of pixel-level segmentation. Inspired by the same recurrent neural network (ReNet) (2015) [
47] architecture, Li (2016) [
48] proposed a novel long short-term memorized context fusion (LSTM-CF) model for scene labeling. The DAG-RNNs (2016) [
49] model was proposed to process DAG-structured data and effectively encode long-range contextual information. Shuai (2017) [
50] built a recurrent neural network with a directed acyclic graph to model global contexts and improve semantic segmentation by linking pixel-level and local information together. These methods with recurrent neural networks can capture the global relationship implicitly. However, their effectiveness relies heavily on the learning outcome of the long-term memorization.