4.1. Datasets
We evaluated the proposed method on two open benchmarks provided by ISPRS for the 2D semantic labeling challenge [
40]. Both datasets provide matched orthophotos and the corresponding hand-labeled ground truth.
Potsdam Benchmark [41]: Potsdam is a typical historic city with many large buildings, neat roads, and much traffic. The airborne image dataset consists of true color (R,G,B) orthophotos and corresponding hand-labeled ground truth. Overall, 38 image blocks with a size of 6000 × 6000 are clipped from orthophotos, and the ground sampling distance is 5 cm.
Vaihingen Benchmark [
42]: Vaihingen is a small village with many detached multi-story buildings, more vegetation cover and less traffic. The airborne image dataset consists of false color (NIR,R,G) orthophotos and corresponding hand-labeled ground truth. In total, 33 different sizes of blocks are clipped with a ground sampling distance of 9 cm. The size of each patch is about
pixels. Some examples of the original orthophotos and ground truth provided by the two datasets are shown in
Figure 6.
Since 2018, ISPRS has provided ground truth for all image patches. However, in order to facilitate algorithm comparison, the training data and test data of our experiments are the same as those set in the competition. Specifically, the Vaihingen benchmark uses 16 patches for training and the remaining 17 patches for method evaluation, while the Potsdam benchmark sets the number of training and test patches to 24 and 14, respectively. Some DCNN-based methods have achieved high semantic segmentation accuracy by using only spectral data. At the same time, considering that DSM is not provided under general circumstances, our experiment utilizes only spectral data to complete the semantic segmentation task.
4.3. Training Details
The training set was preprocessed by a series of data augmentation transformations. The initial size of the input aerial images is 600 × 600; convert them to a random size (ratio 0.5–2), and then randomly crop them to 300 × 300. Randomly flip the images along the vertical direction with a probability of 0.5 and add a photometric distortion. Finally, images were normalized by subtracting the mean value of each channel.
In our experiment, all neural networks are trained using stochastic gradient descent (SGD) [
43] with a momentum of 0.9 and a weight decay of 0.0005. Poly-learning rate policy is employed; that is, the learning rate is calculated by the product of the base learning rate and
. In all our experiments, the base learning rate is set to 0.01 and the power is set to 0.9. The maximum number of iterations for our training is, and all comparison experiments are consistent.
In the training phase, an auxiliary loss for semantic segmentation is added on the output of stage 4 with a weight of 0.4, and the auxiliary loss is not used for inference in the testing phase. In general, the auxiliary head is conducive to network convergence and can help avoid model overfitting. We initialize part of the network from a pretrained 50-layer ResNeSt model on the Cityscape dataset. Then we fine-tune the model on our experimental datasets.
Our experiments are conducted on an Ubuntu 16.04 platform equipped with an Nvidia GeForce 1080Ti. Due to hardware limitations, our batch size is set to 2. It takes about 12 h to complete the training of a standard ResNeSt-50.
4.4. DNL Block Experimental Analysis
The network implementation of the DNL block is briefly shown in
Figure 7. For the whitened pairwise term, the key and query are first calculated by a 1
1 convolution separately, and they are each whitened by subtracting the mean value. Subsequently, the key and query tensors are reshaped and multiplied, and then the attention matrix (
) is obtained through softmax. For unary term, the attention matrix is directly calculated through 1
1 convolution and softmax operations, and then the values are further copied and expanded to a size of (
).
We compared semantic segmentation results of our architecture when the DNL block was inserted into different positions in the pipeline. Each time a single DNL block is inserted into different stages of the ResNeSt-50 or the last layer before semantic segmentation. We use only one DNL block in our architecture in consideration of its large memory cost. In order to preserve the resolution of the feature map, we apply atrous convolution in the ResNeSt, and the feature maps of the 1st, 2nd, 3rd, and 4th stage are {} of the input image, respectively. To express feature similarity between any two pixels in the image, the DNL block contains a large matrix () calculation, and brings a huge amount of parameters. In our experiment, the parameters of the correlation matrix are about 800 Mb ().
In this set of experiments, we adopt the ResNeSt-50 as the backbone, utilized the depth-wise separable ASPP block to extract and fuse multiscale features, and employed the encoder–decoder architecture shown in
Figure 1 to maintain and recover the resolution of the feature maps. Please note that in order to avoid introducing interference factors, we did not incorporate the edge detection task in the network during the testing of the DNL block. The same data augmentation and model optimization strategies are applied for all training processes.
Table 2 displays the overall accuracy of embedding a DNL block to different positions of the encoder–decoder structure. It was found that inserting the DNL block into each stage of ResNeSt brings an improvement in model accuracy, and the results of stage 2 and stage 3 are slightly better than the results of stage 4 and stage 1. A possible explanation is that stage 1 contains a less semantic message, while stage 4 has too wide a receptive field to provide accurate spatial information. It is worth noting that adding the DNL before the last segmentation layer brings the greatest precision improvement over the baseline. It is presumably because multiscale fusion features have provided rich semantic and location information in the last layer.
In order to show the effect of the DNL block more vividly, samples of attention maps obtained by both terms in the DNL block are displayed in
Figure 8. From the middle three columns, it can be observed that in the whitened pairwise term, pixels belonging to a class similar to the query points are assigned higher weights. This is exactly consistent with the physical meaning of the term that represents the similarity between pixels.
The visualization clues of the unary term are clear. As indicated in the last column in
Figure 8, the unary term is more sensitive to the edge pixels between targets; the distribution of the weight of these pixels is significantly higher. In addition, we observe that the edges inside objects (such as the edge pixels of the “chimney” on the “building”) are not assigned extremely high unary weights. This may help determine the class extent in semantic segmentation tasks.
4.5. Architecture Ablation Study
We designed a group of ablation experiments to test the effect of each module in our proposed network. Since our goal is to compare the effects of the modules instead of obtaining the most competitive accuracy, we have completed the set of experiments on a baseline of 50-layer ResNeSt for less time cost.
We first trained the basic ResNeSt-50 as the baseline and then tested the segmentation accuracy of the network with the edge detection task added, or the DNL module added, respectively. The effectiveness of the whitened pairwise term and the unary term of the DNL module were further analyzed. Finally, we tested the segmentation accuracy of the proposed architecture, which combines all the ResNeSt-50, edge detection task and DNL modules.
Quantitative analysis: The results of our consequence of experiments on the Potsdam and Vaihingen datasets are shown in
Table 3 and
Table 4. Based on the Potsdam dataset, the overall accuracy of the baseline (ResNeSt-50) reached 89.34%, and the average F1 score was 87.33%. Adding DNL to the last layer of the pipeline increased the model’s OA by 1.1% and F1 score by 1.81%. The combination of edge detection tasks increased the OA of the baseline by 1.02% and the F1 score by 1.7%. Adding the DNL module and the edge detection task to ResNeSt-50 simultaneously obtained the highest overall accuracy of 90.82%, and the improvement of OA and F1 score is 1.48% and 1.98%, respectively. Focusing on the DNL module, both unary and whitened pairwise terms promote the improvement of model accuracy, and the accuracy improvement brought by the unary term exceeds that of the whitened pairwise term.
In terms of details, the addition of the DNL module has improved the segmentation precision of all classes, and the combination of edge detection tasks has a significantly more positive effect on the accuracy of buildings, cars, and impervious classes. One possible reason is that the edges of buildings and cars are sharper and easier to determine. On the contrary, even for a human, the boundaries of low vegetation and trees cannot be distinguished well. Interestingly, this is consistent with the performance of the DNL’s unary term in each class, which tends to highlight edge pixels.
Based on the Vaihingen dataset, the OA of the baseline (ResNeSt-50) reached 87.32%, and the average F1 score reached 87.57%. Adding DNL to the last layer of the pipeline can increase the model’s OA by 0.82% and F1 score by 1.25%. Among them, the unary term is more effective on buildings, cars and low vegetation, while the whitened term has a positive effect on each class and performs best on the low vegetation class. Among all classes, edge detection can relatively more improve the accuracy of buildings, low vegetation and cars, which is similar to the performance of the unary term. Adding both the DNL module and the edge detection task to the ResNeSt-50 obtained the highest overall accuracy of 89.04%, and the improvement of OA and F1 score was 1.52% and 2.47%, respectively.
Visualization analysis: we show some of the test results of these two datasets in
Figure 9 and
Figure 10; from left to right are the original image, ground truth, and model segmentation results. The field of view gradually increases from top to bottom, from details to the whole.
Model information corresponding to each abbreviation is: (1) ResNeSt-50: baseline architecture; (2) RS+Edge: combine edge detection task in the baseline to provide boundary constraints for semantic segmentation; (3) RS+DNLLAST: insert a DNL module into the last layer before semantic segmentation in the baseline; (4) RS+Edge+DNLLAST: combine edge detection task and DNL block in the baseline simultaneously.
Based on the Potsdam dataset, almost all models perform well in car detection and can cover cars relatively completely. The main challenge of building detection is the completeness of the segmentation results, especially for buildings with vegetation growing on the roof. Determining the edges between trees and low vegetation is the main difficulty in the semantic segmentation task.
Edge constraints further optimize the contour lines of each class, their edges are much sharper, and the positioning is more accurate, so the structure of targets is more complete. From the perspective of the details of each class, the edge constraint brings the most obvious optimization of the buildings and cars. It detects some missing structures of buildings and cars, and the edges are closer to the ground truth. The edge positioning between low vegetation and trees is also clearer but still insufficient. The application of the DNL attention module enhances the semantic constraints between classes, improves the labeling consistency within targets, and eliminates most salt and pepper noise. Generally, the result of “RS+Edge+DNLLAST“ has the best visual effect, with the least misclassification in each class and the sharpest target edge.
The visualization effect based on the Vaihingen dataset is basically the same as that of Potsdam. Edge constraints play a positive role in determining the edges of buildings and cars; however, it has no obvious effect on trees and impervious. The DNL attention module obtains better semantic connectivity through the relationship constraints between each class and eliminates the partial salt-and-pepper misclassification in the image. The result of “RS+Edge+DNLLAST” has the best visual effect, with clearer edges and less noise.
In addition, in our experiments, we found some inaccuracies of the ground truth, and such inaccuracies appear more frequently in the Vaihingen dataset. We think it is inevitable for remote sensing image labeling, but it may have some influence on the accurate evaluation of the model. In our experiment, for the Vaihingen dataset, due to the relatively small number of cars, even a few errors may cause a difference in the accuracy evaluation results. Another more affected class is the building, which is caused by relatively more labeling errors compared to other classes. This may explain the certain gap between the extraction accuracy of cars and buildings between the two datasets.