Arms and Hands Segmentation for Egocentric Perspective Based on PSPNet and Deeplab

Sarah, Heverton; Clua, Esteban; Vasconcelos, Cristina Nader

doi:10.1007/978-3-030-49695-1_11

Heverton Sarah¹⁰,
Esteban Clua¹⁰ &
Cristina Nader Vasconcelos¹⁰

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 12190))

Included in the following conference series:

International Conference on Human-Computer Interaction

3092 Accesses

Abstract

First person videos and games are the central paradigms of camera positioning when using Head Mounted Displays (HMDs). In these situations, the user’s hands and arms play a fundamental role in self-presence feeling and interface. While their visual image is trivial in Augmented Reality devices or when using depth cameras attached to the HMDs, their rendering is not trivial to be solved with regular HMD, such as those based on smartphone devices. This work proposes the usage of semantic image segmentation with Fully Convolutional Networks for detaching user’s hands and arms from a raw image, captured by regular cameras, positioned as a First Person visual schema. We first create a training dataset composed by 4041 images and a validation dataset composed of 322 images, both of them receive labels for an arm and no-arm pixels, focused on the egocentric view. Then, based on two important architectures related to semantic segmentation - PSPNet and Deeplab - we propose a specific calibration for the particular scenario composed of hands and arms, captured from an HMD perspective. Our results show that PSPNet has better detail segmentation while Deeplab achieves best inference time performance. Training with our egocentric dataset generates better arm segmentation than using images in different and more general perspectives.

Supported by the Coordenação de Aperfeiçoamento de Pessoal de Nível Superior - Brasil (CAPES) - Finance Code 001. The authors also wish to thank Eder de Oliveira for giving the Egohand images, making possible our pixel label creation for training our models.

You have full access to this open access chapter, Download conference paper PDF

Egocentric upper limb segmentation in unconstrained real-life scenarios

Article 03 December 2022

2D and 3D Human Pose Estimation and Analysis Using Deep Learning

First-Person View Hand Parameter Estimation Based on Fully Convolutional Neural Network

Keywords

1 Introduction

According to [38], the degree of immersion on a Virtual Environment (VE) is related to the description of the technology, the realism that a display can show, the number of sensory types, the surrounding quality and how panoramic the virtual reality is. In particular, the immersion requires a Virtual Body (VB), which represents the user location at the virtual environment, an essential aspect of the self-presence concept. One way to achieve this goal consists in the rendering of virtual representations of the human body, where the hands and arms play a fundamental role, due to its constant appearance in most of the first-person view shooting positioning.

Virtual Reality (VR) is currently present in many applications where the user experiences a sense of presence in a virtual environment. Nowadays, due to the increase of attention that different industries are giving to VR, it is very common to achieve and have HMD devices. Additionally, high quality and portable cameras are available at affordable prices, which are being used on the egocentric content recording, turning Augmented Reality (AR) applications easily configurable. This content can be used to train automatic detection techniques of users’ arms.

Automatic detection and segmentation of human skin is a topic of broad interest due to a large amount of applications, such as detection of human faces [43], hand gestures [31], pedestrian detection [13], human pose detection to Virtual Reality environments [29], among many other examples. These applications are the result of applying distinct techniques, in which the process of separating skin image pixels from background pixels, called skin image segmentation, can be one of them. In this work, we intend to segment arms and hands, in real-time, from egocentric images perspectives.

Convolutional Neural Network (CNN) is a Deep Neural Network based approach that shows impressive results on processing visual signals [41]. Outputs of a layer are connected to inputs of others, where each layer may represent different features of the image. Typically, earlier layers detect simple patterns, while later layers describe more complex features. These feature detectors are only connected to a small region from the previous layer, reducing the number of parameters used by the network, in contrast with the fully connected networks, which is typically used in conventional neural networks.

To segment an image with a CNN, one can use convolutional layers from input to output, turning the network a Fully Convolutional Network (FCN) [24], in which upsampling layers enable pixelwise prediction when the network has subsampling layers, and skip architecture fuses information between layers that identify simple appearance patterns to ones which identify complex semantic patterns. Upsampling layers increase the size of a previous layer output, while the subsampling does the reverse.

Since the advent of FCN network, many researchers started to change other networks architectures to also be fully convolutional, by changing fully connected layers to convolutional ones. However, the process of applying convolutions decrease the spatial dimension of features, which is good to achieve better semantic features on deeper layers but reduces the quality of less semantic results which are done on earlier layers. This problem made different authors think in solutions to increase the spatial dimension of features or to take information from earlier layers to the latter one.

These networks were originally trained to segment various classes. To use them for a different problem, in which they were not trained, they must be trained again with another database which represents the new problem. To adapt them to human arms and hands segmentation in egocentric perspective, it is necessary more images of people in this perspective. Although exist some public dataset which contains this type of image for segmentation, the amount of images available is still much less than the available for image classification problems. Therefore, the construction of one more dataset can contribute to the emerging of more researches with a greater amount of images. Another problem is that these images must be captured by cameras positioned in the users’ eyes to better represent the images that will be captured in the mixed reality application with visualization of users’ arms. Another characteristic of this application is that the user will also move its head, generating camera movements, which means the training images should also simulate this situation.

The contribution of this paper is as follows: we first generate two datasets composed by egocentric images of human arms. We also use a third database, which is composed of images in different perspectives, to confirm if training with egocentric focused images improves egocentric segmentation; then we detach two relevant architectures for semantic segmentation - one based on the PSP network [46] and other based on Deeplabv3 with Mobilenet [8, 20] and Xception [11]. Lastly, we propose configurations and parameters adjustments for hands and arms specific segmentation, focused on VR games and interactive experiences with First Person View paradigm of camera positioning.

In the next section, we show the related works. Following, we present our modeling for segmenting human skin image based on our detached architectures. Section 4 shows the specific configuration of our networks. Finally, we present our experiments and results, followed by summarization and future works descriptions.

2 Related Works

First Person Vision (FPV) presents multiple challenges: (1) Nonstatic cameras, (2) Variable illumination and background conditions, (3) Real-time processing, (4) Small video processing capabilities [3]. In order to segment hand images in these video situation, the segmentation technique must address those challenges. In this section, we present relevant works that approach human hand images in the egocentric view segmentation.

Based on traditional computer vision and machine learning techniques, [27] address the problem of segmenting human hands images in the egocentric view by developing a method which runs on Google cardboard/HMD application using a combination of histogram equalization, Gaussian blurring, and its proposed Multi Orientation Matched Filtering (MOMF). Although it achieves real-time inference, segmentation accuracy is 6.97% worse than SegNet [2]. [23] combines global appearance models with a sparse 50-dimensional combination of color, texture and gradient histogram features to segment hands under varying illumination and hand poses.

In the field of deep neural networks, [36] applies YOLO [33] convolutional neural network to detect hands in car driving scenario to recognize driver’s grasp. Although the training was made using a large number of hand images, the focus was not the first person view.

Auto-Context [40] consists of a segmentation method where the segmenter is iterated, and each result from an iteration is used in the next iteration added to the original input. Another approach [44] develops a convolution neural network focused on segmenting egocentric images of hands in real time, which differentiates from Auto-Context by initially segmenting on a downscaled version of the input image, following by passing the result by the second iteration after upscaling. An image dataset, similar to ours (but smaller) was created with 348 images, 90% of them were used for training and the rest for testing.

DeepLabv3 [8] and PSPNet [46] were not created to work with the egocentric view. Nevertheless, in our research, they presented good results in semantic segmentation by deep neural networks. In this work, they are implemented to segment human hands images in egocentric view.

Depth cameras could also be used for the same purpose of our work. These data could be combined with pixels colors in conventional camera images in order to help pixel classifier algorithms to better split the interest objects [18, 25, 32, 39].

3 Human Arms Segmentation in Egocentric View with PSPNet and DeepLab

In semantic segmentation, PSPNet [46] and DeeplabV3 [8] achieved state of the art on PASCAL VOC Challenge 2012 [14], a challenge that focuses on evaluating trained models on a dataset of many object classes related to realistic scenes, in which one of them is a person. The segmentation methods that achieve state of the art on this challenge may be a good classifier of pixels which represents humans or part of them.

PSPNet and DeeplabV3 also presented concepts that helped the evolution of segmentation using deep neural networks. Because of these achievements, we aim to fine tune these architectures in order to apply them in binary classification, that means, to separate pixels from an image that represents a human arm from pixels representing the background. We present an overview of these architectures and show how we adjusted them to achieve our particular objective.

3.1 PSPNet

Pyramid Scene Parsing Network (PSPNet) [46] aims to provide a global context prior. It is structured as a hierarchy of pooling which extracts features in different scales in order to be concatenated and to form a final feature map.

The overall architecture is shown in Fig. 1, where the input image pass through a pre-trained ResNet [19] with a dilated network strategy [6, 45] to extract the feature map (Fig. 1(b)). It then goes to a pyramid pooling module (c), composed by 4-level pyramid kernels responsible for extracting feature maps in different scales. The result of each kernel has its depth dimension reduced by applying 1 \(\times \) 1 convolution and its height and width upscaled by bilinear interpolation in order to make each feature map dimension the same as the pyramid input feature map. These feature maps are then concatenated to be merged with the previous one. The final step is to pass through a convolution layer, generating the prediction map.

In our implementation, the network layer “conv6” and all subsequent were set to be trained with an output size of 2, which represents the number of classes our problem needs: arm and no-arm. ResNet101 [19] (ResNet with 101 layers) was used to extract the feature map which is input for the pyramid pooling. This ResNet version has the last layers converted in dilated convolutions by [46] in order to upsample the final feature maps because ResNet is focused initially on classification and not segmentation.

3.2 Deeplab

The Deeplab version 3 (DeeplabV3) [8] uses Atrous Spatial Pyramid Pooling (ASPP) [7] and dilated convolutions [6, 45] (or atrous convolutions) to transform ResNet [19] into a semantic segmentation neural network. The ASPP module is composed by dilated convolutions, which arranges features in different scales in order to better classify regions of various sizes. These features contain “holes”, which is an amount r (atrous rate) of zeros that are put in every direction of a feature value, enabling to control how densely to compute feature responses without learning too many parameters. The final layers of ResNet are also composed by dilated convolutions to upscale the feature maps before the ASPP module.

Figure 3 shows ResNet [19] as a backbone network, modified to attend Deeplab V3 architecture, where the output stride is the amount of time the feature map is smaller than the input and rate is the atrous rate cited above. The ASPP (a) is composed by 1 \(\times \) 1 convolution and three 3 \(\times \) 3 dilated convolutions (all with 256 filters and bach normalization), where the three rate values are hyperparameters and can be adjusted for training and inference. Image pooling (b) operation is the result of applying global average pooling on the previous feature map, which passes through a 1 \(\times \) 1 convolution with 256 filters and bach normalization, followed by bilinearly upsampling to the desired dimension.

DeepLabv3+ [9] (Fig. 2) is an Encoder-Decoder [16] version of DeepLabv3, where the encoder part consists of DeepLabv3 with ASPP module composed by atrous separable convolutions (the union of dilated convolution with depthwise separable convolutions [37]). The encoder’s output is the last feature map before logits and is used as input to the decoder module. Before this feature map enters the decoder module, its spacial size is increased by bilinear interpolation with rate 4, and then it is merged with low-level encoder features (feature maps before ASPP and after applying 1 \(\times \) 1 convolution to decrease channels). After merging, 3 \(\times \) 3 convolutions are applied to improve results, and a bilinear interpolation of rate 4 is used to change the final spatial size. The best encoder using this architecture in [9] is a modified version of Aligned Inception [12] (or Aligned Xception).

The Xception version with DeepLabv3+ has more layers than the original [11], with the same entry flow structure and every max pooling operations replaced by depthwise separable convolution. There are also batch normalization and RELU activations after each 3 \(\times \) 3 depthwise convolution. Our implementation of this Xception, composed by ASSP module and decoder, is called Xception X-65 [9]. MobileNetV2 [35] is focused on decreasing the number of parameters and operations in order to be possible for running through an environment with a low level of processing and memory, such as mobile devices. Its main components are the inverted residuals and the linear bottlenecks, which makes it better than the first version.

Unlike the residual modules of ResNet [19], MobileNetV2 modules first increase the input number of channels they receive as 1 \(\times \) 1 convolutions. The result passes through a depthwise convolution, decreasing the number of parameters. Finally, a 1 \(\times \) 1 convolution to turn the number of channels equal to the input is applied. This operation is called inverted residuals because it is the opposite of a standard residual [19].

Linear bottleneck consists in not doing non-linear activation before the addition operation between the block’s output and the input activation with skip connections.

The MobileNetV2 with DeepLabV3 implementation was used in this work. Besides, MobileNet [35] detected that network version without ASSP module considerably decrease the number of operations. For this reason, we did not make use of the decoder.

In our implementation of these two encoders (Xception X-65 and MobileNetV2), the classification layer weights are not reused, and the last layer only has logits. Moreover, the trained model is exported to work with only two classes, unlike the standard 21 from the original implementation of the network.

4 Convolutional Networks Configuration

For every network configuration, it was performed five training sessions in order to measure the evaluation metrics. Each PSPNet configuration was trained for 100K iterations (50 epoch of 2000 gradient passes) and each Deeplab for 30K iterations. Each of these configurations differs in hyperparameters and datasets.

4.1 Image Datasets

Four datasets were used in total, in which its image examples can be seen in Fig. 4. Two of them (ViolaSet [42] and SFA [5]) were merged to create another (Huskin), which is composed of images in various perspectives. The other two image datasets (Egohand and EgohandVal) were created from images we captured in the egocentric view:

ViolaSet [42] (Fig. 4A), which contains 2922 images in various sizes (this is the number of images achieved from deleting GIFs and broken images from the total amount). It is composed of pictures of people (body and face) in different places and image sizes, with an uncontrolled background.
SFA [5] (Fig. 4B), which contains 1118 images and is composed of pictures of human faces, with a less uncontrolled background.
Egohand (Fig. 4C) is a dataset we created with 4041 images from [30], which contains 320 \(\times \) 240 sized images of people in egocentric view performing a set of gestures in indoor environments. We generated masks for each image by applying segmentation with Grabcut [34].
EgohandVal (Fig. 4D) is a dataset, also created in this work and composed by 322 640 \(\times \) 360 sized images taken from different skin color individuals (4 individuals) and environments (indoor and outdoor) from Egohand. The gestures are similar to Egohand and other random ones. It is used as the validation set, with its labels also created by applying Grabcut [34].

Merging ViolaSet and SFA datasets, we composed the Huskin (human skin) dataset, resulting in 4040 images, in order to train with images of people in different perspectives and verify if it generates better models to segment egocentric images than training with a dataset composed of only egocentric images (which is the Egohand dataset). Furthermore, because datasets used to train deep neural networks are usually big (order of millions of images), and ours is way smaller than that, we also apply data augmentation techniques to verify if it improves training results.

Each network configuration was trained with Huskin and Egohand datasets, which results in two separate experiments. The trained models were validated with EgohandVal. That way we verify if training with Huskin is better than with Egohand.

4.2 Deeplab Training Configuration

PASCAL VOC 2012 [14] public available pre-trained weights were used to initialize the weights of our implementations of DeepLabv3+ with Xception as encoder and DeepLabv3 with MobileNetV2.

In order to train, the learning rate is maintained as the original paper and follows a “poly” policy, which is to multiply the learning rate started at 0.0001 by \( \left( 1 - \frac{iter}{max\_iter} \right) ^{power} \) with \(power = 0.9\), iter representing the current iteration, \(max\_iter\) the maximum iteration and weight decay at 0, 00004.

Images and ground truth are resized to the crop size of 513 \(\times \) 513 during training and evaluation.

The added blocks to the backbone networks were all trained with batch normalization, where output_stride is set to 16 in every deeplab network configurations. When Xception is used, the decoder output stride is set to 4, and ASPP module dilatation rates are set to 6, 12 and 18 for every training and evaluation. During training, we changed the batch sizes for the two encoders, leaving the other parameters as cited above, and the not mentioned are left as default:

MobileNetv2 - depth multiplier of 1; batch normalization with fine-tuning; output stride set to 16, since is the value that gave better results according to [35].

Xception - the encoder output stride is also set to 16 because it is the value that gives better results in processing and accuracy according to [9]; batch normalization fine-tuning is also used, with decay of 0,9997, epsilon (variable used to avoid division by zero when normalizing activations and its variances) of 1e-3 using gamma multiplier for scaling activations on batch normalization.

Data augmentation is applied by randomly scaling input images from 0.5 to 2.0 and randomly horizontal and vertical flipping during training.

As the original proposal of DeepLab [8] was made on Tensorflow [1], we also adopted this framework for the fine-tuning.

4.3 PSPNet Training Configuration

The fine-tuning was made using pre-trained weights with PASCAL VOC 2012 [14]. The learning rate was set fixed at 1e-10, momentum 0.99, batch size of one image and weight decay at 0.0005.

Images and its labels are resized to 473 \(\times \) 473 during training and evaluation, which is the size of the network input.

Data augmentation is set to be the same as Deeplab. Hence, the Tensorflow operations were implemented in the Python programming language, in order to be used on Caffe [21], where the PSPNet network is originally implemented.

5 Results

In this section, we show the results of our performed experiments using different architecture configurations of DeepLab and PSPNet, described in Sect. 4. The experiments performance are measured by the metrics: average accuracy, F-Score [10] (harmonic average of precision and recall), Matthews correlation coefficient (MCC) [26], and its respective standard deviation resulted from five training sessions of each model.

We focused on comparing segmentation performance with MCC, although F-Score and accuracy were also captured in order to compare with other works. The best models from the two architectures, according to their MCC, will be used to test performance on GTEA public image dataset.

With our experiments, we show that models trained on the egocentric perspective have the best inference results when using egocentric images rather than the ones trained with images in different perspectives. Besides that, we show benefits on the segmentation performance, which makes possible inferences in real time. This hypothesis will show that our models can be used in VR applications that intend to show users’ arms. At the end of the section, we will show images resulted from our segmentation with the best models and validated through the dataset and the GTEA.

The number of pixels which belongs to a background of an image or the ones which belongs to a human arm is naturally unbalanced, and for that reason, capturing segmentation accuracy may hide the real performance of a used model. This problem happens because human arms are usually not in a significant area of the captured image in the egocentric perspective. Therefore, the model may have more accuracy in background pixels rather than human arms pixels. This problem is widespread on binary classification, which makes necessary finding other metrics strategies to measure classification performance.

MCC [26] metric was initially created to measure secondary protein structures, where the value 1 represents an excellent performance, and -1 a bad one. Later, due to its focus on solving the unbalanced class problem, it was started to be used on measuring binary classification machine learning problems [4, 17, 22, 28]. Another alternative to the use of accuracy is the F-Score [10] metric, which is also used to measure binary classification problems, where the value of 1 represents an excellent performance, and 0 is the contrary. These two metrics are used to measure the performance of models trained in this proposal. We focused on MCC because it is already used in problems with significant differences in classes distribution, as occur in our research. However, we also show accuracy and F-Score from models performance in order to expose our results for future comparisons and validations.

The training, evaluation and time inference tests were run on an NVIDIA GPU Tesla P100-SXM2-16 GB, Intel Xeon CPU E5-2698 v4 2.20 GHz, 528 GB of memory. The PSPNet was executed through Caffe NVIDIA Docker container 17.11. Deeplab used Tensorflow NVIDIA Docker container 18.04.

All the training and inference made with PSPNet were executed with only one GPU with the configurations cited above. In Table 3 we show PSPNet training without data augmentation (we name it psp_huskin_no_aug or psp_egohand_no_ aug, according to which dataset was used to train it) results in better MCC on the validation set than with data augmentation (psp_huskin_aug or psp_egohand_aug).

Deeplab Mobilenet was trained with a batch size of 14 (we name it huskin_14b_ Mobilenet or egohand_14b_ Mobilenet) and Xception with a batch size of 4 (huskin_ 4b_Xception or egohand_4b_Xception) and 14 (huskin_ 14b_Xception or egohand_14b_Xception). When the training was executed with a batch of size 4, only one GPU was used. For batch size 14, two GPUs were used at the same time. When running inference, only one GPU was used for both cases of batch size.

Table 6 shows that Haskin Mobilenet version is better than Egohand one. However, the best Deeplab configuration is egohand_4b_Xception, with haskin_4b_ Xception very close to it. Additionally, as can be observed in Table 3, any version of PSP trained models have better MCC, where those trained with Egohand and no data augmentation had the best results, with values of 0.969 mean MCC.

By looking at Table 6, we also note that all Deeplab configurations have poor MCC, with its values very close to the 0, indicating that the models are not good binary classifiers, being almost random at predicting.

Qualitative results are shown in Fig. 5, in which (A) is the original image, (B) ground truth, (C) deeplab _egohand_14b_mobilenet, (D) deeplab_huskin_14b_ mobilenet, (E) deeplab_egohand_14b_xception, (F) deeplab _huskin_14b_xception, (G) psp_egohand_no_aug, (H) psp _huskin_no_aug. When can see PSP results have more fine details than Deeplab, although Huskin implementation (H) detects parts of a table as hand, and Egohand implementation (G) separates the table correctly from the hand (Tables 1, 2, 4 and 5).

Table 1. PSPNet Accuracy on Val set

Full size table

Table 2. PSPNet F-Score on Val set

Full size table

Table 3. PSPNet MCC on Val set

Full size table

Table 4. Deeplab accuracy on Val set

Full size table

Table 5. Deeplab F-Score on Val set

Full size table

Table 6. Deeplab MCC on Val set

Full size table

This better quality in the PSP segmentation can be explained by its quantity of parameters and layers, which is higher than all the other cited implementations. This increase of parameters results in more significant results on the last layers. Its size is the result of merging 101 ResNet layers with the pooling pyramid at the end. Another fact which can contribute to this better performance is the output size of the feature map previous to applying the pooling pyramid, which consists of 1/8 (that means output size of 8, with more details than the 16 from another network configurations cited) original input image size of the network. Besides that, even ResNet using DeepLab architecture, shown in [35], has fewer parameters (58.16M parameters and 81.0B multiplication-add operations) than original PSP, because it uses depthwise separable convolutions and dilated convolutions on ASSP module, which filters can have different scales without increasing the number of parameters.

The employment of data augmentation in every DeepLab configuration could have decreased its results because the images in which they were tested did not cover the situation where the user’s arms are inverted upside down. These inverted images are part of one of the data augmentation applied to the training dataset. That may be one more reason PSPNet results were better, because data augmentation was not used on them.

The Xception implementation is smaller than PSP (54,17B mult-add operations) because it also uses dilated convolutions on its ASSP module. Besides that, the number of layers and the use of depthwise separable convolutions also decrease the number of parameters, making the network lighter, yet decreasing segmentation quality.

Table 7. Inference time test

Full size table

Table 8. Tests on GTEA dataset

Full size table

Our MobileNet implementation has substantially fewer parameters than Xception (2.11M parameters and 2.75B operations), being smaller and lighter than all cited network configurations, but also with worst MCC. Its size can be the result of not using an ASSP module (as mentioned before, using ASSP could increase the number of parameters and inference time). Besides that, it also does not have a decoder module, which is responsible for improving the feature map extracted from the encoder on Xception implementation; finally, its residual modules are inverted, helping to decrease more parameters because of the way it structures its operations.

Another interesting fact is the batch size used on Xceptions implementations. Using size 4 results in better MCC, that means, updating the weights faster, without looking to many training data gives a better model, which usually should not occur.

Concerning testing inference time, the 322 frames in 640 \(\times \) 360 resolution from validation dataset were taken as input for the best network configurations. Segmentation time, measured by running the 322 frames, is shown in Table 7, where we can see Deeplab Mobilenet is better than the others, achieving 38 frames per second, while PSP cannot be used in real time application for segmenting every frame.

By looking at the number of parameters and operations from cited networks implementations, we can observe that MobileNet has better inference time because it is smaller than the other architectures, both in parameters and operations. PSP loses in inference time in comparison with all the other network configurations because it has more operations.

Tests on GTEA [15] public dataset were also performed. The dataset contains seven different types of daily activities made by four different individuals, all images captured in egocentric view. Table 8 shows MCC measures, accuracy, and F-Score for the best-trained network configurations (Deeplab egohand_4b_Xception and PSP egohand_no_aug) according to their MCC.

Figure 6 shows qualitative results from Table 8 tests, where we can see again that PSP (trained without data augmentation) has superior segmentation quality compared with DeepLab Xception trained with batch 4 and output stride 16. We also tested the output stride variation to 8 in MobileNet and Xception. As observed by the authors, the size 16 results in better segmentation quality, according to all metrics we used.

6 Conclusion

In this work, we proposed a human arms and hands segmentation strategy present in first person view images from Virtual Reality, Mixed Reality and Augmented Reality applications. Our method, when applied for fast execution, the segmentation quality drops compared to when focused on quality, which the reverse occurs. We also made publicly available all the source code and materials used for training^{Footnote 1}.

In order to train egocentric human arms, it was first created a dataset of 4041 pixel masks, based on images from [30] dataset. These images generated our egocentric perspective dataset, called Egohand, focused on semantic segmentation of human arms. We also created another dataset for validation (EgohandVal), also in egocentric view. The first one has images captured in indoor environments. The second has images in indoor and outdoor environments with different skin color tone and illumination. Although the motivation for creating these datasets has been the application of mixed reality with the user’s arm visualization, they can also be helpful for training networks that segment arms for another objective.

It was also presented fine tuning calibration and configuration of PSPNet and Deeplab network architectures, using our dataset for training and validation, in which DeeplabV3+ Xception trained using 4 batches on Egohand has shown the best result when considering inference time. In contrast, it does not represent a good pixel classifier by looking at its MCC value. PSP has presented the best MCC (0.969), although the inference time is not suitable for real-time segmentation, taking more than 20 s to segment one frame. Therefore, DeepLabV3+ Xception is the configuration that can be used in virtual or mixed reality application.

The DeeplabV3 Mobilenet configurations have presented the best results on inference time, making it possible to run 38 frames in one second.

The best training configurations were those trained with Egohand, which means that the results were better than training with a dataset with images in other perspectives (Huskin).

It was also verified that the PSPNet results were better when not using data augmentation. This could have damaged the DeepLab results, which have data augmentation in all configurations. In addition, one of Deeplab data augmentation consists in vertically inverting the images, resulting in inverted arms, which are not represented in the testing dataset. A future work can verify if training DeepLab configurations without this invertion can improve its results.

The main limitations of this work consist of the amount of GPUs used for training (only two for DeepLab and one for PSPNet) and capture of images for training. Training in more than one GPU on PSPNet was not possible due limitations of the frameworks optimizations for these scenarios. We limited the amount of GPUs when training DeepLab in order to try to match the training limitations, so that DeepLab uses batches of more than one image, which results in better training performance, while PSPNet has only one image per batch since it requires a significant amount of memory in one GPU.

For future works, Deeplab configurations should be trained for more iterations to see if MCC measure increases. Additionally, increasing the training dataset could also help. An application running on an HMD could also be created to test the whole system. [30] is a natural sequel to our research: after the user’s arm rendering, the detection of different movements can be used for interaction inside a virtual environment.

Notes

1.
https://github.com/indexhever/ahsep.

References

Abadi, M., et al.: TensorFlow: large-scale machine learning on heterogeneous distributed systems. CoRR abs/1603.04467 (2016). http://arxiv.org/abs/1603.04467
Badrinarayanan, V., Kendall, A., Cipolla, R.: SegNet: a deep convolutional encoder-decoder architecture for image segmentation. arXiv:1511.00561 [cs], November 2015
Betancourt, A., Morerio, P., Regazzoni, C.S., Rauterberg, M.: The evolution of first person vision methods: a survey. IEEE Trans. Circuits Syst. Video Technol. 25(5), 744–760 (2015). https://doi.org/10.1109/TCSVT.2015.2409731
Article Google Scholar
Boughorbel, S., Jarray, F., El-Anbari, M.: Optimal classifier for imbalanced data using Matthews Correlation Coefficient metric. PLoS ONE 12(6), e0177678 (2017). https://doi.org/10.1371/journal.pone.0177678. https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0177678
Casati, J.P.B., Moraes, D.R., Rodrigues, E.L.L.: SFA: a human skin image database based on FERET and AR facial images. In: Anais do VIII Workshop de Visão Computacional. Rio de Janeiro (2013). http://www.sel.eesc.usp.br/sfa/
Chen, L.C., Papandreou, G., Kokkinos, I., Murphy, K., Yuille, A.L.: Semantic image segmentation with deep convolutional nets and fully connected CRFs. arXiv:1412.7062 [cs], December 2014
Chen, L.C., Papandreou, G., Kokkinos, I., Murphy, K., Yuille, A.L.: DeepLab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs. arXiv:1606.00915 [cs], June 2016
Chen, L.C., Papandreou, G., Schroff, F., Adam, H.: Rethinking atrous convolution for semantic image segmentation. arXiv:1706.05587 [cs], June 2017
Chen, L., Zhu, Y., Papandreou, G., Schroff, F., Adam, H.: Encoder-decoder with atrous separable convolution for semantic image segmentation. CoRR abs/1802.02611 (2018). http://arxiv.org/abs/1802.02611
Chinchor, N.: MUC-4 evaluation metrics. In: Proceedings of the 4th Conference on Message Understanding, MUC4 1992, pp. 22–29. Association for Computational Linguistics, Stroudsburg (1992). https://doi.org/10.3115/1072064.1072067
Chollet, F.: Xception: deep learning with depthwise separable convolutions. CoRR abs/1610.02357 (2016). http://arxiv.org/abs/1610.02357
Dai, J., et al.: Deformable convolutional networks. CoRR abs/1703.06211 (2017). http://arxiv.org/abs/1703.06211
Dollar, P., Wojek, C., Schiele, B., Perona, P.: Pedestrian detection: an evaluation of the state of the art. IEEE Trans. Pattern Anal. Mach. Intell. 34(4), 743–761 (2012). https://doi.org/10.1109/TPAMI.2011.155
Article Google Scholar
Everingham, M., Van Gool, L., Williams, C.K.I., Winn, J., Zisserman, A.: The PASCAL visual object classes challenge 2012 (VOC2012) results (2012)
Google Scholar
Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). https://doi.org/10.1109/CVPR.2011.5995444
Garcia-Garcia, A., Orts-Escolano, S., Oprea, S., Villena-Martinez, V., Garcia-Rodriguez, J.: A review on deep learning techniques applied to semantic segmentation. arXiv:1704.06857 [cs], April 2017
Gu, Q., Zhu, L., Cai, Z.: Evaluation measures of the classification performance of imbalanced data sets. In: Cai, Z., Li, Z., Kang, Z., Liu, Y. (eds.) ISICA 2009. CCIS, vol. 51, pp. 461–471. Springer, Heidelberg (2009). https://doi.org/10.1007/978-3-642-04962-0_53
Chapter Google Scholar
Gupta, S., Arbeláez, P., Girshick, R., Malik, J.: Indoor scene understanding with RGB-D images: bottom-up segmentation, object detection and semantic segmentation. Int. J. Comput. Vis. 112(2), 133–149 (2015). https://doi.org/10.1007/s11263-014-0777-6
Article MathSciNet Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. arXiv:1512.03385 [cs], December 2015
Howard, A.G., et al.: MobileNets: efficient convolutional neural networks for mobile vision applications. arXiv:1704.04861 [cs], April 2017
Jia, Y.,et al.: Caffe: convolutional architecture for fast feature embedding. arXiv preprint arXiv:1408.5093 (2014)
Koyejo, O.O., Natarajan, N., Ravikumar, P.K., Dhillon, I.S.: Consistent binary classification with generalized performance metrics. In: Ghahramani, Z., Welling, M., Cortes, C., Lawrence, N.D., Weinberger, K.Q. (eds.) Advances in Neural Information Processing Systems, vol. 27, pp. 2744–2752. Curran Associates, Inc. (2014). http://papers.nips.cc/paper/5454-consistent-binary-classification-with-generalized-performance-metrics.pdf
Li, C., Kitani, K.M.: Pixel-level hand detection in ego-centric videos, pp. 3570–3577 (2013)
Google Scholar
Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. arXiv:1411.4038 [cs], November 2014
Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. CoRR abs/1411.4038 (2014). http://arxiv.org/abs/1411.4038
Matthews, B.W.: Comparison of the predicted and observed secondary structure of T4 phage lysozyme. Biochimica et Biophysica Acta (BBA) - Protein Structure 405(2), 442–451 (1975). https://doi.org/10.1016/0005-2795(75)90109-9. http://www.sciencedirect.com/science/article/pii/0005279575901099
Maurya, J., Hebbalaguppe, R., Gupta, P.: Real time hand segmentation on frugal headmounted device for gestural interface. In: 2018 25th IEEE International Conference on Image Processing (ICIP), pp. 4023–4027 (2018). https://doi.org/10.1109/ICIP.2018.8451213
Menon, A., Narasimhan, H., Agarwal, S., Chawla, S.: On the statistical consistency of algorithms for binary classification under class imbalance. In: International Conference on Machine Learning, pp. 603–611, February 2013. http://proceedings.mlr.press/v28/menon13a.html
Obdržálek, S., Kurillo, G., Han, J., Abresch, T., Bajcsy, R.: Real-time human pose detection and tracking for tele-rehabilitation in virtual reality. Stud. Health Technol. Inform. 173, 320–324 (2012)
Google Scholar
de Oliveira, E., Clua, E.W.G., Vasconcelos, C.N., Marques, B.A.D., Trevisan, D.G., de Castro Salgado, L.C.: FPVRGame: deep learning for hand pose recognition in real-time using low-end HMD. In: van der Spek, E., Göbel, S., Do, E.Y.-L., Clua, E., Baalsrud Hauge, J. (eds.) ICEC-JCSG 2019. LNCS, vol. 11863, pp. 70–84. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-34644-7_6
Chapter Google Scholar
Phung, S.L., Bouzerdoum, A., Chai, D.: Skin segmentation using color pixel classification: analysis and comparison. IEEE Trans. Pattern Anal. Mach. Intell. 27(1), 148–154 (2005). https://doi.org/10.1109/TPAMI.2005.17
Article Google Scholar
Qi, X., Liao, R., Jia, J., Fidler, S., Urtasun, R.: 3D graph neural networks for RGBD semantic segmentation. In: 2017 IEEE International Conference on Computer Vision (ICCV), pp. 5209–5218, October 2017. https://doi.org/10.1109/ICCV.2017.556, iSSN 2380-7504
Redmon, J., Divvala, S., Girshick, R., Farhadi, A.: You only look once: unified, real-time object detection. arXiv:1506.02640 [cs], June 2015
Rother, C., Kolmogorov, V., Blake, A.: “GrabCut”: interactive foreground extraction using iterated graph cuts. In: ACM SIGGRAPH 2004 Papers, SIGGRAPH 2004, pp. 309–314. ACM, New York (2004). https://doi.org/10.1145/1186562.1015720
Sandler, M., Howard, A.G., Zhu, M., Zhmoginov, A., Chen, L.: Inverted residuals and linear bottlenecks: mobile networks for classification, detection and segmentation. CoRR abs/1801.04381 (2018). http://arxiv.org/abs/1801.04381
Siddharth, Rangesh, A., Ohn-Bar, E., Trivedi, M.M.: Driver hand localization and grasp analysis: a vision-based real-time approach, February 2018. https://arxiv.org/abs/1802.07854v1
Sifre, L.: Rigid-motion scattering for image classification. Ph.D. thesis, Ecole Polytechnique, CMAP, October 2014
Google Scholar
Slater, M., Wilbur, S.: A framework for immersive virtual environments (five): Speculations on the role of presence in virtual environments. Presence: Teleoperators and Virtual Environments 6(6), 603–616 (1997). https://doi.org/10.1162/pres.1997.6.6.603, https://doi.org/10.1162/pres.1997.6.6.603
Song, X., Herranz, L., Jiang, S.: Depth CNNs for RGB-D scene recognition: learning from scratch better than transferring from RGB-CNNs. CoRR abs/1801.06797 (2018). http://arxiv.org/abs/1801.06797
Tu, Z., Bai, X.: Auto-context and its application to high-level vision tasks and 3D brain image segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 32(10), 1744–1757 (2010). https://doi.org/10.1109/TPAMI.2009.186
Article Google Scholar
Vasconcelos, C.N., Clua, E.W.G.: Deep learning - Teoria e Prática. In: Jornadas de Atualização em Informática 2017. Sociedade Brasileira de Computação - SBC, Porto Alegre/RS, July 2017
Google Scholar
Viola, P., Jones, M.: Rapid object detection using a boosted cascade of simple features. In: Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. CVPR 2001, vol. 1, pp. I-511–I-518, vol. 1 (2001). https://doi.org/10.1109/CVPR.2001.990517
Viola, P., Jones, M.J.: Robust real-time face detection. Int. J. Comput. Vis. 57(2), 137–154 (2004). https://doi.org/10.1023/B:VISI.0000013087.49260.fb
Article Google Scholar
Vodopivec, T., Lepetit, V., Peer, P.: Fine hand segmentation using convolutional neural networks. arXiv:1608.07454 [cs], August 2016
Yu, F., Koltun, V.: Multi-scale context aggregation by dilated convolutions. arXiv:1511.07122 [cs], November 2015
Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. arXiv:1612.01105 [cs], December 2016

Download references

Author information

Authors and Affiliations

Universidade Federal Fluminense, Niterói, Rio de Janeiro, Brazil
Heverton Sarah, Esteban Clua & Cristina Nader Vasconcelos

Authors

Heverton Sarah
View author publications
You can also search for this author in PubMed Google Scholar
Esteban Clua
View author publications
You can also search for this author in PubMed Google Scholar
Cristina Nader Vasconcelos
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Heverton Sarah , Esteban Clua or Cristina Nader Vasconcelos .

Editor information

Editors and Affiliations

U.S. Army Research Laboratory, Aberdeen Proving Ground, MD, USA
Jessie Y. C. Chen
U.S. Army Combat Capabilities Development Command, Orlando, FL, USA
Gino Fragomeni

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Sarah, H., Clua, E., Vasconcelos, C.N. (2020). Arms and Hands Segmentation for Egocentric Perspective Based on PSPNet and Deeplab. In: Chen, J.Y.C., Fragomeni, G. (eds) Virtual, Augmented and Mixed Reality. Design and Interaction. HCII 2020. Lecture Notes in Computer Science(), vol 12190. Springer, Cham. https://doi.org/10.1007/978-3-030-49695-1_11

Download citation

DOI: https://doi.org/10.1007/978-3-030-49695-1_11
Published: 10 July 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-49694-4
Online ISBN: 978-3-030-49695-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Arms and Hands Segmentation for Egocentric Perspective Based on PSPNet and Deeplab

Abstract

Similar content being viewed by others

Egocentric upper limb segmentation in unconstrained real-life scenarios

2D and 3D Human Pose Estimation and Analysis Using Deep Learning

First-Person View Hand Parameter Estimation Based on Fully Convolutional Neural Network

Keywords

1 Introduction

2 Related Works