1. Introduction
The basic components in power systems are prone to physical defects due to long-term exposure to extreme weather (all-day direct sunlight, strong winds, snowstorms, storms, etc.), high mechanical tension and high pressure [
1]. Traditional power system component defect detection methods mainly involve professional and experienced inspectors manually looking for defects or using detection methods to check the health of components; these approaches are time-consuming and potentially dangerous, while the detection rate is limited by the inspector’s skills [
2,
3]. With the new opportunities provided by intelligent power system network construction, computer vision technology is gradually being applied in the field of power systems [
4]. According to the history of object detection algorithms, defect detection methods based on computer vision for power system components can be summarized into three types: image processing-based methods, machine learning methods based on manual features and deep learning methods based on convolutional neural networks.
Image processing-based defect detection methods for power system components are mainly based on features such as differences in texture or contrast between defective parts and background information for differentiation [
5]. Wu et al. [
6] adopted the difference between the vibration dampers and the background for segmentation using the thresholding method, used the Hessian matrix to improve the contour curvature control ability of the edge detection method, and detected the antivibration hammer in segments according to the contour features. This method has poor detection performance when the background of the antivibration hammer in the image is complex, and the use of grayscale images often causes information loss. Huang et al. [
7] used local differential processing, edge intensity mapping and image fusion to complete the enhancement of antivibration hammer images, completed the segmentation of antivibration hammer by setting a threshold, and, finally, classified the degree of antivibration hammer rust according to the Rusty Area Ratio and Color Shade Index. This method can achieve good recognition results for rusty hammer in a complex background, though it is less effective in recognizing rusty hammer in low contrast. Yuan et al. [
8] used IULBP (Improved Uniform Local Binary Patterns) to extract texture features from icing insulator images in the monitoring system and identify different degrees of icing insulators through the correlation coefficients of texture histograms. Since different degrees of icing insulators have obvious features, IULBP has a good recognition effect, though that effect is less satisfactory for insulators in general. Zhang et al. [
9] proposed a method for the identification of conductor breaks and surface defects in transmission line UAV inspection, using adaptive threshold segmentation to extract conductor regions, detecting broken strands via square wave transform of their gray distribution curves, and, finally, identifying conductor surface defects using the projection algorithm of gray variance normalization (GVN) images of conductor regions to achieve better detection results; however, the general applicability of the method is inferior.
Traditional machine learning-based methods for detecting defects in power system components are mainly based on extracting defective features and then using classifiers, such as SVM (Support Vector Machine), to detect defective samples [
10]. Dan et al. [
11] advanced a method to detect glass insulators from aerial images using the Haar image feature extraction and Adaboost iterative algorithms, further segmented the insulators using color features, and, finally, analyzed the pixels obtained by segmenting the insulators to complete the insulator missing detection. However, the performance of the algorithm is not good when there is an occlusion problem in the detected image. Ullah I et al. [
12] proposed a new method for defect analysis in high-voltage equipment, which first extracts rich feature maps of infrared images of the equipment using the AlexNet network, before feeding the feature maps into a random forest for training. The results show that the algorithm can effectively distinguish whether there are defects in the high-voltage equipment, though it is difficult to achieve the expected accuracy when the background of the detection image is complex. Mao et al. [
13] proposed a transmission line defect identification method based on the histogram of gradients (HOG) and support vector machine (SVM) algorithms. The HOG algorithm is used for transmission line feature extraction, and the SVM algorithm determines whether a transmission line is defective based on the proposed features; its average accuracy can reach 84.3%. However, the time overhead needed by this network to detect a single image is about 539 ms, and the speed cannot meet the requirements of real-time applications. LIU et al. [
14] used support vector machines to identify the state of fouled porcelain insulators in the input image in order to accurately detect the state of fouled insulators, though the recognition was not effective for images with complex backgrounds.
The deep learning-based transmission line component defect detection method mainly uses CNNs with a learning capability to extract component defect features from images autonomously, which effectively compensates for the loss of information during manual feature extraction and improves the defect detection efficiency [
15]. Ni et al. [
16] proposed an improved Faster R-CNN for detecting insulators, anti-vibration hammers and bird nests in transmission lines in images captured via UAVs, using the concept-ResNet-v2 network as the basic feature extraction network, and its detection accuracy was increased to 98.65%, which was achieved by fine-tuning the network parameters. However, the time overhead of this network for detecting a single image is about 676 ms, and, once again, the speed is insufficient to meet the requirements of real-time applications. JLA B et al. [
17] proposed a transmission line defect detection algorithm based on an improved RetinaNet network, which uses the K-means++ algorithm to redesign the size and number of anchor frames and uses the DenseNet feature pyramid network as the backbone network. Their experimental results show that the algorithm has good real-time performance, though the accuracy is not high. Zhang et al. [
18] proposed a defective insulator detection method based on the YOLO network and SPP-Net network, which used trained YOLOv5s to identify and locate insulators in the original image, before cropping them according to the locating box. The cropped image was fed into the SPP-Net classification network for defective insulator detection, and the final detection accuracy of defective insulators could reach 89%. However, connecting two networks in series increased the complexity of the model. Zhang et al. [
19] proposed an improved YOLOv5-based bird nest detection method for the problem of poor applicability of transmission line bird nest detection in complex backgrounds, and the experimental results showed that the method has strong generalization capability and applicability. Bao et al. [
20] proposed a BC-YOLO network based on YOLOv5 by fusing coordinate attention and the bidirectional feature pyramid network, respectively, which could accurately detect the components of transmission lines in remote sensing images, and its detection mAP reached 89.1%. Zhang et al. [
21] proposed an improved YOLOv5 network for insulator detection by introducing a ghost module to reduce the model parameters and volume, while using a convolutional block attention module (CBAM) to make the network focus on the key regions of the target. Experimental results show that the improved YOLOv5 network has high detection accuracy while maintaining a low model volume.
The substation meter is an important part of the power system that can be used to monitor the use of power equipment. Substation meters are exposed outdoors for a long time, the housing easily rusts, and the dial can easily blur and crack. The method for defect detection based on computer vision has achieved certain research results in power system inspection, though in order to solve the problems of the complex background of substation meter defect images, different object sizes and large differences in appearance, this study proposes an algorithm for automatic defect detection in substation meters based on the PHAM-YOLO network. With YOLOv5 as the baseline, a PHAM module with two parallel branches is designed to pay more attention to defect features by integrating local and non-local features of meter images. The SPPF module and EIOU are introduced into YOLOv5 to adapt it to different sizes of meter and improve the accuracy of the bounding box regression of defect.
2. Models and Methods
YOLOv5 [
22] is a one-stage object detection algorithm that is based on direct regression of the relative positions of candidate frames to achieve object localization and classification. YOLOv5 is a further improvement based on YOLOv4 [
23], and the detection accuracy and speed are significantly higher. As substation meter defect images have complex backgrounds, different object sizes and large differences in shape, and they are also influenced by the shooting angle and light intensity, this paper proposes a PHAM-YOLO network model for meter defect detection on the basis of YOLOv5.
2.1. Architecture of the PHAM-YOLO Network
The PHAM-YOLO network model was improved on the basis of YOLOv5s, and the structure of the model is shown in
Figure 1.
The model consisted of a backbone part for feature extraction, a neck part for feature fusion and a prediction part. The model used the Mosaic algorithm for data enhancement in the input part, and it stitched four random images via random deflation, random cropping and random arrangement, which effectively improved the detection of small targets. The backbone part mainly consisted of focus, CBS and CSP (Cross-Stage Partial network) structures. The focus structure performed a slicing operation; the CBS structure consisted of convolution, normalization and activation functions; and the CSP structure integrated gradient changes into the feature map to maintain higher accuracy while reducing computation. In order to make the network focus on the features of meter defects against a complex background, this paper designed and added the PHAM module to the backbone part, the inclusion of which helped the system to effectively focus on the defect features and reduced the weight input of useless background information. Spatial pyramid pooling fast (SPPF) was also used instead of spatial pyramid pooling (SPP). The SPPF structure used continuous fixed convolutional kernels to pool the input feature maps, fusing and enriching the expressiveness of the feature maps of different perceptual fields. The neck adopted the structure of FPN and PAN, where FPN used up-sampling to transmit and fuse the semantic information of different layers, while PAN effectively solved the multi-scale problem by stitching the underlying and high-level semantic information. In the prediction section, feature maps of size 80 × 80 × 255, 40 × 40 × 255 and 20 × 20 × 255 were the outputs, where 255 indicated the number of channels. The smaller the size of the feature map, the larger the image area corresponding to each cell network in the feature map and the more suitable it was for detecting large objects. The CIOU used in YOLOv5 as the boundary loss function had some ambiguity, which made the BBR regression inaccurate. Therefore, we introduced the EIOU (efficient intersection over union) loss function to solve the problems of the CIOU loss function and make the BBR regression more accurate. The remaining part of this section will describe in detail the implementation process used for the improvements and method.
2.2. Parallel Hybrid Attention Mechanism
In order to make the network focus on the key information of the meter defects within a complex background, as well as to suppress other useless information from different channels, this paper presents a PHAM module with a two parallel branches attention mechanism, as shown in
Figure 2. The upper branch was composed of channel (
Figure 3) and spatial attention modules (
Figure 4), and the lower branch was the coordinate attention module (
Figure 5).
The channel attention of the upper branch performed a two-dimensional global pooling operation on the initial features to obtain two sets of feature vectors, and then transmitted the two sets of feature vectors to a multilayer perceptron (MLP) network to obtain the channel attention feature map . Next, a 2D global pooling operation was performed on the channel level, and a convolutional kernel was used to reduce the channel dimension to 1, which could generate spatial attention feature maps . Next, was multiplied by to obtain the attention feature map output using the upper branch. The channel and spatial attention of the upper branch were obtained using global pooling, capturing the local correlation of the feature map.
In the coordinate attention module of the lower branch, the input feature map was first pooled horizontally and vertically using convolution kernels of size (H, 1) and (1, W), which were averaged over the input. In order to make better use of the generated features,
and
were subjected to the concatenate operation [
24], and the features after the connection were passed through the transform and the non-linear activation functions, respectively, to achieve output
:
where
is the transform function with a convolution kernel of 1 × 1,
is the non-linear activation function, and
is the intermediate feature mapping that encodes the spatial information in the horizontal and vertical directions. Next,
was divided into two separate tensors along the spatial dimension, and the attention feature map
of the output in the lower branch was obtained via convolution and non-linear processing. Therefore, the lower branch captured the non-local correlation of the feature map.
Finally, the
and
obtained from the upper branch and the lower branch were fused, and the output feature
of the PHAM module were expressed as
2.3. Spatial Pyramid Pooling Fast
The SPP structure was used in the YOLOv5 network to change the size of the feature maps. As shown in
Figure 6, the SPP structure first performed a CBS operation on the input feature maps, and the CBS output feature maps were then connected in cascade, with these CBS output feature maps having convolutional kernel sizes after a maximum pooling operation using 3 × 3, 5 × 5 and 9 × 9, before being fed into the CBS module. Although the SPP module implemented the function of converting the feature maps to a specific size, this parallel connection ignored the impact of different receptive field feature maps on the model performance and, with this pooling, it added additional computational overhead to the model.
The SPPF [
25] structure used a continuous fixed convolution kernel to pool the input feature maps, fusing the feature maps of different receptive fields and enriching the expressiveness of the feature maps without increasing the computation. The SPPF structure replaced the parallel max-pooling operation of the three different-sized convolutional kernels in the original SPP with a serial operation of three convolutional kernels of the same size. As shown in
Figure 6, the operation of the SPPF structure first performed a 5 × 5 max-pooling operation on the data transferred serially from the CBS structure. Next, the data were passed into the CBS structure via cascade splicing, which accomplished richer feature information extraction without increasing the algorithm’s computation.
2.4. EIOU Loss
In defect detection, the boundary loss function served to determine the positive and negative samples and evaluate the distance between the prediction frame and the true frame. The IOU (intersection over union) was the ratio of the intersection and concatenation between the prediction box and the true box, thus satisfying non-negativity, homogeneity, symmetry and triangular inequality, and had a value between 0 and 1, regardless of the size of the prediction box. In actual use, we wrote the IOU loss [
26] as it is shown in Equation (4). However, the IOU loss could not optimize the case where the true and predicted boxes did not intersect, nor did it reflect the problem of how the true and predicted boxes intersected. To solve these problems, Rezatofighi et al. [
27] proposed the GIOU (generalized IOU) loss, which introduced the minimum outer rectangle of the real frame and the prediction frame on the basis of the IOU, as shown in Equation (5).
where
C is the area of the smallest outer rectangle of the prediction box and the real box. However, the GIOU loss still had some problems. The GIOU loss degenerated to IOU loss when the prediction box and the real box appeared to be contained; when the prediction box and the real box intersected, convergence was slow in the horizontal and vertical directions. Therefore, the authors of [
28] proposed DIOU (distance IOU) loss and CIOU loss, which are two classes of losses that improve the speed of convergence by directly minimizing the normalized distance between the prediction frame and the true frame, and make the regression more accurate when overlapping or even containing the target frame, as shown in Equations (6) and (7).
where α is the weight function,
is used to measure the consistency of the aspect ratio,
denotes the Euclidean distance between the two centroids and
denotes the diagonal length of the smallest outer rectangle between the prediction box and the real box. However, the DIOU loss did not take into account the aspect of the pre-bounded box in the regression process, and there was room for further improvement in accuracy. CIOU loss considered the width–height ratio of the regression box and the center distance between the real box and the prediction box. But, it only considered the width height ratio as the influence factor. If there were two box center points that were consistent with the original figure, the width–height ratio was the same, but the width–height value was different as, according to the CIOU loss, they may have been consistent with the regression target.
In this paper, we used the EIOU [
29] loss function, instead of the CIOU loss function, as the boundary loss function, which was defined as shown in Equation (10). EIOU calculates the width–height value of the prediction box and the real box separately by separating the influence factors of aspect ratio based on CIOU. It took into account the overlapping area, the distance between centroids and the actual width–height difference. The ambiguity problems related to the CIOU loss function were solved, making the model converge faster and the BBR regression more accurate.
2.5. Method Framework
The flow of the substation meter defect detection algorithm based on the PHAM-YOLO network and presented in this paper is shown in
Figure 7.
We used the meter defect images from the manually inspected process to construct the substation meter defect dataset used in this algorithm. The images in the dataset were randomly divided into training, validation and test sets in the ratio of 6:2:2. In order to enrich the training set and improve the robustness of the model, the images in the training set were horizontally mirrored and rotated by 10°. In order to improve the accuracy of the model, this paper proposes a PHAM-YOLO network by designing a PHAM attention module, using the SPPF module instead of the SPP module, and using EIOU loss instead of CIOU loss on the basis of YOLOv5, respectively. In the training process of the PHAM-YOLO network, the original image and label file were, firstly, input into the model, and the prediction box of the image was predicted via the model; the loss value was calculated from the prediction box and the real box, which was then passed back to the model to optimize the weight parameters, and the loss value was reduced and the model performance was enhanced after several training iterations. The final trained model weights were tested on the test set to determine the test results.
3. Dataset
Since there is no publicly available dataset of substation meter defect images on the network, in order to verify the performance of the algorithm in this paper, a dataset of substation meter defects was constructed from images taken via manual inspection. When selecting these images, in order to ensure the richness of the dataset, as far as possible, images with different angles, backgrounds and shooting distances were selected, aiming to ensure a balance of different target categories. The meter defect images obtained are then divided into three categories: fuzzy dial (as shown in
Figure 8a), damaged dial (as shown in
Figure 8b) and broken meter housing (as shown in
Figure 8c).
The purpose of data annotation is to mark the position and category of the target in each image. In order to ensure the reasonableness and accuracy of the labelling, the data in this paper were manually labelled using the LabelImg labelling software under the guidance of professionals. For the substation meter defect dataset, if the meter dial was blurred, it was labelled as “bj_bpmu”; if the meter dial was broken, it was labelled as “bj_bpps”; and if the meter housing was broken, it was labelled as “bj_wkps”. Some examples of data annotation are shown in
Figure 9. The annotated data are also created strictly in accordance with the public dataset PascalVOC [
30] format, and the annotation file is in .xml format.
A total of 1439 images of auxiliary equipment defects were selected from the images acquired during the manual inspection, including 598 images of blurred dials, 537 images of broken dials and 304 images of broken housings. These three types of auxiliary equipment defect images were divided into training, validation and test sets in the ratio of 6:2:2. Due to the constraints on the practical situation, the equipment defect images obtained during the manual inspection are limited. On the other hand, for the deep learning model, a rich dataset is beneficial to the performance improvement of the model, and it can also improve the generalization ability of the model. Therefore, in this section, the training set images in the dataset are horizontally mirrored and rotated by 10°, which can simulate the difference of target shape and size due to different angles when shooting manually, on one hand, and enrich the training set to avoid overfitting during the network training, on the other hand. The data generated after expansion are shown in
Table 1 and the visualization of its extension is shown in
Figure 10.
6. Conclusions
The meter defect images taken during manual inspection have problems, such as their complex backgrounds, different target sizes and large differences in appearance, while the three defects have similarities, making it difficult for the existing model to accurately detect and distinguish the three defects. To address the above problems, a PHAM-YOLO network model based on YOLOv5 is proposed. The main contributions are as follows:
- (1)
The PHAM module can focus the network on key areas in the complex background of the meter defect image and the differences in various defect features, highlighting the differences in meter defect features.
- (2)
The SPPF module uses continuous fixed convolution kernels to pool the input feature maps and fuses the feature maps of different receptive fields, which does not increase the computation.
- (3)
The EIOU loss function solves the ambiguity problem of CIOU loss and makes the BBR regression more accurate.
The experimental results show that for the substation meter defect dataset, the recognition accuracy of the PHAM-YOLO network proposed in this paper is higher than those of other mainstream target networks, which can greatly help substation staff to solve the manual inspection problem. In addition to the meter defects, some other components in power systems may also have defects; thus, the proposed method also provides some ideas for detecting other component defects in power systems.