Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

We reformulate the oriented gradients (HoG) method [2] from the viewpoints of gradient-based image pattern recognition and the directional statistics. The gradient-based image recognition is robust against bias change such that illumination changes. The directional statistics compares distribution of the directions of gradients in an image. Gradients are fundamental geometrical features for the segment boundary extraction. The segment boundaries are employed for the detection of a particular object, such as a pedestrian and car, from a scene and a sequence of scenes. In the HoG method, oriented gradients in windowed sequence on the plane yields histogram on the unit circles [6]. To measure differences and similarities among these histogram-based features, we are required to use a well-defined metric for the distributions. For this purpose, we redefined a histogram as a cyclic probability density function by using the directional statistics [4]. The state-of-art methods in the pedestrian detection [1, 3, 5, 9] handle occlusion, non-rigid deformation and scale change of pedestrian images with less detection error than that of the original HoG method. However, almost all the methods still adopt the feature of HoG method. Therefore, clarifying the mathematical properties for accurate and robust achievement of the HoG method derives the new improvements of object detection systems.

By using the Wasserstein distance [8], we introduce the distance among directional statistics. Furthermore, we develop three methods to construct histogram from a distribution of gradients. Combining these three construction methods and three aggregating method for the local regions of an image, we define directional distribution-based features. Moreover, we explore the mathematical properties of the definition of the HoG method. Finally, we evaluate the performances of the developed directional distribution-based features and HoG feature with \(L_p\)-norm and the Wasserstein distance. Our developed features are presented in Sects. 3.1 and 3.2. Comparing these evaluations, we examine what mathematical properties give accurate detection.

2 Mathematical Preliminaries

2.1 Directions and Matching

We assume an image \(f(\varvec{x})\) defined over a finite closed set \(\varOmega \) exists in the intersection of Sobolev space \(W^{1,p}\) and bounded variantional space. Therefore, \(\int _{\varOmega } |\nabla f| d\varvec{x} < \infty \) and \(\int _{\varOmega } |\nabla f|^2 d\varvec{x} < \infty \) hold. For gradients of images, we have the following theorem.

Theorem 1

For functions f and g, iff \(\nabla f = \nabla g\), then \(f = g + \mathrm {constant}\).

From Theorem 1, we derive the distance \( D(f,g) = \sqrt{ \int _{\mathbb {R}^2} | \nabla f - \nabla g |^2 d\varvec{x}}\), which is invariant to a constant bias. Therefore, gradient-based matching is robust against soft illumination change. However, since partial derivative enhances noises, we need blur filtering to remove noises for robust matching based on gradients.

The directional gradient of images \(f(\varvec{x})\) for \(\varvec{x} = (x,y)^{\top }\) in the direction of \(\varvec{\omega } = (\cos \theta , \sin \theta )^{\top }\) is computed as \(D(\theta ) = \frac{\partial f}{\partial \varvec{\omega }} =\varvec{\omega }^\top \nabla f\). The directional gradient \(D(\theta )\) evaluates steepness, smoothness and flatness of f along the direction of vector \(\varvec{\omega }\). gradient-based image recognition [2] uses the pair of the direction of gradient \(\varvec{n}(\varvec{x}) = \nabla f(\varvec{x}) / \Vert \nabla f(\varvec{x}) \Vert _2\) and its magnitude \(m(\varvec{x})=\Vert \nabla f(\varvec{x}) \Vert _2\) over the region \(\varOmega \) as \( d(f(\varvec{x})) = ( m(\varvec{x}), \, \varvec{n}(\varvec{x}) )\), where \(\Vert \cdot \Vert _2\) is the \(L_2\)-norm. We represent the gradients field of an image as \(\varPhi (f) = \{d(f(\varvec{x}))\ | \ \varvec{x} \in \varOmega \}\). Hereafter, to compute the directional angle of each point, we define the operation \(\angle (\nabla f(\varvec{x}))\) as \(\angle (\nabla f(\varvec{x})) = \theta \), if \(\varvec{n}(\varvec{x}) = \frac{\nabla f}{\Vert \nabla f\Vert _2} = (\cos \theta , \sin \theta )^{\top }\). Figure 1 shows an example of gradient field and its histogram for a local region of an image.

Fig. 1.
figure 1

Example of directional statistics. (a) A grayscale image. (b) Magnitudes of gradients in a local region of the image. (c) Directions of gradients in a local region of the image. (d) A distribution of structure tensors in a local region of the image. (e) A circular histogram constructed with the gradient field.

2.2 Aggregating Methods for Local Regions

To measure the difference of two images, the HoG method focuses on the difference of local regions between two images. The HoG method aggregates directional distributions of gradient in local regions of an image. To compute directional distributions of local regions of an image, we define local regions in same manner of Reference [2]. For a fixed point \(\varvec{c}\in \mathbb {R}^2\) on an image, positive constant \(\alpha \in \mathbb {R}_{+}\) and positive integer \(k \in \mathbb {Z}_{+}\), using a set of points \(\varvec{x}\in \mathbb {R}^2\) and infinity norm\( \Vert \cdot \Vert _{\infty }\), we define a local region and bounding box as

$$\begin{aligned} C(\varvec{c}) = \{ \varvec{c} \, | \, \Vert \varvec{x} -\varvec{c} \Vert _{\infty } < \alpha \}, \ \ \ B(\varvec{c}) = \{ \varvec{x} \ | \ \Vert \varvec{x} - \varvec{c} \Vert _{\infty }<k\alpha \}, \end{aligned}$$
(1)

which are called cells and blocks, respectively.

For practical computation of a cell and block, using a constant \(\alpha \), we select a set of points \(\{ \varvec{c}_{ij} \}_{i,j=1}^{M,N}\) with the conditions \(C(\varvec{c}_{ij}) \cap C(\varvec{c}_{i'j'})=\emptyset , (i,j) \ne (i',j')\) and \(C(\varvec{c}_{11}) \cup C(\varvec{c}_{12}) \cup \dots ,C(\varvec{c}_{MN}) = \varOmega \). For \(k=2\), we select a set of blocks \(B(\varvec{c'}_l)\) with the condition \(\varvec{c}_l' \in \{ \varvec{\mu } | \varvec{\mu } = (\varvec{c}_{ij}+\varvec{c}_{ij+1}+\varvec{c}_{i+1j}+\varvec{c}_{i+1j+1})/4 \}\), \(l=1,2,\dots , (M-1)(N-1)\). The HoG method divides an image by cells, aggregates the cells by blocks and represents an image as a vector by connecting histograms of cells. For vectors, cells, blocks and histograms, we have to select appropriate metrics.

2.3 Histogram of Gradients

We define local and global directional distribution for an image. For local regions \(C(\varvec{c})\) and \(B(\varvec{c})\), using Dirac delta function \(\delta \), the operation \(\angle (\cdot )\) and magnitude \(m(\varvec{x})\) of gradient at a point \(\varvec{x}\), we construct the normalised histogram in local regions

$$\begin{aligned} H^{\mathrm {C}}_{\mathrm {w}}(\theta ,\varvec{c}) = \frac{\int _{C(\varvec{c})} m(\varvec{x})\delta (\theta - \angle (\nabla f)) d\varvec{x}}{\int _{0}^{2\pi } \int _{C(\varvec{c})} m(\varvec{x})\delta (\theta - \angle (\nabla f)) d\varvec{x} \, d\theta }, \end{aligned}$$
(2)
$$\begin{aligned} H^{\mathrm {B}}_{\mathrm {w}}(\theta ,\varvec{c}) = \frac{\int _{B(\varvec{c})} m(\varvec{x})\delta (\theta - \angle (\nabla f)) d\varvec{x}}{\int _{0}^{2\pi } \int _{B(\varvec{c})} m(\varvec{x})\delta (\theta - \angle (\nabla f)) d\varvec{x} \, d\theta }, \end{aligned}$$
(3)

as a probabilistic distribution. We call this histogram directional distribution. Furthermore, for over an image region \(\varOmega \in \mathbb {R}^2\), we define a normalised global histogram of gradients as

$$\begin{aligned} H^{\mathrm {G}}_{\mathrm {w}} (\theta ) = \frac{\int _{\varvec{x} \in \varOmega } m(\varvec{x}) \delta \left( \theta - \angle (\nabla f) \right) d\varvec{x} }{\int _{0}^{2 \pi } \int _{\varvec{x} \in \varOmega } m(\varvec{x}) \delta \left( \theta - \angle (\nabla f) \right) d\varvec{x} \, d\theta }. \end{aligned}$$
(4)

2.4 Histogram of Dominant Directions

For gradients, we have the following property.

Proposition 1

For f, setting the directional tensor \(\varvec{S} = \nabla f\nabla f^{\top }\), \(\nabla f\) and \(| \nabla f|^2\) are the eigenfunction \(u_1\) and eigenvalue \(\lambda _1\) of \(\varvec{S}\), respectively.

For a local region \({\varPsi }(\varvec{c}) \in \{C(\varvec{c}),B(\varvec{c}) \} \) defined around a point \(\varvec{c} \in \mathbb {R}^2\), the eigenfunction \(\{u_i\}_{i=1}^2\) and eigenvalues \(\{\lambda \}_{i=1}^2\) of the structure tensor

$$\begin{aligned} \bar{\varvec{S}} = \frac{1}{|\varPsi (\varvec{c})|} \int _{\varPsi (\varvec{c})} \nabla f \nabla f^{\top } d \varvec{x}. \end{aligned}$$
(5)

are used as the descriptor of f in a local region \(\varPsi ({\varvec{c}})\). Figure 1(d) illustrates a distribution of structure tensors in a local region of an image. Using the pair \((\lambda _1, u_1)\) of the first eigenvalue and the first eigenvector of \(\bar{S}\), the operation \(\angle (\cdot )\) and the region \(\varOmega \) of an image, we construct the histogram \(h_{\mathrm {D}} (\theta , \bar{s}(\varvec{c})) = \int _{\varvec{c}\in \varOmega } \lambda _1 \delta (\theta -\angle (u_1)) d\varvec{c}\). This histogram expresses the number of dominant directions for a direction \(\theta \) in an image. However, if we obtain \(u_1\) as the solution of eigenvalue problem of \(\bar{\varvec{S}}u_1 = \lambda u_1\), the range of angle of \(u_1\) is limited in \([0,\frac{\pi }{2}]\), since eigenfunction \(u_1\) lost sign of plus and minus.

For the practical computation, we define the maximum of a histogram in a cell \(M^{\mathrm {C}}(\varvec{c}) = \underset{\theta }{\mathrm {max}} \ H^{\mathrm {C}}_{\mathrm {w}}(\theta ,\varvec{c})\). Using this maximum, for a constant \(\lambda > 0.5\), we define histograms of dominant directions for a cell as

$$\begin{aligned} h^{\mathrm {C}}_{\mathrm {D}}(\theta , \varvec{c}) = {\left\{ \begin{array}{ll} H^{\mathrm {C}}_{\mathrm {w}}(\theta ,\varvec{c}), &{} H^{\mathrm {C}}_{\mathrm {w}}(\theta ,\varvec{c}) \ge \lambda M^{\mathrm {C}}(\varvec{c}) \\ 0, &{} \text {otherwise}. \end{array}\right. } \end{aligned}$$
(6)

For the region \(\varOmega \) of an image, we have the global dominant directional distribution as

$$\begin{aligned} H^{\mathrm {G}}_{\mathrm {D}}(\theta ) = \frac{\int _{\varvec{c} \in \varOmega } \int _{\varvec{x} \in B(\varvec{c})} h^{\mathrm {C}}_{\mathrm {D}}(\theta , \varvec{x})d\varvec{x}d\varvec{c}}{\int _0^{2\pi } \int _{\varvec{c} \in \varOmega } \int _{\varvec{x} \in B(\varvec{c})} h^{\mathrm {C}}_{\mathrm {D}}(\theta , \varvec{x})d \varvec{x} d\varvec{c} d \theta }. \end{aligned}$$
(7)

2.5 Gradient-Based Object Detection Method

For a reference image f and a template image g, using a moving window W, the gradient-based feature F(f) of an image f and the metric D for features, gradient-based object detection is achieved by minimising

$$\begin{aligned} J(f) = D(F(g_W(\varvec{x})),F(f(\varvec{x}))), \end{aligned}$$
(8)

where

$$\begin{aligned} g_W(\varvec{x}) = \chi _W(\varvec{x})g(\varvec{x}), \ \chi _W(\varvec{x}) = {\left\{ \begin{array}{ll} 1, &{} \mathrm {if} \ \varvec{x} \in W, \\ 0, &{} \mathrm {otherwise}. \end{array}\right. } \end{aligned}$$
(9)

To establish high accurate detection, we need to construct discriminative gradient-based feature F(f) and select appropriate metric for this gradient-based feature.

2.6 Wasserstein Distance

The p-Wasserstein distance [8] between a pair of probabilistic distributions f(x) and g(y) for \(x \in X\) and \(y \in Y\) is

$$\begin{aligned} W_p(f,g) = \underset{c(x,y)}{\mathrm {min}} \left( \int _X \int _Y |f(x) - g(y)|^p c(x,y)dx dy \right) ^{1/p}, \end{aligned}$$
(10)

where c(x, y) is a cost function.

Fig. 2.
figure 2

Examples of the computation of the Wasserstein distances. (a) Two probabilistic distribution on a circle. (b) Sampled data of the two function. (c) and (d) illustrate the 1-Wasserstein distance and binomial-distribution-based 1-Wasserstein distance, respectively. The top row in (c) and (d) show the residual values after the maximum flows moved from each bin of P to bins of Q. The bottom row in (c) and (d) show the flows that moved from P to Q as maximum flow. These two figures show that different ground distances give different Wasserstein distances. (e) The state at the end of the computation. All sampled values of P are moved to bins of Q.

For sampled probabilistic distributions \(P = \{ p_i \}_{i=1}^n\) and \(Q = \{ q_j \}_{j=1}^n\) on a circle, where \(p_{i+n} = p_i\) and \(q_{j+n}=q_j\), we define ground distance \( \ d_{ij} = | p_i - q_j |\). The distance between P and Q is computed as

$$\begin{aligned} D_{\text{ W }}(P,Q) = \underset{x_{ij}}{\mathrm {min}} \left( \sum _{i=1}^n \sum _{j=1}^n d_{ij}^p x_{ij} \right) ^{1/p} , \end{aligned}$$
(11)

subject to the conditions \(\sum _{j=1}^n x_{ij} = p_i\), \(\sum _{i=1}^n x_{ij} = q_j\) and \(\ x_{ij} \ge 0\). The minimisation is achieved by using linear programming to solve the transportation problem. As a special case of the Wasserstein distance, 1-Wasserstein distance is known as the Earth Mover’s distance [7].

Furthermore, we assume random-walk model for a gap between directional distributions. We define binomial-distribution-based p-Wasserstein distance as \( \ d_{ij} = | p_i - q_j |{\begin{pmatrix} n \\ d_c(i,j) \end{pmatrix}\frac{1}{2^n}} \), where \(d_c(i,j) = |i-j|\) for \(|i-j|\le \frac{n}{2}\) and \(d_c(i,j) = |i+n-j|\) for \(|i-j| > \frac{n}{2}\). This ground distance is weighted by value generated from the binomial distribution of the difference of indexes i and j. Figure 2 illustrates these two kinds of Wasserstein distances (Fig. 3).

3 Feature Extractions and Measurements

3.1 Local Directional Distributions Methods

Using the histograms of cells and blocks, we define three features based on local directional distributions. For \(H_{\mathrm {w}}(\theta ,\varvec{c}) \in \{H^{\mathrm {C}}_{\mathrm {w}}(\theta ,\varvec{c}), H^{\mathrm {B}}_{\mathrm {w}}(\theta ,\varvec{c}) \}\), we define directional distribution (DD) feature as

$$\begin{aligned} F_{\mathrm {DD}}(f) = \{ P_{\mathrm {w}}(\varvec{c}) \ | \ \varvec{c} \in \varOmega \}, \ \ P_{\mathrm {w}}(\varvec{c}) = \{H_{\mathrm {w}}(\theta ,\varvec{c}) \ | \ 0 \le \theta < 2\pi \}. \end{aligned}$$
(12)
Fig. 3.
figure 3

Flow of feature extractions. There are three flows. The top and middle rows show feature extraction with respect to cells and blocks, respectively. In the top and middle rows, a set of histograms is extracted. The bottom row shows feature extraction from overall region of an image. For these three extractions, we adopt histograms of directional distribution and dominant directional distribution. The left small box summarises the extraction of dominant directional distribution in each cell.

For the extracted features, we construct four methods to measure difference between images. First, for the DD features extracted from cells, we define two discrimination methods. We set \(F(f) = \{ P(\varvec{c}) | \varvec{c} \in \varOmega \}\) and \(F(g) = \{ Q(\varvec{c})| \varvec{c} \in \varOmega \}\) are extracted features for images f and g. We define cell-based discrimination method (C method) as

$$\begin{aligned} D_{\mathrm {Cell}} = \int _{\varvec{c} \in \varOmega } \mathrm {D}_{\mathrm {W}} \left( P(\varvec{c}), Q(\varvec{c})\right) d\varvec{c}, \end{aligned}$$
(13)

where \(\mathrm {D}_{\mathrm {W}}(\cdot , \cdot )\) is the Wasserstein distance. This discrimination method sums the Wasserstein distances of each corresponding cells in two images.

Second, we define block-wise-cell-based discrimination method (BWC method) as

$$\begin{aligned} D_{\mathrm {BWC}} = \int _{\varvec{c} \in \varOmega } \int _{\varvec{x} \in B(\varvec{c})} \mathrm {D}_{\mathrm {W}} \left( P(\varvec{x}), Q(\varvec{x})\right) d\varvec{x} d\varvec{c}. \end{aligned}$$
(14)

This discrimination method sums the Wasserstein distances of cells in each corresponding blocks in two images. In Eqs. (13) and (14), both \(P(\varvec{c})\) and \(Q(\varvec{c})\) are given by histograms \(H^{\mathrm {C}}_{\mathrm {w}}\) defined in Eq. (2) of cells.

Third, for the DD features extracted from blocks, we define block-based discrimination method (B method) as

$$\begin{aligned} D_{\mathrm {Block}} =\int _{\varvec{c} \in \varOmega } \mathrm {D}_{\mathrm {W}}( P(\varvec{c}), Q(\varvec{c}) ) d\varvec{c}, \end{aligned}$$
(15)

where both \(P(\varvec{c})\) and \(Q(\varvec{c})\) are given by histograms \(H^{\mathrm {B}}_{\mathrm {w}}\) defined in Eq. (3) of blocks. This discrimination method sums the Wasserstein distances of each corresponding blocks in two images.

Fourth, for the DD features extracted from cells and blocks, we define \(L_p\)-norm. We set \(P(\varvec{c}) = \{H^{\mathrm {C}}_P(\theta ,\varvec{c}) \ | \ 0 \le \theta < 2\pi \}\) and \(Q(\varvec{c}) = \{H^{\mathrm {C}}_Q(\theta ,\varvec{c}) \ | \ 0 \le \theta < 2\pi \}\) that are given by the histograms \(H^{\mathrm {C}}_{\mathrm {w}}\) for two images f and g, respectively. For two images, we define \(L_p\)-norm for cells as

$$\begin{aligned} D_{L_p} = \left( \int _{\varOmega } \int _0^{2\pi } |H_P^{\mathrm {C}}(\theta ,\varvec{c}) - H_Q^{\mathrm {C}}(\theta ,\varvec{c}) |^p d\theta d \varvec{c} \right) ^{1/p}. \end{aligned}$$
(16)

This \(L_p\)-norm measures the differences of all the cells dividing images. Furthermore, we define \(L_p\)-norm for a collection of block-wise cells as

$$\begin{aligned} D_{L_p} = \left( \int _{\varvec{c} \in \varOmega } \int _{\varvec{x}\in B(\varvec{c})}\int _0^{2\pi } |H_P^{\mathrm {C}}(\theta ,\varvec{c}) - H_Q^{\mathrm {C}}(\theta ,\varvec{c}) |^p d\theta d \varvec{x} d \varvec{c} \right) ^{1/p}. \end{aligned}$$
(17)

This \(L_p\)-norm measures the differences of all the cells obtained by moving a block.

Moreover, we define \(L_p\)-norm for blocks. To define \(L_p\)-norm for blocks , we set \(P(\varvec{c}) = \{H_P^{\mathrm {B}}(\theta ,\varvec{c}) \ | \ 0 \le \theta < 2\pi \}\) and \(Q(\varvec{c}) = \{H_Q^{\mathrm {B}}(\theta ,\varvec{c}) \ | \ 0 \le \theta < 2\pi \}\) that are given by the histograms \(H^{\mathrm {B}}_{\mathrm {w}}\) for two images f and g, respectively. Then, we define \(L_p\)-norm for a collection of blocks as

$$\begin{aligned} D_{L_p} = \left( \int _{\varOmega } \int _0^{2\pi } |H_P^{\mathrm {B}}(\theta ,\varvec{c}) - H_Q^{\mathrm {B}}(\theta ,\varvec{c}) |^p d\theta d \varvec{c} \right) ^{1/p}. \end{aligned}$$
(18)

3.2 Global Directional Distribution Method

Let an image f be a blurred image by the Gaussian filtering of the standard deviation \(\sigma \). For this blurred image f, using a histogram defined in Eq. (4), we construct a global direction distribution (DD) feature

$$\begin{aligned} F^{\mathrm {G}}_{\mathrm {DD}}(f) = P = \{ H^{\mathrm {G}}_{\mathrm {w}}(\theta ) | 0 \le \theta < 2\pi \}. \end{aligned}$$
(19)

This feature represent directional distribution over region of an image by one histogram. Furthermore, using a histogram defined in Eq. (7), we construct dominant directional distribution (DDD) feature as

$$\begin{aligned} F^{\mathrm {G}}_{\mathrm {DDD}}(f) = P = \{ H^{\mathrm {G}}_{\mathrm {D}}(\theta ) | 0 \le \theta < 2\pi \}. \end{aligned}$$
(20)

The both features consist of a histogram of a global directional distribution over an image.

For images f and g, using \(F^{\mathrm {G}} \in \{ F^{\mathrm {G}}_{\mathrm {DD}}, F^{\mathrm {G}}_{\mathrm {DDD}} \}\), we extract these features \( F^{\mathrm {G}}(f) = P\), and \( F^{\mathrm {G}}(g) = Q\), respectively. To measure the difference of two features, we define global discrimination method (G method) as

$$\begin{aligned} D_{\mathrm {G}} = D_{\mathrm {W}}(P,Q), \end{aligned}$$
(21)

where \(D_{\mathrm {W}}\) is the Wasserstein distance. Furthermore, we define \(L_p\)-norm between two features. Setting \(P = \{H_{P}^{\mathrm {G}} (\theta ) | 0 \le \theta < 2\pi \}\) and \(Q = \{H_{Q}^{\mathrm {G}} (\theta ) | 0 \le \theta < 2\pi \}\), we have

$$\begin{aligned} D_{L_p} = \left( \int _0^{2\pi } |H_{P}^{\mathrm {G}} (\theta ) - H_{Q}^{\mathrm {G}}(\theta )|^p d \theta \right) ^{1/p}. \end{aligned}$$
(22)

where both histograms \(H_{P}^{\mathrm {G}}\) and \(H_{Q}^{\mathrm {G}} \) are given by either of \(H_{\mathrm {w}}^{\mathrm {G}} \) or \(H_{\mathrm {D}}^{\mathrm {G}}\).

3.3 Histogram of Oriented Gradients Method

We summarises the feature extraction of the HoG method as below.

  1. 1.

    Compute \(\varPhi (f)\) in the Sect. 2.1.

  2. 2.

    Compute a histogram of gradients defined in Eq. (2) in a local region \(C(\varvec{c})\).

  3. 3.

    By sliding a bounding box, we extract the HoG feature of an image.

We use \(L_p\)-normalisation for the histogram in each block with \(p=2\) as

$$\begin{aligned} H_{\mathrm {H}_p}(\theta ,\varvec{c}) = \frac{H^{\mathrm {C}}_{\mathrm {w}}(\theta , \varvec{c})}{\left( \int _{\varvec{x} \in B(\varvec{c})} \int _{0}^{2\pi } | H^{\mathrm {C}}_{\mathrm {w}}(\theta ,\varvec{c})|^p d\theta d\varvec{x}\right) ^{1/p}}. \end{aligned}$$
(23)

Sliding the center \(\varvec{c}\) of the block \(B(\varvec{c})\) over the region \(\varOmega \) of an image f, we extract the HoG feature

$$\begin{aligned} F_{\mathrm {H}}(f) = \{H_{\mathrm {H}}(\theta ,\varvec{c}) \ | \ \varvec{c} \in \varOmega , \ 0 \le \theta < 2\pi \}, \end{aligned}$$
(24)

where \(H_{\mathrm {H}} = H_{\mathrm {H}_2}\).

For two HoG features \(F_{\mathrm {H}}(f) = \{H_{\mathrm {H},1}(\theta ,\varvec{c})\}\) and \(F_{\mathrm {H}}(g) = \{H_{\mathrm {H},2}(\theta , \varvec{c})\}\) of images f and g, respectively, we derive the \(L_p\)-norm

$$\begin{aligned} D_{L_p} = \left( \int _{\varOmega } \int _{0}^{2\pi }| H_{\mathrm {H},1}(\theta ,\varvec{c}) - H_{\mathrm {H},2}(\theta ,\varvec{c}) |^p d\theta d\varvec{c} \right) ^{1/p}. \end{aligned}$$
(25)

Measuring the difference of two HoG features by \(D_{L_p}\) is the HoG method in Reference [2]. Furthermore, setting \(H^{'}_1(\varvec{c}) = \{ \int _{\varvec{x}\in B(\varvec{c}) } H^1_{\mathrm {H}}(\theta ,\varvec{x}) d\varvec{x} \, | \, 0 \le \theta < 2\pi \}\) and \(H^{'}_2(\varvec{c}) = \{ \int _{\varvec{x}\in B(\varvec{c}) } H^2_{\mathrm {H}}(\theta ,\varvec{x}) d\varvec{x} \, | \, 0 \le \theta < 2\pi \}\), we define block-based discrimination method

$$\begin{aligned} D_{\mathrm {Block}} =\int _{\varvec{c}\in \varOmega } D_{\mathrm {W}}\left( H^{'}_1(\varvec{c}), H^{'}_2(\varvec{c}) \right) d\varvec{c}. \end{aligned}$$
(26)

In block-based discrimination method, the difference between the DD and HoG features is how to normalise each histogram of a block. The HoG method defines histogram of a block as connected vectorised histograms of cells and normalised by \(L_2\)-norm.

For the \(L_p\)-normalisation, we have

Theorem 2

Assuming \(f \in {L}_1(\varOmega ) \cap {L}_2(\varOmega )\) for a finite closed set \(\varOmega \), we have the relation

$$\begin{aligned} \Vert \varvec{x} \Vert _1 \le \sqrt{\varOmega } \Vert \varvec{x} \Vert _2 \end{aligned}$$
(27)

(Proof). For all \(f \in {L}_p \) using the Cauchy-Schwartz inequality, we have

$$\begin{aligned} \Vert f \Vert _1 = \int _{\varOmega } |f(\varvec{x})| \cdot 1 \, d\varvec{x} \le \left( \int _{\varOmega } |f(\varvec{x})|^2 d\varvec{x} \right) ^{1/2} \left( \int _{\varOmega } 1^2 d\varvec{x} \right) ^{1/2} = \sqrt{\varOmega } \Vert \varvec{x} \Vert _2. \end{aligned}$$
(28)

(Q.E.D.)

Theorem 3

If we define mapping \(\phi : \frac{f}{\Vert f \Vert _1} \mapsto \frac{f}{\Vert f \Vert _2} \), then this mapping \(\phi \) is a nonlinear mapping.

(Proof). If we assume that the transform \(\phi \) is linear, that is, for \( \frac{f}{\Vert f \Vert _1}=\phi (\frac{f}{\Vert f \Vert _2}) =\int _{\varOmega } K(\varvec{x},\varvec{y})\frac{f}{\Vert f \Vert _2}d \varvec{y}\), K is independent to f. This operator K satisfies the relation \(K(\varvec{x},\varvec{y})=\frac{\Vert f \Vert _2}{\Vert f \Vert _1}\delta (\varvec{x}-\varvec{y})\) for all \(f \in H\). Since \(\alpha (f)=\frac{\Vert f \Vert _2}{\Vert f \Vert _1}\) is a function of f, K depends on f. This property of K derives the contradiction to the assumption on \(\phi \). This contradiction implies that \(\phi \) is non-linear transform. (Q.E.D.)

For the normalised histograms \(H_{\mathrm {H}_p}(\theta ,\varvec{c})\) with \(p \in \{ 1, 2 \}\), we define the \({L}_1\)-norm of a block in an image as \(D_{{L}_{1,p}}(\varvec{c}) = \int _0^{2\pi } | H_{\mathrm {H}_p}(\theta ,\varvec{c})| d\theta \). Therefore, we rewrite \({L}_1\)-norm defined Eq. (25) for the HoG feature as \(D_{{L}_{1,p}} = \int _{\varvec{c}\in \varOmega } | D_{{L}_{1,p}}(\varvec{c})| d\varvec{c}\). with \(p \in \{ 1,2 \}\). From Theorem 2, we have the following inequality

$$\begin{aligned} D_{{L}_{1,1}} \le \sqrt{n} D_{{L}_{1,2}} \end{aligned}$$
(29)

holds. Here, n is the number of bins of histogram in a block. The connecting of four histograms gives larger upper bound of \(L_1\)-norm than that given by only one histogram. Furthermore, from Theorem 3, we can infer that \(L_2\)-normalisation makes separation ratio higher than \(L_1\)-normalisation in the recognition of HoG features.

4 Numerical Experiments

By comparing the recognition rates of the local and global directional distribution methods, and the HoG method, we examine the mathematical properties that contribute to accurate gradient-based image pattern recognition.

For feature extractions, adopting the same procedure of the original HoG paper in Reference [2], we use 9 and 18 oriented directions \(\{ \frac{\pi }{9}(i-1) \}_{i=1}^9\) and \(\{ \frac{\pi }{9}(i-1) \}_{i=1}^{18}\). For local and global features, we use nine and 18 directions, respectively. For the HoG feature, we use both nine and 18 directions. The sizes of a cell and block are \(8\times 8\) pixels and \(2\times 2\) cells, respectively. For the DDD feature, we set \(\lambda = 0.9\).

Throughout this section, as metrics, we use \(L_1\)- and \(L_2\)-norms, and the p-Wasserstein distance with \(p=1\) (1WD) and binomial-distribution-based p-Wasserstein distance with \(p=1\) (B1WD). For the local directional distribution features, we use C method, B method, BWC method and \(L_p\)-norms defined in Eqs. (13), (14), (15), and (16), (18) and (17). For the global directional distribution features, we use G method and \(L_p\)-norm defined in Eqs. (21) and (22), respectively. For the HoG feature, we use B method and \(L_p\)-norm in Eqs. (25) and (26), respectively.

For the computation of recognition rate, we use INRIA dataset [2] that benefits from high quality annotations of pedestrian in diverse settings (city, beach, mountain, etc.). From the INRIA dataset, we select images that show frontal view of a pedestrian as positive images. These positive images are divided into 38 learning positive images and 115 positive queries. To obtain negative images, we randomly crop 115 background regions of images in the INRIA dataset. We use these negative images as negative queries.

Table 1. Summary of pairs of features and discrimination criteria. For the directional distribution (DD), the histogram of oriented gradients (HoG) and the dominant directional distribution (DDD), this table shows whether each discrimination method is available or not by equation or \(-\), respectively. Features are divided with respect to whether they are based on a local histogram or a global histogram.
figure b

For all the combinations of feature and discrimination methods shown in Table 1, we compute recognition rates using the Algorithms 1 and 2. In Algorithm 2, threshold X represents that positive queries exist on how large percentage of the whole space that contain all the queries. Therefore, if small threshold gives highest recognition rate, the results mean that a combination of feature and discrimination method achieves discriminative classification.

Figure 4 shows the recognition rates for the local directional distribution method and the HoG method. Figure 5 show the recognition rate for global directional method.

In Fig. 4(a), all the discrimination methods except \(L_1\)-norm for blocks give lower recognition rate than 0.5. Figure 4(b) shows that \(L_1\)-norms gives highest recognition rate in both nine and 18 directions for the HoG. Figure 5(a)–(d) show that the Gaussian filtering of larger standard deviation gives more discriminative feature. In Fig. 5(e)–(h), 1WD gives highest recognition rate than \(L_p\)-norms and B1WD.

Figure 6 shows the relation of the \(L_1\)- and \(L_2\)-normalisation of histograms of blocks in the HoG feature. Figure 6(a) shows that discrimination given by \(L_2\)-normalisation is larger distance among the median and queries than that given by \(L_1\)-normalisation. This result is coincident to Theorem 2. Figure 6(b) and (c) show the nonlinear mapping from \(L_1\)-normalised HoG feature to \(L_2\)-normalised HoG feature increases separation ratio of two categories.

Table 2 summarises the discriminative combinations of a feature and metric. In the context of directional statistics, the combination of WD and DDD feature gives accurate recognition. Using \(L_1\)-norm for DDD feature, we can obtain same performance without the extraction of the dominant directions.

Fig. 4.
figure 4

Recognition rate for the local directional distribution method and histogram of oriented gradients method. The vertical and horizontal axes represent recognition rate and criterion, respectively. The \(L_1\)-norm gives highest recognition for both methods.

Fig. 5.
figure 5

The recognition rate for the global directional distribution method. Upper and lower rows show the recognition rate for directional distribution feature and dominant directional distribution feature, respectively. The first, second, third and fourth column shows results for \(L_1\)-norm, \(L_2\)-norm, 1-Wasserstein distance (1WD) and Binomial-distribution-based 1-Wasserstein distance (B1WD), respectively. The vertical and horizontal axes represent the recognition rate and criterion, respectively. The circle, square, six-rayed star, upwards and downwards mark represent results for blurred images with the Gaussian filtering of standard deviations 0, 2, 4, 8, 16, respectively.

Fig. 6.
figure 6

Distribution of \(L_1\)-norms among the median and queries. (a) Relation between \(L_1\)-norms for \(L_1\)- and \(L_2\)-normalised HoG features. The horizontal and vertical axes represent \(L_1\)-norms for \(L_1\)- and \(L_2\)-normalised HoG features, respectively. This relation shows that the mapping \(\phi \) is nonlinear. In (b) and (c), the horizontal and vertical axes represent distances among the median and queries and its probability of occurrence, respectively. \(L_2\)-normalisation gives more discriminative distributions to positive and negative queries than \(L_1\)-normalisation.

Table 2. Summary of discriminative combination of feature and metric.

5 Discussion

For the results of experiments in the previous section, we summarise four key observations about feature extraction methods and discrimination methods.

The first observation is that the low-frequency features of an image are important for the image pattern recognition. In generally, an image can be represented as linear combination of orthogonal functions such that low frequency and high frequency sinusoidal functions. The frequency of the features depends on the size of blocks. If the size of divided local region is large, the features with low frequency are extracted. Therefore, the blocks extracted features with low frequency since the size of block is larger than those of cells. In Fig. 4(a), the recognition rates of the block-extracted feature with low frequencies are higher than cell- and block-wise-cell-extracted features. The results of Fig. 5 also imply that the discriminative features of distributions of gradients depend the low frequency components of an image, since the Gaussian filtering extracts the low-frequency components of an image.

The second observation is that \(L_p\)-norm for aggregated local regions has a similar role of the bluer filtering of an image, since \(L_p\)-norm sums all the differences of local regions. The shapes of graph of recognition rate in Fig. 5(a)–(d) are similar to the shape of the graph of recognition rate for the 1WD and B1WD in Fig. 4(b).

The third observation is that the HoG feature is not the probabilistic distribution of gradients while the HoG feature is a kind of gradient-based feature. Figure 4(a) and (b) show that the HoG gives higher recognition rate than the local directional distribution method. The only difference between the block-wise local feature and HoG feature is the normalisaion procedure.

The last observation is that if we use dominant directional distribution we have the same recognition rate to the HoG features of nine oriented directions with \(L_2\)-norm.

6 Conclusions

We firstly introduced metrics for directional distributions of gradients for image pattern recognition. Secondly, we develop three features based on directional distribution of an image and their discrimination methods. Finally, in experiments, we evaluated performances of all the features comparing with the HoG method. The results in our experiments clarify following properties. First, if we use the \(L_1\)-norm for the histogram of oriented gradients features, we can establish higher recognition rate than \(L_2\)-norm and the Wasserstein distances since the \(L_2\)-normalisaion is nonlinear mapping. Second, the global directional distribution is more discriminative feature than the local directional distribution method. Furthermore, results of the global directional distribution method imply that the histogram of oriented gradients method extracts low-frequency features of local regions. Third, the dominant directional distribution is discriminative feature in image pattern recognition. This dominant directional distributions represent edges. Furthermore, using \(L_1\)-norm for global directional distribution feature, we can obtain as accurate detection as the combination of the dominant directional distribution and Wasserstein distance without the extraction of dominant directions.