Abstract
When convolutional neural networks are used to tackle learning problems based on music or other time series, raw one-dimensional data are commonly preprocessed to obtain spectrogram or mel-spectrogram coefficients, which are then used as input to the actual neural network. In this contribution, we investigate, both theoretically and experimentally, the influence of this pre-processing step on the network’s performance and pose the question whether replacing it by applying adaptive or learned filters directly to the raw data can improve learning success. The theoretical results show that approximately reproducing mel-spectrogram coefficients by applying adaptive filters and subsequent time-averaging on the squared amplitudes is in principle possible. We also conducted extensive experimental work on the task of singing voice detection in music. The results of these experiments show that for classification based on convolutional neural networks the features obtained from adaptive filter banks followed by time-averaging the squared modulus of the filters’ output perform better than the canonical Fourier transform-based mel-spectrogram coefficients. Alternative adaptive approaches with center frequencies or time-averaging lengths learned from training data perform equally well.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Notes
This observation seems to have served as one motivation to introduce the so-called scattering transform, which consists of repeated composition of convolution, a nonlinearity in the form of taking the absolute value and time-averaging. In that framework, mel-spectrogram coefficients are interpreted as first-order scattering coefficients.
References
Abreu LD, Romero JL (2017) MSE estimates for multitaper spectral estimation and off-grid compressive sensing. IEEE Trans Inf Theory 63(12):7770–7776
Andén J, Mallat S (2014) Deep scattering spectrum. IEEE Trans Signal Process 62(16):4114–4128
Anselmi F, Leibo JZ, Rosasco L, Mutch J, Tacchetti A, Poggio TA (2013) Unsupervised learning of invariant representations in hierarchical architectures. CoRR arxiv:1311.4158
Balazs P, Dörfler M, Jaillet F, Holighaus N, Velasco G (2011) Theory, implementation and applications of nonstationary gabor frames. J Comput Appl Math 236(6):1481–1496
Balazs P, Dörfler M, Kowalski M, Torrésani B (2013) Adapted and adaptive linear time-frequency representations: a synthesis point of view. IEEE Signal Process Mag 30(6):20–31
Bammer R, Dörfler M (2017) Invariance and stability of Gabor scattering for music signals. In: Sampling theory and applications (SampTA), 2017 international conference on. IEEE, pp 299–302
Boulanger-Lewandowski N, Bengio Y, Vincent P (2012) Modeling temporal dependencies in high-dimensional sequences: application to polyphonic music generation and transcription. arXiv preprint arXiv:1206.6392
Choi K, Fazekas G, Sandler M, Cho K (2018) The effects of noisy labels on deep convolutional neural networks for music tagging. IEEE Trans Emerg Top Comput Intell 2(2):139–149
Choi K, Fazekas G, Sandler M (2016) Automatic tagging using deep convolutional neural networks. In: Proceddings of the 17th international society for music information retrieval conference
Chuan CH, Herremans D (2018) Modeling temporal tonal relations in polyphonic music through deep networks with a novel image-based representation. In: Thirty-second AAAI conference on artificial intelligence
Dieleman S, Brakel P, Schrauwen B (2011) Audio-based music classification with a pretrained convolutional network. In: 12th international society for music information retrieval conference (ISMIR-2011). University of Miami, pp 669–674
Dieleman S, Schrauwen B (2014) End-to-end learning for music audio. In: Acoustics, speech and signal processing (ICASSP), 2014 IEEE international conference on, pp 6964–6968. https://doi.org/10.1109/ICASSP.2014.6854950
Dörfler M (2001) Time-frequency analysis for music signals: a mathematical approach. J New Music Res 30(1):3–12
Dörfler M, Bammer R, Grill T (2017) Inside the spectrogram: convolutional neural networks in audio processing. In: International conference on sampling theory and applications (SampTA). IEEE, pp 152–155
Dörfler M, Torrésani B (2010) Representation of operators in the time-frequency domain and generalized Gabor multipliers. J Fourier Anal Appl 16(2):261–293
Feichtinger HG, Kozek W (1998) Quantization of TF lattice-invariant operators on elementary LCA groups. In: Feichtinger HG, Strohmer T (eds) Gabor analysis and algorithms, applied and numerical harmonic analysis. Birkhäuser, Boston, pp 233–266
Feichtinger HG, Nowak K (2003) A first survey of Gabor multipliers. In: Feichtinger HG, Strohmer T (eds) Advances in Gabor analysis, applied and numerical harmonic analysis. Birkhäuser, Boston, pp 99–128
Goodfellow I, Bengio Y, Courville A (2016) Deep learning. MIT Press, Cambridge
Grill T, Schlüter J (2015) Music boundary detection using neural networks on combined features and two-level annotations. In: Proceedings of the 16th international society for music information retrieval conference (ISMIR 2015). Malaga, Spain, pp 531–537
Grohs P, Wiatowski T, Bölcskei H (2016) Deep convolutional neural networks on cartoon functions. In: Information theory (ISIT), 2016 IEEE international symposium on. IEEE, pp 1163–1167
Holighaus N, Dörfler M, Velasco GA, Grill T (2013) A framework for invertible, real-time constant-Q transforms. IEEE Trans Audio Speech Lang Process 21(4):775–785
Humphrey EJ, Bello JP (2012) Rethinking automatic chord recognition with convolutional neural networks. In: Machine learning and applications (ICMLA), 2012 11th international conference on. IEEE, vol 2, pp 357–362
Humphrey EJ, Montecchio N, Bittner R, Jansson A, Jehan T (2017) Mining labeled data from web-scale collections for vocal activity detection in music. In: Proceedings of the 18th international society for music information retrieval conference (ISMIR), Suzhou, China
Kingma D, Ba J (2015) Adam: a method for stochastic optimization. In: Proceedings of the 6th international conference on learning representations (ICLR). San Diego, USA
Korzeniowski F, Widmer G (2016) A fully convolutional deep auditory model for musical chord recognition. In: Machine learning for signal processing (MLSP), 2016 IEEE 26th international workshop on. IEEE, pp 1–6
LeCun Y, Bottou L, Bengio Y, Haffner P (1998) Gradient-based learning applied to document recognition. Proc IEEE 86(11):2278–2324
Lee H, Pham P, Largman Y, Ng AY (2009) Unsupervised feature learning for audio classification using convolutional deep belief networks. In: Advances in neural information processing systems, pp 1096–1104
Leglaive S, Hennequin R, Badeau R (2015) Singing voice detection with deep recurrent neural networks. In: Acoustics, speech and signal processing (ICASSP), 2015 IEEE international conference on. IEEE, pp 121–125
Lehner B, Schlüter J, Widmer G (2018) Online, loudness-invariant vocal detection in mixed music signals. IEEE/ACM Trans Audio Speech Lang Process 26(8):1369–1380
Malik M, Adavanne S, Drossos K, Virtanen T, Ticha D, Jarina R (2017) Stacked convolutional and recurrent neural networks for music emotion recognition. arXiv preprint arXiv:1706.02292
Mallat S (2012) Group invariant scattering. Commun Pure Appl Math 65(10):1331–1398
Mallat S (2016) Understanding deep convolutional networks. Philos Trans R Soc Lond A Math Phys Eng Sci 374(2065). https://doi.org/10.1098/rsta.2015.0203. URL http://rsta.royalsocietypublishing.org/content/374/2065/20150203
Schlüter J, Böck S (2013) Musical onset detection with convolutional neural networks. In: 6th international workshop on machine learning and music (MML), Prague, Czech Republic
Schlüter J, Böck S (2014) Improved musical onset detection with convolutional neural networks. In: Proceedings of the IEEE international conference on acoustics, speech, and signal processing (ICASSP 2014). Florence, Italy
Schlüter J, Grill T (2015) Exploring data augmentation for improved singing voice detection with neural networks. In: Proceedings of the 16th international society for music information retrieval conference (ISMIR 2015). Malaga, Spain
Ullrich K, Schlüter J, Grill T (2014) Boundary detection in music structure analysis using convolutional neural networks. In: Proceedings of the 15th international society for music information retrieval conference (ISMIR 2014). Taipei, Taiwan
Waldspurger I (2015) Wavelet transform modulus: phase retrieval and scattering. Ph.D. thesis, Ecole normale supérieure-ENS PARIS
Waldspurger I (2017) Exponential decay of scattering coefficients. In: 2017 international conference on sampling theory and applications (SampTA), pp 143–146. https://doi.org/10.1109/SAMPTA.2017.8024473
Wiatowski T, Grohs P, Bölcskei H (2017) Energy propagation in deep convolutional neural networks. arXiv preprint arXiv:1704.03636
Wiatowski T, Tschannen M, Stanic A, Grohs P, Bölcskei H (2016) Discrete deep feature extraction: a theory and new architectures. In: Proceedings of the international conference on machine learning, pp 2149–2158
Acknowledgements
This research has been supported by the Vienna Science and Technology Fund (WWTF) through Project MA14-018.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they have no conflict of interest.
Appendix A: Proof of Theorem 1
Appendix A: Proof of Theorem 1
In order to include the situation described in Theorem 1, we assume the situation in which the original spectrogram is sub-sampled, in other words, we start the computations concerning a signal f from
The proof is based on the observation that the mel-spectrogram can be written via the operation of so-called STFT- or Gabor multipliers, cf. [17], on any given function in the sense of a bilinear form. Before deriving the involved correspondence, we thus introduce this important class of operators.
Given a window function g, time- and frequency-sub-sampling parameters \(\alpha , \beta\), respectively, and a function \(\mathbf{{m}}: {\mathbb {Z}} \times {\mathbb {Z}} \mapsto {\mathbb {C}}\), the corresponding Gabor multiplier \(G^{\alpha ,\beta }_{g, \mathbf{{m}}}\) is defined as
We next derive the expression of a mel-spectrogram by an appropriately chosen Gabor multiplier. Using sub-sampling factors \(\alpha\) in time and \(\beta\) in frequency as before, we start from (4) and reformulate as follows:
with \(\mathbf{m} (k,l) = \delta (\alpha l-b)\varLambda _\nu (\beta k)\). We see that the mel-coefficients can thus be interpreted via a Gabor multiplier: \({{\text {MS}}}_{g}(f) (b,\nu ) = \langle G^{\alpha ,\beta }_{g, \mathbf{{m}}}f, f \rangle\).
The next step is to switch to an alternative operator representation. Indeed, as shown in [16], every operator H can equally be written by means of its spreading function\(\eta _H\) as
We note that two operators \(H_1\), \(H_2\) are equal if and only if their spreading functions coincide, see [15, 16] for details.
As shown in [15], a Gabor multiplier’s spreading function \(\eta ^{\alpha ,\beta }_{{g, \mathbf{m}} }\) is given by
where \({\mathcal {M}} (x,\xi )\) denotes the \((\beta ^{-1}, \alpha ^{-1})\)-periodic symplectic Fourier transform of \(\mathbf{m}\), i.e.,
We now equally rewrite the time-averaging operation applied to a filtered signal, as defined in (6), as a Gabor multiplier. As before, we set \(\check{h}_\nu (t) = \overline{h_\nu (-t)}\) and have
with \(\mathbf{m}_F (k,l) = T_b \varpi _\nu (l) \delta (\beta k)\). To obtain the error estimate in Corollary 1, first note that by straightforward computation using the operators’ representation by their spreading functions as in (12)
and we can estimate the error by the difference of the spreading functions. We write the sampled version of \(\varLambda _\nu\) by using the Dirac comb Ш\(_\beta\): \(\varLambda _\nu (\beta k) = (\)Ш\(_\beta \varLambda _\nu ) (t) = \sum _k \varLambda _\nu (t) \delta (t-\beta k)\) and analogously for \(\varpi _\nu\) using Ш\(_\alpha\) to obtain \(\mathbf{m} =T_b \delta (\alpha l) \cdot\)Ш\(_\beta \varLambda _\nu\) and \(\mathbf{m}_F =\)Ш\(_\alpha T_b \varpi _\nu \cdot \delta (\beta k)\). Applying the symplectic Fourier transform (14) to \(\mathbf{m}\) then gives:
Now it is a well-known fact that the Fourier transform turns sampling with sampling interval \(\beta\) into periodization by \(1/\beta\), in other words, into a convolution with Ш\(_{\frac{1}{\beta }}\):
hence
Completely analogous considerations for \(\varpi _\nu\) and Ш\(_\alpha\) lead to the periodization of \(\mathcal {F}(\varpi _\nu )\) and thus the following expression for the symplectic Fourier transform of \(\mathbf{m}_F\):
Plugging these expressions into (13) gives the bound (8).
Remark 5
It is interesting to interpret the action of an operator in terms of its spreading function. In view of (12), we see that the spreading function determines the amount of shift in time and frequency, which the action of the operator imposes on a function. For Gabor multipliers, if well-concentrated window functions are used, it is immediately obvious that the amount of shifting is moderate as well as determined by the window’s eccentricity. At the same time, the aliasing effects introduced by coarse sub-sampling are reflected in the periodic nature of \({\mathcal {M}}\). Since, for \(\mathcal {F}^{-1} (\varLambda _\nu )\) the sub-sampling density in frequency, determined by \(\beta\), and for \(\mathcal {F}(\varpi _\nu )\) the sub-sampling density in time, determined by \(\alpha\), determine the amount of aliasing, the overall approximation quality deteriorates with increasing sub-sampling factors.
Rights and permissions
About this article
Cite this article
Dörfler, M., Grill, T., Bammer, R. et al. Basic filters for convolutional neural networks applied to music: Training or design?. Neural Comput & Applic 32, 941–954 (2020). https://doi.org/10.1007/s00521-018-3704-x
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00521-018-3704-x