Contrastive disentanglement for self-supervised motion style transfer

Wu, Zizhao; Mao, Siyuan; Zhang, Cheng; Wang, Yigang; Zeng, Ming

doi:10.1007/s11042-024-18238-4

Contrastive disentanglement for self-supervised motion style transfer

Published: 30 January 2024

Volume 83, pages 70523–70544, (2024)
Cite this article

Multimedia Tools and Applications Aims and scope Submit manuscript

Zizhao Wu ORCID: orcid.org/0000-0003-2103-5037¹,
Siyuan Mao¹,
Cheng Zhang¹,
Yigang Wang¹ &
…
Ming Zeng²

262 Accesses
Explore all metrics

Abstract

Motion style transfer, which aims to transfer the style from a source motion to the target while keeping its content, has recently gained considerable attention. Some existing works have shown promising results but required labeled data for supervised training, limiting their applicability. In this paper, we present a novel self-supervised learning method for motion style transfer. Specifically, we cast the problem into a contrastive learning framework, which disentangles the human motion representation into a content code and a style code, and the result can be generated by compositing the style code of source motion and the content code of target motion. To encourage better code disentanglement and composition, we investigate InfoNCE loss and Triplet loss in a self-supervised manner. This framework aims at generating reasonable motions while guaranteeing the disentanglement of the latent codes. Comprehensive experiments have been conducted over the benchmark datasets and demonstrated our superior performance over state-of-the-art methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Hierarchical Style-Based Networks for Motion Synthesis

Automatic Character Motion Style Transfer via Autoencoder Generative Model and Spatio-Temporal Correlation Mining

MotionCLIP: Exposing Human Motion Generation to CLIP Space

Data Availability

The dataset and sourcecode of Refs.[59,4,44] can be found as follows: Ref. [59]: http://mocap.cs.cmu.edu/ Ref. [4]: https://github.com/DeepMotionEditing/deep-motion-editing Ref. [44]: https://github.com/tianxintao/Online-Motion-Style-Transfer

References

Tenenbaum JB, Freeman WT (1996) Separating style and content. In: Mozer M, Jordan MI, Petsche T (eds) NIPS, pp 662–668. MIT Press, ???
Holden D, Habibie I, Kusajima I, Komura T (2017) Fast neural style transfer for motion data. IEEE Comput Graph Appl 37(4):42–49
Article Google Scholar
Holden D, Saito J, Komura T, Joyce T (2015) Learning motion manifolds with convolutional autoencoders. In: SIGGRAPH Asia, pp 18–1184. ACM, ???
Aberman K, Weng Y, Lischinski D, Cohen-Or D, Chen B (2020) Unpaired motion style transfer from video to animation. ACM Trans Graph 39(4):64
Article Google Scholar
Pan J, Sun H, Kong Y (2021) Fast human motion transfer based on a meta network. Inf Sci 547:367–383
Article Google Scholar
Wang W, Xu J, Zhang L, Wang Y, Liu J (2020) Consistent video style transfer via compound regularization. In: The Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI, pp 12233–12240. AAAI Press, ???
Park SS, Jang D-K, Lee S-H (2021) Diverse motion stylization for multiple style domains via spatial-temporal graph-based generative model. Proceedings of the ACM on computer graphics and interactive techniques 4:1–17
Jang D-K, Park SS, Lee S-H (2022) Motion puzzle: Arbitrary motion style transfer by body part. ACM Trans Graph (TOG) 41:1–16
Google Scholar
Yan S, Xiong Y, Lin D (2018) Spatial temporal graph convolutional networks for skeleton-based action recognition. In: AAAI Conference on Artificial Intelligence
Kotovenko D, Sanakoyeu A, Lang S, Ommer B (2019) Content and style disentanglement for artistic style transfer. In: 2019 IEEE/CVF international conference on computer vision, ICCV 2019, Seoul, Korea (South), October 27 - November 2, 2019, pp 4421–4430. IEEE, ???
Li Y, Li Y, Lu J, Shechtman E, Lee YJ, Singh KK (2022) Contrastive learning for diverse disentangled foreground generation. In: Computer vision - ECCV. Lecture notes in computer science, vol 13676, pp 334–351. Springer, ???
Bengio Y, Courville AC, Vincent P (2013) Representation learning: A review and new perspectives. IEEE Trans Pattern Anal Mach Intell 35(8):1798–1828
Article Google Scholar
Kovar L, Gleicher M, Pighin FH (2002) Motion graphs. ACM Trans Graph 21(3):473–482
Article Google Scholar
Min J, Chai J (2012) Motion graphs++: a compact generative model for semantic motion analysis and synthesis. ACM Trans Graph 31(6):153–115312
Article Google Scholar
Safonova A, Hodgins JK (2007) Construction and optimal search of interpolated motion graphs. ACM Trans Graph 26(3):106
Article Google Scholar
Shapiro A, Cao Y, Faloutsos P (2006) Style components. In: Gutwin C, Mann S (eds) Graphics Interface, pp 33–39
Grochow K, Martin SL, Hertzmann A, Popovic Z (2004) Style-based inverse kinematics. ACM Trans Graph 23(3):522–531
Article Google Scholar
Wang JM, Fleet DJ, Hertzmann A (2008) Gaussian process dynamical models for human motion. IEEE Trans Pattern Anal Mach Intell 30(2):283–298
Article Google Scholar
Ukita N, Kanade T (2012) Gaussian process motion graph models for smooth transitions among multiple actions. Comput Vis Image Underst 116(4):500–509
Article Google Scholar
Zhou L, Shang L, Shum HPH, Leung H (2014) Human motion variation synthesis with multivariate gaussian processes. Comput Animat Virtual Worlds 25(3–4):303–311
Google Scholar
Lau M, Bar-Joseph Z, Kuffner J (2009) Modeling spatial and temporal variation in motion data. ACM Trans Graph 28(5):171
Article Google Scholar
Young JE, Igarashi T, Sharlin E (2008) Puppet master: Designing reactive character behavior by demonstration. In: Gross MH, James DL (eds) Eurographics/ACM SIGGRAPH symposium on computer animation, SCA, pp 183–191. Eurographics Association, ???
Levine S, Wang JM, Haraux A, Popovic Z, Koltun V (2012) Continuous character control with low-dimensional embeddings. ACM Trans Graph 31(4):28–12810
Article Google Scholar
Ma, W., Xia, S., Hodgins, J.K., Yang, X., Li, C., Wang, Z.: Modeling style and variation in human motion. In: Popovic, Z., Otaduy, M.A. (eds.) Eurographics/ACM SIGGRAPH Symposium on Computer Animation, pp. 21–30 (2010)
Zheng Q, Wu W, Pan H, Mitra NJ, Cohen-Or D, Huang H (2021) Inferring object properties from human interaction and transferring them to new motions. Comput. Vis. Media 7(3):375–392
Article Google Scholar
Zhou, Y., Li, Z., Xiao, S., He, C., Huang, Z., Li, H.: Auto-conditioned recurrent networks for extended complex human motion synthesis. In: International Conference on Learning Representations, ICLR. OpenReview.net, ??? (2018)
Martinez, J., Black, M.J., Romero, J.: On human motion prediction using recurrent neural networks. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR, pp. 4674–4683. IEEE Computer Society, ??? (2017)
Jain, A., Zamir, A.R., Savarese, S., Saxena, A.: Structural-rnn: Deep learning on spatio-temporal graphs. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR, pp. 5308–5317 (2016)
Goodfellow, I.J., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A.C., Bengio, Y.: Generative adversarial nets. In: Advances in Neural Information Processing Systems, pp. 2672–2680 (2014)
Sadoughi, N., Busso, C.: Novel realizations of speech-driven head movements with generative adversarial networks. In: ICASSP, pp. 6169–6173. IEEE, ??? (2018)
Starke S, Zhao Y, Komura T, Zaman KA (2020) Local motion phases for learning multi-contact character movements. ACM Trans. Graph. 39(4):54
Article Google Scholar
Wang Z, Chai J, Xia S (2021) Combining recurrent neural networks and adversarial training for human motion synthesis and control. IEEE Trans. Vis. Comput. Graph. 27(1):14–28
Article Google Scholar
Rose C, Cohen MF, Bodenheimer B (1998) Verbs and adverbs: Multidimensional motion interpolation. IEEE Computer Graphics and Applications 18(5):32–40
Article Google Scholar
Hoyet L, Ryall K, Zibrek K, Park H, Lee J, Hodgins JK, O’Sullivan C (2013) Evaluating the distinctiveness and attractiveness of human motions on realistic virtual bodies. ACM Trans. Graph. 32(6):204–120411
Article Google Scholar
Kiiski, H., Hoyet, L., Cullen, B., O’Sullivan, C., Newell, F.N.: Perception and prediction of social intentions from human body motion. In: ACM Symposium on Applied Perception, p. 134. ACM, ??? (2013)
Smith HJ, Neff M (2017) Understanding the impact of animated gesture performance on personality perceptions. ACM Trans. Graph. 36(4):49–14912
Article Google Scholar
Torresani, L., Hackney, P., Bregler, C.: Learning motion style synthesis from perceptual observations. In: Schölkopf, B., Platt, J.C., Hofmann, T. (eds.) Neural Information Processing Systems, pp 1393–1400 (2006)
Kim, H.J., Lee, S.: Perceptual characteristics by motion style category. In: Cignoni, P., Miguel, E. (eds.) Annual Conference of the European Association for Computer Graphics, pp 1–4 (2019)
Hsu E, Pulli K, Popovic J (2005) Style translation for human motion. ACM Trans. Graph. 24(3):1082–1089
Article Google Scholar
Ikemoto L, Arikan O, Forsyth DA (2009) Generalizing motion edits with gaussian processes. ACM Trans. Graph. 28(1):1–1112
Article Google Scholar
Jing Y, Yang Y, Feng Z, Ye J, Yu Y, Song M (2020) Neural style transfer: A review. IEEE Trans. Vis. Comput. Graph. 26(11):3365–3385
Article Google Scholar
Holden D, Saito J, Komura T (2016) A deep learning framework for character motion synthesis and editing. ACM Trans. Graph. 35(4):138–113811
Article Google Scholar
Gatys, L.A., Ecker, A.S., Bethge, M.: Image style transfer using convolutional neural networks. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR, pp. 2414–2423. IEEE Computer Society, ??? (2016)
Smith HJ, Cao C, Neff M, Wang Y (2019) Efficient neural networks for real-time motion style transfer. Proc. ACM Comput. Graph. Interact. Tech. 2(2):13–11317
Article Google Scholar
Xu, J., Xu, H., Ni, B., Yang, X., Wang, X., Darrell, T.: Hierarchical style-based networks for motion synthesis. In: ECCV. Lecture Notes in Computer Science, vol. 12356, pp. 178–194. Springer, ??? (2020)
Tao, T., Zhan, X., Chen, Z., van de Panne, M.: Style-erd: Responsive and coherent online motion style transfer. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 6583–6593 (2022)
Wen, Y.-H., Yang, Z., Fu, H., Gao, L., Sun, Y., Liu, Y.-J.: Autoregressive stylized motion synthesis with generative flow. 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 13607–13607 (2021)
Locatello, F., Bauer, S., Lucic, M., Rätsch, G., Gelly, S., Schölkopf, B., Bachem, O.: Challenging common assumptions in the unsupervised learning of disentangled representations. In: Chaudhuri, K., Salakhutdinov, R. (eds.) ICML. Proceedings of Machine Learning Research, vol. 97, pp. 4114–4124. PMLR, ??? (2019)
Xue Y, Guo Y, Zhang H, Xu T, Zhang S, Huang X (2022) Deep image synthesis from intuitive user input: A review and perspectives. Comput. Vis. Media 8(1):3–31
Article Google Scholar
Liu, Y., Wei, F., Shao, J., Sheng, L., Yan, J., Wang, X.: Exploring disentangled feature representation beyond face identification. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR, pp. 2080–2089. IEEE Computer Society, ??? (2018)
Higgins, I., Matthey, L., Pal, A., Burgess, C., Glorot, X., Botvinick, M., Mohamed, S., Lerchner, A.: beta-vae: Learning basic visual concepts with a constrained variational framework. In: International Conference on Learning Representations, ICLR. OpenReview.net, ??? (2017)
Kim, H., Mnih, A.: Disentangling by factorising. In: Dy, J.G., Krause, A. (eds.) ICML, vol. 80, pp. 2654–2663 (2018)
Kumar, A., Sattigeri, P., Balakrishnan, A.: Variational inference of disentangled latent concepts from unlabeled observations. CoRR abs/1711.00848 (2017)
Villegas, R., Yang, J., Hong, S., Lin, X., Lee, H.: Decomposing motion and content for natural video sequence prediction. In: 5th International Conference on Learning Representations, ICLR. OpenReview.net, ??? (2017)
Denton, E.L., Birodkar, V.: Unsupervised learning of disentangled representations from video. In: Guyon, I., von Luxburg, U., Bengio, S., Wallach, H.M., Fergus, R., Vishwanathan, S.V.N., Garnett, R. (eds.) Advances in Neural Information Processing Systems, pp. 4414–4423 (2017)
van den Oord, A., Li, Y., Vinyals, O.: Representation learning with contrastive predictive coding. CoRR abs/1807.03748 (2018)
He, K., Fan, H., Wu, Y., Xie, S., Girshick, R.B.: Momentum contrast for unsupervised visual representation learning. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR, pp 9726–9735. IEEE, ??? (2020)
Wang, X., Gupta, A.: Unsupervised learning of visual representations using videos. In: IEEE International Conference on Computer Vision, ICCV, pp. 2794–2802. IEEE Computer Society, ??? (2015)
Chopra, S., Hadsell, R., LeCun, Y.: Learning a similarity metric discriminatively, with application to face verification. In: IEEE Computer Society Conference on Computer Vision and Pattern Recognition, CVPR, pp. 539–546 (2005)
Zhang, Y., Tang, F., Dong, W., Huang, H., Ma, C., Lee, T., Xu, C.: Domain enhanced arbitrary image style transfer via contrastive learning. In: Nandigjav, M., Mitra, N.J., Hertzmann, A. (eds.) SIGGRAPH ’22, pp. 12–1128. ACM, ??? (2022)
Hénaff, O.J.: Data-efficient image recognition with contrastive predictive coding. In: ICML, vol. 119, pp. 4182–4192. PMLR, ??? (2020)
CMU : Cmu graphics lab motion capture database. http://mocap.cs.cmu.edu/ (2019)
Xia S, Wang C, Chai J, Hodgins JK (2015) Realtime style transfer for unlabeled heterogeneous human motion. ACM Trans. Graph. 34(4):119–111910
Article Google Scholar
Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., Hochreiter, S.: Gans trained by a two time-scale update rule converge to a local nash equilibrium. In: NIPS (2017)
Binkowski, M., Sutherland, D.J., Arbel, M., Gretton, A.: Demystifying MMD gans. In: 6th International Conference on Learning Representations, ICLR (2018)

Download references

Author information

Authors and Affiliations

School of Digital Media Technology, Hangzhou Dianzi University, Hangzhou, 310018, China
Zizhao Wu, Siyuan Mao, Cheng Zhang & Yigang Wang
School of Informatics, Xiamen University, Xiamen, 361005, China
Ming Zeng

Authors

Zizhao Wu
View author publications
You can also search for this author in PubMed Google Scholar
Siyuan Mao
View author publications
You can also search for this author in PubMed Google Scholar
Cheng Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Yigang Wang
View author publications
You can also search for this author in PubMed Google Scholar
Ming Zeng
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

Zizhao Wu conceived the presented idea. Zizhao Wu developted the theory and algorithm. Siyuan Mao and Cheng Zhang carried out the experiments. Zizhao Wu wrote the manuscript with the support from Yigang Wang and Ming Zeng.

Corresponding author

Correspondence to Zizhao Wu.

Ethics declarations

Conflicts of interest

All authors declare that they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Wu, Z., Mao, S., Zhang, C. et al. Contrastive disentanglement for self-supervised motion style transfer. Multimed Tools Appl 83, 70523–70544 (2024). https://doi.org/10.1007/s11042-024-18238-4

Download citation

Received: 15 June 2023
Revised: 02 January 2024
Accepted: 07 January 2024
Published: 30 January 2024
Issue Date: August 2024
DOI: https://doi.org/10.1007/s11042-024-18238-4

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Contrastive disentanglement for self-supervised motion style transfer

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Hierarchical Style-Based Networks for Motion Synthesis

Automatic Character Motion Style Transfer via Autoencoder Generative Model and Spatio-Temporal Correlation Mining

MotionCLIP: Exposing Human Motion Generation to CLIP Space

Data Availability

References

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflicts of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

Contrastive disentanglement for self-supervised motion style transfer

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Hierarchical Style-Based Networks for Motion Synthesis

Automatic Character Motion Style Transfer via Autoencoder Generative Model and Spatio-Temporal Correlation Mining

MotionCLIP: Exposing Human Motion Generation to CLIP Space

Data Availability

References

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflicts of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation