Human-3Diffusion: Realistic Avatar Creation via Explicit 3D Consistent Diffusion Modelsh

TL;DR: 2D Multi-view Diffusion Model and 3D diffusion-based Generative Model can be synchronized at diffusing and reverse sampling to provide complementary information to benefit each other.

Creating realistic avatars from a single RGB image is an attractive yet challenging problem. Due to its ill-posed nature, recent works leverage powerful prior from 2D diffusion models pretrained on large datasets. Although 2D diffusion models demonstrate strong generalization capability, they cannot provide multi-view shape priors with guaranteed 3D consistency. We propose Human 3Diffusion: Realistic Avatar Creation via Explicit 3D Consistent Diffusion.
Our key insight is that 2D multi-view diffusion and 3D reconstruction models provide complementary information for each other, and by coupling them in a tight manner, we can fully leverage the potential of both models. We introduce a novel image-conditioned generative 3D Gaussian Splats reconstruction model that leverages the priors from 2D multi-view diffusion models, and provides an explicit 3D representation, which further guides the 2D reverse sampling process to have better 3D consistency. Our design follows: (1) multi-view 2D priors enhance generative 3D reconstruction and (2) consistency refinement of diffusion sampling trajectory via the explicit 3D representation.

Reconstruction Results

Reconstruction of challenging unseen subjects with diverse appearance, geometry, and accessories. Surprisingly, our approach even generalizes to multiple persons thanks to the strong prior of 2D foundational models.

Reconstruction of unseen subjects from UBC Fashion dataset:

Reconstruction of unseen subjects from IIIT-3Dhuman dataset:

Reconstruction of unseen subjects from Sizer dataset:

Reconstruction of 'the Rock' Dwayne Johnson (left) and Taylor Swift (right) collected from the internet:

Generative Power in Reconstruction: We formulate the single image reconstruction problem as a conditional generative problem. In other word, we learn the conditional distribution of 3D representations given a single image. Hence, we sampled from distribution to reconstruct 3D, ensures a diverse but clear occluded region of the subject.

We appreciate GarvitaTiwari, Zehao Yu, Chuqiao Li, Yuliang Xiu, Zhen Liu, Zeju Qiu, Siyao Li, Weiyang Liu and other colleagues for their feedback to improve the work This work was made possible by funding from the Carl Zeiss Foundation. This work is also funded by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) - 409792180 (EmmyNoether Programme, project: Real Virtual Humans) and the German Federal Ministry of Education and Research (BMBF): Tübingen AI Center, FKZ: 01IS18039A. G. Pons-Moll is a member of the Machine Learning Cluster of Excellence, EXC number 2064/1 – Project number 390727645. The authors thank the International Max Planck Research School for Intelligent Systems (IMPRS-IS) for supporting Y.Xue. For this project, R. Marin has been supported by the innovation program under the Marie Skłodowska-Curie grant agreement No. 101109330.

BibTeX

@article{xue2024human3diffusion,
  author    = {Xue, Yuxuan and Xie, Xianghui and Marin, Riccardo and Pons-Moll, Gerard},
  title     = {Human-3Diffusion: Realistic Avatar Creation via Explicit 3D Consistent Diffusion Models},
  booktitle    = {Advances in Neural Information Processing Systems 38: Annual Conference
    on Neural Information Processing Systems 2024, NeurIPS 2024, Vancouver,
    BC, Canada, December 10 - 15, 2024},
  year      = {2024},
}

Human-3Diffusion: Realistic Avatar Creation via Explicit 3D Consistent Diffusion Models

Abstract

Reconstruction Results

Acknowledgement

BibTeX