Human 3Diffusion: Realistic Avatar Creation via Explicit 3D Consistent Diffusion Models

1University of Tübingen, 2Tübingen AI Center,
3Max Planck Institute for Informatics, Saarland Informatics Campus

Human 3Diffusion: from single RGB image, we reconstruct realistic avatar in 3D Gaussian Splats with high-fidelity geometry and texture. We can perform Novel View Synthesis and Triangle Mesh Extraction from the generated 3D-GS. Thanks to priors from the 2D foundation models, our method can reconstruct challenging scenarios including loose clothing like skirts, accessories like hats and bags, children, and even multiple persons.


Abstract

TL;DR: 2D Multi-view Diffusion Model and 3D diffusion-based Generative Model can be synchronized at diffusing and reverse sampling to provide complementary information to benefit each other.

Creating realistic avatars from a single RGB image is an attractive yet challenging problem. Due to its ill-posed nature, recent works leverage powerful prior from 2D diffusion models pretrained on large datasets. Although 2D diffusion models demonstrate strong generalization capability, they cannot provide multi-view shape priors with guaranteed 3D consistency. We propose Human 3Diffusion: Realistic Avatar Creation via Explicit 3D Consistent Diffusion.
Our key insight is that 2D multi-view diffusion and 3D reconstruction models provide complementary information for each other, and by coupling them in a tight manner, we can fully leverage the potential of both models. We introduce a novel image-conditioned generative 3D Gaussian Splats reconstruction model that leverages the priors from 2D multi-view diffusion models, and provides an explicit 3D representation, which further guides the 2D reverse sampling process to have better 3D consistency. Our design follows: (1) multi-view 2D priors enhance generative 3D reconstruction and (2) consistency refinement of diffusion sampling trajectory via the explicit 3D representation.


Inference time: Given a single RGB image, we sample a realistic 3D avatar represented as 3D Gaussian Splats from our learned conditional distribution via iterative denoising. At each reverse sampling step, the 2D Multi-view Diffusion Model provides an initial estimate of Multi-view images, and our 3D-GS generative model tries to estimate clear 3D Gaussian Splats from input condition image and noisy multi-view images. We render 3D-GS to the same camera view to provide a 3D-consistent multi-view images during the reverse noise sampling. In other word, we guide the reverse sampling process of 2D Multi-view Diffusion model with the generated 3D representation.

Reconstruction Results

Reconstruction of challenging unseen subjects with diverse appearance, geometry, and accessories. Surprisingly, our approach even generalizes to multiple persons thanks to the strong prior of 2D foundational models.

Reconstruction of unseen subjects from UBC Fashion dataset:

Reconstruction of unseen subjects from IIIT-3Dhuman dataset:

Reconstruction of unseen subjects from Sizer dataset:

Reconstruction of 'the Rock' Dwayne Johnson (left) and Taylor Swift (right) collected from the internet:

Generative Power in Reconstruction: We formulate the single image reconstruction problem as a conditional generative problem. In other word, we learn the conditional distribution of 3D representations given a single image. Hence, we sampled from distribution to reconstruct 3D, ensures a diverse but clear occluded region of the subject.

Acknowledgement

We appreciate GarvitaTiwari, Zehao Yu, Chuqiao Li, Yuliang Xiu, Zhen Liu, Zeju Qiu, Siyao Li, Weiyang Liu and other colleagues for their feedback to improve the work This work was made possible by funding from the Carl Zeiss Foundation. This work is also funded by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) - 409792180 (EmmyNoether Programme, project: Real Virtual Humans) and the German Federal Ministry of Education and Research (BMBF): Tübingen AI Center, FKZ: 01IS18039A. G. Pons-Moll is a member of the Machine Learning Cluster of Excellence, EXC number 2064/1 – Project number 390727645. The authors thank the International Max Planck Research School for Intelligent Systems (IMPRS-IS) for supporting Y.Xue. For this project, R. Marin has been supported by the innovation program under the Marie Skłodowska-Curie grant agreement No. 101109330.



Carl-Zeiss-Stiftung Tübingen AI Center University of Tübingen IMPRS mpi-inf eu

BibTeX

@article{xue2024human3diffusion,
  author    = {Xue, Yuxuan and Xie, Xianghui and Marin, Riccardo and Pons-Moll, Gerard},
  title     = {Human 3Diffusion: Realistic Avatar Creation via Explicit 3D Consistent Diffusion Models},
  journal   = {Arxiv},
  year      = {2024},
}