TexDreamer: Towards Zero-Shot High-Fidelity 3D Human Texture Generation

ECCV 2024 Oral

Yufei Liu¹, Junwei Zhu², Junshu Tang³, Shijie Zhang⁴, Jiangning Zhang², Weijian Cao², Chengjie Wang², Yunsheng Wu², Dongjin Huang^1*,

¹Shanghai University, ²Tencent Youtu Lab, ³Shanghai Jiao Tong University, ⁴Fudan University, ^*Corresponding Author

Abstract

Texturing 3D humans with semantic UV maps remains a challenge due to the difficulty of acquiring reasonably unfolded UV. Despite recent text-to-3D advancements in supervising multi-view renderings using large text-to-image (T2I) models, issues persist with generation speed, text consistency, and texture quality, resulting in data scarcity among existing datasets. We present TexDreamer, the first zero-shot multimodal high-fidelity 3D human texture generation model. Utilizing an efficient texture adaptation finetuning strategy, we adapt large T2I model to a semantic UV structure while preserving its original generalization capability. Leveraging a novel feature translator module, the trained model is capable of generating high-fidelity 3D human textures from either text or image within seconds. Furthermore, we introduce ArTicuLated humAn textureS (ATLAS), the largest highresolution (1024×1024) 3D human texture dataset which contains 50k high-fidelity textures with text descriptions. Our dataset and model will be public for research purposes.

Method

We divide TexDreamer training into two stages: T2UV (green) and I2UV (blue). For T2UV, we use LDM denoise loss L1 to optimize the text encoder and U-Net. For I2UV, we build a feature translator ϕ_i2t to map the input image feature encoded by ϕ_i−enc to a conditional feature f_i2t for T2UV. We train I2UV by optimizing ϕ_t−enc and ϕ_i−enc.

Pipeline

Pipeline for generating synthetic data. (a) Sample texture acquisition. We first use a differentiable render to optimize UV from multi-view images, then further refine them by projection painting. Acquired sample textures with prompts are used to train T2UV in TexDreamer. (b) Diverse textured human synthesis. With the help of ChatGPT, we utilize T2UV to generate 50k human textures. Human images are rendered with animation sequence, background image, HDR lighting, and perspective camera. Orange stars indicate included data in our ATLAS dataset.

Application

Texturing dressed avatars. Our human textures can be applied to complex dressed meshes generated by text-to-3d method. We show some examples generated by TADA with synthetic UV texture generated by TexDreamer.

BibTeX

@misc{liu2024texdreamer, title={TexDreamer: Towards Zero-Shot High-Fidelity 3D Human Texture Generation}, author={Yufei Liu and Junwei Zhu and Junshu Tang and Shijie Zhang and Jiangning Zhang and Weijian Cao and Chengjie Wang and Yunsheng Wu and Dongjin Huang}, year={2024}, eprint={2403.12906}, archivePrefix={arXiv}, primaryClass={cs.CV}}

TexDreamer: Towards Zero-Shot High-Fidelity 3D Human Texture Generation

ECCV 2024 Oral

ATLAS Dataset

ATLAS is so far the largest high-resolution (1024×1024) 3D human texture dataset.

TexDreamer

TexDreamer is the first zero-shot high-fidelity 3D human texture generation model that supports both text and image inputs.

TexDreamer

TexDreamer is the first zero-shot high-fidelity 3D human texture generation model that supports both text and image inputs.

With efficient texture adaptation fine-tuning, TexDreamer exhibits faithful identity and clothing for generating semantic 3D human UV textures from texts.