HOI-Blender: A Unifying Blender Add-on for Standardization and Visualization of Diverse Human-Object Interaction Datasets

Yuxuan Xue*†, Margaret Kostyrko*, Hoai An Nguyen, Pradyumna YM, Xianghui Xie, Gerard Pons-Moll

University of Tübingen, Tübingen AI Center

* Equal contribution † Corresponding author

Paper arXiv (Coming Soon) Code (Coming Soon) Video

1st Workshop on Interactive Human-centric Foundation Models @ ICCV 2025, Honolulu, Hawaii

HOI-Blender standardizes heterogeneous HOI datasets into unified Blender scenes, enabling one-click import, animation preview, and rendering with decoupled motion and appearance across 15 public datasets.

Abstract

Human-Object Interaction (HOI) datasets differ in coordinate frames, formats, data structures, and directory layouts, forcing bespoke loaders and brittle dataset-specific rendering setups. Thus, we present HOI-Blender, a Blender add-on which standardizes HOI data as a unified loader: it normalizes coordinates and scale, harmonizes metadata, and maps motion parameters to skinned bodies and object meshes across congruent scenes. Once standardized, camera rigs, lighting composition and render settings are reapplicable across datasets, enabling one-click import, animation preview, and image rendering. The add-on further decouples motion and appearance, allowing the same motion to drive different human identities—or the same identity to enact different motions.

Additionally, to streamline annotation, HOI-Blender includes an auto-captioning module which forwards rendered frames to Vision-Language Models to produce action-object descriptions, supporting rapid weak labeling and dataset curation. We demonstrate support for an initial 15 HOI public datasets and report a comprehensive qualitative evaluation spanning motion fidelity, mesh integrity, hand-object interactions, and inter-mesh collisions.

Method Overview

HOI-Blender adapts heterogeneous HOI datasets into unified Blender-canonical components across three key stages:

HOI-Blender Pipeline — **HOI-Blender Overview.** (1) **Data Standardization:** ingests diverse HOI datasets and normalizes them into consistent representation; (2) **Scene Assembly:** maps SMPL-Family parameters to rigged human bodies, with decoupled motion and appearance; (3) **Rendering & Applications:** enables scalable rendering with auxiliary render passes (depth, normals, segmentation masks), supporting downstream applications.

Key Features

Standardization of 15 Datasets

Automated dataset identification and tailored loaders resolve file formats, coordinate frames, rotation encodings, scale ambiguities, and metadata inconsistencies across AIST++, AMASS, Hi4D, Duolando, BEHAVE, InterCap, COUCH, OMOMO, NeuralDome, IMHD², GRAB, Arctic, HIMO, HOI-M3, and CORE4D.

4D HOI Representation

Leverages Blender's native rigging tools—armatures, shape keys, vertex groups—for efficient human animation. Shape and pose blend shapes are exposed as shape keys, with joint transformations mapped to a hierarchical skeleton.

Decoupled Motion & Appearance

Motion and appearance are separated: the same motion can drive different human identities, or the same identity can enact different motions. Texture and displacement are independently controllable.

Automated Rendering & Captioning

Dataset-aware camera framing, automated lighting, and configurable rendering. Includes a VLM-based auto-captioning module for generating action-object descriptions from rendered frames.

User Interface

Motion–Appearance Decoupling

HOI-Blender separates motion from appearance, enabling flexible combinations. The top row shows the same identity enacting various motion sequences; the bottom row shows the same motion enacted by various identities.

Top row: Same identity, various motions. Bottom row: Same motion, various identities.

Downstream Applications

VLM-Based Motion Captioning

HOI-Blender renders image sequences of standardized HOI dataset animations and forwards them to Vision-Language Models for automated captioning, producing concise yet semantically rich motion-text pairs describing motion phases, contact patterns, and interaction semantics at scale.

Generated Data Modalities

HOI-Blender generates multi-view synthetic HOI renders with auxiliary render passes and annotations—including depth, normal and segmentation maps, human and object mesh exports, as well as interaction contact points. Additional annotations include 6-DoF camera poses, motion parameters, and temporally consistent dense point tracks.

Dataset Evaluation

We qualitatively evaluate 50 homogenized motion clips per dataset across six axes on a four-point scale (4 = strong performance, 1 = weak performance):

Animation Smoothness

Motion continuity, free of temporal jitter and abrupt transitions

Motion Realism

Perceived naturalness of motion

Mesh Quality

Absence of deformation artifacts and topological failures

HOI Contact Quality

Tenability of hand-object contact

Hand Grip Quality

Plausibility of grip-action and finger placement

Inter-Mesh Collision

Lack of irregular interpenetration

Contact Quality Scale

4 – Accurate

3 – Plausible

2 – Dubious

1 – Erroneous

Grip Quality Scale

4 – Accurate

3 – Plausible

2 – Dubious

1 – Erroneous

Dataset	Anim. Smoothness	Motion Realism	Mesh Quality	HOI Contact	Hand Grip	Inter-Mesh Collision
AIST++	3.52	3.78	3.54	–	–	–
AMASS	3.94	3.98	3.96	–	–	–
Hi4D	3.24	4.00	3.88	–	–	3.96
Duolando	3.34	3.98	3.92	–	–	3.98
BEHAVE	2.62	3.61	3.10	2.96	–	2.92
InterCap	2.00	2.56	3.94	2.90	3.68	3.38
COUCH	2.86	2.96	3.67	2.73	–	3.25
OMOMO	3.94	3.94	3.92	3.98	–	3.36
NeuralDome	3.70	3.40	2.92	3.60	3.74	3.29
IMHD²	3.02	3.12	3.10	3.31	3.88	3.44
GRAB	3.91	3.52	3.98	3.43	3.94	3.69
Arctic	3.98	3.52	4.00	3.34	3.90	3.46
HIMO	3.82	2.84	4.00	2.90	3.53	3.47
HOI-M3	3.04	3.26	4.00	3.25	–	2.96
CORE4D	3.26	3.41	3.24	3.71	3.69	3.90

Key findings: Isolated human motion datasets excel in kinematic fidelity (AMASS: 3.94/3.98/3.96). Multi-human interaction datasets show lifelike captures with minimal collisions (Hi4D, Duolando). Hand-object interaction datasets present refined grasping with high mesh quality (Arctic: MQ 4.00, GQ 3.90). Full-body HOI datasets show wider variance, often impeded by occlusion-driven artifacts, but include datasets of exceptional contact performance (OMOMO: CQ 3.98).

Photorealistic Enhancement

HOI-Blender renders can be further enhanced toward photorealism using neural upscaling. We demonstrate this by applying DLSS 5 Anything to our rendered outputs, transforming stylized 3D renders into photorealistic imagery while preserving pose, interaction, and scene composition.

Sitting on Couch
Carrying Pillow
Holding Vase

Video

BibTeX

@inproceedings{xue2025hoiblend,
  title     = {HOI-Blender: A Unifying Blender Add-on for Standardization and
               Visualization of Diverse Human-Object Interaction Datasets},
  author    = {Xue, Yuxuan and Kostyrko, Margaret and Nguyen, Hoai An and
               YM, Pradyumna and Xie, Xianghui and Pons-Moll, Gerard},
  booktitle = {1st Workshop on Interactive Human-centric Foundation Models,
               IEEE/CVF International Conference on Computer Vision (ICCV)},
  year      = {2025}
}