Human-Object Interaction (HOI) datasets differ in coordinate frames, formats, data structures, and directory layouts, forcing bespoke loaders and brittle dataset-specific rendering setups. Thus, we present HOI-Blender, a Blender add-on which standardizes HOI data as a unified loader: it normalizes coordinates and scale, harmonizes metadata, and maps motion parameters to skinned bodies and object meshes across congruent scenes. Once standardized, camera rigs, lighting composition and render settings are reapplicable across datasets, enabling one-click import, animation preview, and image rendering. The add-on further decouples motion and appearance, allowing the same motion to drive different human identities—or the same identity to enact different motions.
Additionally, to streamline annotation, HOI-Blender includes an auto-captioning module which forwards rendered frames to Vision-Language Models to produce action-object descriptions, supporting rapid weak labeling and dataset curation. We demonstrate support for an initial 15 HOI public datasets and report a comprehensive qualitative evaluation spanning motion fidelity, mesh integrity, hand-object interactions, and inter-mesh collisions.
HOI-Blender adapts heterogeneous HOI datasets into unified Blender-canonical components across three key stages:
Automated dataset identification and tailored loaders resolve file formats, coordinate frames, rotation encodings, scale ambiguities, and metadata inconsistencies across AIST++, AMASS, Hi4D, Duolando, BEHAVE, InterCap, COUCH, OMOMO, NeuralDome, IMHD2, GRAB, Arctic, HIMO, HOI-M3, and CORE4D.
Leverages Blender's native rigging tools—armatures, shape keys, vertex groups—for efficient human animation. Shape and pose blend shapes are exposed as shape keys, with joint transformations mapped to a hierarchical skeleton.
Motion and appearance are separated: the same motion can drive different human identities, or the same identity can enact different motions. Texture and displacement are independently controllable.
Dataset-aware camera framing, automated lighting, and configurable rendering. Includes a VLM-based auto-captioning module for generating action-object descriptions from rendered frames.
HOI-Blender separates motion from appearance, enabling flexible combinations. The top row shows the same identity enacting various motion sequences; the bottom row shows the same motion enacted by various identities.
Top row: Same identity, various motions. Bottom row: Same motion, various identities.
HOI-Blender renders image sequences of standardized HOI dataset animations and forwards them to Vision-Language Models for automated captioning, producing concise yet semantically rich motion-text pairs describing motion phases, contact patterns, and interaction semantics at scale.
HOI-Blender generates multi-view synthetic HOI renders with auxiliary render passes and annotations—including depth, normal and segmentation maps, human and object mesh exports, as well as interaction contact points. Additional annotations include 6-DoF camera poses, motion parameters, and temporally consistent dense point tracks.
We qualitatively evaluate 50 homogenized motion clips per dataset across six axes on a four-point scale (4 = strong performance, 1 = weak performance):
Motion continuity, free of temporal jitter and abrupt transitions
Perceived naturalness of motion
Absence of deformation artifacts and topological failures
Tenability of hand-object contact
Plausibility of grip-action and finger placement
Lack of irregular interpenetration
4 – Accurate
3 – Plausible
2 – Dubious
1 – Erroneous
4 – Accurate
3 – Plausible
2 – Dubious
1 – Erroneous
| Dataset | Anim. Smoothness | Motion Realism | Mesh Quality | HOI Contact | Hand Grip | Inter-Mesh Collision |
|---|---|---|---|---|---|---|
| AIST++ | 3.52 | 3.78 | 3.54 | – | – | – |
| AMASS | 3.94 | 3.98 | 3.96 | – | – | – |
| Hi4D | 3.24 | 4.00 | 3.88 | – | – | 3.96 |
| Duolando | 3.34 | 3.98 | 3.92 | – | – | 3.98 |
| BEHAVE | 2.62 | 3.61 | 3.10 | 2.96 | – | 2.92 |
| InterCap | 2.00 | 2.56 | 3.94 | 2.90 | 3.68 | 3.38 |
| COUCH | 2.86 | 2.96 | 3.67 | 2.73 | – | 3.25 |
| OMOMO | 3.94 | 3.94 | 3.92 | 3.98 | – | 3.36 |
| NeuralDome | 3.70 | 3.40 | 2.92 | 3.60 | 3.74 | 3.29 |
| IMHD2 | 3.02 | 3.12 | 3.10 | 3.31 | 3.88 | 3.44 |
| GRAB | 3.91 | 3.52 | 3.98 | 3.43 | 3.94 | 3.69 |
| Arctic | 3.98 | 3.52 | 4.00 | 3.34 | 3.90 | 3.46 |
| HIMO | 3.82 | 2.84 | 4.00 | 2.90 | 3.53 | 3.47 |
| HOI-M3 | 3.04 | 3.26 | 4.00 | 3.25 | – | 2.96 |
| CORE4D | 3.26 | 3.41 | 3.24 | 3.71 | 3.69 | 3.90 |
Key findings: Isolated human motion datasets excel in kinematic fidelity (AMASS: 3.94/3.98/3.96). Multi-human interaction datasets show lifelike captures with minimal collisions (Hi4D, Duolando). Hand-object interaction datasets present refined grasping with high mesh quality (Arctic: MQ 4.00, GQ 3.90). Full-body HOI datasets show wider variance, often impeded by occlusion-driven artifacts, but include datasets of exceptional contact performance (OMOMO: CQ 3.98).
HOI-Blender renders can be further enhanced toward photorealism using neural upscaling. We demonstrate this by applying DLSS 5 Anything to our rendered outputs, transforming stylized 3D renders into photorealistic imagery while preserving pose, interaction, and scene composition.
@inproceedings{xue2025hoiblend,
title = {HOI-Blender: A Unifying Blender Add-on for Standardization and
Visualization of Diverse Human-Object Interaction Datasets},
author = {Xue, Yuxuan and Kostyrko, Margaret and Nguyen, Hoai An and
YM, Pradyumna and Xie, Xianghui and Pons-Moll, Gerard},
booktitle = {1st Workshop on Interactive Human-centric Foundation Models,
IEEE/CVF International Conference on Computer Vision (ICCV)},
year = {2025}
}