HOI-Blender: A Unifying Blender Add-on for Standardization and Visualization of Diverse Human-Object Interaction Datasets

University of Tübingen, Tübingen AI Center
* Equal contribution    † Corresponding author
1st Workshop on Interactive Human-centric Foundation Models @ ICCV 2025, Honolulu, Hawaii
HOI-Blender Teaser

HOI-Blender standardizes heterogeneous HOI datasets into unified Blender scenes, enabling one-click import, animation preview, and rendering with decoupled motion and appearance across 15 public datasets.

Abstract

Human-Object Interaction (HOI) datasets differ in coordinate frames, formats, data structures, and directory layouts, forcing bespoke loaders and brittle dataset-specific rendering setups. Thus, we present HOI-Blender, a Blender add-on which standardizes HOI data as a unified loader: it normalizes coordinates and scale, harmonizes metadata, and maps motion parameters to skinned bodies and object meshes across congruent scenes. Once standardized, camera rigs, lighting composition and render settings are reapplicable across datasets, enabling one-click import, animation preview, and image rendering. The add-on further decouples motion and appearance, allowing the same motion to drive different human identities—or the same identity to enact different motions.

Additionally, to streamline annotation, HOI-Blender includes an auto-captioning module which forwards rendered frames to Vision-Language Models to produce action-object descriptions, supporting rapid weak labeling and dataset curation. We demonstrate support for an initial 15 HOI public datasets and report a comprehensive qualitative evaluation spanning motion fidelity, mesh integrity, hand-object interactions, and inter-mesh collisions.

Method Overview

HOI-Blender adapts heterogeneous HOI datasets into unified Blender-canonical components across three key stages:

HOI-Blender Pipeline
HOI-Blender Overview. (1) Data Standardization: ingests diverse HOI datasets and normalizes them into consistent representation; (2) Scene Assembly: maps SMPL-Family parameters to rigged human bodies, with decoupled motion and appearance; (3) Rendering & Applications: enables scalable rendering with auxiliary render passes (depth, normals, segmentation masks), supporting downstream applications.

Key Features

Standardization of 15 Datasets

Automated dataset identification and tailored loaders resolve file formats, coordinate frames, rotation encodings, scale ambiguities, and metadata inconsistencies across AIST++, AMASS, Hi4D, Duolando, BEHAVE, InterCap, COUCH, OMOMO, NeuralDome, IMHD2, GRAB, Arctic, HIMO, HOI-M3, and CORE4D.

4D HOI Representation

Leverages Blender's native rigging tools—armatures, shape keys, vertex groups—for efficient human animation. Shape and pose blend shapes are exposed as shape keys, with joint transformations mapped to a hierarchical skeleton.

Decoupled Motion & Appearance

Motion and appearance are separated: the same motion can drive different human identities, or the same identity can enact different motions. Texture and displacement are independently controllable.

Automated Rendering & Captioning

Dataset-aware camera framing, automated lighting, and configurable rendering. Includes a VLM-based auto-captioning module for generating action-object descriptions from rendered frames.

User Interface

HOI-Blender User Interface
HOI-Blender User Interface. After providing an asset path to SMPL-Family models (1), users can either import a specified model type (2) or load individual sequences directly (3). Both texture and displacement may be optionally provided (4). The selected sequence can be directly rendered at specified step-size (5), with output frames optionally passed for automatic captioning (6).

Motion–Appearance Decoupling

HOI-Blender separates motion from appearance, enabling flexible combinations. The top row shows the same identity enacting various motion sequences; the bottom row shows the same motion enacted by various identities.

Same identity, motion 1 Same identity, motion 2 Same identity, motion 3 Same motion, identity 1 Same motion, identity 2 Same motion, identity 3

Top row: Same identity, various motions. Bottom row: Same motion, various identities.

Downstream Applications

VLM-Based Motion Captioning

HOI-Blender renders image sequences of standardized HOI dataset animations and forwards them to Vision-Language Models for automated captioning, producing concise yet semantically rich motion-text pairs describing motion phases, contact patterns, and interaction semantics at scale.

VLM-Based Motion Captioning

Generated Data Modalities

HOI-Blender generates multi-view synthetic HOI renders with auxiliary render passes and annotations—including depth, normal and segmentation maps, human and object mesh exports, as well as interaction contact points. Additional annotations include 6-DoF camera poses, motion parameters, and temporally consistent dense point tracks.

Generated Data Modalities

Dataset Evaluation

We qualitatively evaluate 50 homogenized motion clips per dataset across six axes on a four-point scale (4 = strong performance, 1 = weak performance):

Animation Smoothness

Motion continuity, free of temporal jitter and abrupt transitions

Motion Realism

Perceived naturalness of motion

Mesh Quality

Absence of deformation artifacts and topological failures

HOI Contact Quality

Tenability of hand-object contact

Hand Grip Quality

Plausibility of grip-action and finger placement

Inter-Mesh Collision

Lack of irregular interpenetration

Contact Quality Scale
Score 4

4 – Accurate

Score 3

3 – Plausible

Score 2

2 – Dubious

Score 1

1 – Erroneous

Grip Quality Scale
Score 4

4 – Accurate

Score 3

3 – Plausible

Score 2

2 – Dubious

Score 1

1 – Erroneous

Dataset Anim. Smoothness Motion Realism Mesh Quality HOI Contact Hand Grip Inter-Mesh Collision
AIST++ 3.52 3.78 3.54
AMASS 3.94 3.98 3.96
Hi4D 3.24 4.00 3.88 3.96
Duolando 3.34 3.98 3.92 3.98
BEHAVE 2.62 3.61 3.10 2.96 2.92
InterCap 2.00 2.56 3.94 2.90 3.68 3.38
COUCH 2.86 2.96 3.67 2.73 3.25
OMOMO 3.94 3.94 3.92 3.98 3.36
NeuralDome 3.70 3.40 2.92 3.60 3.74 3.29
IMHD2 3.02 3.12 3.10 3.31 3.88 3.44
GRAB 3.91 3.52 3.98 3.43 3.94 3.69
Arctic 3.98 3.52 4.00 3.34 3.90 3.46
HIMO 3.82 2.84 4.00 2.90 3.53 3.47
HOI-M3 3.04 3.26 4.00 3.25 2.96
CORE4D 3.26 3.41 3.24 3.71 3.69 3.90

Key findings: Isolated human motion datasets excel in kinematic fidelity (AMASS: 3.94/3.98/3.96). Multi-human interaction datasets show lifelike captures with minimal collisions (Hi4D, Duolando). Hand-object interaction datasets present refined grasping with high mesh quality (Arctic: MQ 4.00, GQ 3.90). Full-body HOI datasets show wider variance, often impeded by occlusion-driven artifacts, but include datasets of exceptional contact performance (OMOMO: CQ 3.98).

Photorealistic Enhancement

HOI-Blender renders can be further enhanced toward photorealism using neural upscaling. We demonstrate this by applying DLSS 5 Anything to our rendered outputs, transforming stylized 3D renders into photorealistic imagery while preserving pose, interaction, and scene composition.

DLSS 5 Enhancement

Video

BibTeX

@inproceedings{xue2025hoiblend,
  title     = {HOI-Blender: A Unifying Blender Add-on for Standardization and
               Visualization of Diverse Human-Object Interaction Datasets},
  author    = {Xue, Yuxuan and Kostyrko, Margaret and Nguyen, Hoai An and
               YM, Pradyumna and Xie, Xianghui and Pons-Moll, Gerard},
  booktitle = {1st Workshop on Interactive Human-centric Foundation Models,
               IEEE/CVF International Conference on Computer Vision (ICCV)},
  year      = {2025}
}