ControlEvents: Controllable Synthesis of Event Camera Data with Foundational Prior from Image Diffusion Models

1Technical University of Munich, 2University of Tübingen, 3Tübingen AI Center,
4Max Planck Institute for Informatics, Saarland Informatics Campus
5Munich Center for Machine Learning (MCML)
*co-first author, corresponding author

ControlEvents can synthesize realistic event camera data from diverse conditions, such as class text label, 2D human skeleton, 3D body poses. ControlEvents can generate large-scale realistic event data with pseudo ground-truth labels, which can enhance the deep learning model performance.


Abstract

In recent years, event cameras have gained significant attention due to their bio-inspired properties, such as high temporal resolution and high dynamic range. However, obtaining large-scale labeled ground-truth data for event-based vision tasks remains challenging and costly. In this paper, we present ControlEvents, a diffusion-based generative model designed to synthesize high-quality event data guided by diverse control signals such as class text labels, 2D skeletons, and 3D body poses.
Our key insight is to leverage the diffusion prior from foundation models, such as Stable Diffusion, enabling high-quality event data generation with minimal fine-tuning and limited labeled data. Our method streamlines the data generation process and significantly reduces the cost of producing labeled event datasets. We demonstrate the effectiveness of our approach by synthesizing event data for visual recognition, 2D skeleton estimation, and 3D body pose estimation. Our experiments show that the synthesized labeled event data enhances model performance in all tasks. Additionally, our approach can generate events based on unseen text labels during training, illustrating the powerful text-based generation capabilities inherited from foundation models. Our models and generated datasets will be publicly available for future research.




Overview of ControlEvents. For text-conditioned event data synthesis, we fine-tune Stable Diffusion. For 2D & 3D pose-conditioned event data synthesis, we fine-tune ControlNet using skeleton map or normal map. Our ControlEvents can synthesize large-scale dataset for various tasks.

Generation Results

Generation of event images from text class labels in N-ImageNet

Zero-shot generation of unseen text label from N-Caltech101 dataset. We determine the closest seen text label during training based on the CLIP cosine similarity.



Carl-Zeiss-Stiftung tum Tübingen AI Center University of Tübingen IMPRS mpi-inf