From Gallery to Wrist:
Realistic 3D Bracelet Insertion in Videos

1The Chinese University of Hong Kong, 2SenseTime Research, 3Shanghai AI Laboratory

🎥 Watch Our Demo

Our hybrid pipeline combines 3D Gaussian Splatting and 2D diffusion models for realistic bracelet insertion with temporal consistency and photorealistic lighting.

Abstract

Inserting 3D objects into videos is a longstanding challenge in computer graphics with applications in augmented reality, virtual try-on, and video composition. Achieving both temporal consistency, or realistic lighting remains difficult, particularly in dynamic scenarios with complex object motion, perspective changes, and varying illumination. While 2D diffusion models have shown promise for producing photorealistic edits, they often struggle with maintaining temporal coherence across frames. Conversely, traditional 3D rendering methods excel in spatial and temporal consistency but fall short in achieving photorealistic lighting. In this work, we propose a hybrid object insertion pipeline that combines the strengths of both paradigms. Specifically, we focus on inserting bracelets into dynamic wrist scenes, leveraging the high temporal consistency of 3D Gaussian Splatting (3DGS) for initial rendering and refining the results using a 2D diffusion-based enhancement model to ensure realistic lighting interactions. Our method introduces a shading-driven pipeline that separates intrinsic object properties (albedo, shading, reflectance) and refines both shading and sRGB images for photorealism. To maintain temporal coherence, we optimize the 3DGS model with multi-frame weighted adjustments. This is the first approach to synergize 3D rendering and 2D diffusion for video object insertion, offering a robust solution for realistic and consistent video editing.

Motivation

Motivation

Traditional video editing methods struggle to insert objects with both temporal consistency and photorealistic appearance. We propose a hybrid pipeline for inserting 3D objects into videos, combining 3D Gaussian Splatting rendering for temporal consistency and a 2D diffusion-based enhancement for photorealistic lighting. In this example, a virtual bracelet is inserted onto a wrist in a dynamic scene. The 3D representation ensures temporal consistency and correct handling of occlusions as the wrist moves, while the 2D image priors enhance realism by synthesizing realistic shading. Our approach bridges the gap between 3D rendering and 2D diffusion models, achieving both temporal coherence and realism.

Method Overview

Method Overview

Our pipeline inserts a 3D bracelet into a video while maintaining temporal consistency and realistic lighting. 1) We first compute motion and occlusion using 3D Gaussian Splatting (3DGS), leveraging 2D tracking points to align the bracelet with the wrist’s motion and monocular depth maps to handle occlusions. 2) Next, we enhance realism through a shading-driven approach, decomposing the image into albedo and shading components. The shading is refined using a diffusion-based model to adapt the bracelet’s lighting to the scene, while the albedo ensures color consistency. 3) Finally, we apply temporal smoothing to the bracelet and shadows, optimizing the 3DGS model and interpolating frames to ensure smooth transitions across the video.

Results

Our method achieves realistic bracelet insertion with proper lighting, shadows, and temporal consistency.

BibTeX

@inproceedings{gao2025gallery,
      title={From Gallery to Wrist: Realistic 3D Bracelet Insertion in Videos},
      author={Gao, Chenjian and Ding, Lihe and Han, Rui and Huang, Zhanpeng and Wang, Zibin and Xue, Tianfan},
      booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision},
      year={2025},
}