LoRA-Edit: Controllable First-Frame-Guided Video Editing via Mask-Aware LoRA Fine-Tuning

1The Chinese University of Hong Kong, 2SenseTime Research

🎥 Watch Our Demo ( YouTube)

LoRA-Edit enables controllable video editing through mask-aware LoRA fine-tuning.

Abstract

Video editing using diffusion models has achieved remarkable results in generating high-quality edits for videos. However, current methods often rely on large-scale pretraining, limiting flexibility for specific edits. First-frame-guided editing provides control over the first frame, but lacks flexibility over subsequent frames. To address this, we propose a mask-based LoRA (Low-Rank Adaptation) tuning method that adapts pretrained Image-to-Video (I2V) models for flexible video editing. Our approach preserves background regions while enabling controllable edits propagation. This solution offers efficient and adaptable video editing without altering the model architecture.

To better steer this process, we incorporate additional references, such as alternate viewpoints or representative scene states, which serve as visual anchors for how content should unfold. We address the control challenge using a mask-driven LoRA tuning strategy that adapts a pre-trained image-to-video model to the editing context.

The model must learn from two distinct sources: the input video provides spatial structure and motion cues, while reference images offer appearance guidance. A spatial mask enables region-specific learning by dynamically modulating what the model attends to, ensuring that each area draws from the appropriate source. Experimental results show our method achieves superior video editing performance compared to state-of-the-art methods.

First-Frame-Guided Video Editing

We demonstrate our first-frame-guided video editing results using LoRA-Edit. Given an input video and an edited first frame, our method generates consistent video edits that propagate the first frame changes throughout the entire sequence. Move your mouse across the videos to compare source and edited results, or click to play/pause.

First Frame
Edited First Frame
📹 Source Video
✨ Result
1/11
🎬 Click video to play/pause

Additional Reference-Guided Editing

Beyond using just the edited first frame, our method can incorporate additional reference frames to provide more flexible editing guidance. Here we demonstrate how adding a second edited frame (from a different time point) enhances the controllability throughout the entire video sequence.

Comparisons

Comparison with Reference-Guided Video Editing

We compare LoRA-Edit with reference-guided video editing methods, demonstrating our method's advantages in maintaining reference fidelity.

Comparison with First-Frame-Guided Video Editing

We also compare with first-frame-guided video editing methods, showcasing LoRA-Edit's superior performance in propagating first-frame edits with high quality while preserving the background.

How does it work?

Exploring Mask Configurations

We explore different mask configurations as input conditions to the image-to-video model. Left: Input conditions including mask and pseudo-video. Right: Video generation results under different mask configurations. From top to bottom, we explore four different cases: Default case uses a default mask preserving only the first frame. Case 1 uses no input condition (text-to-video). Case 2 uses the entire video without masking, resulting in artifacts. Case 3 masks the foreground, which also fails.

Mask Configuration Exploration

Our Approach

Building on this exploration, we modify the spatiotemporal mask to enable more flexible video edits. Combined with LoRA fine-tuning, the mask serves two roles: it improves the I2V model's alignment with mask constraints, allowing flexible control over which regions are edited or preserved; and it acts as a signal guiding LoRA to learn specific patterns from training data. By configuring the condition video, mask, and target video in different ways, we enable flexible video editing through LoRA.

Our Method

BibTeX

@article{loraedit2025,
      author    = {Chenjian Gao and Lihe Ding and Xin Cai and Zhanpeng Huang and Zibin Wang and Tianfan Xue},
      title     = {LoRA-Edit: Controllable First-Frame-Guided Video Editing via Mask-Aware LoRA Fine-Tuning},
      journal   = {arXiv preprint},
      year      = {2025},
}