Wan-Move: Open-Source Alternative to Kling 1.5 Pro for Motion-Controllable Video Generation

WAN_MOVE video editor

A team of researchers from Tongyi Lab (Alibaba Group), Tsinghua University, and the University of Hong Kong presented Wan-Move — a new approach to precise motion control in generative video models. Unlike existing methods that require additional motion encoders, Wan-Move directly edits condition features, embedding motion information without modifying the base model architecture. The method generates 5-second 480p videos with motion control accuracy comparable to the commercial Motion Brush from Kling 1.5 Pro. The Wan-Move-14B-480P model is available for download on Hugging Face, Github, and ModelScope under the open Apache 2.0 license.

Six examples of Wan-Move usage: motion transfer between videos, multi-object motion control, combined object and camera control, complex motion, 3D rotation, and primitive-level control
Wan-Move supports diverse motion control scenarios: from motion transfer to 3D object rotation.

How Wan-Move Works

Wan-Move builds on top of the existing image-to-video model Wan-I2V-14B without adding auxiliary modules. The core idea is to embed motion information by directly editing the condition features of the first frame. These updated features become a latent guidance signal that contains information about both the visual content of the first frame (objects, textures, colors) and how these objects should move in subsequent frames.

Diagram shows latent feature replication process and overall Wan-Move training pipeline with VAE encoder, DiT blocks, and decoder
Wan-Move uses latent feature replication along trajectories to embed motion information without additional modules

Motion is represented by point trajectories. Unlike previous work, researchers transfer each trajectory from pixel space to latent coordinates. The method works as follows: for each motion trajectory, the model takes the feature from the starting point on the first frame and copies it to all corresponding positions in subsequent frames, following that trajectory. Each copied feature preserves rich context, creating more natural local motion.

MoveBench: A New Benchmark

Researchers created MoveBench — a new benchmark of 1,018 videos (832×480 resolution, 5-second duration) with detailed motion annotations. This allows testing how accurately models control object movement over extended time intervals and in diverse scenarios.

Videos undergo a four-stage preparation pipeline: quality assessment using an expert model, processing with cropping to 480p and sampling to 81 frames, clustering into 54 content categories, and manual selection of 15-25 representative examples for each category.

movebench how it works
MoveBench pipeline includes automatic curation, interactive annotation with SAM, and detailed text descriptions

Two types of annotations are provided for each video: point trajectories and segmentation masks. This allows testing methods with different control approaches. Annotation was performed using an interactive interface: annotators indicated the target area in the first frame, and SAM automatically generated a segmentation mask. As a result, each video contains at least one annotated trajectory, and 192 videos include simultaneous motion of multiple objects.

Experimental Results

Wan-Move demonstrated the best results among all academic methods on the MoveBench and DAVIS benchmarks, significantly outperforming competitors in multi-object scenarios.

Generation examples from different methods: ImageConductor, LeviTor, Kling 1.5 Pro, and Wan-Move. Red marks indicate motion errors and visual artifacts
Wan-Move demonstrates more accurate trajectory following and fewer visual artifacts compared to competitors

To evaluate against professional-level performance, a 2AFC (two-alternative forced-choice) user study was conducted comparing Wan-Move with commercial Kling 1.5 Pro. Participants evaluated motion accuracy, motion quality, and visual quality. Wan-Move showed competitive results: winning in motion accuracy (52.2%) and motion quality (53.4%), with comparable visual quality (50.2%). Compared to academic methods, Wan-Move achieved a win rate above 96% across all categories.

human study

Ablation Studies

Comparing ControlNet with direct concatenation showed comparable performance (FID 12.2 vs 12.4), but ControlNet increases inference time by 225 seconds, while Wan-Move adds only 3 seconds.

Optimal performance is achieved with 200 trajectories during training. At inference, the model demonstrates strong generalization ability, achieving a minimum EPE of 1.1 with 1,024 trajectories, despite being trained on a maximum of 200.

Conclusions

Wan-Move represents a simple and scalable framework for precise motion control in video generation without architectural changes to the base model. Extensive experiments show that the method generates high-quality videos with motion controllability comparable to commercial tools. The project is distributed under the Apache 2.0 license, allowing free use of the model for both commercial and non-commercial purposes.


Subscribe
Notify of
guest

0 Comments
Inline Feedbacks
View all comments