Wan S2V: Getting Started

Wan S2V turns an image and an audio file into a synchronized video. This quick start covers inputs, framing choices, and your first run.

Inputs

Reference image: a portrait or a full-body photo for identity and framing.
Audio: speech or singing. Length guides video duration.
Optional prompt: scene intent, mood, camera notes.
Optional pose video: drives body motion while keeping audio sync.

Framing

Choose half-body for talk segments and interviews; choose full-body for performances with gestures or choreography.

First run

python generate.py  --task s2v-14B --size 1024*704 --ckpt_dir ./Wan2.2-S2V-14B/ --offload_model True --convert_model_dtype --prompt "a speaker explains a topic calmly"  --image "examples/i2v_input.JPG" --audio "examples/talk.wav"

If memory is tight, keep offload and dtype conversion enabled. To guide motion with choreography, add --pose_video with a reference clip.