Wan S2V: Getting Started
Wan S2V turns an image and an audio file into a synchronized video. This quick start covers inputs, framing choices, and your first run.
Inputs
- Reference image: a portrait or a full-body photo for identity and framing.
- Audio: speech or singing. Length guides video duration.
- Optional prompt: scene intent, mood, camera notes.
- Optional pose video: drives body motion while keeping audio sync.
Framing
Choose half-body for talk segments and interviews; choose full-body for performances with gestures or choreography.
First run
python generate.py --task s2v-14B --size 1024*704 --ckpt_dir ./Wan2.2-S2V-14B/ --offload_model True --convert_model_dtype --prompt "a speaker explains a topic calmly" --image "examples/i2v_input.JPG" --audio "examples/talk.wav"
If memory is tight, keep offload and dtype conversion enabled. To guide motion with choreography, add --pose_video
with a reference clip.