Wan S2V

Wan S2V is an audio-driven video generation approach in the Wan 2.2 family. It takes a single image and audio, and produces synchronized video with expressive face and body motion, timing aligned to speech or music, and camera behavior that suits the scene. The model supports half-body and full-body setups and is suitable for dialogue, singing, and performance.

Last updated: 2025-09-29

What is Wan S2V?

Wan S2V stands for Speech-to-Video in the Wan 2.2 series. From one reference image and an audio track, the model generates a coherent video. Mouth shapes match phonemes in speech, expressions follow the emotional tone, and body movement and framing are consistent with the prompt. This focus on timing and expression makes Wan S2V practical for film and television use, narration, and content creation.

The pipeline aligns audio features with the diffusion process, injecting signals that guide motion and expression across frames. It supports pose control when a pose video is provided, so creators can match choreography or specific movement patterns. The result is a controlled, synchronized output suitable for talk segments, songs, and staged scenes.

In the broader Wan 2.2 lineup, S2V complements text-to-video and image-to-video options. Together they cover prompt-only generation, reference-image motion, and audio-synchronized production. This page focuses on Wan S2V usage and practical guidance.

Key Capabilities

Image + Audio to Video

Provide a single reference image and an audio file. The model produces a video where speech and visuals are synchronized. Facial motion, head turns, and shoulder movement follow the sound while keeping identity consistent.

Full-body or Half-body

Configure framing to match your scene. For dialogue, half-body framing keeps attention on facial expression. For performance and music, full-body framing shows posture, gestures, and choreography.

Pose + Audio Control

Optionally add a pose video to guide body movement while keeping audio alignment. This helps reproduce motion patterns from reference clips, useful for music videos and explained scenes.

Instruction Following

Prompts can guide scene intent: camera speed, mood, and high-level action. The model translates brief instructions into lighting and motion tendencies that fit the audio and the image.

Run Wan 2.2 — Installation

The Wan 2.2 repository includes S2V along with T2V, I2V, and TI2V. Follow these steps to set up a local environment. Notes mirror publicly shared guidance. For details on GPUs and memory, see the project documentation.

Clone the repo

git clone https://github.com/Wan-Video/Wan2.2.git cd Wan2.2

Install dependencies

# Ensure torch >= 2.4.0
# If the installation of `flash_attn` fails, try installing the other packages first and install `flash_attn` last
pip install -r requirements.txt

Model download

Models	Description
T2V-A14B	Text-to-Video MoE model, supports 480P & 720P
I2V-A14B	Image-to-Video MoE model, supports 480P & 720P
TI2V-5B	High-compression VAE, T2V+I2V, supports 720P
S2V-14B	Speech-to-Video model, supports 480P & 720P

Note: The TI2V-5B model supports 720P video generation at 24 FPS.

Download with huggingface-cli

pip install "huggingface_hub[cli]" huggingface-cli download Wan-AI/Wan2.2-T2V-A14B --local-dir ./Wan2.2-T2V-A14B

Download with modelscope-cli

pip install modelscope modelscope download Wan-AI/Wan2.2-T2V-A14B --local_dir ./Wan2.2-T2V-A14B

Run Text-to-Video

Single-GPU inference:

python generate.py  --task t2v-A14B --size 1280*720 --ckpt_dir ./Wan2.2-T2V-A14B --offload_model True --convert_model_dtype --prompt "Two anthropomorphic cats in comfy boxing gear and bright gloves fight intensely on a spotlighted stage."

Runs on a GPU with at least 80GB VRAM. For OOM, use --offload_model True, --convert_model_dtype and --t5_cpu.

Multi-GPU (FSDP + DeepSpeed Ulysses):

torchrun --nproc_per_node=8 generate.py --task t2v-A14B --size 1280*720 --ckpt_dir ./Wan2.2-T2V-A14B --dit_fsdp --t5_fsdp --ulysses_size 8 --prompt "Two anthropomorphic cats in comfy boxing gear and bright gloves fight intensely on a spotlighted stage."

Prompt extension (optional)

Using Dashscope API:

DASH_API_KEY=your_key torchrun --nproc_per_node=8 generate.py  --task t2v-A14B --size 1280*720 --ckpt_dir ./Wan2.2-T2V-A14B --dit_fsdp --t5_fsdp --ulysses_size 8 --prompt "Two anthropomorphic cats in comfy boxing gear and bright gloves fight intensely on a spotlighted stage" --use_prompt_extend --prompt_extend_method 'dashscope' --prompt_extend_target_lang 'zh'

Using a local Qwen model:

torchrun --nproc_per_node=8 generate.py  --task t2v-A14B --size 1280*720 --ckpt_dir ./Wan2.2-T2V-A14B --dit_fsdp --t5_fsdp --ulysses_size 8 --prompt "Two anthropomorphic cats in comfy boxing gear and bright gloves fight intensely on a spotlighted stage" --use_prompt_extend --prompt_extend_method 'local_qwen' --prompt_extend_target_lang 'zh'

Run Image-to-Video

Single-GPU inference:

python generate.py --task i2v-A14B --size 1280*720 --ckpt_dir ./Wan2.2-I2V-A14B --offload_model True --convert_model_dtype --image examples/i2v_input.JPG --prompt "Summer beach vacation style, a white cat wearing sunglasses sits on a surfboard..."

Size represents area; aspect ratio follows the input image.

Run Text-Image-to-Video (TI2V-5B)

Single-GPU T2V:

python generate.py --task ti2v-5B --size 1280*704 --ckpt_dir ./Wan2.2-TI2V-5B --offload_model True --convert_model_dtype --t5_cpu --prompt "Two anthropomorphic cats in comfy boxing gear and bright gloves fight intensely on a spotlighted stage"

For TI2V at 720P use 1280*704 or 704*1280. On 24GB VRAM, keep offload and dtype conversion.

Run Speech-to-Video (S2V-14B)

Single-GPU:

python generate.py  --task s2v-14B --size 1024*704 --ckpt_dir ./Wan2.2-S2V-14B/ --offload_model True --convert_model_dtype --prompt "Summer beach vacation style, a white cat wearing sunglasses sits on a surfboard."  --image "examples/i2v_input.JPG" --audio "examples/talk.wav"

Without --num_clip, length follows audio. For multi-GPU with FSDP and Ulysses:

torchrun --nproc_per_node=8 generate.py --task s2v-14B --size 1024*704 --ckpt_dir ./Wan2.2-S2V-14B/ --dit_fsdp --t5_fsdp --ulysses_size 8 --prompt "Summer beach vacation style, a white cat wearing sunglasses sits on a surfboard." --image "examples/i2v_input.JPG" --audio "examples/talk.wav"

Pose + Audio:

torchrun --nproc_per_node=8 generate.py --task s2v-14B --size 1024*704 --ckpt_dir ./Wan2.2-S2V-14B/ --dit_fsdp --t5_fsdp --ulysses_size 8 --prompt "a person is singing" --image "examples/pose.png" --audio "examples/sing.MP3" --pose_video "./examples/pose.mp4"

For S2V, size is area; aspect ratio follows the reference image. --num_clip controls clip count for previews.

Use Cases

Dialogue and Narration

Create speaking segments from a portrait and audio track. Useful for guide videos, explainer clips, and scripted scenes.

Singing and Performance

Synchronize vocals with expressions and body movement for music pieces, with optional pose guidance for choreography.

Character Continuity

Maintain identity across shots using the same reference image. Keep framing consistent for half-body or full-body sequences.

FAQs

Does Wan S2V require text prompts?

Prompts are optional. The model follows audio and the reference image. Prompts help set scene intent and camera behavior.

How long can the output be?

By default, duration follows audio length unless you set --num_clip for shorter previews.

Can I control pose?

Yes. Provide a pose video for pose-driven output. The model matches pose sequence while keeping audio sync.

What resolutions are supported?

Common settings are 480P and 720P. Size is given as area; aspect ratio follows the reference image.

Is this the official site?

This is a community page with simple guidance.