Installation

Set up Wan 2.2 locally and run S2V.

Clone

git clone https://github.com/Wan-Video/Wan2.2.git cd Wan2.2

Install

# Ensure torch >= 2.4.0
# If the installation of `flash_attn` fails, try installing the other packages first and install `flash_attn` last
pip install -r requirements.txt

Models

T2V-A14B — Text-to-Video MoE, 480P & 720P
I2V-A14B — Image-to-Video MoE, 480P & 720P
TI2V-5B — High-compression VAE, T2V+I2V, 720P
S2V-14B — Speech-to-Video, 480P & 720P

Note: TI2V-5B supports 720P generation at 24 FPS.

huggingface-cli

pip install "huggingface_hub[cli]" huggingface-cli download Wan-AI/Wan2.2-T2V-A14B --local-dir ./Wan2.2-T2V-A14B

modelscope-cli

pip install modelscope modelscope download Wan-AI/Wan2.2-T2V-A14B --local_dir ./Wan2.2-T2V-A14B

Run S2V

python generate.py  --task s2v-14B --size 1024*704 --ckpt_dir ./Wan2.2-S2V-14B/ --offload_model True --convert_model_dtype --prompt "a person is speaking"  --image "examples/i2v_input.JPG" --audio "examples/talk.wav"

torchrun --nproc_per_node=8 generate.py --task s2v-14B --size 1024*704 --ckpt_dir ./Wan2.2-S2V-14B/ --dit_fsdp --t5_fsdp --ulysses_size 8 --prompt "a person is speaking" --image "examples/i2v_input.JPG" --audio "examples/talk.wav"