Method and Capabilities

Wan S2V injects audio features into the diffusion process to guide expression and motion. With one image and an audio track, it outputs a video aligned in timing and mood.

Audio Injection

Speech features and rhythm inform mouth shapes, head movement, and timing across frames. This produces synchronized lip motion and expression that matches phonemes and prosody.

Pose + Audio

Add a pose video to follow a movement sequence while keeping audio alignment. This is useful for songs, choreography, and specific body motion patterns.

Instruction Following

Prompts provide high-level intent for pacing, camera behavior, and mood. The model translates brief text into scene tendencies that fit the audio and reference image.

Framing

Use half-body for talking heads and interviews. Use full-body for performance, gestures, and action. Keep the reference image consistent across shots to maintain identity.

Resolution

Typical outputs use 480P or 720P. Size is expressed as area (e.g., 1024*704). Aspect ratio follows the reference image. For longer clips, adjust --num_clip or split audio.