Method and Capabilities
Wan S2V injects audio features into the diffusion process to guide expression and motion. With one image and an audio track, it outputs a video aligned in timing and mood.
Audio Injection
Speech features and rhythm inform mouth shapes, head movement, and timing across frames. This produces synchronized lip motion and expression that matches phonemes and prosody.
Pose + Audio
Add a pose video to follow a movement sequence while keeping audio alignment. This is useful for songs, choreography, and specific body motion patterns.
Instruction Following
Prompts provide high-level intent for pacing, camera behavior, and mood. The model translates brief text into scene tendencies that fit the audio and reference image.
Framing
Use half-body for talking heads and interviews. Use full-body for performance, gestures, and action. Keep the reference image consistent across shots to maintain identity.
Resolution
Typical outputs use 480P or 720P. Size is expressed as area (e.g., 1024*704). Aspect ratio follows the reference image. For longer clips, adjust --num_clip
or split audio.