Explainer

How AI videos are made

1. Anchor the character’s face

Each character starts with a face reference image. Using SDXL with the PuLID identity-control extension, we can render the same character across different outfits and scenes while keeping the face stable. This is the character’s anchor.

2. Render the scene as a still

Next we render a still image placing the character into a specific scene — “rainy night at a cafe,” “moonlit forest,” and so on. Composition, lighting, and color are chosen to match the character’s personality and world.

3. Animate the still

Wan 2.2, an image-to-video model, takes the still as a starting frame and produces a roughly five-second clip. The motion — “slow turn,” “gentle smile,” “walking forward” — is directed scene by scene.

4. Generate the character’s notes

Each character has a detailed persona file (interests, dislikes, active hours, voice). A large language model uses that file to draft short microblog “notes” in the character’s voice.

5. Human review and publish

Every video and note passes through human review before going live. Quality and risk scores are assigned; outputs that look underage or hit a forbidden theme are rejected here.

Why short clips

Today’s image-to-video models stay coherent for roughly five seconds before quality degrades. Rather than fight that limit, we lean into it — collecting many five-second moments per character.