Every new video moves through the same three-step wizard. The screen chrome stays consistent: heading, content, primary CTA at the bottom. Step 2 is where most decisions live; this article breaks it into its constituent UI sections.
TIP
You can jump between steps
At any point, tap a row in the Step 3 review summary to jump back and edit. Your inputs persist β the wizard doesn't reset.
Step 1: What's the video about?
Step 1 captures the topic plus optional script notes. The AI uses both to generate hooks, scripts, and captions later in the flow.
What's the
topic?
Keep it short β "vendor tips", not a full sentence.
π‘ Script tips for venue tours
Open with a hook: "Wait till you see the bride's suite." Use specific details. Call out unexpected features.
Heading: 'What's the video about?' Sub: 'A topic, a question, or a keyword.' Hairline-underline input + Trending chips + collapsible script notes.
What you see
Heading β "What's the video about?" Sub: "A topic, a question, or a keyword."
Topic input β bare typographic line with hairline underline that lights up on focus. Placeholder: "e.g. why our espresso tastes different".
Trending chips β appear once Promoat finishes analyzing trends for your audience (eyebrow label "Trending"; spinner while loading). Tap to select; a checkmark badge appears on the active chip.
"Or use what you typed" shortcut β appears below the chips when you've typed something that doesn't match a trend. Tap to override the trend selection with your typed text.
Add script or notes (optional) β chevron toggle. Expanded, reveals a multiline input with the placeholder "Paste a script or describe how it should soundβ¦" Limit 2,000 characters.
Topic best practices
Narrow but not niche. "Wedding venue tips" beats "weddings" but not "where the bride sits during the vows."
Searchable language. Use words people actually search for on TikTok or Instagram.
Action or value. "How to..." or "Why..." framing outperforms "My thoughts on..."
TIP
Notes are for tone, not exact wording
The AI rewrites whatever you paste. Use script notes to nudge tone or structure ("punchy, no jargon"; "open with a question") more than to dictate the exact text β generation will rephrase regardless.
Type a topic β or tap a Trending chip
Optionally tap 'Add script or notes' and paste guidance
Step 2: Your scene
Step 2 is the scene composer. Four UI sections stack vertically: the description input (always shown), reference images (collapsed by default), additional media (collapsed by default), and advanced settings (collapsed by default). A description plus a couple of reference photos is more powerful than any one alone.
TIP
References vs. additional media
Reference images shape the look of your scene β what's around you, on you, behind you β feeding into one composed shot. Additional media are separate clips Promoat cuts into the timeline as their own beats β product close-ups, screen recordings, before/afters. Different jobs.
Description (always shown)
The header pairs your portrait (uploaded during
onboarding) with the title
"Your scene." and the sub
"Describe your video." Below: a multiline textarea with a hairline underline plus a horizontal row of audience-tailored suggestion chips.
Walking through the venue, pointing out where the ceremony will be.
Header: portrait avatar + 'Your scene.' Description textarea with hairline underline + chip row.
What to write
Good: "Walking through the venue, pointing out where the ceremony and reception will be."
Good: "Holding the new candle in both hands, bringing it close to the camera and smelling it."
Too vague: "A video about my product" or "Me talking."
Suggestion chips
The chips below the textarea are tailored to your audience profile. Tap one to drop the chip's full prompt into the input β chips have short labels but expand to longer starter sentences when tapped. Edit from there. The active chip is filled dark to confirm selection.
Describe what someone would
see
, not what your video is "about." "Showing the new candle next to last year's bestseller" reads better than "Comparison of products."
Reference images (optional, collapsed by default)
A collapsible section. Header reads Reference images with an optional tag and a hint that updates as you add photos: "helps the AI match your style" when empty, "3 images added" once filled.
HOW THIS WORKS
βMe in this dress, sitting on a chair, cat on my lap, this building behind me.β
YOUR TURN β ADD YOUR IMAGES
Expanded: 'HOW THIS WORKS' example block at top, 'YOUR TURN β Add your images' caption, then your slots.
The HOW THIS WORKS block
At the top of the expanded section, an inline teaching block headed HOW THIS WORKS shows four labeled thumbnails β you, background, dress, cat β the example prompt "Me in this dress, sitting on a chair, cat on my lap, this building behind me.", a down-arrow, and a single composed output image showing all four ingredients combined into one scene. The block is static β it never uses your photos.
Your slots
Below the example, a small caption YOUR TURN β Add your images introduces the input row. Three named slots reveal one at a time as you fill them β the placeholders are tailored to your audience profile (e.g. for a wedding-planner audience: "your dress" / "your venue" / "your bouquet"). Internally these are holding, setting, and interacting; the labels you see are looser and audience-specific.
Once the three named slots are filled, an "add more" tile appears. You can stack up to 9 reference images total (the portrait you uploaded in onboarding counts as one of the nine). Past three or four, returns diminish β the image model starts cramming rather than composing.
Capabilities
Up to 9 reference images per scene.
Each slot has a small text input below it. Type a label (e.g. "silk dress") so the AI knows what's in the image when reading your description.
Filled slots show the cool-gradient border β they're treated like the next-to-add slot for visual emphasis.
Tap the Γ on any filled slot to remove it.
TIP
You can skip references entirely
References help the AI match your real-world setting. If a generated backdrop is fine, leave the section closed and rely on the description.
Tap Reference images to expand
Read the example block to learn the pattern
Tap the gradient-bordered slot to pick a photo
Type a short label below the photo
Repeat for the next slots; tap 'add more' for a 4th-9th
Same collapsible pattern as Reference images. This section is for clips or screenshots Promoat should insert into the finished video as their own beats β separate from you talking. Think product close-ups, app screen recordings, before/after shots, explainer graphics.
will be inserted separately in the video
Expanded: each cut-in slot has its own placeholder hint and a label input below.
What it's for
Reference images shape the look of your talking-head shot. Additional media plays as separate beats cut into the timeline alongside that shot. Both can stack with the description.
Capabilities
Accepts both images and videos.
Each item gets a small label (e.g. "product close-up", "before / after") that tells the AI when to drop it in.
Slots reveal progressively β fill the first to see the second, and so on.
HEADS-UP
Description, references, and media stack
A strong description + two reference photos + one cut-in clip is more powerful than any one alone. Promoat uses all of it together.
Tap Additional media to expand
Tap a slot to pick a clip or image from your phone
Type a label below the slot
Advanced settings (optional)
A small Advanced settings chevron near the bottom of Step 2. Most people leave both controls on Auto β Promoat picks based on your description. Open it only when you want manual control over scene count or movement style.
Scenes
Auto picks based on your description.
Movement
You speak straight to camera. Fast and reliable.
Two controls: Scenes (Auto / 1 / 2 / 3 / 4) and Movement (Auto / Just talking / Acting it out).
Scenes
How many distinct shots/cuts the video contains. Chip row: Auto (sparkles icon, default) and four numeric chips 1 / 2 / 3 / 4. A small helper line below the chips explains the active choice β e.g. "One continuous shot β fastest, no cuts" for 1, or "3 cuts β montage feel, more visual variety" for 3.
Movement
How you appear on screen. Chip row: Auto plus two text chips:
Just talking β sub "Eye contact with camera." The synthetic likeness from your portrait is animated lip-syncing to the script. Fast, reliable, classic talking-head look. Internally this is the fabric pipeline.
Acting it out β sub "On-camera action." Generates you moving and interacting in the scene (holding the product, walking through the venue) while your voice plays as voiceover. More cinematic. Internally this is the grok pipeline.
TIP
Leave it on Auto unless you have a reason
Auto reads your description and picks the right combination. Manual overrides are useful for specific looks (e.g. one clean shot for a product reveal, four cuts for a fast-paced list).
Tap Advanced settings to expand
Pick a Scenes count (or leave on Auto)
Pick a Movement style (or leave on Auto)
Tap Continue at the bottom of the screen
Step 3: Your idea
A one-screen summary with hairline rows and the primary Generate CTA at the bottom.
Ready to
create?
Review your idea before generating.
Scene
Walking through the venue, pointing out ceremony and reception spaces
Heading: 'Your idea.' Sub: 'Tap any line to edit.' Hairline rows for Topic / Scene / References / Cut-ins. Generate CTA with cost badge.
What you see
Heading β "Your idea." Sub: "Tap any line to edit."
Topic row β your topic. Tap to jump back to Step 1.
Scene row β your scene description (or "No scene details β we'll improvise." if blank). Tap to jump back to Step 2.
References row β count of reference images, if any.
Cut-ins row β count of additional media, if any.
Style summary β concatenates your movement choice ("Just talking" or "Acting in scene Β· Studio lipsync") with the scene count if you set one manually.
Generate CTA β gradient button with a credit-cost badge. Disabled if your balance is below the cost; a hint appears in red below the button to top up.
Credit cost
Credits are Promoat's generation currency. The cost shown on the button is calculated from your scene composition (image generation runs nano-banana / seedream depending on inputs) plus the downstream video pipeline. Acting it out costs more than Just talking because it runs a per-scene motion model. The exact number for this video is shown on the button before you commit.
HEADS-UP
You need enough credits
If your balance is below the cost shown, the Generate button is disabled and a "top up" hint appears below it. Tap your balance in the header or open Profile / Settings to buy more.
Tap any row to jump back and edit
Check the credit cost on the Generate button
What happens after Generate
Generation runs server-side, so it keeps going whether you stay on the progress screen, switch tabs, or close the app. Image generation finishes first (usually within a minute); the video render follows. Failed steps automatically refund the credits charged for that step.
When the render lands, the new entry shows up in the
Watch tab feed as a Ready overlay. From there you can save, share, re-edit, schedule, or post directly β see
Feed actions for the full set. To publish to TikTok / Instagram / YouTube Shorts / LinkedIn, link your accounts under
Connect social accounts first.