Steps Reference
All built-in steps — inspect, smart cuts, edit, enrich, VFX, audio, lyrics, and generation.
Steps Reference
Steps are the building blocks of Montaj's editing pipeline. Each step is an executable that follows the output convention (result on stdout, errors on stderr). Steps are composable at the shell level and discoverable via CLI, HTTP, and MCP.
Steps are organized into subdirectories by category: audio/, edit/, generate/, lyrics/, media/, speech/, transform/. The CLI resolves step names automatically — you never need to specify the subdirectory.
Inspect Steps
Inspect steps give the agent (or human) an understanding of the source material before editing begins.
probe
Extract metadata from a video file: duration, resolution, fps, codec, audio channels.
montaj step probe --input clip.mp4
# → JSON: duration, resolution, fps, codec, audio channelsThe output is JSON on stdout — the agent uses this to understand what it's working with and make informed editing decisions.
snapshot
Generate a contact sheet frame grid — the agent's visual understanding of the clip.
montaj step snapshot --input clip.mp4
# → /path/to/snapshot.png (frame grid contact sheet)
montaj step snapshot --input clip.mp4 --cols 3 --rows 3
# Custom grid size
montaj step snapshot --input clip.mp4 --at 5.0
# Extract a single frame at a specific timestamp| Param | Default | Description |
|---|---|---|
--cols <n> | auto | Number of columns in the grid |
--rows <n> | auto | Number of rows in the grid |
--frames <n> | auto | Total number of frames to extract |
--at <t> | — | Extract a single frame at a specific timestamp |
virtual_to_original
Map virtual-timeline timestamps to original-file timestamps (or the reverse). Useful for inspecting and debugging trim specs.
montaj step virtual_to_original --input spec.json 47.32
# → 95.483 (virtual timestamp → original-file timestamp)
montaj step virtual_to_original --input spec.json 47.32 53.23 66.89
# → one result per line
montaj step virtual_to_original --input spec.json --inverse 95.483
# → 47.320 (original-file timestamp → virtual timestamp)| Param | Description |
|---|---|
--input <spec.json> | Trim spec file |
--inverse | Reverse direction: original → virtual |
| Positional args | One or more timestamps to map |
Smart Cut Steps
Smart cut steps analyze audio and produce trim specs — JSON files describing which ranges of the original source to keep. No video encoding happens at this stage.
waveform_trim
Waveform silence analysis — detects silent sections and outputs a trim spec. Near-instant, no encode.
montaj step waveform_trim --input clip.mp4
# → /path/to/clip_spec.json
montaj step waveform_trim --input clip.mp4 --threshold -30 --min-silence 0.3
# Custom sensitivity| Param | Default | Description |
|---|---|---|
--threshold <dB> | auto | Volume threshold for silence detection (negative dB value) |
--min-silence <s> | auto | Minimum silence duration to trigger a cut (seconds) |
Output: a trim spec JSON with input (original file) and keeps (time ranges to keep).
rm_fillers
Remove filler words (um, uh, hmm) from speech. Takes a trim spec as input and outputs a refined trim spec.
montaj step rm_fillers --input clip.mp4
# → /path/to/clip_spec.json
montaj step rm_fillers --input clip.mp4 --model medium.en
# Higher accuracy, slower| Param | Default | Description |
|---|---|---|
--model <name> | base.en | Whisper model: tiny.en, base.en, medium.en, large |
Important: rm_fillers can accept either a video file or a trim spec as input. When given a trim spec, it refines the existing keeps list.
rm_nonspeech
Remove all non-speech audio (noisy ambient audio). Takes a trim spec as input and outputs a refined trim spec.
montaj step rm_nonspeech --input clip.mp4
montaj step rm_nonspeech --input clip.mp4 --model base --max-word-gap 0.18 --sentence-edge 0.10| Param | Default | Description |
|---|---|---|
--model <name> | base | Whisper model: base, small, medium |
--max-word-gap <s> | 0.18 | Maximum gap between words before splitting |
--sentence-edge <s> | 0.10 | Padding at sentence boundaries |
Important: Input should be a trim spec, not a raw video file.
crop_spec
Crop a trim spec to virtual-timeline windows. Outputs a refined trim spec — no encode.
montaj step crop_spec --input spec.json --keep 8.5:14.8
# Crop to a single window
montaj step crop_spec --input spec.json --keep 0:2.4 --keep 13.84:18.33
# Multiple windows — keeps are concatenated in order
montaj step crop_spec --input spec.json --keep 40.28:end
# Open-ended — keep from virtual 40.28s to end of clip| Param | Description |
|---|---|
--keep <start:end> | Virtual-timeline window to keep (repeatable). Use end sentinel for open-ended. |
--out <path> | Output path (default: <input>_cropped.json) |
How Smart Cuts Chain
Smart cut steps form a pipeline where each step refines the trim spec:
waveform_trim(clip.MOV)
→ trim spec (silence removed)
rm_fillers(spec.json)
→ refined spec (fillers removed)
rm_nonspeech(spec.json)
→ refined spec (non-speech removed)
crop_spec(spec.json)
→ cropped spec (virtual-timeline windows)The original source file path is preserved through the entire chain. Only materialize_cut will finally encode video.
Edit Steps
Core video editing operations. Most of these are ffmpeg wrappers with careful codec handling.
trim
Cut by in/out point — extract a segment from a video.
montaj step trim --input clip.mp4 --start 2.5 --end 8.3
# Extract from 2.5s to 8.3s
montaj step trim --input clip.mp4 --start 00:00:02 --end 00:01:30
# HH:MM:SS format also accepted| Param | Description |
|---|---|
--start <t> | Start time (seconds or HH:MM:SS) |
--end <t> | End time (seconds or HH:MM:SS) |
--duration <t> | Duration instead of end time |
cut
Remove one or more sections from a video and rejoin — the opposite of trim.
montaj step cut --input clip.mp4 --start 3.0 --end 7.5
# Remove a single section and rejoin
montaj step cut --input clip.mp4 --cuts '[[0,1.2],[5.3,7.8]]'
# Remove multiple sections in one ffmpeg pass
montaj step cut --input clip.mp4 --cuts '[[3.0,7.5]]' --spec
# Write a trim spec JSON instead of encoding — use with materialize_cut for deferred encode| Param | Description |
|---|---|
--start <t>, --end <t> | Remove a single section |
--cuts <json> | Remove multiple sections: [[start, end], ...] |
--spec | Output a trim spec instead of encoding video |
jump_cut
Remove time ranges from video — similar to cut, but oriented toward jump-cut editing with explicit keeps or cuts.
montaj step jump_cut --input clip.mp4 --cuts '[[2.0, 3.5], [7.0, 8.0]]'
# Remove specified ranges
montaj step jump_cut --input clip.mp4 --keeps '[[0, 2.0], [3.5, 7.0], [8.0, 12.0]]'
# Keep only specified ranges| Param | Description |
|---|---|
--cuts <json> | Time ranges to remove: [[start, end], ...] |
--keeps <json> | Time ranges to keep: [[start, end], ...] |
cross_cut
Interleave A-roll and B-roll segments — creates a cross-cut edit between two video sources.
montaj step cross_cut --input a-roll.mp4 --b-roll b-roll.mp4 --segment-duration 3| Param | Default | Description |
|---|---|---|
--b-roll <file> | required | B-roll video file |
--segment-duration <s> | auto | Duration of each interleaved segment |
montage
Create a rapid montage from multiple clips — cuts them into short segments and assembles.
montaj step montage --inputs clip1.mp4 clip2.mp4 clip3.mp4 --beat-duration 2| Param | Default | Description |
|---|---|---|
--inputs <files> | required | Multiple source clips |
--beat-duration <s> | auto | Duration of each montage beat |
--offset <s> | 0 | Offset into each source clip |
materialize_cut
Encode a trim spec or raw video segment to H.264. Used only when a subsequent step (e.g., remove_bg) requires an actual video file rather than a trim spec.
montaj step materialize_cut spec.json
# Encode a trim spec to video
montaj step materialize_cut clip.mp4 --inpoint 2.0 --outpoint 8.0
# Encode a raw segment
montaj step materialize_cut --inputs clip0.json clip1.json
# Multiple clips — caps at 2 concurrent encodes by defaultUses input-level seeking (-ss/-t before -i) so only the requested segment is decoded.
| Param | Description |
|---|---|
--inpoint <t> | Start time (for raw video input) |
--outpoint <t> | End time (for raw video input) |
--inputs <files> | Multiple trim specs or clips |
--workers <n> | Max concurrent encodes (default: 2) |
resize
Reframe video to a target aspect ratio.
montaj step resize --input clip.mp4 --ratio 9:16 # TikTok / Reels / Shorts
montaj step resize --input clip.mp4 --ratio 1:1 # Instagram
montaj step resize --input clip.mp4 --ratio 16:9 # YouTube| Param | Description |
|---|---|
--ratio <ratio> | Target aspect ratio: 9:16, 1:1, 16:9 |
extract_audio
Extract the audio track from a video file.
montaj step extract_audio --input clip.mp4 # default: wav
montaj step extract_audio --input clip.mp4 --format mp3| Param | Default | Description |
|---|---|---|
--format <fmt> | wav | Output format: wav, mp3, aac |
Audio Steps
Audio processing steps for stem separation and waveform analysis.
stem_separation
Separate audio into stems (vocals, drums, bass, other) using Demucs.
montaj step stem_separation --input song.mp3 --stems vocals
# Extract vocals only
montaj step stem_separation --input song.mp3 --out-dir ./stems/
# All stems to a directory| Param | Default | Description |
|---|---|---|
--stems <name> | all | Which stem to extract: vocals, drums, bass, other |
--out-dir <path> | auto | Output directory for stems |
--model <name> | htdemucs | Demucs model: htdemucs, htdemucs_ft, mdx_extra |
Requires montaj install demucs.
waveform_image
Generate a PNG waveform visualization grid from an audio or video file.
montaj step waveform_image --input clip.mp4
# → /path/to/waveform.png
montaj step waveform_image --input clip.mp4 --chunk-duration 30
# One row per 30s chunk| Param | Default | Description |
|---|---|---|
--chunk-duration <s> | auto | Duration per waveform row in the grid |
Enrich Steps
Enrich steps add information (transcripts, captions) or polish (loudness normalization) to clips.
transcribe
Generate SRT and JSON transcripts with word-level timestamps using local whisper.cpp.
montaj step transcribe --input clip.mp4
# → transcript.json + clip.srt
montaj step transcribe --input clip.mp4 --model medium.en
# Higher accuracy, slower
montaj step transcribe --input clip.mp4 --language es
# Non-English| Param | Default | Description |
|---|---|---|
--model <name> | base.en | Whisper model: base.en, medium.en |
--language <code> | en | Language code for non-English |
When given a trim spec as input, transcribe extracts audio only at the keep ranges, runs whisper on the joined audio, and remaps word timestamps back to the original timeline.
Output includes:
- JSON transcript with word-level timestamps (start, end, word)
- SRT file for standard subtitle format
caption
Convert a transcript into an animated caption track. Produces data (not pixels) — rendered at review/final render time by the UI and render engine.
montaj step caption --input transcript.json
montaj step caption --input transcript.json --style word-by-word
montaj step caption --input transcript.json --style pop
montaj step caption --input transcript.json --style karaoke
montaj step caption --input transcript.json --style subtitle| Param | Default | Description |
|---|---|---|
--style <name> | word-by-word | Caption style: word-by-word, pop, karaoke, subtitle |
Caption Styles
| Style | Description |
|---|---|
word-by-word | One word at a time, spring pop-in animation |
pop | Segment-at-a-time with scale entry |
karaoke | Words highlight progressively as they're spoken |
subtitle | Static line at bottom, segments replace sequentially |
Caption data is stored in project.json as a track with type: "caption" and inline segments with word timestamps:
{
"id": "captions",
"type": "caption",
"style": "word-by-word",
"segments": [
{
"text": "Hello world",
"start": 0.0,
"end": 1.2,
"words": [
{ "word": "Hello", "start": 0.0, "end": 0.5 },
{ "word": "world", "start": 0.5, "end": 1.2 }
]
}
]
}normalize
Loudness normalization — adjust audio levels to meet platform standards.
montaj step normalize --input clip.mp4 # youtube = -14 LUFS
montaj step normalize --input clip.mp4 --target podcast # -16 LUFS
montaj step normalize --input clip.mp4 --target broadcast # -23 LUFS
montaj step normalize --input clip.mp4 --target custom --lufs -18| Param | Default | Description |
|---|---|---|
--target <name> | youtube | Target preset: youtube (-14 LUFS), podcast (-16 LUFS), broadcast (-23 LUFS), custom |
--lufs <n> | — | Custom LUFS value (only with --target custom) |
Lyrics Steps
Steps for syncing lyrics to audio and rendering lyrics videos.
lyrics_sync
Sync lyrics text to an audio file using Whisper — produces word-level aligned captions.
montaj step lyrics_sync --lyrics song.txt --input song.mp3
# → /path/to/captions.json
montaj step lyrics_sync --lyrics song.txt --input song.mp3 --model medium.en
# Higher accuracy| Param | Default | Description |
|---|---|---|
--lyrics <txt> | required | Plain text lyrics file |
--model <name> | base.en | Whisper model |
--start <t> | — | Override start time |
--end <t> | — | Override end time |
lyrics_render
Render synced captions onto video via ffmpeg drawtext filters. Produces a complete lyrics video.
montaj step lyrics_render --captions captions.json --audio song.mp3
# Render with solid color background
montaj step lyrics_render --captions captions.json --audio song.mp3 --input background.mp4
# Render with background video| Param | Default | Description |
|---|---|---|
--captions <json> | required | Word-synced captions from lyrics_sync |
--audio <file> | required | Audio file |
--input <video> | — | Optional background video |
--position <pos> | center | Text position |
--color <hex> | white | Text color |
--fontsize <n> | auto | Font size |
--preview-duration <s> | — | Render only first N seconds (for quick preview) |
VFX Steps
remove_bg
Remove video background using RVM (Robust Video Matting). Outputs a ProRes 4444 .mov with alpha channel for final render and a VP9 WebM for browser preview.
montaj step remove_bg --input clip.mp4
montaj step remove_bg --inputs clip0.mp4 clip1.mp4
# Multiple clips
montaj step remove_bg --input clip.mp4 --model rvm_resnet50
# Higher quality model
montaj step remove_bg --input clip.mp4 --downsample 0.5
# Downsample for faster processing| Param | Default | Description |
|---|---|---|
--inputs <files> | — | Multiple clips to process |
--model <name> | rvm_mobilenetv3 | Model: rvm_mobilenetv3 (faster) or rvm_resnet50 (higher quality) |
--downsample <factor> | — | Downsample factor for faster processing |
--progress | — | Show progress (recommended for long-running operations) |
--workers <n> | 2 | Max concurrent encodes |
Requirements
Requires montaj install rvm — installs torch, torchvision, av (pip) + RVM model weights.
Output
The step produces two files and updates the project item:
nobg_src— ProRes 4444.movwith alpha channel (for final render)nobg_preview_src— VP9 WebM (for browser preview — Chrome supports VP9 alpha; ProRes does not play in browsers)- Sets
remove_bg: trueon the item
Usage in Workflows
Used in the floating_head workflow:
materialize_cut— encode the trim spec to an actual video fileremove_bg— remove background from the materialized file- Render engine composites the alpha-channel
.movover the layers beneath
Important: remove_bg requires an actual video file — pass the output of materialize_cut, not a trim spec. This step is long-running (minutes per clip) — always run in the background with --progress so you can monitor status.
Render Behavior
At render time, when a tracks[1+] item has remove_bg: true and nobg_src is set, the render engine uses the ProRes 4444 .mov (with alpha) in place of the original src. The alpha channel is preserved through the ffmpeg filter graph and composited over the layers beneath.
Generation Steps
These steps call external APIs (Kling, Gemini, OpenAI) and require montaj install connectors and montaj install credentials.
kling_generate
Generate video via Kling — text-to-video, image-to-video, or reference-guided generation. Supports two models.
montaj kling-generate --prompt "a calico cat walking through a sunlit kitchen, cinematic" --out /tmp/cat.mp4
montaj kling-generate --prompt "slow zoom in" --first-frame frame.png --out /tmp/zoom.mp4
montaj kling-generate --prompt "character walks left" --first-frame start.png --last-frame end.png --out /tmp/walk.mp4
montaj kling-generate --prompt "same style" --ref-image style1.png --ref-image style2.png --out /tmp/styled.mp4
montaj kling-generate --prompt "..." --out /tmp/pro.mp4 --mode pro --duration 10 --aspect-ratio 9:16
montaj kling-generate --multi-shot --shot-type customize --multi-prompt '<json>' --out /tmp/batch.mp4| Param | Default | Description |
|---|---|---|
--prompt <text> | required | Generation prompt |
--out <path> | required | Output file path |
--first-frame <img> | — | Starting image for image-to-video |
--last-frame <img> | — | Ending image (requires --first-frame) |
--ref-image <img> | — | Reference image (repeatable, max 7) |
--duration <3-15> | — | Video duration in seconds |
--negative-prompt <text> | — | What to avoid |
--sound <on|off> | — | Enable/disable sound |
--aspect-ratio <ratio> | — | 16:9, 9:16, 1:1 |
--mode <std|pro> | std | Standard (cheaper/faster) or Pro (higher quality) |
--model <name> | auto | kling-v3-omni (3-15s, audio) or kling-video-o1 (5/10s, best quality) |
--multi-shot | — | Enable multi-shot batch mode (up to 6 scenes) |
--shot-type <type> | — | customize or intelligence (multi-shot only) |
--multi-prompt <json> | — | Per-shot prompts as JSON array (multi-shot only) |
Models
| Model | Duration | Audio | Multi-shot | Notes |
|---|---|---|---|---|
kling-v3-omni | 3-15s | Yes (sound: "on") | Yes | Flexible durations, audio generation |
kling-video-o1 | 5s or 10s only | No | No | Highest visual quality |
The step auto-upgrades to kling-video-o1 when duration is 5/10 and sound is off.
analyze_media
Analyze a media file (video, audio, or image) with Gemini Flash. Supports description, timestamps, and structured output.
montaj analyze-media clip.mp4 --prompt "Describe the scene in 2 sentences."
montaj analyze-media song.mp3 --prompt "Transcribe with timestamps."
montaj analyze-media photo.jpg --prompt "Return JSON: {subject, mood, dominant_colors}" --json-output
montaj analyze-media clip.mp4 --prompt "..." --model gemini-2.5-pro
# Override model
montaj analyze-media clip.mp4 --prompt "..." --out analysis.txt
# Write to file| Param | Default | Description |
|---|---|---|
<input> | required | Video, audio, or image file (positional) |
--prompt <text> | required | Analysis prompt |
--model <id> | gemini-2.5-flash | Model override |
--json-output | — | Request structured JSON response from the model |
--out <path> | — | Write output to file |
Note: Images under approximately 18 MB take a fast inline path (no Files API round-trip).
generate_image
Generate an image via Gemini or OpenAI — text-to-image or reference-conditioned.
montaj generate-image --prompt "portrait, studio lighting" --out /tmp/portrait.png
montaj generate-image --prompt "same character, profile view" --ref-image /tmp/portrait.png --out /tmp/profile.png
montaj generate-image --prompt "red apple on white table" --provider openai --out /tmp/apple.png
montaj generate-image --prompt "..." --provider gemini --aspect-ratio 9:16 --out /tmp/tall.png| Param | Default | Description |
|---|---|---|
--prompt <text> | required | Generation prompt |
--out <path> | required | Output file path |
--provider <name> | gemini | Provider: gemini or openai |
--ref-image <img> | — | Reference image (repeatable) |
--size <WxH> | — | Image dimensions |
--aspect-ratio <ratio> | — | Aspect ratio (Gemini only) |
generate_music
Generate music via Gemini Lyria 3 Clip — produces approximately 30 seconds of audio from a text prompt.
montaj generate-music --prompt "upbeat electronic, 120 bpm" --out /tmp/music.wav
montaj generate-music --prompt "acoustic guitar, mellow" --with-vocals --out /tmp/song.wav| Param | Default | Description |
|---|---|---|
--prompt <text> | required | Music description |
--out <path> | required | Output file path (.wav) |
--with-vocals | false | Include vocals in generation |
--seed <n> | — | Random seed for reproducibility |
generate_voiceover
Generate speech audio via Kling TTS or Gemini TTS.
montaj generate-voiceover --text "Welcome to our farm" --out /tmp/vo.wav
montaj generate-voiceover --text-file script.txt --voice Kore --vendor gemini --out /tmp/vo.wav
montaj generate-voiceover --text "..." --vendor kling --out /tmp/vo.mp3| Param | Default | Description |
|---|---|---|
--text <str> | — | Text to speak (mutually exclusive with --text-file) |
--text-file <path> | — | Text file to read |
--voice <name> | auto | Voice name (vendor-specific) |
--vendor <name> | kling | TTS vendor: kling or gemini |
--speed <n> | — | Speech speed multiplier |
--language <code> | — | Language code |
--model <name> | — | Override default model |
fetch
Download videos from URLs via yt-dlp.
montaj fetch --url "https://www.tiktok.com/@handle/video/123"
montaj fetch --url "https://www.tiktok.com/@handle" --limit 15 --out ./clips/| Param | Default | Description |
|---|---|---|
--url <str> | required | Video or profile URL |
--limit <n> | — | Max number of videos to download |
--format <str> | auto | yt-dlp format selector |
Models Used
| Provider | Use Case | Model |
|---|---|---|
| Kling | Video generation | kling-v3-omni, kling-video-o1 |
| Kling | Text-to-speech | kling-tts-v1 |
| Gemini | Media analysis | gemini-2.5-flash |
| Gemini | Image generation | gemini-3-pro-image-preview |
| Gemini | Text-to-speech | gemini-2.5-flash-preview-tts |
| Gemini | Music generation | lyria-3-clip-preview |
| OpenAI | Image generation | gpt-image-1 |