Steps Reference
All built-in steps — inspect, smart cuts, edit, enrich, VFX, and generation.
Steps Reference
Steps are the building blocks of Montaj's editing pipeline. Each step is an executable that follows the output convention (result on stdout, errors on stderr). Steps are composable at the shell level and discoverable via CLI, HTTP, and MCP.
Inspect Steps
Inspect steps give the agent (or human) an understanding of the source material before editing begins.
probe
Extract metadata from a video file: duration, resolution, fps, codec, audio channels.
montaj step probe --input clip.mp4
# → JSON: duration, resolution, fps, codec, audio channelsThe output is JSON on stdout — the agent uses this to understand what it's working with and make informed editing decisions.
snapshot
Generate a contact sheet frame grid — the agent's visual understanding of the clip.
montaj step snapshot --input clip.mp4
# → /path/to/snapshot.png (frame grid contact sheet)
montaj step snapshot --input clip.mp4 --cols 3 --rows 3
# Custom grid size| Param | Default | Description |
|---|---|---|
--cols <n> | auto | Number of columns in the grid |
--rows <n> | auto | Number of rows in the grid |
virtual_to_original
Map virtual-timeline timestamps to original-file timestamps (or the reverse). Useful for inspecting and debugging trim specs.
montaj step virtual_to_original --input spec.json 47.32
# → 95.483 (virtual timestamp → original-file timestamp)
montaj step virtual_to_original --input spec.json 47.32 53.23 66.89
# → one result per line
montaj step virtual_to_original --input spec.json --inverse 95.483
# → 47.320 (original-file timestamp → virtual timestamp)| Param | Description |
|---|---|
--input <spec.json> | Trim spec file |
--inverse | Reverse direction: original → virtual |
| Positional args | One or more timestamps to map |
Smart Cut Steps
Smart cut steps analyze audio and produce trim specs — JSON files describing which ranges of the original source to keep. No video encoding happens at this stage.
waveform_trim
Waveform silence analysis — detects silent sections and outputs a trim spec. Near-instant, no encode.
montaj step waveform_trim --input clip.mp4
# → /path/to/clip_spec.json
montaj step waveform_trim --input clip.mp4 --threshold -30 --min-silence 0.3
# Custom sensitivity| Param | Default | Description |
|---|---|---|
--threshold <dB> | auto | Volume threshold for silence detection (negative dB value) |
--min-silence <s> | auto | Minimum silence duration to trigger a cut (seconds) |
Output: a trim spec JSON with input (original file) and keeps (time ranges to keep).
rm_fillers
Remove filler words (um, uh, hmm) from speech. Takes a trim spec as input and outputs a refined trim spec.
montaj step rm_fillers --input clip.mp4
# → /path/to/clip_spec.json
montaj step rm_fillers --input clip.mp4 --model medium.en
# Higher accuracy, slower| Param | Default | Description |
|---|---|---|
--model <name> | base.en | Whisper model: tiny.en, base.en, medium.en, large |
Important: rm_fillers can accept either a video file or a trim spec as input. When given a trim spec, it refines the existing keeps list.
rm_nonspeech
Remove all non-speech audio (noisy ambient audio). Takes a trim spec as input and outputs a refined trim spec.
montaj step rm_nonspeech --input clip.mp4
montaj step rm_nonspeech --input clip.mp4 --model base --max-word-gap 0.18 --sentence-edge 0.10| Param | Default | Description |
|---|---|---|
--model <name> | base | Whisper model: base, small, medium |
--max-word-gap <s> | 0.18 | Maximum gap between words before splitting |
--sentence-edge <s> | 0.10 | Padding at sentence boundaries |
Important: Input should be a trim spec, not a raw video file.
crop_spec
Crop a trim spec to virtual-timeline windows. Outputs a refined trim spec — no encode.
montaj step crop_spec --input spec.json --keep 8.5:14.8
# Crop to a single window
montaj step crop_spec --input spec.json --keep 0:2.4 --keep 13.84:18.33
# Multiple windows — keeps are concatenated in order
montaj step crop_spec --input spec.json --keep 40.28:end
# Open-ended — keep from virtual 40.28s to end of clip| Param | Description |
|---|---|
--keep <start:end> | Virtual-timeline window to keep (repeatable). Use end sentinel for open-ended. |
--out <path> | Output path (default: <input>_cropped.json) |
How Smart Cuts Chain
Smart cut steps form a pipeline where each step refines the trim spec:
waveform_trim(clip.MOV)
→ trim spec (silence removed)
rm_fillers(spec.json)
→ refined spec (fillers removed)
rm_nonspeech(spec.json)
→ refined spec (non-speech removed)
crop_spec(spec.json)
→ cropped spec (virtual-timeline windows)The original source file path is preserved through the entire chain. Only concat or materialize_cut will finally encode video.
Edit Steps
Core video editing operations. Most of these are ffmpeg wrappers with careful codec handling.
trim
Cut by in/out point — extract a segment from a video.
montaj step trim --input clip.mp4 --start 2.5 --end 8.3
# Extract from 2.5s to 8.3s
montaj step trim --input clip.mp4 --start 00:00:02 --end 00:01:30
# HH:MM:SS format also accepted| Param | Description |
|---|---|
--start <t> | Start time (seconds or HH:MM:SS) |
--end <t> | End time (seconds or HH:MM:SS) |
--duration <t> | Duration instead of end time |
cut
Remove one or more sections from a video and rejoin — the opposite of trim.
montaj step cut --input clip.mp4 --start 3.0 --end 7.5
# Remove a single section and rejoin
montaj step cut --input clip.mp4 --cuts '[[0,1.2],[5.3,7.8]]'
# Remove multiple sections in one ffmpeg pass
montaj step cut --input clip.mp4 --cuts '[[3.0,7.5]]' --spec
# Write a trim spec JSON instead of encoding — use with concat for deferred encode| Param | Description |
|---|---|
--start <t>, --end <t> | Remove a single section |
--cuts <json> | Remove multiple sections: [[start, end], ...] |
--spec | Output a trim spec instead of encoding video |
concat
Join clips and apply all trim specs in a single encode pass. This is the only step in the normal pipeline that writes video.
montaj step concat --inputs spec1.json spec2.jsonconcat reads trim specs (or raw video files), builds a single ffmpeg filter_complex, and produces one output file. All accumulated cuts from the editing pipeline are applied here.
HEVC source files are handled automatically — no pre-conversion needed.
materialize_cut
Encode a trim spec or raw video segment to H.264. Used only when a subsequent step (e.g., remove_bg) requires an actual video file rather than a trim spec.
montaj step materialize_cut spec.json
# Encode a trim spec to video
montaj step materialize_cut clip.mp4 --inpoint 2.0 --outpoint 8.0
# Encode a raw segment
montaj step materialize_cut --inputs clip0.json clip1.json
# Multiple clips — caps at 2 concurrent encodes by defaultUses input-level seeking (-ss/-t before -i) so only the requested segment is decoded.
| Param | Description |
|---|---|
--inpoint <t> | Start time (for raw video input) |
--outpoint <t> | End time (for raw video input) |
--inputs <files> | Multiple trim specs or clips |
--workers <n> | Max concurrent encodes (default: 2) |
resize
Reframe video to a target aspect ratio.
montaj step resize --input clip.mp4 --ratio 9:16 # TikTok / Reels / Shorts
montaj step resize --input clip.mp4 --ratio 1:1 # Instagram
montaj step resize --input clip.mp4 --ratio 16:9 # YouTube| Param | Description |
|---|---|
--ratio <ratio> | Target aspect ratio: 9:16, 1:1, 16:9 |
extract_audio
Extract the audio track from a video file.
montaj step extract_audio --input clip.mp4 # default: wav
montaj step extract_audio --input clip.mp4 --format mp3| Param | Default | Description |
|---|---|---|
--format <fmt> | wav | Output format: wav, mp3, aac |
Enrich Steps
Enrich steps add information (transcripts, captions) or polish (loudness normalization) to clips.
transcribe
Generate SRT and JSON transcripts with word-level timestamps using local whisper.cpp.
montaj step transcribe --input clip.mp4
# → transcript.json + clip.srt
montaj step transcribe --input clip.mp4 --model medium.en
# Higher accuracy, slower
montaj step transcribe --input clip.mp4 --language es
# Non-English| Param | Default | Description |
|---|---|---|
--model <name> | base.en | Whisper model: base.en, medium.en |
--language <code> | en | Language code for non-English |
When given a trim spec as input, transcribe extracts audio only at the keep ranges, runs whisper on the joined audio, and remaps word timestamps back to the original timeline.
Output includes:
- JSON transcript with word-level timestamps (start, end, word)
- SRT file for standard subtitle format
caption
Convert a transcript into an animated caption track. Produces data (not pixels) — rendered at review/final render time by the UI and render engine.
montaj step caption --input transcript.json
montaj step caption --input transcript.json --style word-by-word
montaj step caption --input transcript.json --style pop
montaj step caption --input transcript.json --style karaoke
montaj step caption --input transcript.json --style subtitle| Param | Default | Description |
|---|---|---|
--style <name> | word-by-word | Caption style: word-by-word, pop, karaoke, subtitle |
Caption Styles
| Style | Description |
|---|---|
word-by-word | One word at a time, spring pop-in animation |
pop | Segment-at-a-time with scale entry |
karaoke | Words highlight progressively as they're spoken |
subtitle | Static line at bottom, segments replace sequentially |
Caption data is stored in project.json as a track with type: "caption" and inline segments with word timestamps:
{
"id": "captions",
"type": "caption",
"style": "word-by-word",
"segments": [
{
"text": "Hello world",
"start": 0.0,
"end": 1.2,
"words": [
{ "word": "Hello", "start": 0.0, "end": 0.5 },
{ "word": "world", "start": 0.5, "end": 1.2 }
]
}
]
}normalize
Loudness normalization — adjust audio levels to meet platform standards.
montaj step normalize --input clip.mp4 # youtube = -14 LUFS
montaj step normalize --input clip.mp4 --target podcast # -16 LUFS
montaj step normalize --input clip.mp4 --target broadcast # -23 LUFS
montaj step normalize --input clip.mp4 --target custom --lufs -18| Param | Default | Description |
|---|---|---|
--target <name> | youtube | Target preset: youtube (-14 LUFS), podcast (-16 LUFS), broadcast (-23 LUFS), custom |
--lufs <n> | — | Custom LUFS value (only with --target custom) |
VFX Steps
remove_bg
Remove video background using RVM (Robust Video Matting). Outputs a ProRes 4444 .mov with alpha channel for final render and a VP9 WebM for browser preview.
montaj step remove_bg --input clip.mp4
montaj step remove_bg --inputs clip0.mp4 clip1.mp4
# Multiple clips
montaj step remove_bg --input clip.mp4 --model rvm_resnet50
# Higher quality model
montaj step remove_bg --input clip.mp4 --downsample 0.5
# Downsample for faster processing| Param | Default | Description |
|---|---|---|
--inputs <files> | — | Multiple clips to process |
--model <name> | rvm_mobilenetv3 | Model: rvm_mobilenetv3 (faster) or rvm_resnet50 (higher quality) |
--downsample <factor> | — | Downsample factor for faster processing |
--progress | — | Show progress (recommended for long-running operations) |
--workers <n> | 2 | Max concurrent encodes |
Requirements
Requires montaj install rvm — installs torch, torchvision, av (pip) + RVM model weights.
Output
The step produces two files and updates the project item:
nobg_src— ProRes 4444.movwith alpha channel (for final render)nobg_preview_src— VP9 WebM (for browser preview — Chrome supports VP9 alpha; ProRes does not play in browsers)- Sets
remove_bg: trueon the item
Usage in Workflows
Used in the floating_head workflow:
materialize_cut— encode the trim spec to an actual video fileremove_bg— remove background from the materialized file- Render engine composites the alpha-channel
.movover the layers beneath
Important: remove_bg requires an actual video file — pass the output of materialize_cut, not a trim spec. This step is long-running (minutes per clip) — always run in the background with --progress so you can monitor status.
Render Behavior
At render time, when a tracks[1+] item has remove_bg: true and nobg_src is set, the render engine uses the ProRes 4444 .mov (with alpha) in place of the original src. The alpha channel is preserved through the ffmpeg filter graph and composited over the layers beneath.
Generation Steps
These steps call external APIs (Kling, Gemini, OpenAI) and require montaj install connectors and montaj install credentials.
kling_generate
Generate video via Kling v3 Omni — text-to-video, image-to-video, or reference-guided generation.
montaj kling-generate --prompt "a calico cat walking through a sunlit kitchen, cinematic" --out /tmp/cat.mp4
montaj kling-generate --prompt "slow zoom in" --first-frame frame.png --out /tmp/zoom.mp4
montaj kling-generate --prompt "character walks left" --first-frame start.png --last-frame end.png --out /tmp/walk.mp4
montaj kling-generate --prompt "same style" --ref-image style1.png --ref-image style2.png --out /tmp/styled.mp4
montaj kling-generate --prompt "..." --out /tmp/pro.mp4 --mode pro --duration 10 --aspect-ratio 9:16| Param | Default | Description |
|---|---|---|
--prompt <text> | required | Generation prompt |
--out <path> | required | Output file path |
--first-frame <img> | — | Starting image for image-to-video |
--last-frame <img> | — | Ending image (requires --first-frame) |
--ref-image <img> | — | Style reference image (repeatable, max 3) |
--duration <3-15> | — | Video duration in seconds |
--negative-prompt <text> | — | What to avoid |
--sound <on|off> | — | Enable/disable sound |
--aspect-ratio <ratio> | — | 16:9, 9:16, 1:1 |
--mode <std|pro> | std | Standard (cheaper/faster) or Pro (higher quality) |
analyze_media
Analyze a media file (video, audio, or image) with Gemini Flash. Supports description, timestamps, and structured output.
montaj analyze-media clip.mp4 --prompt "Describe the scene in 2 sentences."
montaj analyze-media song.mp3 --prompt "Transcribe with timestamps."
montaj analyze-media photo.jpg --prompt "Return JSON: {subject, mood, dominant_colors}" --json-output
montaj analyze-media clip.mp4 --prompt "..." --model gemini-2.5-pro
# Override model
montaj analyze-media clip.mp4 --prompt "..." --out analysis.txt
# Write to file| Param | Default | Description |
|---|---|---|
<input> | required | Video, audio, or image file (positional) |
--prompt <text> | required | Analysis prompt |
--model <id> | gemini-2.5-flash | Model override |
--json-output | — | Request structured JSON response from the model |
--out <path> | — | Write output to file |
Note: Images under approximately 18 MB take a fast inline path (no Files API round-trip).
generate_image
Generate an image via Gemini or OpenAI — text-to-image or reference-conditioned.
montaj generate-image --prompt "portrait, studio lighting" --out /tmp/portrait.png
montaj generate-image --prompt "same character, profile view" --ref-image /tmp/portrait.png --out /tmp/profile.png
montaj generate-image --prompt "red apple on white table" --provider openai --out /tmp/apple.png
montaj generate-image --prompt "..." --provider gemini --aspect-ratio 9:16 --out /tmp/tall.png| Param | Default | Description |
|---|---|---|
--prompt <text> | required | Generation prompt |
--out <path> | required | Output file path |
--provider <name> | gemini | Provider: gemini or openai |
--ref-image <img> | — | Reference image (repeatable) |
--size <WxH> | — | Image dimensions |
--aspect-ratio <ratio> | — | Aspect ratio (Gemini only) |
Models Used
| Provider | Use Case | Model |
|---|---|---|
| Kling | Video generation | kling-v3-omni |
| Gemini | Media analysis | gemini-2.5-flash |
| Gemini | Image generation | gemini-3-pro-image-preview |
| OpenAI | Image generation | gpt-image-1 |