All built-in steps — inspect, smart cuts, edit, enrich, VFX, audio, lyrics, and generation.

Steps Reference

Steps are the building blocks of Montaj's editing pipeline. Each step is an executable that follows the output convention (result on stdout, errors on stderr). Steps are composable at the shell level and discoverable via CLI, HTTP, and MCP.

Steps are organized into subdirectories by category: audio/, edit/, generate/, lyrics/, media/, speech/, transform/. The CLI resolves step names automatically — you never need to specify the subdirectory.

Inspect Steps

Inspect steps give the agent (or human) an understanding of the source material before editing begins.

probe

Extract metadata from a video file: duration, resolution, fps, codec, audio channels.

montaj step probe --input clip.mp4
# → JSON: duration, resolution, fps, codec, audio channels

The output is JSON on stdout — the agent uses this to understand what it's working with and make informed editing decisions.

snapshot

Generate a contact sheet frame grid — the agent's visual understanding of the clip.

montaj step snapshot --input clip.mp4
# → /path/to/snapshot.png (frame grid contact sheet)

montaj step snapshot --input clip.mp4 --cols 3 --rows 3
# Custom grid size

montaj step snapshot --input clip.mp4 --at 5.0
# Extract a single frame at a specific timestamp

Param	Default	Description
`--cols <n>`	auto	Number of columns in the grid
`--rows <n>`	auto	Number of rows in the grid
`--frames <n>`	auto	Total number of frames to extract
`--at <t>`	—	Extract a single frame at a specific timestamp

virtual_to_original

Map virtual-timeline timestamps to original-file timestamps (or the reverse). Useful for inspecting and debugging trim specs.

montaj step virtual_to_original --input spec.json 47.32
# → 95.483  (virtual timestamp → original-file timestamp)

montaj step virtual_to_original --input spec.json 47.32 53.23 66.89
# → one result per line

montaj step virtual_to_original --input spec.json --inverse 95.483
# → 47.320  (original-file timestamp → virtual timestamp)

Param	Description
`--input <spec.json>`	Trim spec file
`--inverse`	Reverse direction: original → virtual
Positional args	One or more timestamps to map

Smart Cut Steps

Smart cut steps analyze audio and produce trim specs — JSON files describing which ranges of the original source to keep. No video encoding happens at this stage.

waveform_trim

Waveform silence analysis — detects silent sections and outputs a trim spec. Near-instant, no encode.

montaj step waveform_trim --input clip.mp4
# → /path/to/clip_spec.json

montaj step waveform_trim --input clip.mp4 --threshold -30 --min-silence 0.3
# Custom sensitivity

Param	Default	Description
`--threshold <dB>`	auto	Volume threshold for silence detection (negative dB value)
`--min-silence <s>`	auto	Minimum silence duration to trigger a cut (seconds)

Output: a trim spec JSON with input (original file) and keeps (time ranges to keep).

rm_fillers

Remove filler words (um, uh, hmm) from speech. Takes a trim spec as input and outputs a refined trim spec.

montaj step rm_fillers --input clip.mp4
# → /path/to/clip_spec.json

montaj step rm_fillers --input clip.mp4 --model medium.en
# Higher accuracy, slower

Param	Default	Description
`--model <name>`	`base.en`	Whisper model: `tiny.en`, `base.en`, `medium.en`, `large`

Important: rm_fillers can accept either a video file or a trim spec as input. When given a trim spec, it refines the existing keeps list.

rm_nonspeech

Remove all non-speech audio (noisy ambient audio). Takes a trim spec as input and outputs a refined trim spec.

montaj step rm_nonspeech --input clip.mp4

montaj step rm_nonspeech --input clip.mp4 --model base --max-word-gap 0.18 --sentence-edge 0.10

Param	Default	Description
`--model <name>`	`base`	Whisper model: `base`, `small`, `medium`
`--max-word-gap <s>`	`0.18`	Maximum gap between words before splitting
`--sentence-edge <s>`	`0.10`	Padding at sentence boundaries

Important: Input should be a trim spec, not a raw video file.

crop_spec

Crop a trim spec to virtual-timeline windows. Outputs a refined trim spec — no encode.

montaj step crop_spec --input spec.json --keep 8.5:14.8
# Crop to a single window

montaj step crop_spec --input spec.json --keep 0:2.4 --keep 13.84:18.33
# Multiple windows — keeps are concatenated in order

montaj step crop_spec --input spec.json --keep 40.28:end
# Open-ended — keep from virtual 40.28s to end of clip

Param	Description
`--keep <start:end>`	Virtual-timeline window to keep (repeatable). Use `end` sentinel for open-ended.
`--out <path>`	Output path (default: `<input>_cropped.json`)

How Smart Cuts Chain

Smart cut steps form a pipeline where each step refines the trim spec:

waveform_trim(clip.MOV)
  → trim spec (silence removed)

rm_fillers(spec.json)
  → refined spec (fillers removed)

rm_nonspeech(spec.json)
  → refined spec (non-speech removed)

crop_spec(spec.json)
  → cropped spec (virtual-timeline windows)

The original source file path is preserved through the entire chain. Only materialize_cut will finally encode video.

Edit Steps

Core video editing operations. Most of these are ffmpeg wrappers with careful codec handling.

trim

Cut by in/out point — extract a segment from a video.

montaj step trim --input clip.mp4 --start 2.5 --end 8.3
# Extract from 2.5s to 8.3s

montaj step trim --input clip.mp4 --start 00:00:02 --end 00:01:30
# HH:MM:SS format also accepted

Param	Description
`--start <t>`	Start time (seconds or HH:MM:SS)
`--end <t>`	End time (seconds or HH:MM:SS)
`--duration <t>`	Duration instead of end time

cut

Remove one or more sections from a video and rejoin — the opposite of trim.

montaj step cut --input clip.mp4 --start 3.0 --end 7.5
# Remove a single section and rejoin

montaj step cut --input clip.mp4 --cuts '[[0,1.2],[5.3,7.8]]'
# Remove multiple sections in one ffmpeg pass

montaj step cut --input clip.mp4 --cuts '[[3.0,7.5]]' --spec
# Write a trim spec JSON instead of encoding — use with materialize_cut for deferred encode

Param	Description
`--start <t>`, `--end <t>`	Remove a single section
`--cuts <json>`	Remove multiple sections: `[[start, end], ...]`
`--spec`	Output a trim spec instead of encoding video

jump_cut

Remove time ranges from video — similar to cut, but oriented toward jump-cut editing with explicit keeps or cuts.

montaj step jump_cut --input clip.mp4 --cuts '[[2.0, 3.5], [7.0, 8.0]]'
# Remove specified ranges

montaj step jump_cut --input clip.mp4 --keeps '[[0, 2.0], [3.5, 7.0], [8.0, 12.0]]'
# Keep only specified ranges

Param	Description
`--cuts <json>`	Time ranges to remove: `[[start, end], ...]`
`--keeps <json>`	Time ranges to keep: `[[start, end], ...]`

cross_cut

Interleave A-roll and B-roll segments — creates a cross-cut edit between two video sources.

montaj step cross_cut --input a-roll.mp4 --b-roll b-roll.mp4 --segment-duration 3

Param	Default	Description
`--b-roll <file>`	required	B-roll video file
`--segment-duration <s>`	auto	Duration of each interleaved segment

montage

Create a rapid montage from multiple clips — cuts them into short segments and assembles.

montaj step montage --inputs clip1.mp4 clip2.mp4 clip3.mp4 --beat-duration 2

Param	Default	Description
`--inputs <files>`	required	Multiple source clips
`--beat-duration <s>`	auto	Duration of each montage beat
`--offset <s>`	`0`	Offset into each source clip

materialize_cut

Encode a trim spec or raw video segment to H.264. Used only when a subsequent step (e.g., remove_bg) requires an actual video file rather than a trim spec.

montaj step materialize_cut spec.json
# Encode a trim spec to video

montaj step materialize_cut clip.mp4 --inpoint 2.0 --outpoint 8.0
# Encode a raw segment

montaj step materialize_cut --inputs clip0.json clip1.json
# Multiple clips — caps at 2 concurrent encodes by default

Uses input-level seeking (-ss/-t before -i) so only the requested segment is decoded.

Param	Description
`--inpoint <t>`	Start time (for raw video input)
`--outpoint <t>`	End time (for raw video input)
`--inputs <files>`	Multiple trim specs or clips
`--workers <n>`	Max concurrent encodes (default: 2)

resize

Reframe video to a target aspect ratio.

montaj step resize --input clip.mp4 --ratio 9:16    # TikTok / Reels / Shorts
montaj step resize --input clip.mp4 --ratio 1:1     # Instagram
montaj step resize --input clip.mp4 --ratio 16:9    # YouTube

Param	Description
`--ratio <ratio>`	Target aspect ratio: `9:16`, `1:1`, `16:9`

extract_audio

Extract the audio track from a video file.

montaj step extract_audio --input clip.mp4              # default: wav
montaj step extract_audio --input clip.mp4 --format mp3

Param	Default	Description
`--format <fmt>`	`wav`	Output format: `wav`, `mp3`, `aac`

Audio Steps

Audio processing steps for stem separation and waveform analysis.

stem_separation

Separate audio into stems (vocals, drums, bass, other) using Demucs.

montaj step stem_separation --input song.mp3 --stems vocals
# Extract vocals only

montaj step stem_separation --input song.mp3 --out-dir ./stems/
# All stems to a directory

Param	Default	Description
`--stems <name>`	all	Which stem to extract: `vocals`, `drums`, `bass`, `other`
`--out-dir <path>`	auto	Output directory for stems
`--model <name>`	`htdemucs`	Demucs model: `htdemucs`, `htdemucs_ft`, `mdx_extra`

Requires montaj install demucs.

waveform_image

Generate a PNG waveform visualization grid from an audio or video file.

montaj step waveform_image --input clip.mp4
# → /path/to/waveform.png

montaj step waveform_image --input clip.mp4 --chunk-duration 30
# One row per 30s chunk

Param	Default	Description
`--chunk-duration <s>`	auto	Duration per waveform row in the grid

Enrich Steps

Enrich steps add information (transcripts, captions) or polish (loudness normalization) to clips.

transcribe

Generate SRT and JSON transcripts with word-level timestamps using local whisper.cpp.

montaj step transcribe --input clip.mp4
# → transcript.json + clip.srt

montaj step transcribe --input clip.mp4 --model medium.en
# Higher accuracy, slower

montaj step transcribe --input clip.mp4 --language es
# Non-English

Param	Default	Description
`--model <name>`	`base.en`	Whisper model: `base.en`, `medium.en`
`--language <code>`	`en`	Language code for non-English

When given a trim spec as input, transcribe extracts audio only at the keep ranges, runs whisper on the joined audio, and remaps word timestamps back to the original timeline.

Output includes:

JSON transcript with word-level timestamps (start, end, word)
SRT file for standard subtitle format

caption

Convert a transcript into an animated caption track. Produces data (not pixels) — rendered at review/final render time by the UI and render engine.

montaj step caption --input transcript.json
montaj step caption --input transcript.json --style word-by-word
montaj step caption --input transcript.json --style pop
montaj step caption --input transcript.json --style karaoke
montaj step caption --input transcript.json --style subtitle

Param	Default	Description
`--style <name>`	`word-by-word`	Caption style: `word-by-word`, `pop`, `karaoke`, `subtitle`

Caption Styles

Style	Description
`word-by-word`	One word at a time, spring pop-in animation
`pop`	Segment-at-a-time with scale entry
`karaoke`	Words highlight progressively as they're spoken
`subtitle`	Static line at bottom, segments replace sequentially

Caption data is stored in project.json as a track with type: "caption" and inline segments with word timestamps:

{
  "id": "captions",
  "type": "caption",
  "style": "word-by-word",
  "segments": [
    {
      "text": "Hello world",
      "start": 0.0,
      "end": 1.2,
      "words": [
        { "word": "Hello", "start": 0.0, "end": 0.5 },
        { "word": "world", "start": 0.5, "end": 1.2 }
      ]
    }
  ]
}

normalize

Loudness normalization — adjust audio levels to meet platform standards.

montaj step normalize --input clip.mp4                       # youtube = -14 LUFS
montaj step normalize --input clip.mp4 --target podcast      # -16 LUFS
montaj step normalize --input clip.mp4 --target broadcast    # -23 LUFS
montaj step normalize --input clip.mp4 --target custom --lufs -18

Param	Default	Description
`--target <name>`	`youtube`	Target preset: `youtube` (-14 LUFS), `podcast` (-16 LUFS), `broadcast` (-23 LUFS), `custom`
`--lufs <n>`	—	Custom LUFS value (only with `--target custom`)

Lyrics Steps

Steps for syncing lyrics to audio and rendering lyrics videos.

lyrics_sync

Sync lyrics text to an audio file using Whisper — produces word-level aligned captions.

montaj step lyrics_sync --lyrics song.txt --input song.mp3
# → /path/to/captions.json

montaj step lyrics_sync --lyrics song.txt --input song.mp3 --model medium.en
# Higher accuracy

Param	Default	Description
`--lyrics <txt>`	required	Plain text lyrics file
`--model <name>`	`base.en`	Whisper model
`--start <t>`	—	Override start time
`--end <t>`	—	Override end time

lyrics_render

Render synced captions onto video via ffmpeg drawtext filters. Produces a complete lyrics video.

montaj step lyrics_render --captions captions.json --audio song.mp3
# Render with solid color background

montaj step lyrics_render --captions captions.json --audio song.mp3 --input background.mp4
# Render with background video

Param	Default	Description
`--captions <json>`	required	Word-synced captions from `lyrics_sync`
`--audio <file>`	required	Audio file
`--input <video>`	—	Optional background video
`--position <pos>`	`center`	Text position
`--color <hex>`	`white`	Text color
`--fontsize <n>`	auto	Font size
`--preview-duration <s>`	—	Render only first N seconds (for quick preview)

VFX Steps

remove_bg

Remove video background using RVM (Robust Video Matting). Outputs a ProRes 4444 .mov with alpha channel for final render and a VP9 WebM for browser preview.

montaj step remove_bg --input clip.mp4

montaj step remove_bg --inputs clip0.mp4 clip1.mp4
# Multiple clips

montaj step remove_bg --input clip.mp4 --model rvm_resnet50
# Higher quality model

montaj step remove_bg --input clip.mp4 --downsample 0.5
# Downsample for faster processing

Param	Default	Description
`--inputs <files>`	—	Multiple clips to process
`--model <name>`	`rvm_mobilenetv3`	Model: `rvm_mobilenetv3` (faster) or `rvm_resnet50` (higher quality)
`--downsample <factor>`	—	Downsample factor for faster processing
`--progress`	—	Show progress (recommended for long-running operations)
`--workers <n>`	`2`	Max concurrent encodes

Requirements

Requires montaj install rvm — installs torch, torchvision, av (pip) + RVM model weights.

Output

The step produces two files and updates the project item:

nobg_src — ProRes 4444 .mov with alpha channel (for final render)
nobg_preview_src — VP9 WebM (for browser preview — Chrome supports VP9 alpha; ProRes does not play in browsers)
Sets remove_bg: true on the item

Usage in Workflows

Used in the floating_head workflow:

materialize_cut — encode the trim spec to an actual video file
remove_bg — remove background from the materialized file
Render engine composites the alpha-channel .mov over the layers beneath

Important: remove_bg requires an actual video file — pass the output of materialize_cut, not a trim spec. This step is long-running (minutes per clip) — always run in the background with --progress so you can monitor status.

Render Behavior

At render time, when a tracks[1+] item has remove_bg: true and nobg_src is set, the render engine uses the ProRes 4444 .mov (with alpha) in place of the original src. The alpha channel is preserved through the ffmpeg filter graph and composited over the layers beneath.

Generation Steps

These steps call external APIs (Kling, Gemini, OpenAI) and require montaj install connectors and montaj install credentials.

kling_generate

Generate video via Kling — text-to-video, image-to-video, or reference-guided generation. Supports two models.

montaj kling-generate --prompt "a calico cat walking through a sunlit kitchen, cinematic" --out /tmp/cat.mp4

montaj kling-generate --prompt "slow zoom in" --first-frame frame.png --out /tmp/zoom.mp4

montaj kling-generate --prompt "character walks left" --first-frame start.png --last-frame end.png --out /tmp/walk.mp4

montaj kling-generate --prompt "same style" --ref-image style1.png --ref-image style2.png --out /tmp/styled.mp4

montaj kling-generate --prompt "..." --out /tmp/pro.mp4 --mode pro --duration 10 --aspect-ratio 9:16

montaj kling-generate --multi-shot --shot-type customize --multi-prompt '<json>' --out /tmp/batch.mp4

Param	Default	Description
`--prompt <text>`	required	Generation prompt
`--out <path>`	required	Output file path
`--first-frame <img>`	—	Starting image for image-to-video
`--last-frame <img>`	—	Ending image (requires `--first-frame`)
`--ref-image <img>`	—	Reference image (repeatable, max 7)
`--duration <3-15>`	—	Video duration in seconds
`--negative-prompt <text>`	—	What to avoid
`--sound <on\|off>`	—	Enable/disable sound
`--aspect-ratio <ratio>`	—	`16:9`, `9:16`, `1:1`
`--mode <std\|pro>`	`std`	Standard (cheaper/faster) or Pro (higher quality)
`--model <name>`	auto	`kling-v3-omni` (3-15s, audio) or `kling-video-o1` (5/10s, best quality)
`--multi-shot`	—	Enable multi-shot batch mode (up to 6 scenes)
`--shot-type <type>`	—	`customize` or `intelligence` (multi-shot only)
`--multi-prompt <json>`	—	Per-shot prompts as JSON array (multi-shot only)

Models

Model	Duration	Audio	Multi-shot	Notes
`kling-v3-omni`	3-15s	Yes (`sound: "on"`)	Yes	Flexible durations, audio generation
`kling-video-o1`	5s or 10s only	No	No	Highest visual quality

The step auto-upgrades to kling-video-o1 when duration is 5/10 and sound is off.

analyze_media

Analyze a media file (video, audio, or image) with Gemini Flash. Supports description, timestamps, and structured output.

montaj analyze-media clip.mp4 --prompt "Describe the scene in 2 sentences."

montaj analyze-media song.mp3 --prompt "Transcribe with timestamps."

montaj analyze-media photo.jpg --prompt "Return JSON: {subject, mood, dominant_colors}" --json-output

montaj analyze-media clip.mp4 --prompt "..." --model gemini-2.5-pro
# Override model

montaj analyze-media clip.mp4 --prompt "..." --out analysis.txt
# Write to file

Param	Default	Description
`<input>`	required	Video, audio, or image file (positional)
`--prompt <text>`	required	Analysis prompt
`--model <id>`	`gemini-2.5-flash`	Model override
`--json-output`	—	Request structured JSON response from the model
`--out <path>`	—	Write output to file

Note: Images under approximately 18 MB take a fast inline path (no Files API round-trip).

generate_image

Generate an image via Gemini or OpenAI — text-to-image or reference-conditioned.

montaj generate-image --prompt "portrait, studio lighting" --out /tmp/portrait.png

montaj generate-image --prompt "same character, profile view" --ref-image /tmp/portrait.png --out /tmp/profile.png

montaj generate-image --prompt "red apple on white table" --provider openai --out /tmp/apple.png

montaj generate-image --prompt "..." --provider gemini --aspect-ratio 9:16 --out /tmp/tall.png

Param	Default	Description
`--prompt <text>`	required	Generation prompt
`--out <path>`	required	Output file path
`--provider <name>`	`gemini`	Provider: `gemini` or `openai`
`--ref-image <img>`	—	Reference image (repeatable)
`--size <WxH>`	—	Image dimensions
`--aspect-ratio <ratio>`	—	Aspect ratio (Gemini only)

generate_music

Generate music via Gemini Lyria 3 Clip — produces approximately 30 seconds of audio from a text prompt.

montaj generate-music --prompt "upbeat electronic, 120 bpm" --out /tmp/music.wav

montaj generate-music --prompt "acoustic guitar, mellow" --with-vocals --out /tmp/song.wav

Param	Default	Description
`--prompt <text>`	required	Music description
`--out <path>`	required	Output file path (.wav)
`--with-vocals`	false	Include vocals in generation
`--seed <n>`	—	Random seed for reproducibility

generate_voiceover

Generate speech audio via Kling TTS or Gemini TTS.

montaj generate-voiceover --text "Welcome to our farm" --out /tmp/vo.wav

montaj generate-voiceover --text-file script.txt --voice Kore --vendor gemini --out /tmp/vo.wav

montaj generate-voiceover --text "..." --vendor kling --out /tmp/vo.mp3

Param	Default	Description
`--text <str>`	—	Text to speak (mutually exclusive with `--text-file`)
`--text-file <path>`	—	Text file to read
`--voice <name>`	auto	Voice name (vendor-specific)
`--vendor <name>`	`kling`	TTS vendor: `kling` or `gemini`
`--speed <n>`	—	Speech speed multiplier
`--language <code>`	—	Language code
`--model <name>`	—	Override default model

fetch

Download videos from URLs via yt-dlp.

montaj fetch --url "https://www.tiktok.com/@handle/video/123"

montaj fetch --url "https://www.tiktok.com/@handle" --limit 15 --out ./clips/

Param	Default	Description
`--url <str>`	required	Video or profile URL
`--limit <n>`	—	Max number of videos to download
`--format <str>`	auto	yt-dlp format selector

Models Used

Provider	Use Case	Model
Kling	Video generation	`kling-v3-omni`, `kling-video-o1`
Kling	Text-to-speech	`kling-tts-v1`
Gemini	Media analysis	`gemini-2.5-flash`
Gemini	Image generation	`gemini-3-pro-image-preview`
Gemini	Text-to-speech	`gemini-2.5-flash-preview-tts`
Gemini	Music generation	`lyria-3-clip-preview`
OpenAI	Image generation	`gpt-image-1`

Steps Reference

On this page