MontajMontajdocs

Steps Reference

All built-in steps — inspect, smart cuts, edit, enrich, VFX, audio, lyrics, and generation.

Steps Reference

Steps are the building blocks of Montaj's editing pipeline. Each step is an executable that follows the output convention (result on stdout, errors on stderr). Steps are composable at the shell level and discoverable via CLI, HTTP, and MCP.

Steps are organized into subdirectories by category: audio/, edit/, generate/, lyrics/, media/, speech/, transform/. The CLI resolves step names automatically — you never need to specify the subdirectory.


Inspect Steps

Inspect steps give the agent (or human) an understanding of the source material before editing begins.

probe

Extract metadata from a video file: duration, resolution, fps, codec, audio channels.

montaj step probe --input clip.mp4
# → JSON: duration, resolution, fps, codec, audio channels

The output is JSON on stdout — the agent uses this to understand what it's working with and make informed editing decisions.

snapshot

Generate a contact sheet frame grid — the agent's visual understanding of the clip.

montaj step snapshot --input clip.mp4
# → /path/to/snapshot.png (frame grid contact sheet)

montaj step snapshot --input clip.mp4 --cols 3 --rows 3
# Custom grid size

montaj step snapshot --input clip.mp4 --at 5.0
# Extract a single frame at a specific timestamp
ParamDefaultDescription
--cols <n>autoNumber of columns in the grid
--rows <n>autoNumber of rows in the grid
--frames <n>autoTotal number of frames to extract
--at <t>Extract a single frame at a specific timestamp

virtual_to_original

Map virtual-timeline timestamps to original-file timestamps (or the reverse). Useful for inspecting and debugging trim specs.

montaj step virtual_to_original --input spec.json 47.32
# → 95.483  (virtual timestamp → original-file timestamp)

montaj step virtual_to_original --input spec.json 47.32 53.23 66.89
# → one result per line

montaj step virtual_to_original --input spec.json --inverse 95.483
# → 47.320  (original-file timestamp → virtual timestamp)
ParamDescription
--input <spec.json>Trim spec file
--inverseReverse direction: original → virtual
Positional argsOne or more timestamps to map

Smart Cut Steps

Smart cut steps analyze audio and produce trim specs — JSON files describing which ranges of the original source to keep. No video encoding happens at this stage.

waveform_trim

Waveform silence analysis — detects silent sections and outputs a trim spec. Near-instant, no encode.

montaj step waveform_trim --input clip.mp4
# → /path/to/clip_spec.json

montaj step waveform_trim --input clip.mp4 --threshold -30 --min-silence 0.3
# Custom sensitivity
ParamDefaultDescription
--threshold <dB>autoVolume threshold for silence detection (negative dB value)
--min-silence <s>autoMinimum silence duration to trigger a cut (seconds)

Output: a trim spec JSON with input (original file) and keeps (time ranges to keep).

rm_fillers

Remove filler words (um, uh, hmm) from speech. Takes a trim spec as input and outputs a refined trim spec.

montaj step rm_fillers --input clip.mp4
# → /path/to/clip_spec.json

montaj step rm_fillers --input clip.mp4 --model medium.en
# Higher accuracy, slower
ParamDefaultDescription
--model <name>base.enWhisper model: tiny.en, base.en, medium.en, large

Important: rm_fillers can accept either a video file or a trim spec as input. When given a trim spec, it refines the existing keeps list.

rm_nonspeech

Remove all non-speech audio (noisy ambient audio). Takes a trim spec as input and outputs a refined trim spec.

montaj step rm_nonspeech --input clip.mp4

montaj step rm_nonspeech --input clip.mp4 --model base --max-word-gap 0.18 --sentence-edge 0.10
ParamDefaultDescription
--model <name>baseWhisper model: base, small, medium
--max-word-gap <s>0.18Maximum gap between words before splitting
--sentence-edge <s>0.10Padding at sentence boundaries

Important: Input should be a trim spec, not a raw video file.

crop_spec

Crop a trim spec to virtual-timeline windows. Outputs a refined trim spec — no encode.

montaj step crop_spec --input spec.json --keep 8.5:14.8
# Crop to a single window

montaj step crop_spec --input spec.json --keep 0:2.4 --keep 13.84:18.33
# Multiple windows — keeps are concatenated in order

montaj step crop_spec --input spec.json --keep 40.28:end
# Open-ended — keep from virtual 40.28s to end of clip
ParamDescription
--keep <start:end>Virtual-timeline window to keep (repeatable). Use end sentinel for open-ended.
--out <path>Output path (default: <input>_cropped.json)

How Smart Cuts Chain

Smart cut steps form a pipeline where each step refines the trim spec:

waveform_trim(clip.MOV)
  → trim spec (silence removed)

rm_fillers(spec.json)
  → refined spec (fillers removed)

rm_nonspeech(spec.json)
  → refined spec (non-speech removed)

crop_spec(spec.json)
  → cropped spec (virtual-timeline windows)

The original source file path is preserved through the entire chain. Only materialize_cut will finally encode video.


Edit Steps

Core video editing operations. Most of these are ffmpeg wrappers with careful codec handling.

trim

Cut by in/out point — extract a segment from a video.

montaj step trim --input clip.mp4 --start 2.5 --end 8.3
# Extract from 2.5s to 8.3s

montaj step trim --input clip.mp4 --start 00:00:02 --end 00:01:30
# HH:MM:SS format also accepted
ParamDescription
--start <t>Start time (seconds or HH:MM:SS)
--end <t>End time (seconds or HH:MM:SS)
--duration <t>Duration instead of end time

cut

Remove one or more sections from a video and rejoin — the opposite of trim.

montaj step cut --input clip.mp4 --start 3.0 --end 7.5
# Remove a single section and rejoin

montaj step cut --input clip.mp4 --cuts '[[0,1.2],[5.3,7.8]]'
# Remove multiple sections in one ffmpeg pass

montaj step cut --input clip.mp4 --cuts '[[3.0,7.5]]' --spec
# Write a trim spec JSON instead of encoding — use with materialize_cut for deferred encode
ParamDescription
--start <t>, --end <t>Remove a single section
--cuts <json>Remove multiple sections: [[start, end], ...]
--specOutput a trim spec instead of encoding video

jump_cut

Remove time ranges from video — similar to cut, but oriented toward jump-cut editing with explicit keeps or cuts.

montaj step jump_cut --input clip.mp4 --cuts '[[2.0, 3.5], [7.0, 8.0]]'
# Remove specified ranges

montaj step jump_cut --input clip.mp4 --keeps '[[0, 2.0], [3.5, 7.0], [8.0, 12.0]]'
# Keep only specified ranges
ParamDescription
--cuts <json>Time ranges to remove: [[start, end], ...]
--keeps <json>Time ranges to keep: [[start, end], ...]

cross_cut

Interleave A-roll and B-roll segments — creates a cross-cut edit between two video sources.

montaj step cross_cut --input a-roll.mp4 --b-roll b-roll.mp4 --segment-duration 3
ParamDefaultDescription
--b-roll <file>requiredB-roll video file
--segment-duration <s>autoDuration of each interleaved segment

montage

Create a rapid montage from multiple clips — cuts them into short segments and assembles.

montaj step montage --inputs clip1.mp4 clip2.mp4 clip3.mp4 --beat-duration 2
ParamDefaultDescription
--inputs <files>requiredMultiple source clips
--beat-duration <s>autoDuration of each montage beat
--offset <s>0Offset into each source clip

materialize_cut

Encode a trim spec or raw video segment to H.264. Used only when a subsequent step (e.g., remove_bg) requires an actual video file rather than a trim spec.

montaj step materialize_cut spec.json
# Encode a trim spec to video

montaj step materialize_cut clip.mp4 --inpoint 2.0 --outpoint 8.0
# Encode a raw segment

montaj step materialize_cut --inputs clip0.json clip1.json
# Multiple clips — caps at 2 concurrent encodes by default

Uses input-level seeking (-ss/-t before -i) so only the requested segment is decoded.

ParamDescription
--inpoint <t>Start time (for raw video input)
--outpoint <t>End time (for raw video input)
--inputs <files>Multiple trim specs or clips
--workers <n>Max concurrent encodes (default: 2)

resize

Reframe video to a target aspect ratio.

montaj step resize --input clip.mp4 --ratio 9:16    # TikTok / Reels / Shorts
montaj step resize --input clip.mp4 --ratio 1:1     # Instagram
montaj step resize --input clip.mp4 --ratio 16:9    # YouTube
ParamDescription
--ratio <ratio>Target aspect ratio: 9:16, 1:1, 16:9

extract_audio

Extract the audio track from a video file.

montaj step extract_audio --input clip.mp4              # default: wav
montaj step extract_audio --input clip.mp4 --format mp3
ParamDefaultDescription
--format <fmt>wavOutput format: wav, mp3, aac

Audio Steps

Audio processing steps for stem separation and waveform analysis.

stem_separation

Separate audio into stems (vocals, drums, bass, other) using Demucs.

montaj step stem_separation --input song.mp3 --stems vocals
# Extract vocals only

montaj step stem_separation --input song.mp3 --out-dir ./stems/
# All stems to a directory
ParamDefaultDescription
--stems <name>allWhich stem to extract: vocals, drums, bass, other
--out-dir <path>autoOutput directory for stems
--model <name>htdemucsDemucs model: htdemucs, htdemucs_ft, mdx_extra

Requires montaj install demucs.

waveform_image

Generate a PNG waveform visualization grid from an audio or video file.

montaj step waveform_image --input clip.mp4
# → /path/to/waveform.png

montaj step waveform_image --input clip.mp4 --chunk-duration 30
# One row per 30s chunk
ParamDefaultDescription
--chunk-duration <s>autoDuration per waveform row in the grid

Enrich Steps

Enrich steps add information (transcripts, captions) or polish (loudness normalization) to clips.

transcribe

Generate SRT and JSON transcripts with word-level timestamps using local whisper.cpp.

montaj step transcribe --input clip.mp4
# → transcript.json + clip.srt

montaj step transcribe --input clip.mp4 --model medium.en
# Higher accuracy, slower

montaj step transcribe --input clip.mp4 --language es
# Non-English
ParamDefaultDescription
--model <name>base.enWhisper model: base.en, medium.en
--language <code>enLanguage code for non-English

When given a trim spec as input, transcribe extracts audio only at the keep ranges, runs whisper on the joined audio, and remaps word timestamps back to the original timeline.

Output includes:

  • JSON transcript with word-level timestamps (start, end, word)
  • SRT file for standard subtitle format

caption

Convert a transcript into an animated caption track. Produces data (not pixels) — rendered at review/final render time by the UI and render engine.

montaj step caption --input transcript.json
montaj step caption --input transcript.json --style word-by-word
montaj step caption --input transcript.json --style pop
montaj step caption --input transcript.json --style karaoke
montaj step caption --input transcript.json --style subtitle
ParamDefaultDescription
--style <name>word-by-wordCaption style: word-by-word, pop, karaoke, subtitle

Caption Styles

StyleDescription
word-by-wordOne word at a time, spring pop-in animation
popSegment-at-a-time with scale entry
karaokeWords highlight progressively as they're spoken
subtitleStatic line at bottom, segments replace sequentially

Caption data is stored in project.json as a track with type: "caption" and inline segments with word timestamps:

{
  "id": "captions",
  "type": "caption",
  "style": "word-by-word",
  "segments": [
    {
      "text": "Hello world",
      "start": 0.0,
      "end": 1.2,
      "words": [
        { "word": "Hello", "start": 0.0, "end": 0.5 },
        { "word": "world", "start": 0.5, "end": 1.2 }
      ]
    }
  ]
}

normalize

Loudness normalization — adjust audio levels to meet platform standards.

montaj step normalize --input clip.mp4                       # youtube = -14 LUFS
montaj step normalize --input clip.mp4 --target podcast      # -16 LUFS
montaj step normalize --input clip.mp4 --target broadcast    # -23 LUFS
montaj step normalize --input clip.mp4 --target custom --lufs -18
ParamDefaultDescription
--target <name>youtubeTarget preset: youtube (-14 LUFS), podcast (-16 LUFS), broadcast (-23 LUFS), custom
--lufs <n>Custom LUFS value (only with --target custom)

Lyrics Steps

Steps for syncing lyrics to audio and rendering lyrics videos.

lyrics_sync

Sync lyrics text to an audio file using Whisper — produces word-level aligned captions.

montaj step lyrics_sync --lyrics song.txt --input song.mp3
# → /path/to/captions.json

montaj step lyrics_sync --lyrics song.txt --input song.mp3 --model medium.en
# Higher accuracy
ParamDefaultDescription
--lyrics <txt>requiredPlain text lyrics file
--model <name>base.enWhisper model
--start <t>Override start time
--end <t>Override end time

lyrics_render

Render synced captions onto video via ffmpeg drawtext filters. Produces a complete lyrics video.

montaj step lyrics_render --captions captions.json --audio song.mp3
# Render with solid color background

montaj step lyrics_render --captions captions.json --audio song.mp3 --input background.mp4
# Render with background video
ParamDefaultDescription
--captions <json>requiredWord-synced captions from lyrics_sync
--audio <file>requiredAudio file
--input <video>Optional background video
--position <pos>centerText position
--color <hex>whiteText color
--fontsize <n>autoFont size
--preview-duration <s>Render only first N seconds (for quick preview)

VFX Steps

remove_bg

Remove video background using RVM (Robust Video Matting). Outputs a ProRes 4444 .mov with alpha channel for final render and a VP9 WebM for browser preview.

montaj step remove_bg --input clip.mp4

montaj step remove_bg --inputs clip0.mp4 clip1.mp4
# Multiple clips

montaj step remove_bg --input clip.mp4 --model rvm_resnet50
# Higher quality model

montaj step remove_bg --input clip.mp4 --downsample 0.5
# Downsample for faster processing
ParamDefaultDescription
--inputs <files>Multiple clips to process
--model <name>rvm_mobilenetv3Model: rvm_mobilenetv3 (faster) or rvm_resnet50 (higher quality)
--downsample <factor>Downsample factor for faster processing
--progressShow progress (recommended for long-running operations)
--workers <n>2Max concurrent encodes

Requirements

Requires montaj install rvm — installs torch, torchvision, av (pip) + RVM model weights.

Output

The step produces two files and updates the project item:

  • nobg_src — ProRes 4444 .mov with alpha channel (for final render)
  • nobg_preview_src — VP9 WebM (for browser preview — Chrome supports VP9 alpha; ProRes does not play in browsers)
  • Sets remove_bg: true on the item

Usage in Workflows

Used in the floating_head workflow:

  1. materialize_cut — encode the trim spec to an actual video file
  2. remove_bg — remove background from the materialized file
  3. Render engine composites the alpha-channel .mov over the layers beneath

Important: remove_bg requires an actual video file — pass the output of materialize_cut, not a trim spec. This step is long-running (minutes per clip) — always run in the background with --progress so you can monitor status.

Render Behavior

At render time, when a tracks[1+] item has remove_bg: true and nobg_src is set, the render engine uses the ProRes 4444 .mov (with alpha) in place of the original src. The alpha channel is preserved through the ffmpeg filter graph and composited over the layers beneath.


Generation Steps

These steps call external APIs (Kling, Gemini, OpenAI) and require montaj install connectors and montaj install credentials.

kling_generate

Generate video via Kling — text-to-video, image-to-video, or reference-guided generation. Supports two models.

montaj kling-generate --prompt "a calico cat walking through a sunlit kitchen, cinematic" --out /tmp/cat.mp4

montaj kling-generate --prompt "slow zoom in" --first-frame frame.png --out /tmp/zoom.mp4

montaj kling-generate --prompt "character walks left" --first-frame start.png --last-frame end.png --out /tmp/walk.mp4

montaj kling-generate --prompt "same style" --ref-image style1.png --ref-image style2.png --out /tmp/styled.mp4

montaj kling-generate --prompt "..." --out /tmp/pro.mp4 --mode pro --duration 10 --aspect-ratio 9:16

montaj kling-generate --multi-shot --shot-type customize --multi-prompt '<json>' --out /tmp/batch.mp4
ParamDefaultDescription
--prompt <text>requiredGeneration prompt
--out <path>requiredOutput file path
--first-frame <img>Starting image for image-to-video
--last-frame <img>Ending image (requires --first-frame)
--ref-image <img>Reference image (repeatable, max 7)
--duration <3-15>Video duration in seconds
--negative-prompt <text>What to avoid
--sound <on|off>Enable/disable sound
--aspect-ratio <ratio>16:9, 9:16, 1:1
--mode <std|pro>stdStandard (cheaper/faster) or Pro (higher quality)
--model <name>autokling-v3-omni (3-15s, audio) or kling-video-o1 (5/10s, best quality)
--multi-shotEnable multi-shot batch mode (up to 6 scenes)
--shot-type <type>customize or intelligence (multi-shot only)
--multi-prompt <json>Per-shot prompts as JSON array (multi-shot only)

Models

ModelDurationAudioMulti-shotNotes
kling-v3-omni3-15sYes (sound: "on")YesFlexible durations, audio generation
kling-video-o15s or 10s onlyNoNoHighest visual quality

The step auto-upgrades to kling-video-o1 when duration is 5/10 and sound is off.

analyze_media

Analyze a media file (video, audio, or image) with Gemini Flash. Supports description, timestamps, and structured output.

montaj analyze-media clip.mp4 --prompt "Describe the scene in 2 sentences."

montaj analyze-media song.mp3 --prompt "Transcribe with timestamps."

montaj analyze-media photo.jpg --prompt "Return JSON: {subject, mood, dominant_colors}" --json-output

montaj analyze-media clip.mp4 --prompt "..." --model gemini-2.5-pro
# Override model

montaj analyze-media clip.mp4 --prompt "..." --out analysis.txt
# Write to file
ParamDefaultDescription
<input>requiredVideo, audio, or image file (positional)
--prompt <text>requiredAnalysis prompt
--model <id>gemini-2.5-flashModel override
--json-outputRequest structured JSON response from the model
--out <path>Write output to file

Note: Images under approximately 18 MB take a fast inline path (no Files API round-trip).

generate_image

Generate an image via Gemini or OpenAI — text-to-image or reference-conditioned.

montaj generate-image --prompt "portrait, studio lighting" --out /tmp/portrait.png

montaj generate-image --prompt "same character, profile view" --ref-image /tmp/portrait.png --out /tmp/profile.png

montaj generate-image --prompt "red apple on white table" --provider openai --out /tmp/apple.png

montaj generate-image --prompt "..." --provider gemini --aspect-ratio 9:16 --out /tmp/tall.png
ParamDefaultDescription
--prompt <text>requiredGeneration prompt
--out <path>requiredOutput file path
--provider <name>geminiProvider: gemini or openai
--ref-image <img>Reference image (repeatable)
--size <WxH>Image dimensions
--aspect-ratio <ratio>Aspect ratio (Gemini only)

generate_music

Generate music via Gemini Lyria 3 Clip — produces approximately 30 seconds of audio from a text prompt.

montaj generate-music --prompt "upbeat electronic, 120 bpm" --out /tmp/music.wav

montaj generate-music --prompt "acoustic guitar, mellow" --with-vocals --out /tmp/song.wav
ParamDefaultDescription
--prompt <text>requiredMusic description
--out <path>requiredOutput file path (.wav)
--with-vocalsfalseInclude vocals in generation
--seed <n>Random seed for reproducibility

generate_voiceover

Generate speech audio via Kling TTS or Gemini TTS.

montaj generate-voiceover --text "Welcome to our farm" --out /tmp/vo.wav

montaj generate-voiceover --text-file script.txt --voice Kore --vendor gemini --out /tmp/vo.wav

montaj generate-voiceover --text "..." --vendor kling --out /tmp/vo.mp3
ParamDefaultDescription
--text <str>Text to speak (mutually exclusive with --text-file)
--text-file <path>Text file to read
--voice <name>autoVoice name (vendor-specific)
--vendor <name>klingTTS vendor: kling or gemini
--speed <n>Speech speed multiplier
--language <code>Language code
--model <name>Override default model

fetch

Download videos from URLs via yt-dlp.

montaj fetch --url "https://www.tiktok.com/@handle/video/123"

montaj fetch --url "https://www.tiktok.com/@handle" --limit 15 --out ./clips/
ParamDefaultDescription
--url <str>requiredVideo or profile URL
--limit <n>Max number of videos to download
--format <str>autoyt-dlp format selector

Models Used

ProviderUse CaseModel
KlingVideo generationkling-v3-omni, kling-video-o1
KlingText-to-speechkling-tts-v1
GeminiMedia analysisgemini-2.5-flash
GeminiImage generationgemini-3-pro-image-preview
GeminiText-to-speechgemini-2.5-flash-preview-tts
GeminiMusic generationlyria-3-clip-preview
OpenAIImage generationgpt-image-1