MontajMontajdocs

Steps Reference

All built-in steps — inspect, smart cuts, edit, enrich, VFX, and generation.

Steps Reference

Steps are the building blocks of Montaj's editing pipeline. Each step is an executable that follows the output convention (result on stdout, errors on stderr). Steps are composable at the shell level and discoverable via CLI, HTTP, and MCP.


Inspect Steps

Inspect steps give the agent (or human) an understanding of the source material before editing begins.

probe

Extract metadata from a video file: duration, resolution, fps, codec, audio channels.

montaj step probe --input clip.mp4
# → JSON: duration, resolution, fps, codec, audio channels

The output is JSON on stdout — the agent uses this to understand what it's working with and make informed editing decisions.

snapshot

Generate a contact sheet frame grid — the agent's visual understanding of the clip.

montaj step snapshot --input clip.mp4
# → /path/to/snapshot.png (frame grid contact sheet)

montaj step snapshot --input clip.mp4 --cols 3 --rows 3
# Custom grid size
ParamDefaultDescription
--cols <n>autoNumber of columns in the grid
--rows <n>autoNumber of rows in the grid

virtual_to_original

Map virtual-timeline timestamps to original-file timestamps (or the reverse). Useful for inspecting and debugging trim specs.

montaj step virtual_to_original --input spec.json 47.32
# → 95.483  (virtual timestamp → original-file timestamp)

montaj step virtual_to_original --input spec.json 47.32 53.23 66.89
# → one result per line

montaj step virtual_to_original --input spec.json --inverse 95.483
# → 47.320  (original-file timestamp → virtual timestamp)
ParamDescription
--input <spec.json>Trim spec file
--inverseReverse direction: original → virtual
Positional argsOne or more timestamps to map

Smart Cut Steps

Smart cut steps analyze audio and produce trim specs — JSON files describing which ranges of the original source to keep. No video encoding happens at this stage.

waveform_trim

Waveform silence analysis — detects silent sections and outputs a trim spec. Near-instant, no encode.

montaj step waveform_trim --input clip.mp4
# → /path/to/clip_spec.json

montaj step waveform_trim --input clip.mp4 --threshold -30 --min-silence 0.3
# Custom sensitivity
ParamDefaultDescription
--threshold <dB>autoVolume threshold for silence detection (negative dB value)
--min-silence <s>autoMinimum silence duration to trigger a cut (seconds)

Output: a trim spec JSON with input (original file) and keeps (time ranges to keep).

rm_fillers

Remove filler words (um, uh, hmm) from speech. Takes a trim spec as input and outputs a refined trim spec.

montaj step rm_fillers --input clip.mp4
# → /path/to/clip_spec.json

montaj step rm_fillers --input clip.mp4 --model medium.en
# Higher accuracy, slower
ParamDefaultDescription
--model <name>base.enWhisper model: tiny.en, base.en, medium.en, large

Important: rm_fillers can accept either a video file or a trim spec as input. When given a trim spec, it refines the existing keeps list.

rm_nonspeech

Remove all non-speech audio (noisy ambient audio). Takes a trim spec as input and outputs a refined trim spec.

montaj step rm_nonspeech --input clip.mp4

montaj step rm_nonspeech --input clip.mp4 --model base --max-word-gap 0.18 --sentence-edge 0.10
ParamDefaultDescription
--model <name>baseWhisper model: base, small, medium
--max-word-gap <s>0.18Maximum gap between words before splitting
--sentence-edge <s>0.10Padding at sentence boundaries

Important: Input should be a trim spec, not a raw video file.

crop_spec

Crop a trim spec to virtual-timeline windows. Outputs a refined trim spec — no encode.

montaj step crop_spec --input spec.json --keep 8.5:14.8
# Crop to a single window

montaj step crop_spec --input spec.json --keep 0:2.4 --keep 13.84:18.33
# Multiple windows — keeps are concatenated in order

montaj step crop_spec --input spec.json --keep 40.28:end
# Open-ended — keep from virtual 40.28s to end of clip
ParamDescription
--keep <start:end>Virtual-timeline window to keep (repeatable). Use end sentinel for open-ended.
--out <path>Output path (default: <input>_cropped.json)

How Smart Cuts Chain

Smart cut steps form a pipeline where each step refines the trim spec:

waveform_trim(clip.MOV)
  → trim spec (silence removed)

rm_fillers(spec.json)
  → refined spec (fillers removed)

rm_nonspeech(spec.json)
  → refined spec (non-speech removed)

crop_spec(spec.json)
  → cropped spec (virtual-timeline windows)

The original source file path is preserved through the entire chain. Only concat or materialize_cut will finally encode video.


Edit Steps

Core video editing operations. Most of these are ffmpeg wrappers with careful codec handling.

trim

Cut by in/out point — extract a segment from a video.

montaj step trim --input clip.mp4 --start 2.5 --end 8.3
# Extract from 2.5s to 8.3s

montaj step trim --input clip.mp4 --start 00:00:02 --end 00:01:30
# HH:MM:SS format also accepted
ParamDescription
--start <t>Start time (seconds or HH:MM:SS)
--end <t>End time (seconds or HH:MM:SS)
--duration <t>Duration instead of end time

cut

Remove one or more sections from a video and rejoin — the opposite of trim.

montaj step cut --input clip.mp4 --start 3.0 --end 7.5
# Remove a single section and rejoin

montaj step cut --input clip.mp4 --cuts '[[0,1.2],[5.3,7.8]]'
# Remove multiple sections in one ffmpeg pass

montaj step cut --input clip.mp4 --cuts '[[3.0,7.5]]' --spec
# Write a trim spec JSON instead of encoding — use with concat for deferred encode
ParamDescription
--start <t>, --end <t>Remove a single section
--cuts <json>Remove multiple sections: [[start, end], ...]
--specOutput a trim spec instead of encoding video

concat

Join clips and apply all trim specs in a single encode pass. This is the only step in the normal pipeline that writes video.

montaj step concat --inputs spec1.json spec2.json

concat reads trim specs (or raw video files), builds a single ffmpeg filter_complex, and produces one output file. All accumulated cuts from the editing pipeline are applied here.

HEVC source files are handled automatically — no pre-conversion needed.

materialize_cut

Encode a trim spec or raw video segment to H.264. Used only when a subsequent step (e.g., remove_bg) requires an actual video file rather than a trim spec.

montaj step materialize_cut spec.json
# Encode a trim spec to video

montaj step materialize_cut clip.mp4 --inpoint 2.0 --outpoint 8.0
# Encode a raw segment

montaj step materialize_cut --inputs clip0.json clip1.json
# Multiple clips — caps at 2 concurrent encodes by default

Uses input-level seeking (-ss/-t before -i) so only the requested segment is decoded.

ParamDescription
--inpoint <t>Start time (for raw video input)
--outpoint <t>End time (for raw video input)
--inputs <files>Multiple trim specs or clips
--workers <n>Max concurrent encodes (default: 2)

resize

Reframe video to a target aspect ratio.

montaj step resize --input clip.mp4 --ratio 9:16    # TikTok / Reels / Shorts
montaj step resize --input clip.mp4 --ratio 1:1     # Instagram
montaj step resize --input clip.mp4 --ratio 16:9    # YouTube
ParamDescription
--ratio <ratio>Target aspect ratio: 9:16, 1:1, 16:9

extract_audio

Extract the audio track from a video file.

montaj step extract_audio --input clip.mp4              # default: wav
montaj step extract_audio --input clip.mp4 --format mp3
ParamDefaultDescription
--format <fmt>wavOutput format: wav, mp3, aac

Enrich Steps

Enrich steps add information (transcripts, captions) or polish (loudness normalization) to clips.

transcribe

Generate SRT and JSON transcripts with word-level timestamps using local whisper.cpp.

montaj step transcribe --input clip.mp4
# → transcript.json + clip.srt

montaj step transcribe --input clip.mp4 --model medium.en
# Higher accuracy, slower

montaj step transcribe --input clip.mp4 --language es
# Non-English
ParamDefaultDescription
--model <name>base.enWhisper model: base.en, medium.en
--language <code>enLanguage code for non-English

When given a trim spec as input, transcribe extracts audio only at the keep ranges, runs whisper on the joined audio, and remaps word timestamps back to the original timeline.

Output includes:

  • JSON transcript with word-level timestamps (start, end, word)
  • SRT file for standard subtitle format

caption

Convert a transcript into an animated caption track. Produces data (not pixels) — rendered at review/final render time by the UI and render engine.

montaj step caption --input transcript.json
montaj step caption --input transcript.json --style word-by-word
montaj step caption --input transcript.json --style pop
montaj step caption --input transcript.json --style karaoke
montaj step caption --input transcript.json --style subtitle
ParamDefaultDescription
--style <name>word-by-wordCaption style: word-by-word, pop, karaoke, subtitle

Caption Styles

StyleDescription
word-by-wordOne word at a time, spring pop-in animation
popSegment-at-a-time with scale entry
karaokeWords highlight progressively as they're spoken
subtitleStatic line at bottom, segments replace sequentially

Caption data is stored in project.json as a track with type: "caption" and inline segments with word timestamps:

{
  "id": "captions",
  "type": "caption",
  "style": "word-by-word",
  "segments": [
    {
      "text": "Hello world",
      "start": 0.0,
      "end": 1.2,
      "words": [
        { "word": "Hello", "start": 0.0, "end": 0.5 },
        { "word": "world", "start": 0.5, "end": 1.2 }
      ]
    }
  ]
}

normalize

Loudness normalization — adjust audio levels to meet platform standards.

montaj step normalize --input clip.mp4                       # youtube = -14 LUFS
montaj step normalize --input clip.mp4 --target podcast      # -16 LUFS
montaj step normalize --input clip.mp4 --target broadcast    # -23 LUFS
montaj step normalize --input clip.mp4 --target custom --lufs -18
ParamDefaultDescription
--target <name>youtubeTarget preset: youtube (-14 LUFS), podcast (-16 LUFS), broadcast (-23 LUFS), custom
--lufs <n>Custom LUFS value (only with --target custom)

VFX Steps

remove_bg

Remove video background using RVM (Robust Video Matting). Outputs a ProRes 4444 .mov with alpha channel for final render and a VP9 WebM for browser preview.

montaj step remove_bg --input clip.mp4

montaj step remove_bg --inputs clip0.mp4 clip1.mp4
# Multiple clips

montaj step remove_bg --input clip.mp4 --model rvm_resnet50
# Higher quality model

montaj step remove_bg --input clip.mp4 --downsample 0.5
# Downsample for faster processing
ParamDefaultDescription
--inputs <files>Multiple clips to process
--model <name>rvm_mobilenetv3Model: rvm_mobilenetv3 (faster) or rvm_resnet50 (higher quality)
--downsample <factor>Downsample factor for faster processing
--progressShow progress (recommended for long-running operations)
--workers <n>2Max concurrent encodes

Requirements

Requires montaj install rvm — installs torch, torchvision, av (pip) + RVM model weights.

Output

The step produces two files and updates the project item:

  • nobg_src — ProRes 4444 .mov with alpha channel (for final render)
  • nobg_preview_src — VP9 WebM (for browser preview — Chrome supports VP9 alpha; ProRes does not play in browsers)
  • Sets remove_bg: true on the item

Usage in Workflows

Used in the floating_head workflow:

  1. materialize_cut — encode the trim spec to an actual video file
  2. remove_bg — remove background from the materialized file
  3. Render engine composites the alpha-channel .mov over the layers beneath

Important: remove_bg requires an actual video file — pass the output of materialize_cut, not a trim spec. This step is long-running (minutes per clip) — always run in the background with --progress so you can monitor status.

Render Behavior

At render time, when a tracks[1+] item has remove_bg: true and nobg_src is set, the render engine uses the ProRes 4444 .mov (with alpha) in place of the original src. The alpha channel is preserved through the ffmpeg filter graph and composited over the layers beneath.


Generation Steps

These steps call external APIs (Kling, Gemini, OpenAI) and require montaj install connectors and montaj install credentials.

kling_generate

Generate video via Kling v3 Omni — text-to-video, image-to-video, or reference-guided generation.

montaj kling-generate --prompt "a calico cat walking through a sunlit kitchen, cinematic" --out /tmp/cat.mp4

montaj kling-generate --prompt "slow zoom in" --first-frame frame.png --out /tmp/zoom.mp4

montaj kling-generate --prompt "character walks left" --first-frame start.png --last-frame end.png --out /tmp/walk.mp4

montaj kling-generate --prompt "same style" --ref-image style1.png --ref-image style2.png --out /tmp/styled.mp4

montaj kling-generate --prompt "..." --out /tmp/pro.mp4 --mode pro --duration 10 --aspect-ratio 9:16
ParamDefaultDescription
--prompt <text>requiredGeneration prompt
--out <path>requiredOutput file path
--first-frame <img>Starting image for image-to-video
--last-frame <img>Ending image (requires --first-frame)
--ref-image <img>Style reference image (repeatable, max 3)
--duration <3-15>Video duration in seconds
--negative-prompt <text>What to avoid
--sound <on|off>Enable/disable sound
--aspect-ratio <ratio>16:9, 9:16, 1:1
--mode <std|pro>stdStandard (cheaper/faster) or Pro (higher quality)

analyze_media

Analyze a media file (video, audio, or image) with Gemini Flash. Supports description, timestamps, and structured output.

montaj analyze-media clip.mp4 --prompt "Describe the scene in 2 sentences."

montaj analyze-media song.mp3 --prompt "Transcribe with timestamps."

montaj analyze-media photo.jpg --prompt "Return JSON: {subject, mood, dominant_colors}" --json-output

montaj analyze-media clip.mp4 --prompt "..." --model gemini-2.5-pro
# Override model

montaj analyze-media clip.mp4 --prompt "..." --out analysis.txt
# Write to file
ParamDefaultDescription
<input>requiredVideo, audio, or image file (positional)
--prompt <text>requiredAnalysis prompt
--model <id>gemini-2.5-flashModel override
--json-outputRequest structured JSON response from the model
--out <path>Write output to file

Note: Images under approximately 18 MB take a fast inline path (no Files API round-trip).

generate_image

Generate an image via Gemini or OpenAI — text-to-image or reference-conditioned.

montaj generate-image --prompt "portrait, studio lighting" --out /tmp/portrait.png

montaj generate-image --prompt "same character, profile view" --ref-image /tmp/portrait.png --out /tmp/profile.png

montaj generate-image --prompt "red apple on white table" --provider openai --out /tmp/apple.png

montaj generate-image --prompt "..." --provider gemini --aspect-ratio 9:16 --out /tmp/tall.png
ParamDefaultDescription
--prompt <text>requiredGeneration prompt
--out <path>requiredOutput file path
--provider <name>geminiProvider: gemini or openai
--ref-image <img>Reference image (repeatable)
--size <WxH>Image dimensions
--aspect-ratio <ratio>Aspect ratio (Gemini only)

Models Used

ProviderUse CaseModel
KlingVideo generationkling-v3-omni
GeminiMedia analysisgemini-2.5-flash
GeminiImage generationgemini-3-pro-image-preview
OpenAIImage generationgpt-image-1