MontajMontajdocs

Core Concepts

Architecture, project.json schema, trim specs, and workflows — the foundational concepts behind Montaj.

Core Concepts

Architecture

Montaj is a video editing tool harness that mounts on top of your existing agent framework. It is not an agent — it is the toolkit the agent uses. You bring Claude, Cursor, or any agent; Montaj gives it the tools to edit video.

System Overview

┌──────────────────────────────────────────────────────────────┐
│                       LOCAL UI  (ui/)                         │
│                    browser → montaj serve                     │
│                                                              │
│  ┌──────────────┐                    ┌──────────────────┐    │
│  │  1. UPLOAD   │                    │   3. REVIEW      │    │
│  │  drop clips  │                    │  timeline        │    │
│  │  write prompt│                    │  preview player  │    │
│  │  POST /run   │                    │  caption editor  │    │
│  └──────┬───────┘                    └────────┬─────────┘    │
│         │          ┌──────────────┐           │              │
│         │          │ 2. LIVE VIEW │           │              │
│         │          │ SSE stream   │───────────┘              │
│         │          └──────┬───────┘                          │
└─────────┼─────────────────┼──────────────────────────────────┘
          │                 │
          ▼                 │
┌───────────────────────────┴──────────────────────────────────┐
│                      montaj serve                             │
│                  local HTTP + SSE server                       │
│                                                              │
│  POST /api/run    → creates project.json [pending]           │
│  GET  /api/projects → list projects; ?status=pending         │
│  file watcher     → detects project.json writes → SSE        │
└─────────────────────────┬────────────────────────────────────┘


┌──────────────────────────────────────────────────────────────┐
│                     AGENT (external)                          │
│                  Claude, Cursor, etc.                         │
│                                                              │
│  reads project.json [pending]                                │
│  reads workflows/<name>.json                                 │
│  calls steps as tools at its own discretion                  │
│  writes project.json as work progresses → file watcher → SSE│
│  marks [draft] when done                                     │
└──────────────────────────────────────────────────────────────┘


              ┌────────────────────────┐
              │   human review (UI)    │
              │   optional tweaks      │
              └────────────┬───────────┘


              ┌────────────────────────┐
              │     RENDER PASS        │
              │  React + Puppeteer     │
              │  + ffmpeg              │
              └────────────┬───────────┘


                      final MP4

Agent Interfaces

Montaj exposes three interfaces for agents to call steps. All are optional. All wrap the same CLI executables.

CLI

The agent runs montaj commands directly via shell access:

montaj trim clip.mp4 --start 2.5 --end 8.3
montaj transcribe clip.mp4 --model base.en
montaj resize clip.mp4 --ratio 9:16

Works with any agent that has shell access — Claude Code, Cursor, or any framework that can execute shell commands.

MCP

Montaj runs as a local MCP server (montaj mcp), started automatically by the MCP client. The agent calls steps as native tools — no shell access required.

{
  "mcpServers": {
    "montaj": { "command": "montaj", "args": ["mcp"] }
  }
}

New steps are picked up automatically — adding steps/my-step.py + .json makes it available as an MCP tool with no extra configuration.

HTTP API

montaj serve exposes a step execution API alongside the browser UI:

POST /api/steps/trim        body: { input: "clip.mp4", start: 2.5, end: 8.3 }
POST /api/steps/transcribe  body: { input: "clip.mp4", model: "medium.en" }
GET  /api/steps             returns: list of available steps with schemas

All API routes are namespaced under /api/ so they never collide with React Router paths.

Summary

InterfacePurposeUsed by
CLIStep executionAgents with shell access, humans
HTTP APIStep executionAgents with HTTP access, the browser UI
MCPStep executionClaude Desktop / Claude Code (native tools)
montaj serveBrowser UI, SSE, project lifecycle, HTTP APIHumans, agents

Directory Structure

Three scopes. Same format at every level:

~/Montaj/                       # workspace — all projects live here
  2024-11-01-my-ad/             # one directory per project
    project.json
    clip1_trimmed.mp4
    ...

~/.montaj/                      # user-global config + extensions
  steps/                        # user custom steps
  workflows/                    # user custom workflows
  config.json                   # global defaults (workspaceDir, model, etc.)
  credentials.json              # API credentials (0600 permissions)

montaj/                         # built-in (ships with montaj)
  steps/                        # native steps
  connectors/                   # external API wrappers
  workflows/                    # bundled workflows

The workspace location defaults to ~/Montaj. Override via ~/.montaj/config.json:

{ "workspaceDir": "/Volumes/FastSSD/Montaj" }

Step Resolution Order

When the agent or CLI calls a step, Montaj resolves it in this order:

  1. Project-local./steps/<name>
  2. User-global~/.montaj/steps/<name>
  3. Built-inmontaj/steps/<name>

Prefix in workflow files makes scope explicit:

PrefixResolves to
montaj/<name>Built-in steps
user/<name>~/.montaj/steps/<name>
./steps/<name>Project-local steps

Output Convention

All steps follow a strict contract:

  • stdout — the result: file path or JSON. Nothing else.
  • stderr — errors only: {"error":"code","message":"detail"}
  • exit 0 on success, exit 1 on failure

This makes steps composable at the shell level:

FILE=$(montaj step rm_fillers --input clip.mp4 --model base.en)
FILE=$(montaj step trim --input "$FILE" --start 5 --end 90)
FILE=$(montaj step resize --input "$FILE" --ratio 9:16)

Versioning

Two layers:

  • Git (milestone)montaj run initializes the workspace as a git repo. Commits are created automatically at state transitions (pending, draft, human save). montaj checkpoint "<name>" creates a named commit before risky operations.
  • In-memory undo stack (fine-grained) — the UI maintains an undo stack for the current review session. Every caption, overlay, or trim edit is undoable without touching disk.

Project Format

project.json is the single format that flows through the entire pipeline. One file, three states.

Lifecycle

StateWho writes itWhat's in it
pendingmontaj run / montaj serveProject ID, name, clip paths, editing prompt, workflow name. No agent work yet.
draftAgentTrim points, ordering, captions, overlays. Agent's complete edit.
finalHuman (via UI)Reviewed and tweaked. Ready to render.

Schema

{
  "version": "0.2",
  "id": "<uuid>",
  "status": "pending",
  "workflow": "overlays",
  "editingPrompt": "tight cuts, remove filler, 9:16",
  "settings": {
    "resolution": [1080, 1920],
    "fps": 30
  },
  "tracks": [
    [
      {
        "id": "clip-0",
        "type": "video",
        "src": "/abs/path/clip.mp4",
        "start": 0.0,
        "end": 0.0
      }
    ]
  ],
  "assets": [],
  "audio": {}
}

Identity

Each project gets a UUID (id) at init time — this is the stable identifier. The workspace directory name (~/Montaj/<date>-<name>/) is human-readable but not the identity. The optional name field is a label; it does not need to be unique.

Tracks

tracks is a top-level array of arrays. tracks[0] is the primary video track. Overlay and caption tracks start at index 1.

Video Items

{
  "id": "clip-0",
  "type": "video",
  "src": "/abs/path/to/original.MOV",
  "inPoint": 2.1,
  "outPoint": 8.4,
  "start": 0.0,
  "end": 6.3
}
  • src — absolute path to the real video file (never a spec JSON file)
  • inPoint / outPoint — source file timestamps (what range of the original to use)
  • start / end — position in the output timeline

Overlay Items

{
  "id": "ov-hook",
  "type": "overlay",
  "src": "/abs/path/to/project/overlays/hook.jsx",
  "props": { "text": "She built an AI employee" },
  "start": 0.0,
  "end": 3.0
}
FieldRequiredDescription
typeyes"overlay" for custom JSX, "image" for static images, "video" for video clips
srcyesAbsolute path to the JSX file
start / endyesTime window in output video (seconds)
propsnoArbitrary data injected as the props global inside the component
offsetX / offsetYnoPosition offset as % of frame size
scalenoUniform scale multiplier

Caption Track

{
  "id": "captions",
  "type": "caption",
  "style": "word-by-word",
  "segments": [
    {
      "text": "Hello world",
      "start": 0.0,
      "end": 1.2,
      "words": [
        { "word": "Hello", "start": 0.0, "end": 0.5 },
        { "word": "world", "start": 0.5, "end": 1.2 }
      ]
    }
  ]
}

Caption data (segments + word timestamps) is always inlined in the track — never a file pointer.

Assets

Image files (logos, watermarks). Each has id, absolute src, type: "image", and optional name.

# CLI
montaj run ./clips --prompt "..." --assets logo.png

# HTTP
POST /api/run  body: { "assets": ["/path/logo.png"] }

Audio

The audio field stores music and audio configuration for the project, including music tracks with inPoint/outPoint and ducking settings.

Live Updates

The agent writes project.json as it works — every write is picked up by the file watcher and pushed to the browser via SSE. The timeline builds live as the agent makes decisions.

State Transitions

montaj run       → project.json [pending]
agent completes  → project.json [draft]
human reviews    → project.json [final]
montaj render    → reads [final] → final.mp4

Trim Specs

The trim spec architecture is the core innovation in Montaj's editing pipeline. Editing steps output trim specs — not video files. A trim spec describes which ranges of the original source file to keep:

{
  "input": "/abs/path/original.MOV",
  "keeps": [[0.0, 5.3], [6.1, 12.4]]
}

Why This Matters

Before this architecture, every editing step re-encoded the full video. A five-clip workflow running silence removal + filler removal produced fifteen video encodes before the final concat. For 4K HEVC footage this caused multi-minute timeouts per step.

With trim specs, no video is decoded or encoded until concat. Editing steps work on audio only (for analysis) and pass timestamps forward. The entire set of cuts — silence boundaries, filler removals, take selections — is accumulated as trim spec refinements and applied in a single ffmpeg filter_complex pass at concat time.

Data Flow

waveform_trim(clip.MOV)
  → {input: "clip.MOV", keeps: [[2.1, 8.4], [9.0, 15.2]]}

transcribe({input: "clip.MOV", keeps: [...]})
  → extracts audio only at keep ranges
  → runs whisper on the joined audio
  → remaps word timestamps back to original timeline

rm_fillers({input: "clip.MOV", keeps: [...]})
  → extracts audio at keeps, detects fillers
  → subtracts filler timestamps from keeps
  → {input: "clip.MOV", keeps: [[2.1, 7.8], [9.2, 15.2]]}  ← refined

concat({inputs: [spec1.json, spec2.json, ...]})
  → ONE filter_complex per clip, applying all accumulated cuts
  → ONE encode pass total
  → final.mp4

Trim Spec Rules

  1. Editing steps always receive the original source file path, never a re-encoded intermediate
  2. Trim specs chain — each step refines the keeps list, preserving the original input path throughout
  3. concat and materialize_cut are the only steps that encode videoconcat for the normal pipeline; materialize_cut only when a subsequent step (e.g. remove_bg) requires a physical video file
  4. Both encoders use input-level seeking (-ss/-t placed before -i) — ffmpeg seeks at the container level so only the requested segment is decoded
  5. HEVC source files are handled automatically at concat — no pre-conversion needed

Trim Spec Steps

These steps produce or refine trim specs:

StepInputOutput
waveform_trimVideo/audio fileNew trim spec (silence removed)
rm_fillersTrim specRefined trim spec (fillers removed)
rm_nonspeechTrim specRefined trim spec (non-speech removed)
crop_specTrim specCropped trim spec (virtual-timeline windows)

Trim Spec Consumers

These steps consume trim specs by encoding video:

StepWhat it does
concatJoins clips and applies all trim specs in a single encode pass
materialize_cutEncodes a trim spec to H.264 when a physical file is needed

Important: The src Field

Any video clip item in project.json (in any track) must have src pointing to a real video file (.MOV, .mp4, etc.) — never a spec JSON file. For clips derived from trim specs, read spec["input"] for src, and spec["keeps"] to derive inPoint/outPoint.

Multi-keep specs expand into multiple clip items, each with their own inPoint/outPoint.


Workflows

Workflows are suggested editing plans — which steps to use and their default params. The agent reads the plan, reads the prompt, and decides the actual execution.

A workflow is not a deterministic execution pipeline. The agent may reorder steps, adjust params, skip steps that don't apply, or add steps not in the list — whatever the prompt and content call for.

Workflow File Format

{
  "name": "overlays",
  "description": "Multi-clip edit — silence trim, transcribe, select best takes, remove fillers, concat, caption, overlays, resize to 9:16.",
  "steps": [
    { "id": "probe",             "uses": "montaj/probe" },
    { "id": "snapshot",          "uses": "montaj/snapshot" },
    { "id": "silence",           "uses": "montaj/waveform_trim",  "foreach": "clips", "params": { "threshold": "-30", "min-silence": 0.3 } },
    { "id": "transcribe",        "uses": "montaj/transcribe",     "foreach": "clips", "needs": ["silence"],           "params": { "model": "base.en" } },
    { "id": "select_takes",      "uses": "montaj/select_takes",                       "needs": ["transcribe"] },
    { "id": "fillers",           "uses": "montaj/rm_fillers",     "foreach": "clips", "needs": ["select_takes"],      "params": { "model": "base.en" } },
    { "id": "concat",            "uses": "montaj/concat",                             "needs": ["fillers"] },
    { "id": "transcribe_concat", "uses": "montaj/transcribe",                         "needs": ["concat"],            "params": { "model": "base.en" } },
    { "id": "caption",           "uses": "montaj/caption",                            "needs": ["transcribe_concat"], "params": { "style": "word-by-word" } },
    { "id": "overlays",          "uses": "montaj/overlay",                            "needs": ["caption"],           "params": { "style": "auto" } },
    { "id": "resize",            "uses": "montaj/resize",                             "needs": ["overlays"],          "params": { "ratio": "9:16" } }
  ]
}

Step Fields

FieldRequiredDescription
idyesUnique identifier within the workflow — used in needs references
usesyesStep to run: montaj/<name>, user/<name>, or ./steps/<name>.py
paramsnoDefault param overrides — only include values that differ from step defaults
needsnoStep IDs that must complete before this step starts. Omit when there are no deps.
foreachno"clips" — run this step once per clip in the project, in parallel

Parallel Execution

needs is the dependency graph. The agent fires all steps with no unmet needs simultaneously, then re-evaluates after each completes. Steps in the same "wave" run in parallel.

Example execution waves for the overlays workflow:

Wave 1 (parallel): probe, snapshot, silence x N (foreach clips)
Wave 2 (parallel): transcribe x N (foreach clips — needs silence)
Wave 3:            select_takes (needs transcribe)
Wave 4 (parallel): fillers x N (foreach clips — needs select_takes)
Wave 5:            concat (needs fillers)
Wave 6:            transcribe_concat (needs concat)
Wave 7:            caption (needs transcribe_concat)
Wave 8:            overlays (needs caption)
Wave 9:            resize (needs overlays)

foreach: "clips" fans out a step across all clips in the project. The agent runs them as parallel tool calls and collects the outputs before proceeding to steps that need them.

Bundled Workflows

WorkflowDescription
overlaysMulti-clip edit — silence trim, transcribe, select best takes, remove fillers, overlays. No captions. Default when no --workflow is specified.
short_captionsMulti-clip edit — same as overlays plus caption and resize 9:16.
clean_cutTrim and clean only — silence, transcribe, select best takes, remove fillers. No captions, overlays, or resize.
animationsAnimation-only — no source footage required. Agent builds entirely from overlays and animation sections.
explainerMulti-clip edit with animation sections — same as overlays plus animation sections. No captions.
floating_headTalking-head presenter over a custom background — trim, materialize, RVM background removal. Background in tracks[0], presenter in tracks[1].
lyrics_videoAudio + lyrics aligned with word-synced text video.
ai_videoDirector agent writes a storyboard from your prompt and references, you approve, scenes are generated via Kling.

Agent-Authored Steps

Two step names in workflows are not CLI executables — they are tasks the agent performs itself:

montaj/select_takes

The agent reads transcripts from all clips, groups segments by content similarity (repeated takes), selects the best take of each section, and trims each clip accordingly using montaj/trim.

montaj/overlay

The agent writes custom JSX overlay files and adds them to tracks in project.json. There are no built-in overlay templates — every overlay is a custom React component the agent authors.

Deviation Rules

The agent should follow the assigned workflow and deviate only when the prompt explicitly requires it:

  • "no captions" → skip caption
  • "keep it raw" → skip rm_fillers, waveform_trim
  • "YouTube format" → resize 16:9

Managing Workflows

montaj workflow list                    # list all available workflows
montaj workflow new <name>              # scaffold a new workflow file
montaj workflow edit <name>             # open in the node graph UI
montaj workflow run <name> ./clips --prompt "..."  # run a specific workflow

All workflow files are equal — fork any of them, save under a new name, and it becomes available immediately.