Architecture, project.json schema, trim specs, and workflows — the foundational concepts behind Montaj.

Core Concepts

Architecture

Montaj is a video editing tool harness that mounts on top of your existing agent framework. It is not an agent — it is the toolkit the agent uses. You bring Claude, Cursor, or any agent; Montaj gives it the tools to edit video.

System Overview

┌──────────────────────────────────────────────────────────────┐
│                       LOCAL UI  (ui/)                         │
│                    browser → montaj serve                     │
│                                                              │
│  ┌──────────────┐                    ┌──────────────────┐    │
│  │  1. UPLOAD   │                    │   3. REVIEW      │    │
│  │  drop clips  │                    │  timeline        │    │
│  │  write prompt│                    │  preview player  │    │
│  │  POST /run   │                    │  caption editor  │    │
│  └──────┬───────┘                    └────────┬─────────┘    │
│         │          ┌──────────────┐           │              │
│         │          │ 2. LIVE VIEW │           │              │
│         │          │ SSE stream   │───────────┘              │
│         │          └──────┬───────┘                          │
└─────────┼─────────────────┼──────────────────────────────────┘
          │                 │
          ▼                 │
┌───────────────────────────┴──────────────────────────────────┐
│                      montaj serve                             │
│                  local HTTP + SSE server                       │
│                                                              │
│  POST /api/run    → creates project.json [pending]           │
│  GET  /api/projects → list projects; ?status=pending         │
│  file watcher     → detects project.json writes → SSE        │
└─────────────────────────┬────────────────────────────────────┘
                          │
                          ▼
┌──────────────────────────────────────────────────────────────┐
│                     AGENT (external)                          │
│                  Claude, Cursor, etc.                         │
│                                                              │
│  reads project.json [pending]                                │
│  reads workflows/<name>.json                                 │
│  calls steps as tools at its own discretion                  │
│  writes project.json as work progresses → file watcher → SSE│
│  marks [draft] when done                                     │
└──────────────────────────────────────────────────────────────┘
                          │
                          ▼
              ┌────────────────────────┐
              │   human review (UI)    │
              │   optional tweaks      │
              └────────────┬───────────┘
                           │
                           ▼
              ┌────────────────────────┐
              │     RENDER PASS        │
              │  React + Puppeteer     │
              │  + ffmpeg              │
              └────────────┬───────────┘
                           │
                           ▼
                      final MP4

Agent Interfaces

Montaj exposes three interfaces for agents to call steps. All are optional. All wrap the same CLI executables.

CLI

The agent runs montaj commands directly via shell access:

montaj trim clip.mp4 --start 2.5 --end 8.3
montaj transcribe clip.mp4 --model base.en
montaj resize clip.mp4 --ratio 9:16

Works with any agent that has shell access — Claude Code, Cursor, or any framework that can execute shell commands.

MCP

Montaj runs as a local MCP server (montaj mcp), started automatically by the MCP client. The agent calls steps as native tools — no shell access required.

{
  "mcpServers": {
    "montaj": { "command": "montaj", "args": ["mcp"] }
  }
}

New steps are picked up automatically — adding steps/my-step.py + .json makes it available as an MCP tool with no extra configuration.

HTTP API

montaj serve exposes a step execution API alongside the browser UI:

POST /api/steps/trim        body: { input: "clip.mp4", start: 2.5, end: 8.3 }
POST /api/steps/transcribe  body: { input: "clip.mp4", model: "medium.en" }
GET  /api/steps             returns: list of available steps with schemas

All API routes are namespaced under /api/ so they never collide with React Router paths.

Summary

Interface	Purpose	Used by
CLI	Step execution	Agents with shell access, humans
HTTP API	Step execution	Agents with HTTP access, the browser UI
MCP	Step execution	Claude Desktop / Claude Code (native tools)
`montaj serve`	Browser UI, SSE, project lifecycle, HTTP API	Humans, agents

Directory Structure

Three scopes. Same format at every level:

~/Montaj/                       # workspace — all projects live here
  2024-11-01-my-ad/             # one directory per project
    project.json
    clip1_trimmed.mp4
    ...

~/.montaj/                      # user-global config + extensions
  steps/                        # user custom steps
  workflows/                    # user custom workflows
  config.json                   # global defaults (workspaceDir, model, etc.)
  credentials.json              # API credentials (0600 permissions)

montaj/                         # built-in (ships with montaj)
  steps/                        # native steps
  connectors/                   # external API wrappers
  workflows/                    # bundled workflows

The workspace location defaults to ~/Montaj. Override via the MONTAJ_WORKSPACE_DIR env var or ~/.montaj/config.json:

export MONTAJ_WORKSPACE_DIR=/Volumes/FastSSD/Montaj

{ "workspaceDir": "/Volumes/FastSSD/Montaj" }

Precedence: MONTAJ_WORKSPACE_DIR > ~/.montaj/config.json > default. The env var is useful for containerized deployments where dropping a config file is awkward.

Step Resolution Order

When the agent or CLI calls a step, Montaj resolves it in this order:

Project-local — ./steps/<name>
User-global — ~/.montaj/steps/<name>
Built-in — montaj/steps/<name>

Prefix in workflow files makes scope explicit:

Prefix	Resolves to
`montaj/<name>`	Built-in steps
`user/<name>`	`~/.montaj/steps/<name>`
`./steps/<name>`	Project-local steps

Output Convention

All steps follow a strict contract:

stdout — the result: file path or JSON. Nothing else.
stderr — errors only: {"error":"code","message":"detail"}
exit 0 on success, exit 1 on failure

This makes steps composable at the shell level:

FILE=$(montaj step rm_fillers --input clip.mp4 --model base.en)
FILE=$(montaj step trim --input "$FILE" --start 5 --end 90)
FILE=$(montaj step resize --input "$FILE" --ratio 9:16)

Versioning

Two layers:

Git (milestone) — montaj run initializes the workspace as a git repo. Commits are created automatically at state transitions (pending, draft, human save). montaj checkpoint "<name>" creates a named commit before risky operations.
In-memory undo stack (fine-grained) — the UI maintains an undo stack for the current review session. Every caption, overlay, or trim edit is undoable without touching disk.

Project Format

project.json is the single format that flows through the entire pipeline. One file, three states.

Project Types

Type	Renders to	Notes
`editing`	MP4	Default. Trim, cut, transcribe, composite against source clips.
`music_video`	MP4	Lyrics-synced overlays atop a background video or color.
`ai_video`	MP4	Storyboard-driven scene generation via Kling, with music + voiceover.
`carousel`	N PNG slides	Slide-based design surface for Instagram / TikTok photo posts. No time axis. See Image Carousel.

The schema below describes video projects (editing, music_video, ai_video). Carousel projects share the same header (id, status, projectType, name, workflow, editingPrompt, settings.resolution, assets) but replace tracks/audio/fps with a flat slides[] array — see the carousel page for details.

Lifecycle

State	Who writes it	What's in it
`pending`	`montaj run` / `montaj serve`	Project ID, name, clip paths, editing prompt, workflow name. No agent work yet.
`draft`	Agent	Trim points, ordering, captions, overlays. Agent's complete edit.
`final`	Human (via UI)	Reviewed and tweaked. Ready to render.

Schema

{
  "version": "0.2",
  "id": "<uuid>",
  "status": "pending",
  "workflow": "overlays",
  "editingPrompt": "tight cuts, remove filler, 9:16",
  "settings": {
    "resolution": [1080, 1920],
    "fps": 30
  },
  "tracks": [
    [
      {
        "id": "clip-0",
        "type": "video",
        "src": "/abs/path/clip.mp4",
        "start": 0.0,
        "end": 0.0
      }
    ]
  ],
  "assets": [],
  "audio": {}
}

Identity

Each project gets a UUID (id) at init time — this is the stable identifier. The workspace directory name (~/Montaj/<date>-<name>/) is human-readable but not the identity. The optional name field is a label; it does not need to be unique.

Tracks

tracks is a top-level array of arrays. tracks[0] is the primary video track. Overlay and caption tracks start at index 1.

Video Items

{
  "id": "clip-0",
  "type": "video",
  "src": "/abs/path/to/original.MOV",
  "inPoint": 2.1,
  "outPoint": 8.4,
  "start": 0.0,
  "end": 6.3
}

src — absolute path to the real video file (never a spec JSON file)
inPoint / outPoint — source file timestamps (what range of the original to use)
start / end — position in the output timeline

Overlay Items

{
  "id": "ov-hook",
  "type": "overlay",
  "src": "/abs/path/to/project/overlays/hook.jsx",
  "props": { "text": "She built an AI employee" },
  "start": 0.0,
  "end": 3.0
}

Field	Required	Description
`type`	yes	`"overlay"` for custom JSX, `"image"` for static images, `"video"` for video clips
`src`	yes	Absolute path to the JSX file
`start` / `end`	yes	Time window in output video (seconds)
`props`	no	Arbitrary data injected as the `props` global inside the component
`offsetX` / `offsetY`	no	Position offset as % of frame size
`scale`	no	Uniform scale multiplier

Caption Track

{
  "id": "captions",
  "type": "caption",
  "style": "word-by-word",
  "segments": [
    {
      "text": "Hello world",
      "start": 0.0,
      "end": 1.2,
      "words": [
        { "word": "Hello", "start": 0.0, "end": 0.5 },
        { "word": "world", "start": 0.5, "end": 1.2 }
      ]
    }
  ]
}

Caption data (segments + word timestamps) is always inlined in the track — never a file pointer.

Assets

Image files (logos, watermarks). Each has id, absolute src, type: "image", and optional name.

# CLI
montaj run ./clips --prompt "..." --assets logo.png

# HTTP
POST /api/run  body: { "assets": ["/path/logo.png"] }

Audio

The audio field stores music and audio configuration for the project, including music tracks with inPoint/outPoint and ducking settings.

Live Updates

The agent writes project.json as it works — every write is picked up by the file watcher and pushed to the browser via SSE. The timeline builds live as the agent makes decisions.

State Transitions

montaj run       → project.json [pending]
agent completes  → project.json [draft]
human reviews    → project.json [final]
montaj render    → reads [final] → final.mp4

Trim Specs

The trim spec architecture is the core innovation in Montaj's editing pipeline. Editing steps output trim specs — not video files. A trim spec describes which ranges of the original source file to keep:

{
  "input": "/abs/path/original.MOV",
  "keeps": [[0.0, 5.3], [6.1, 12.4]]
}

Why This Matters

Before this architecture, every editing step re-encoded the full video. A five-clip workflow running silence removal + filler removal produced fifteen video encodes before the final concat. For 4K HEVC footage this caused multi-minute timeouts per step.

With trim specs, no video is decoded or encoded until materialize_cut. Editing steps work on audio only (for analysis) and pass timestamps forward. The entire set of cuts — silence boundaries, filler removals, take selections — is accumulated as trim spec refinements and applied in a single ffmpeg filter_complex pass at encode time.

Data Flow

waveform_trim(clip.MOV)
  → {input: "clip.MOV", keeps: [[2.1, 8.4], [9.0, 15.2]]}

transcribe({input: "clip.MOV", keeps: [...]})
  → extracts audio only at keep ranges
  → runs whisper on the joined audio
  → remaps word timestamps back to original timeline

rm_fillers({input: "clip.MOV", keeps: [...]})
  → extracts audio at keeps, detects fillers
  → subtracts filler timestamps from keeps
  → {input: "clip.MOV", keeps: [[2.1, 7.8], [9.2, 15.2]]}  ← refined

materialize_cut({inputs: [spec1.json, spec2.json, ...]})
  → ONE filter_complex per clip, applying all accumulated cuts
  → ONE encode pass total
  → final.mp4

Trim Spec Rules

Editing steps always receive the original source file path, never a re-encoded intermediate
Trim specs chain — each step refines the keeps list, preserving the original input path throughout
materialize_cut is the only step that encodes video — it handles both the normal pipeline encode and cases where a subsequent step (e.g. remove_bg) requires a physical video file
Uses input-level seeking (-ss/-t placed before -i) — ffmpeg seeks at the container level so only the requested segment is decoded
HEVC source files are handled automatically — no pre-conversion needed

Trim Spec Steps

These steps produce or refine trim specs:

Step	Input	Output
`waveform_trim`	Video/audio file	New trim spec (silence removed)
`rm_fillers`	Trim spec	Refined trim spec (fillers removed)
`rm_nonspeech`	Trim spec	Refined trim spec (non-speech removed)
`crop_spec`	Trim spec	Cropped trim spec (virtual-timeline windows)

Trim Spec Consumers

These steps consume trim specs by encoding video:

Step	What it does
`materialize_cut`	Encodes a trim spec to H.264 — applies all accumulated cuts in a single ffmpeg pass

Important: The `src` Field

Any video clip item in project.json (in any track) must have src pointing to a real video file (.MOV, .mp4, etc.) — never a spec JSON file. For clips derived from trim specs, read spec["input"] for src, and spec["keeps"] to derive inPoint/outPoint.

Multi-keep specs expand into multiple clip items, each with their own inPoint/outPoint.

Workflows

Workflows are suggested editing plans — which steps to use and their default params. The agent reads the plan, reads the prompt, and decides the actual execution.

A workflow is not a deterministic execution pipeline. The agent may reorder steps, adjust params, skip steps that don't apply, or add steps not in the list — whatever the prompt and content call for.

Workflow File Format

{
  "name": "overlays",
  "description": "Multi-clip edit — silence trim, transcribe, select best takes, remove fillers, concat, caption, overlays, resize to 9:16.",
  "steps": [
    { "id": "probe",             "uses": "montaj/probe" },
    { "id": "snapshot",          "uses": "montaj/snapshot" },
    { "id": "silence",           "uses": "montaj/waveform_trim",  "foreach": "clips", "params": { "threshold": "-30", "min-silence": 0.3 } },
    { "id": "transcribe",        "uses": "montaj/transcribe",     "foreach": "clips", "needs": ["silence"],           "params": { "model": "base.en" } },
    { "id": "select_takes",      "uses": "montaj/select_takes",                       "needs": ["transcribe"] },
    { "id": "fillers",           "uses": "montaj/rm_fillers",     "foreach": "clips", "needs": ["select_takes"],      "params": { "model": "base.en" } },
    { "id": "transcribe_final",  "uses": "montaj/transcribe",                         "needs": ["fillers"],           "params": { "model": "base.en" } },
    { "id": "caption",           "uses": "montaj/caption",                            "needs": ["transcribe_final"],  "params": { "style": "word-by-word" } },
    { "id": "overlays",          "uses": "montaj/overlay",                            "needs": ["caption"],           "params": { "style": "auto" } },
    { "id": "resize",            "uses": "montaj/resize",                             "needs": ["overlays"],          "params": { "ratio": "9:16" } }
  ]
}

Step Fields

Field	Required	Description
`id`	yes	Unique identifier within the workflow — used in `needs` references
`uses`	yes	Step to run: `montaj/<name>`, `user/<name>`, or `./steps/<name>.py`
`params`	no	Default param overrides — only include values that differ from step defaults
`needs`	no	Step IDs that must complete before this step starts. Omit when there are no deps.
`foreach`	no	Dotted path into the project — run per entry in that collection (e.g. `"clips"`, `"storyboard.scenes"`)

Parallel Execution

needs is the dependency graph. The agent fires all steps with no unmet needs simultaneously, then re-evaluates after each completes. Steps in the same "wave" run in parallel.

Example execution waves for the overlays workflow:

Wave 1 (parallel): probe, snapshot, silence x N (foreach clips)
Wave 2 (parallel): transcribe x N (foreach clips — needs silence)
Wave 3:            select_takes (needs transcribe)
Wave 4 (parallel): fillers x N (foreach clips — needs select_takes)
Wave 5:            transcribe_final (needs fillers)
Wave 6:            caption (needs transcribe_final)
Wave 7:            overlays (needs caption)
Wave 8:            resize (needs overlays)

foreach: <path> fans out a step across all entries in a dotted-path collection on the project. Common values: "clips" (all project clips), "storyboard.scenes", "storyboard.imageRefs", "storyboard.styleRefs" (used by the ai_video workflow). The agent runs them as parallel tool calls and collects the outputs before proceeding to steps that need them.

Bundled Workflows

Workflow	Description
`overlays`	Multi-clip edit — silence trim, transcribe, select best takes, remove fillers, overlays. No captions. Default when no `--workflow` is specified.
`short_captions`	Multi-clip edit — same as `overlays` plus caption and resize 9:16.
`clean_cut`	Trim and clean only — silence, transcribe, select best takes, remove fillers. No captions, overlays, or resize.
`animations`	Animation-only — no source footage required. Agent builds entirely from overlays and animation sections.
`explainer`	Multi-clip edit with animation sections — same as `overlays` plus animation sections. No captions.
`floating_head`	Talking-head presenter over a custom background — trim, materialize, RVM background removal. Background in `tracks[0]`, presenter in `tracks[1]`.
`lyrics_video`	Audio + lyrics aligned with word-synced text video.
`ai_video`	Director agent writes a storyboard from your prompt and references (`ai-video-plan` skill), you approve, scenes are generated via Kling with music and voiceover (`ai-video-generate` skill).

"no captions" → skip caption
"keep it raw" → skip rm_fillers, waveform_trim
"YouTube format" → resize 16:9

Managing Workflows

montaj workflow list                    # list all available workflows
montaj workflow new <name>              # scaffold a new workflow file
montaj workflow edit <name>             # open in the node graph UI
montaj workflow run <name> ./clips --prompt "..."  # run a specific workflow

All workflow files are equal — fork any of them, save under a new name, and it becomes available immediately.

Core Concepts

On this page