Core Concepts
Architecture, project.json schema, trim specs, and workflows — the foundational concepts behind Montaj.
Core Concepts
Architecture
Montaj is a video editing tool harness that mounts on top of your existing agent framework. It is not an agent — it is the toolkit the agent uses. You bring Claude, Cursor, or any agent; Montaj gives it the tools to edit video.
System Overview
┌──────────────────────────────────────────────────────────────┐
│ LOCAL UI (ui/) │
│ browser → montaj serve │
│ │
│ ┌──────────────┐ ┌──────────────────┐ │
│ │ 1. UPLOAD │ │ 3. REVIEW │ │
│ │ drop clips │ │ timeline │ │
│ │ write prompt│ │ preview player │ │
│ │ POST /run │ │ caption editor │ │
│ └──────┬───────┘ └────────┬─────────┘ │
│ │ ┌──────────────┐ │ │
│ │ │ 2. LIVE VIEW │ │ │
│ │ │ SSE stream │───────────┘ │
│ │ └──────┬───────┘ │
└─────────┼─────────────────┼──────────────────────────────────┘
│ │
▼ │
┌───────────────────────────┴──────────────────────────────────┐
│ montaj serve │
│ local HTTP + SSE server │
│ │
│ POST /api/run → creates project.json [pending] │
│ GET /api/projects → list projects; ?status=pending │
│ file watcher → detects project.json writes → SSE │
└─────────────────────────┬────────────────────────────────────┘
│
▼
┌──────────────────────────────────────────────────────────────┐
│ AGENT (external) │
│ Claude, Cursor, etc. │
│ │
│ reads project.json [pending] │
│ reads workflows/<name>.json │
│ calls steps as tools at its own discretion │
│ writes project.json as work progresses → file watcher → SSE│
│ marks [draft] when done │
└──────────────────────────────────────────────────────────────┘
│
▼
┌────────────────────────┐
│ human review (UI) │
│ optional tweaks │
└────────────┬───────────┘
│
▼
┌────────────────────────┐
│ RENDER PASS │
│ React + Puppeteer │
│ + ffmpeg │
└────────────┬───────────┘
│
▼
final MP4Agent Interfaces
Montaj exposes three interfaces for agents to call steps. All are optional. All wrap the same CLI executables.
CLI
The agent runs montaj commands directly via shell access:
montaj trim clip.mp4 --start 2.5 --end 8.3
montaj transcribe clip.mp4 --model base.en
montaj resize clip.mp4 --ratio 9:16Works with any agent that has shell access — Claude Code, Cursor, or any framework that can execute shell commands.
MCP
Montaj runs as a local MCP server (montaj mcp), started automatically by the MCP client. The agent calls steps as native tools — no shell access required.
{
"mcpServers": {
"montaj": { "command": "montaj", "args": ["mcp"] }
}
}New steps are picked up automatically — adding steps/my-step.py + .json makes it available as an MCP tool with no extra configuration.
HTTP API
montaj serve exposes a step execution API alongside the browser UI:
POST /api/steps/trim body: { input: "clip.mp4", start: 2.5, end: 8.3 }
POST /api/steps/transcribe body: { input: "clip.mp4", model: "medium.en" }
GET /api/steps returns: list of available steps with schemasAll API routes are namespaced under /api/ so they never collide with React Router paths.
Summary
| Interface | Purpose | Used by |
|---|---|---|
| CLI | Step execution | Agents with shell access, humans |
| HTTP API | Step execution | Agents with HTTP access, the browser UI |
| MCP | Step execution | Claude Desktop / Claude Code (native tools) |
montaj serve | Browser UI, SSE, project lifecycle, HTTP API | Humans, agents |
Directory Structure
Three scopes. Same format at every level:
~/Montaj/ # workspace — all projects live here
2024-11-01-my-ad/ # one directory per project
project.json
clip1_trimmed.mp4
...
~/.montaj/ # user-global config + extensions
steps/ # user custom steps
workflows/ # user custom workflows
config.json # global defaults (workspaceDir, model, etc.)
credentials.json # API credentials (0600 permissions)
montaj/ # built-in (ships with montaj)
steps/ # native steps
connectors/ # external API wrappers
workflows/ # bundled workflowsThe workspace location defaults to ~/Montaj. Override via ~/.montaj/config.json:
{ "workspaceDir": "/Volumes/FastSSD/Montaj" }Step Resolution Order
When the agent or CLI calls a step, Montaj resolves it in this order:
- Project-local —
./steps/<name> - User-global —
~/.montaj/steps/<name> - Built-in —
montaj/steps/<name>
Prefix in workflow files makes scope explicit:
| Prefix | Resolves to |
|---|---|
montaj/<name> | Built-in steps |
user/<name> | ~/.montaj/steps/<name> |
./steps/<name> | Project-local steps |
Output Convention
All steps follow a strict contract:
- stdout — the result: file path or JSON. Nothing else.
- stderr — errors only:
{"error":"code","message":"detail"} - exit 0 on success, exit 1 on failure
This makes steps composable at the shell level:
FILE=$(montaj step rm_fillers --input clip.mp4 --model base.en)
FILE=$(montaj step trim --input "$FILE" --start 5 --end 90)
FILE=$(montaj step resize --input "$FILE" --ratio 9:16)Versioning
Two layers:
- Git (milestone) —
montaj runinitializes the workspace as a git repo. Commits are created automatically at state transitions (pending,draft, human save).montaj checkpoint "<name>"creates a named commit before risky operations. - In-memory undo stack (fine-grained) — the UI maintains an undo stack for the current review session. Every caption, overlay, or trim edit is undoable without touching disk.
Project Format
project.json is the single format that flows through the entire pipeline. One file, three states.
Lifecycle
| State | Who writes it | What's in it |
|---|---|---|
pending | montaj run / montaj serve | Project ID, name, clip paths, editing prompt, workflow name. No agent work yet. |
draft | Agent | Trim points, ordering, captions, overlays. Agent's complete edit. |
final | Human (via UI) | Reviewed and tweaked. Ready to render. |
Schema
{
"version": "0.2",
"id": "<uuid>",
"status": "pending",
"workflow": "overlays",
"editingPrompt": "tight cuts, remove filler, 9:16",
"settings": {
"resolution": [1080, 1920],
"fps": 30
},
"tracks": [
[
{
"id": "clip-0",
"type": "video",
"src": "/abs/path/clip.mp4",
"start": 0.0,
"end": 0.0
}
]
],
"assets": [],
"audio": {}
}Identity
Each project gets a UUID (id) at init time — this is the stable identifier. The workspace directory name (~/Montaj/<date>-<name>/) is human-readable but not the identity. The optional name field is a label; it does not need to be unique.
Tracks
tracks is a top-level array of arrays. tracks[0] is the primary video track. Overlay and caption tracks start at index 1.
Video Items
{
"id": "clip-0",
"type": "video",
"src": "/abs/path/to/original.MOV",
"inPoint": 2.1,
"outPoint": 8.4,
"start": 0.0,
"end": 6.3
}src— absolute path to the real video file (never a spec JSON file)inPoint/outPoint— source file timestamps (what range of the original to use)start/end— position in the output timeline
Overlay Items
{
"id": "ov-hook",
"type": "overlay",
"src": "/abs/path/to/project/overlays/hook.jsx",
"props": { "text": "She built an AI employee" },
"start": 0.0,
"end": 3.0
}| Field | Required | Description |
|---|---|---|
type | yes | "overlay" for custom JSX, "image" for static images, "video" for video clips |
src | yes | Absolute path to the JSX file |
start / end | yes | Time window in output video (seconds) |
props | no | Arbitrary data injected as the props global inside the component |
offsetX / offsetY | no | Position offset as % of frame size |
scale | no | Uniform scale multiplier |
Caption Track
{
"id": "captions",
"type": "caption",
"style": "word-by-word",
"segments": [
{
"text": "Hello world",
"start": 0.0,
"end": 1.2,
"words": [
{ "word": "Hello", "start": 0.0, "end": 0.5 },
{ "word": "world", "start": 0.5, "end": 1.2 }
]
}
]
}Caption data (segments + word timestamps) is always inlined in the track — never a file pointer.
Assets
Image files (logos, watermarks). Each has id, absolute src, type: "image", and optional name.
# CLI
montaj run ./clips --prompt "..." --assets logo.png
# HTTP
POST /api/run body: { "assets": ["/path/logo.png"] }Audio
The audio field stores music and audio configuration for the project, including music tracks with inPoint/outPoint and ducking settings.
Live Updates
The agent writes project.json as it works — every write is picked up by the file watcher and pushed to the browser via SSE. The timeline builds live as the agent makes decisions.
State Transitions
montaj run → project.json [pending]
agent completes → project.json [draft]
human reviews → project.json [final]
montaj render → reads [final] → final.mp4Trim Specs
The trim spec architecture is the core innovation in Montaj's editing pipeline. Editing steps output trim specs — not video files. A trim spec describes which ranges of the original source file to keep:
{
"input": "/abs/path/original.MOV",
"keeps": [[0.0, 5.3], [6.1, 12.4]]
}Why This Matters
Before this architecture, every editing step re-encoded the full video. A five-clip workflow running silence removal + filler removal produced fifteen video encodes before the final concat. For 4K HEVC footage this caused multi-minute timeouts per step.
With trim specs, no video is decoded or encoded until concat. Editing steps work on audio only (for analysis) and pass timestamps forward. The entire set of cuts — silence boundaries, filler removals, take selections — is accumulated as trim spec refinements and applied in a single ffmpeg filter_complex pass at concat time.
Data Flow
waveform_trim(clip.MOV)
→ {input: "clip.MOV", keeps: [[2.1, 8.4], [9.0, 15.2]]}
transcribe({input: "clip.MOV", keeps: [...]})
→ extracts audio only at keep ranges
→ runs whisper on the joined audio
→ remaps word timestamps back to original timeline
rm_fillers({input: "clip.MOV", keeps: [...]})
→ extracts audio at keeps, detects fillers
→ subtracts filler timestamps from keeps
→ {input: "clip.MOV", keeps: [[2.1, 7.8], [9.2, 15.2]]} ← refined
concat({inputs: [spec1.json, spec2.json, ...]})
→ ONE filter_complex per clip, applying all accumulated cuts
→ ONE encode pass total
→ final.mp4Trim Spec Rules
- Editing steps always receive the original source file path, never a re-encoded intermediate
- Trim specs chain — each step refines the keeps list, preserving the original
inputpath throughout concatandmaterialize_cutare the only steps that encode video —concatfor the normal pipeline;materialize_cutonly when a subsequent step (e.g.remove_bg) requires a physical video file- Both encoders use input-level seeking (
-ss/-tplaced before-i) — ffmpeg seeks at the container level so only the requested segment is decoded - HEVC source files are handled automatically at concat — no pre-conversion needed
Trim Spec Steps
These steps produce or refine trim specs:
| Step | Input | Output |
|---|---|---|
waveform_trim | Video/audio file | New trim spec (silence removed) |
rm_fillers | Trim spec | Refined trim spec (fillers removed) |
rm_nonspeech | Trim spec | Refined trim spec (non-speech removed) |
crop_spec | Trim spec | Cropped trim spec (virtual-timeline windows) |
Trim Spec Consumers
These steps consume trim specs by encoding video:
| Step | What it does |
|---|---|
concat | Joins clips and applies all trim specs in a single encode pass |
materialize_cut | Encodes a trim spec to H.264 when a physical file is needed |
Important: The src Field
Any video clip item in project.json (in any track) must have src pointing to a real video file (.MOV, .mp4, etc.) — never a spec JSON file. For clips derived from trim specs, read spec["input"] for src, and spec["keeps"] to derive inPoint/outPoint.
Multi-keep specs expand into multiple clip items, each with their own inPoint/outPoint.
Workflows
Workflows are suggested editing plans — which steps to use and their default params. The agent reads the plan, reads the prompt, and decides the actual execution.
A workflow is not a deterministic execution pipeline. The agent may reorder steps, adjust params, skip steps that don't apply, or add steps not in the list — whatever the prompt and content call for.
Workflow File Format
{
"name": "overlays",
"description": "Multi-clip edit — silence trim, transcribe, select best takes, remove fillers, concat, caption, overlays, resize to 9:16.",
"steps": [
{ "id": "probe", "uses": "montaj/probe" },
{ "id": "snapshot", "uses": "montaj/snapshot" },
{ "id": "silence", "uses": "montaj/waveform_trim", "foreach": "clips", "params": { "threshold": "-30", "min-silence": 0.3 } },
{ "id": "transcribe", "uses": "montaj/transcribe", "foreach": "clips", "needs": ["silence"], "params": { "model": "base.en" } },
{ "id": "select_takes", "uses": "montaj/select_takes", "needs": ["transcribe"] },
{ "id": "fillers", "uses": "montaj/rm_fillers", "foreach": "clips", "needs": ["select_takes"], "params": { "model": "base.en" } },
{ "id": "concat", "uses": "montaj/concat", "needs": ["fillers"] },
{ "id": "transcribe_concat", "uses": "montaj/transcribe", "needs": ["concat"], "params": { "model": "base.en" } },
{ "id": "caption", "uses": "montaj/caption", "needs": ["transcribe_concat"], "params": { "style": "word-by-word" } },
{ "id": "overlays", "uses": "montaj/overlay", "needs": ["caption"], "params": { "style": "auto" } },
{ "id": "resize", "uses": "montaj/resize", "needs": ["overlays"], "params": { "ratio": "9:16" } }
]
}Step Fields
| Field | Required | Description |
|---|---|---|
id | yes | Unique identifier within the workflow — used in needs references |
uses | yes | Step to run: montaj/<name>, user/<name>, or ./steps/<name>.py |
params | no | Default param overrides — only include values that differ from step defaults |
needs | no | Step IDs that must complete before this step starts. Omit when there are no deps. |
foreach | no | "clips" — run this step once per clip in the project, in parallel |
Parallel Execution
needs is the dependency graph. The agent fires all steps with no unmet needs simultaneously, then re-evaluates after each completes. Steps in the same "wave" run in parallel.
Example execution waves for the overlays workflow:
Wave 1 (parallel): probe, snapshot, silence x N (foreach clips)
Wave 2 (parallel): transcribe x N (foreach clips — needs silence)
Wave 3: select_takes (needs transcribe)
Wave 4 (parallel): fillers x N (foreach clips — needs select_takes)
Wave 5: concat (needs fillers)
Wave 6: transcribe_concat (needs concat)
Wave 7: caption (needs transcribe_concat)
Wave 8: overlays (needs caption)
Wave 9: resize (needs overlays)foreach: "clips" fans out a step across all clips in the project. The agent runs them as parallel tool calls and collects the outputs before proceeding to steps that need them.
Bundled Workflows
| Workflow | Description |
|---|---|
overlays | Multi-clip edit — silence trim, transcribe, select best takes, remove fillers, overlays. No captions. Default when no --workflow is specified. |
short_captions | Multi-clip edit — same as overlays plus caption and resize 9:16. |
clean_cut | Trim and clean only — silence, transcribe, select best takes, remove fillers. No captions, overlays, or resize. |
animations | Animation-only — no source footage required. Agent builds entirely from overlays and animation sections. |
explainer | Multi-clip edit with animation sections — same as overlays plus animation sections. No captions. |
floating_head | Talking-head presenter over a custom background — trim, materialize, RVM background removal. Background in tracks[0], presenter in tracks[1]. |
lyrics_video | Audio + lyrics aligned with word-synced text video. |
ai_video | Director agent writes a storyboard from your prompt and references, you approve, scenes are generated via Kling. |
Agent-Authored Steps
Two step names in workflows are not CLI executables — they are tasks the agent performs itself:
montaj/select_takes
The agent reads transcripts from all clips, groups segments by content similarity (repeated takes), selects the best take of each section, and trims each clip accordingly using montaj/trim.
montaj/overlay
The agent writes custom JSX overlay files and adds them to tracks in project.json. There are no built-in overlay templates — every overlay is a custom React component the agent authors.
Deviation Rules
The agent should follow the assigned workflow and deviate only when the prompt explicitly requires it:
- "no captions" → skip caption
- "keep it raw" → skip rm_fillers, waveform_trim
- "YouTube format" → resize 16:9
Managing Workflows
montaj workflow list # list all available workflows
montaj workflow new <name> # scaffold a new workflow file
montaj workflow edit <name> # open in the node graph UI
montaj workflow run <name> ./clips --prompt "..." # run a specific workflowAll workflow files are equal — fork any of them, save under a new name, and it becomes available immediately.