Core Concepts
Architecture, project.json schema, trim specs, and workflows — the foundational concepts behind Montaj.
Core Concepts
Architecture
Montaj is a video editing tool harness that mounts on top of your existing agent framework. It is not an agent — it is the toolkit the agent uses. You bring Claude, Cursor, or any agent; Montaj gives it the tools to edit video.
System Overview
┌──────────────────────────────────────────────────────────────┐
│ LOCAL UI (ui/) │
│ browser → montaj serve │
│ │
│ ┌──────────────┐ ┌──────────────────┐ │
│ │ 1. UPLOAD │ │ 3. REVIEW │ │
│ │ drop clips │ │ timeline │ │
│ │ write prompt│ │ preview player │ │
│ │ POST /run │ │ caption editor │ │
│ └──────┬───────┘ └────────┬─────────┘ │
│ │ ┌──────────────┐ │ │
│ │ │ 2. LIVE VIEW │ │ │
│ │ │ SSE stream │───────────┘ │
│ │ └──────┬───────┘ │
└─────────┼─────────────────┼──────────────────────────────────┘
│ │
▼ │
┌───────────────────────────┴──────────────────────────────────┐
│ montaj serve │
│ local HTTP + SSE server │
│ │
│ POST /api/run → creates project.json [pending] │
│ GET /api/projects → list projects; ?status=pending │
│ file watcher → detects project.json writes → SSE │
└─────────────────────────┬────────────────────────────────────┘
│
▼
┌──────────────────────────────────────────────────────────────┐
│ AGENT (external) │
│ Claude, Cursor, etc. │
│ │
│ reads project.json [pending] │
│ reads workflows/<name>.json │
│ calls steps as tools at its own discretion │
│ writes project.json as work progresses → file watcher → SSE│
│ marks [draft] when done │
└──────────────────────────────────────────────────────────────┘
│
▼
┌────────────────────────┐
│ human review (UI) │
│ optional tweaks │
└────────────┬───────────┘
│
▼
┌────────────────────────┐
│ RENDER PASS │
│ React + Puppeteer │
│ + ffmpeg │
└────────────┬───────────┘
│
▼
final MP4Agent Interfaces
Montaj exposes three interfaces for agents to call steps. All are optional. All wrap the same CLI executables.
CLI
The agent runs montaj commands directly via shell access:
montaj trim clip.mp4 --start 2.5 --end 8.3
montaj transcribe clip.mp4 --model base.en
montaj resize clip.mp4 --ratio 9:16Works with any agent that has shell access — Claude Code, Cursor, or any framework that can execute shell commands.
MCP
Montaj runs as a local MCP server (montaj mcp), started automatically by the MCP client. The agent calls steps as native tools — no shell access required.
{
"mcpServers": {
"montaj": { "command": "montaj", "args": ["mcp"] }
}
}New steps are picked up automatically — adding steps/my-step.py + .json makes it available as an MCP tool with no extra configuration.
HTTP API
montaj serve exposes a step execution API alongside the browser UI:
POST /api/steps/trim body: { input: "clip.mp4", start: 2.5, end: 8.3 }
POST /api/steps/transcribe body: { input: "clip.mp4", model: "medium.en" }
GET /api/steps returns: list of available steps with schemasAll API routes are namespaced under /api/ so they never collide with React Router paths.
Summary
| Interface | Purpose | Used by |
|---|---|---|
| CLI | Step execution | Agents with shell access, humans |
| HTTP API | Step execution | Agents with HTTP access, the browser UI |
| MCP | Step execution | Claude Desktop / Claude Code (native tools) |
montaj serve | Browser UI, SSE, project lifecycle, HTTP API | Humans, agents |
Directory Structure
Three scopes. Same format at every level:
~/Montaj/ # workspace — all projects live here
2024-11-01-my-ad/ # one directory per project
project.json
clip1_trimmed.mp4
...
~/.montaj/ # user-global config + extensions
steps/ # user custom steps
workflows/ # user custom workflows
config.json # global defaults (workspaceDir, model, etc.)
credentials.json # API credentials (0600 permissions)
montaj/ # built-in (ships with montaj)
steps/ # native steps
connectors/ # external API wrappers
workflows/ # bundled workflowsThe workspace location defaults to ~/Montaj. Override via the MONTAJ_WORKSPACE_DIR env var or ~/.montaj/config.json:
export MONTAJ_WORKSPACE_DIR=/Volumes/FastSSD/Montaj{ "workspaceDir": "/Volumes/FastSSD/Montaj" }Precedence: MONTAJ_WORKSPACE_DIR > ~/.montaj/config.json > default. The env var is useful for containerized deployments where dropping a config file is awkward.
Step Resolution Order
When the agent or CLI calls a step, Montaj resolves it in this order:
- Project-local —
./steps/<name> - User-global —
~/.montaj/steps/<name> - Built-in —
montaj/steps/<name>
Prefix in workflow files makes scope explicit:
| Prefix | Resolves to |
|---|---|
montaj/<name> | Built-in steps |
user/<name> | ~/.montaj/steps/<name> |
./steps/<name> | Project-local steps |
Output Convention
All steps follow a strict contract:
- stdout — the result: file path or JSON. Nothing else.
- stderr — errors only:
{"error":"code","message":"detail"} - exit 0 on success, exit 1 on failure
This makes steps composable at the shell level:
FILE=$(montaj step rm_fillers --input clip.mp4 --model base.en)
FILE=$(montaj step trim --input "$FILE" --start 5 --end 90)
FILE=$(montaj step resize --input "$FILE" --ratio 9:16)Versioning
Two layers:
- Git (milestone) —
montaj runinitializes the workspace as a git repo. Commits are created automatically at state transitions (pending,draft, human save).montaj checkpoint "<name>"creates a named commit before risky operations. - In-memory undo stack (fine-grained) — the UI maintains an undo stack for the current review session. Every caption, overlay, or trim edit is undoable without touching disk.
Project Format
project.json is the single format that flows through the entire pipeline. One file, three states.
Project Types
| Type | Renders to | Notes |
|---|---|---|
editing | MP4 | Default. Trim, cut, transcribe, composite against source clips. |
music_video | MP4 | Lyrics-synced overlays atop a background video or color. |
ai_video | MP4 | Storyboard-driven scene generation via Kling, with music + voiceover. |
carousel | N PNG slides | Slide-based design surface for Instagram / TikTok photo posts. No time axis. See Image Carousel. |
The schema below describes video projects (editing, music_video, ai_video). Carousel projects share the same header (id, status, projectType, name, workflow, editingPrompt, settings.resolution, assets) but replace tracks/audio/fps with a flat slides[] array — see the carousel page for details.
Lifecycle
| State | Who writes it | What's in it |
|---|---|---|
pending | montaj run / montaj serve | Project ID, name, clip paths, editing prompt, workflow name. No agent work yet. |
draft | Agent | Trim points, ordering, captions, overlays. Agent's complete edit. |
final | Human (via UI) | Reviewed and tweaked. Ready to render. |
Schema
{
"version": "0.2",
"id": "<uuid>",
"status": "pending",
"workflow": "overlays",
"editingPrompt": "tight cuts, remove filler, 9:16",
"settings": {
"resolution": [1080, 1920],
"fps": 30
},
"tracks": [
[
{
"id": "clip-0",
"type": "video",
"src": "/abs/path/clip.mp4",
"start": 0.0,
"end": 0.0
}
]
],
"assets": [],
"audio": {}
}Identity
Each project gets a UUID (id) at init time — this is the stable identifier. The workspace directory name (~/Montaj/<date>-<name>/) is human-readable but not the identity. The optional name field is a label; it does not need to be unique.
Tracks
tracks is a top-level array of arrays. tracks[0] is the primary video track. Overlay and caption tracks start at index 1.
Video Items
{
"id": "clip-0",
"type": "video",
"src": "/abs/path/to/original.MOV",
"inPoint": 2.1,
"outPoint": 8.4,
"start": 0.0,
"end": 6.3
}src— absolute path to the real video file (never a spec JSON file)inPoint/outPoint— source file timestamps (what range of the original to use)start/end— position in the output timeline
Overlay Items
{
"id": "ov-hook",
"type": "overlay",
"src": "/abs/path/to/project/overlays/hook.jsx",
"props": { "text": "She built an AI employee" },
"start": 0.0,
"end": 3.0
}| Field | Required | Description |
|---|---|---|
type | yes | "overlay" for custom JSX, "image" for static images, "video" for video clips |
src | yes | Absolute path to the JSX file |
start / end | yes | Time window in output video (seconds) |
props | no | Arbitrary data injected as the props global inside the component |
offsetX / offsetY | no | Position offset as % of frame size |
scale | no | Uniform scale multiplier |
Caption Track
{
"id": "captions",
"type": "caption",
"style": "word-by-word",
"segments": [
{
"text": "Hello world",
"start": 0.0,
"end": 1.2,
"words": [
{ "word": "Hello", "start": 0.0, "end": 0.5 },
{ "word": "world", "start": 0.5, "end": 1.2 }
]
}
]
}Caption data (segments + word timestamps) is always inlined in the track — never a file pointer.
Assets
Image files (logos, watermarks). Each has id, absolute src, type: "image", and optional name.
# CLI
montaj run ./clips --prompt "..." --assets logo.png
# HTTP
POST /api/run body: { "assets": ["/path/logo.png"] }Audio
The audio field stores music and audio configuration for the project, including music tracks with inPoint/outPoint and ducking settings.
Live Updates
The agent writes project.json as it works — every write is picked up by the file watcher and pushed to the browser via SSE. The timeline builds live as the agent makes decisions.
State Transitions
montaj run → project.json [pending]
agent completes → project.json [draft]
human reviews → project.json [final]
montaj render → reads [final] → final.mp4Trim Specs
The trim spec architecture is the core innovation in Montaj's editing pipeline. Editing steps output trim specs — not video files. A trim spec describes which ranges of the original source file to keep:
{
"input": "/abs/path/original.MOV",
"keeps": [[0.0, 5.3], [6.1, 12.4]]
}Why This Matters
Before this architecture, every editing step re-encoded the full video. A five-clip workflow running silence removal + filler removal produced fifteen video encodes before the final concat. For 4K HEVC footage this caused multi-minute timeouts per step.
With trim specs, no video is decoded or encoded until materialize_cut. Editing steps work on audio only (for analysis) and pass timestamps forward. The entire set of cuts — silence boundaries, filler removals, take selections — is accumulated as trim spec refinements and applied in a single ffmpeg filter_complex pass at encode time.
Data Flow
waveform_trim(clip.MOV)
→ {input: "clip.MOV", keeps: [[2.1, 8.4], [9.0, 15.2]]}
transcribe({input: "clip.MOV", keeps: [...]})
→ extracts audio only at keep ranges
→ runs whisper on the joined audio
→ remaps word timestamps back to original timeline
rm_fillers({input: "clip.MOV", keeps: [...]})
→ extracts audio at keeps, detects fillers
→ subtracts filler timestamps from keeps
→ {input: "clip.MOV", keeps: [[2.1, 7.8], [9.2, 15.2]]} ← refined
materialize_cut({inputs: [spec1.json, spec2.json, ...]})
→ ONE filter_complex per clip, applying all accumulated cuts
→ ONE encode pass total
→ final.mp4Trim Spec Rules
- Editing steps always receive the original source file path, never a re-encoded intermediate
- Trim specs chain — each step refines the keeps list, preserving the original
inputpath throughout materialize_cutis the only step that encodes video — it handles both the normal pipeline encode and cases where a subsequent step (e.g.remove_bg) requires a physical video file- Uses input-level seeking (
-ss/-tplaced before-i) — ffmpeg seeks at the container level so only the requested segment is decoded - HEVC source files are handled automatically — no pre-conversion needed
Trim Spec Steps
These steps produce or refine trim specs:
| Step | Input | Output |
|---|---|---|
waveform_trim | Video/audio file | New trim spec (silence removed) |
rm_fillers | Trim spec | Refined trim spec (fillers removed) |
rm_nonspeech | Trim spec | Refined trim spec (non-speech removed) |
crop_spec | Trim spec | Cropped trim spec (virtual-timeline windows) |
Trim Spec Consumers
These steps consume trim specs by encoding video:
| Step | What it does |
|---|---|
materialize_cut | Encodes a trim spec to H.264 — applies all accumulated cuts in a single ffmpeg pass |
Important: The src Field
Any video clip item in project.json (in any track) must have src pointing to a real video file (.MOV, .mp4, etc.) — never a spec JSON file. For clips derived from trim specs, read spec["input"] for src, and spec["keeps"] to derive inPoint/outPoint.
Multi-keep specs expand into multiple clip items, each with their own inPoint/outPoint.
Workflows
Workflows are suggested editing plans — which steps to use and their default params. The agent reads the plan, reads the prompt, and decides the actual execution.
A workflow is not a deterministic execution pipeline. The agent may reorder steps, adjust params, skip steps that don't apply, or add steps not in the list — whatever the prompt and content call for.
Workflow File Format
{
"name": "overlays",
"description": "Multi-clip edit — silence trim, transcribe, select best takes, remove fillers, concat, caption, overlays, resize to 9:16.",
"steps": [
{ "id": "probe", "uses": "montaj/probe" },
{ "id": "snapshot", "uses": "montaj/snapshot" },
{ "id": "silence", "uses": "montaj/waveform_trim", "foreach": "clips", "params": { "threshold": "-30", "min-silence": 0.3 } },
{ "id": "transcribe", "uses": "montaj/transcribe", "foreach": "clips", "needs": ["silence"], "params": { "model": "base.en" } },
{ "id": "select_takes", "uses": "montaj/select_takes", "needs": ["transcribe"] },
{ "id": "fillers", "uses": "montaj/rm_fillers", "foreach": "clips", "needs": ["select_takes"], "params": { "model": "base.en" } },
{ "id": "transcribe_final", "uses": "montaj/transcribe", "needs": ["fillers"], "params": { "model": "base.en" } },
{ "id": "caption", "uses": "montaj/caption", "needs": ["transcribe_final"], "params": { "style": "word-by-word" } },
{ "id": "overlays", "uses": "montaj/overlay", "needs": ["caption"], "params": { "style": "auto" } },
{ "id": "resize", "uses": "montaj/resize", "needs": ["overlays"], "params": { "ratio": "9:16" } }
]
}Step Fields
| Field | Required | Description |
|---|---|---|
id | yes | Unique identifier within the workflow — used in needs references |
uses | yes | Step to run: montaj/<name>, user/<name>, or ./steps/<name>.py |
params | no | Default param overrides — only include values that differ from step defaults |
needs | no | Step IDs that must complete before this step starts. Omit when there are no deps. |
foreach | no | Dotted path into the project — run per entry in that collection (e.g. "clips", "storyboard.scenes") |
Parallel Execution
needs is the dependency graph. The agent fires all steps with no unmet needs simultaneously, then re-evaluates after each completes. Steps in the same "wave" run in parallel.
Example execution waves for the overlays workflow:
Wave 1 (parallel): probe, snapshot, silence x N (foreach clips)
Wave 2 (parallel): transcribe x N (foreach clips — needs silence)
Wave 3: select_takes (needs transcribe)
Wave 4 (parallel): fillers x N (foreach clips — needs select_takes)
Wave 5: transcribe_final (needs fillers)
Wave 6: caption (needs transcribe_final)
Wave 7: overlays (needs caption)
Wave 8: resize (needs overlays)foreach: <path> fans out a step across all entries in a dotted-path collection on the project. Common values: "clips" (all project clips), "storyboard.scenes", "storyboard.imageRefs", "storyboard.styleRefs" (used by the ai_video workflow). The agent runs them as parallel tool calls and collects the outputs before proceeding to steps that need them.
Bundled Workflows
| Workflow | Description |
|---|---|
overlays | Multi-clip edit — silence trim, transcribe, select best takes, remove fillers, overlays. No captions. Default when no --workflow is specified. |
short_captions | Multi-clip edit — same as overlays plus caption and resize 9:16. |
clean_cut | Trim and clean only — silence, transcribe, select best takes, remove fillers. No captions, overlays, or resize. |
animations | Animation-only — no source footage required. Agent builds entirely from overlays and animation sections. |
explainer | Multi-clip edit with animation sections — same as overlays plus animation sections. No captions. |
floating_head | Talking-head presenter over a custom background — trim, materialize, RVM background removal. Background in tracks[0], presenter in tracks[1]. |
lyrics_video | Audio + lyrics aligned with word-synced text video. |
ai_video | Director agent writes a storyboard from your prompt and references (ai-video-plan skill), you approve, scenes are generated via Kling with music and voiceover (ai-video-generate skill). |
Agent-Authored Steps
Two step names in workflows are not CLI executables — they are tasks the agent performs itself:
montaj/select_takes
The agent reads transcripts from all clips, groups segments by content similarity (repeated takes), selects the best take of each section, and trims each clip accordingly using montaj/trim.
montaj/overlay
The agent writes custom JSX overlay files and adds them to tracks in project.json. There are no built-in overlay templates — every overlay is a custom React component the agent authors.
Deviation Rules
The agent should follow the assigned workflow and deviate only when the prompt explicitly requires it:
- "no captions" → skip caption
- "keep it raw" → skip rm_fillers, waveform_trim
- "YouTube format" → resize 16:9
Managing Workflows
montaj workflow list # list all available workflows
montaj workflow new <name> # scaffold a new workflow file
montaj workflow edit <name> # open in the node graph UI
montaj workflow run <name> ./clips --prompt "..." # run a specific workflowAll workflow files are equal — fork any of them, save under a new name, and it becomes available immediately.