The render pipeline — React + Puppeteer + ffmpeg turns project.json into a final MP4 (or N PNG slides for carousels).

Render Engine

The render engine lives in render/ and turns project.json [final] into either a final MP4 (video projects) or N still PNG slides (carousel projects). It reads the project, renders the visual content via React + Puppeteer, and either composites with source footage via ffmpeg (video) or screenshots each slide directly (carousel).

How Rendering Works

Invocation

montaj render <project>
# or directly:
node render/render.js          <project.json> [--out <path>] [--workers <n>] [--clean]
node render/render-carousel.js --project-json <project.json> [--out <path>] [--clean]

Project status must be "final" before rendering. The render is non-destructive — source files are never modified.

Render dispatch by project type

montaj render and the HTTP endpoint POST /api/projects/{id}/render both read projectType from project.json and dispatch:

`projectType`	Renderer	Output
`editing`, `music_video`, `ai_video`	`render.js`	`<project>/render/final.mp4`
`carousel`	`render-carousel.js`	`<project>/render/slide_NN.png` + `manifest.json`

The video pipeline (Pipeline section below) covers the MP4 path. Carousel rendering is much simpler:

Read project.slides[].
For each slide: bundle its image + overlay elements as JSX via esbuild, load in Puppeteer, screenshot at the project's native resolution, write slide_NN.png (zero-padded, 1-based).
Write manifest.json listing each slide's filename + dimensions.

No segments, no ffmpeg, no audio. The carousel render modal in the UI streams the renderer's log lines and, on completion, opens a full-screen overlay with every slide as a clickable thumbnail and a Download all (.zip) button (GET /api/projects/{id}/render-zip). The zip excludes manifest.json — that's a renderer-side artifact for tooling, not user-facing.

Pipeline

project.json
    │
    ├─ 1. Validate + resolve paths
    ├─ 2. Collect segment specs + video/image items
    ├─ 3. Normalize pre-pass (project working color space)
    ├─ 4. Process video items (remove_bg if flagged)
    ├─ 5. Bundle JSX → HTML  (one per overlay/caption)
    ├─ 6. Render HTML → FFV1/MKV  (Puppeteer pool)
    ├─ 7. Segment-based compose → concat
    └─ 8. Mix audio tracks → final.mp4

Project Color Space

Each project has an explicit working color space stored at settings.colorSpace in project.json. This setting drives the codec, pixel format, and color metadata the entire render pipeline emits — from the normalize pre-pass through the segment encoder to the final concat.

Three color spaces are supported:

Key	Encoder	Pixel format	Transfer	Typical source
`sdr_bt709`	`libx264`	`yuv420p`	`bt709`	most non-HDR footage
`hdr_hlg`	`libx265`	`yuv420p10le`	`arib-std-b67`	iPhone "HDR Video" default
`hdr_pq`	`libx265`	`yuv420p10le`	`smpte2084`	iPhone "Dolby Vision", HDR10

Smart-detect at init. When clips are added to a project (montaj run or montaj init), each clip's color transfer is probed and the project color space is the modal (most common) value across all clips. Outliers are converted on the fly — HDR sources in an SDR project are tonemapped per-segment, SDR sources in an HDR project are stretched into the HDR container. This matches the FCP/Resolve pattern: one SDR clip dropped into an iPhone-HDR project is treated as SDR-graded content shown on an HDR canvas, not a reason to flip the whole project down to SDR.

All clips HLG → hdr_hlg.
All clips PQ → hdr_pq.
27 HLG + 1 SDR → hdr_hlg (modal wins; the 1 SDR clip is stretched into HLG on the fly).
27 SDR + 1 HLG → sdr_bt709 (modal wins; the 1 HLG clip is tonemapped).
Tied modes (no clear majority) — tiebreaks: HLG+PQ tied → hdr_pq (larger gamut, clean HLG→PQ conversion); SDR tied with HDR → sdr_bt709 (conservative — tonemap-down is well-defined, inverse-stretch is creative when there's no signal of intent).
No clips probed → sdr_bt709 default.

Override. Pass --color-space {sdr_bt709|hdr_hlg|hdr_pq|auto} to montaj init (default auto runs the smart-detect rules above), or include "colorSpace" in the HTTP intake JSON, to force a specific working space regardless of source detection.

Legacy projects. Projects created before settings.colorSpace existed (or any project where the field was deleted) trigger the same smart-detect rules at render time. The detected value is written back to project.json so subsequent renders skip the detection step. This means an older project will render correctly on first run with no manual migration — the engine fills in what's missing.

Per-color-space behavior in the segment encoder. SDR projects emit libx264 yuv420p with bt709 color metadata; HDR projects emit libx265 yuv420p10le with bt2020nc colorimetry plus the appropriate transfer (arib-std-b67 for HLG, smpte2084 for PQ with static HDR10 mastering metadata). Sources whose color space conflicts with the project are converted at the per-item filter chain in the segment encoder (zscale-based tonemap for HDR→SDR; stretch into HDR container for SDR→HDR; HLG↔PQ via zscale transfer-curve conversion).

Step 3 — Normalize Pre-Pass

The normalize pre-pass is color-space-aware. A source is conformant when its color transfer and bit depth match the project's working color space, and its keyframe interval is ≤ 2.0s (required for the segment encoder's input-level fast seek). When all three hold, the source passes through with no transcode — iPhone HDR HLG clips in an hdr_hlg project are essentially a no-op at intake.

When a source conflicts, normalize emits the project's working format using the encoder, pixel format, and color args from the color-space spec:

sdr_bt709 project: libx264 -pix_fmt yuv420p with bt709 stream metadata. HDR sources are tonemapped via zscale + tonemap (with a bare-tonemap fallback when zscale is missing — accompanied by a loud warning).
hdr_hlg project: libx265 -pix_fmt yuv420p10le with bt2020nc / arib-std-b67 stream metadata.
hdr_pq project: libx265 -pix_fmt yuv420p10le with bt2020nc / smpte2084 stream metadata + static HDR10 mastering metadata.

All paths emit AAC 48 kHz audio and force IDR keyframes every ~1s.

Resolution is preserved. Source clips remain at their native resolution through the entire pipeline; the segment encoder scales per-item at compose time via the scale= filter in encode-segment.js. This avoids the permanent quality loss of intake-time downscaling and preserves headroom for crops, zooms, and re-frames.

Parallel execution: Both init-time and render-time pre-pass normalize loops run with a concurrency cap of 2. Memory-heavy 4K HDR encodes are the worst case; 2 workers stays within bounds on systems with ≥8GB free RAM. The cap applies to both libx264 (SDR projects) and libx265 (HDR projects) — both are preset-bound CPU encodes.

Normalization creates _normalized_<colorSpace>.mp4 files alongside the originals (e.g. clip_normalized_sdr_bt709.mp4 or clip_normalized_hdr_hlg.mp4) — originals are never modified. Namespacing by color space lets a project flip between SDR and HDR without colliding with cached normalize output. The lib/normalize.py module backs this and is also used at ingest time (project/init.py) and for AI-generated clips.

Cached output is reused across renders. When the deterministic _normalized_<colorSpace>.mp4 output already exists and its mtime is at least as fresh as the source, the render-time pre-pass skips the re-encode entirely. Replacing or re-recording a source file (which advances its mtime) correctly invalidates the cache. Net result: legacy projects pay the normalize cost once, not on every render.

After normalization, every source entering the compose pipeline conforms to the project's working color space. The segment encoder still handles per-item scaling at compose time, and applies in-line color conversion for any source that arrives in a different color space than the project. Resolution is intentionally NOT unified at intake.

Step 5 — JSX Bundling

Each overlay/caption JSX component is compiled into a self-contained HTML page using esbuild. The page exposes window.__setFrame(n) so Puppeteer can drive it frame-by-frame.

Step 6 — Puppeteer Rendering

A pool of N Chromium browsers (default: os.cpus().length) renders each segment in parallel.

Per-job flow:

Open a new page, set viewport to design resolution (1080 x 1920)
Navigate to the bundled HTML file
For each frame: call window.__setFrame(f), wait for paint confirmation, screenshot to PNG
Encode PNG sequence to FFV1 in MKV container
If segment exceeds chunk size, split into chunks and concatenate after encoding

Step 7 — Segment-Based Compositing

Compositing uses a segment-based pipeline with three stages:

normalized video items + Puppeteer segments
    │
    ├─ 1. segment-plan.js   → plan segments at clip/overlay boundaries
    ├─ 2. encode-segment.js → encode each segment independently
    └─ 3. ffmpeg concat      → join segments via concat demuxer

Segment planning — the timeline is divided into segments at every clip and overlay boundary. Each segment is a contiguous time range where the set of active layers does not change.

Segment encoding — each segment is encoded independently with its own ffmpeg call. Items at non-project resolution are scaled by the per-item scale= filter — this is what enables source-resolution preservation at intake. Segments are encoded in parallel.

Concat — all segments are joined via the ffmpeg concat demuxer with -c:v copy (no re-encode). This is near-instant.

Step 8 — Audio Mixing

Independent audio tracks (music, voiceover, sound effects) are mixed in a final pass via mix-audio.js. Handles volume, ducking (sidechaincompress), delay offsets, and in/out points.

Three Render Stages (Summary)

montaj render runs three stages:

Normalize + base video — normalize all sources, then trim and prepare source clips. Canvas projects (no video track) generate a synthetic black base from overlay durations.
Overlay segments — each JSX overlay is bundled with esbuild, rendered frame-by-frame in headless Chromium, and encoded to lossless FFV1/MKV. Segments are rendered at design resolution (1080 x 1920) regardless of output resolution.
Compose + mix — segment-based pipeline encodes each timeline segment independently, concats them, then mixes audio tracks.

Output Encoding

Per-segment encoding follows the project's color space (see Project Color Space above). Within a single render every segment shares one codec and pixel format, so the concat demuxer can stream-copy video without re-encoding:

SDR projects (sdr_bt709): libx264 -preset fast -crf 18 -pix_fmt yuv420p with bt709 stream-level color metadata.
HDR projects (hdr_hlg, hdr_pq): libx265 -preset fast -crf 22 -pix_fmt yuv420p10le with bt2020nc colorimetry plus the project's transfer curve (arib-std-b67 for HLG, smpte2084 + static HDR10 mastering metadata for PQ).

Per-frame setparams and per-stream color args come from the color-space spec, ensuring downstream players read the same colorimetry the encoder produced.

Intermediate Files

<project>/
└── render/
    ├── segments/           Puppeteer FFV1/MKV files + composed segment files
    │   ├── <id>-chunk-0.mkv    (Puppeteer renders)
    │   ├── seg-000.mp4         (composed segments)
    │   └── ...
    └── final.mp4           Final output

Intermediate files are kept by default and reused on re-runs. Use --clean to delete them after compositing. Set MONTAJ_KEEP_SEGMENTS=1 to preserve composed segment files for debugging (they are cleaned up by default after concat).

Clip Seeking

Each video clip is fed as:

-ss <inPoint> -to <outPoint> -i <src>

Use -to outPoint (absolute file timestamp), not -t duration. Fast seek (-ss before -i) lands on the nearest keyframe before inPoint. -t measures duration from that keyframe — if the keyframe is 0.3s early, the clip is silently trimmed short. -to stops at the absolute file timestamp regardless.

Custom Overlays

Overlays are custom JSX components written by the agent. There are no built-in overlay templates — every overlay is a React component the agent writes, styled to the editing prompt and brand context.

Overlay Item in project.json

{
  "id": "ov-hook",
  "type": "overlay",
  "src": "/abs/path/to/project/overlays/hook.jsx",
  "props": { "text": "She built an AI employee" },
  "start": 0.0,
  "end": 3.0
}

Field	Required	Description
`type`	yes	`"overlay"` for custom JSX
`src`	yes	Absolute path to the JSX file
`start` / `end`	yes	Time window in output video (seconds)
`props`	no	Arbitrary data injected as the `props` global
`offsetX` / `offsetY`	no	Position offset as % of frame size (set by UI drag)
`scale`	no	Uniform scale multiplier (set by UI resize)

offsetX, offsetY, and scale are applied by the render engine as a CSS transform on the component container. The JSX component itself is unaware of them.

Component Globals

Overlay components have access to these globals:

Global	Description
`frame`	Current frame number
`fps`	Frames per second
`props`	Data from the overlay item
`interpolate(frame, inputRange, outputRange)`	Map frame number to any value
`spring({ frame, fps, config })`	Physics-based easing (mass, stiffness, damping)

How Overlays Are Rendered

The JSX file is bundled into a self-contained HTML page by esbuild
Puppeteer opens the page in headless Chrome at 1080 x 1920 (transparent background)
For each frame: window.__setFrame(n) increments, screenshot to PNG with alpha
ffmpeg encodes the PNG sequence into a transparent FFV1/MKV video segment
The segment is composited onto the source footage in the segment-based compose pipeline

Parallelism

Overlay rendering is CPU-bound. Two levels of parallelism:

Segment-level — all overlay segments are independent and rendered simultaneously by a worker pool
Frame chunking — segments above 1,000 frames (~33s at 30fps) are split into chunks, each rendered by a separate worker

caption track (18,000 frames) → 18 chunks × 1,000 frames → 18 workers
lower-third (135 frames)      → 1 chunk → 1 worker
flash (9 frames)              → 1 chunk → 1 worker

Configurable via ~/.montaj/config.json:

{ "render": { "workers": 8, "chunkSize": 1000 } }

Browser Recycling

Each Puppeteer worker restarts its browser every 5 jobs. After many segments, browser processes accumulate memory and can start timing out. Recycling flushes that state.

Design Resolution

Overlays are always rendered at 1080 x 1920 regardless of output resolution. The pipeline upscales at compose time for higher resolutions (e.g., 2x for 4K).

Caption Templates

Caption templates are pre-built React components referenced by style name. Unlike overlays (which are always custom JSX), captions use built-in templates selected by the agent or user.

Available Styles

Style	Description
`word-by-word`	One word at a time, spring pop-in animation
`pop`	Segment-at-a-time with scale entry animation
`karaoke`	Words highlight progressively as they're spoken
`subtitle`	Static line at bottom, segments replace sequentially

How Captions Work

transcribe step generates word-level timestamps via whisper.cpp
caption step converts the transcript into a caption track with a chosen style
Caption data is stored inline in project.json (not as a file pointer)
At render time, the template component receives the caption segments and renders frame-by-frame via Puppeteer

Caption Data Format

{
  "id": "captions",
  "type": "caption",
  "style": "word-by-word",
  "segments": [
    {
      "text": "Hello world",
      "start": 0.0,
      "end": 1.2,
      "words": [
        { "word": "Hello", "start": 0.0, "end": 0.5 },
        { "word": "world", "start": 0.5, "end": 1.2 }
      ]
    }
  ]
}

Each segment contains:

text — the full segment text
start / end — segment time window (seconds)
words — individual word timestamps for styles that animate per-word

Rendering

Caption templates produce the same output as custom overlays: rendered frame-by-frame by Puppeteer, composited into the video by ffmpeg. The template component uses interpolate and spring utilities for animation.

Choosing a Style

Via CLI:

montaj step caption --input transcript.json --style word-by-word
montaj step caption --input transcript.json --style karaoke
montaj step caption --input transcript.json --style pop
montaj step caption --input transcript.json --style subtitle

Via the editing prompt:

"add word-by-word captions"
"karaoke-style captions, bold text"

Editing Captions

In the UI review phase, you can:

Click a caption segment to edit its text inline
Drag segments to adjust their timing
Change the caption style

Changes update the segments array in the caption track of project.json.

GPU Acceleration

The Montaj render pipeline is mostly CPU-bound. GPU acceleration applies at one specific step.

Where GPU Applies

Step	Bound	GPU
Puppeteer frame rendering	CPU	Parallelism is the lever
ffmpeg segment encoding	CPU	Limited GPU filter support
ffmpeg intermediate encode (PNG to FFV1)	CPU	Alpha formats lack hwaccel support
Final encode	GPU	VideoToolbox (macOS), NVENC (NVIDIA), VAAPI (Intel/Linux). Codec follows the project's color space — H.264 for SDR projects, HEVC 10-bit for HDR projects.

ffmpeg detects and uses available hardware encoders automatically, providing a 5-10x speedup on the final encode.

CPU-Bound Stages

Puppeteer Frame Rendering

The main bottleneck. Each overlay/caption segment is rendered frame-by-frame in headless Chrome. Two parallelism strategies:

Segment-level — all segments rendered simultaneously via a worker pool (default: CPU core count)
Frame chunking — segments above 1,000 frames are split into chunks for parallel rendering

ffmpeg Compositing

Each segment is composited independently — the per-segment filter graph is simple since all sources are pre-normalized. GPU filters exist but are limited.

Configuration

Control worker count and chunk size via ~/.montaj/config.json:

{
  "render": {
    "workers": 8,
    "chunkSize": 1000
  }
}

Or via CLI flags:

montaj render --workers 4

Color Space Handling

Color space is a per-project setting (settings.colorSpace), not a fixed pipeline assumption. iPhone HDR clips (HEVC, BT.2020/HLG 10-bit) flow through an HDR project with no transformation — the working format already matches their native color space. In an SDR project they're tonemapped via zscale + tonemap (with a fallback path when zscale is missing). Sources whose color space conflicts with the project are converted at intake or, for late-arriving items, in the segment encoder.

Run montaj doctor to verify your ffmpeg has zscale support. zscale is required for any cross-color-space conversion (HDR↔SDR, HLG↔PQ); on systems without it, only direct passes (matching project color space) and a degraded HDR→SDR fallback work. montaj install ffmpeg rebuilds ffmpeg with libzimg on macOS.

Troubleshooting

Issue	Cause	Solution
Slow render	Too many Puppeteer workers saturating memory	Reduce `--workers`
Browser timeout errors	Memory pressure from many segments	Browser auto-recycles every 5 jobs; reduce workers if still failing
Mixed HDR/SDR color shifts	HDR and SDR sources with mismatched color metadata in the same compose	Fixed by project-color-space contract — every source is converted to the project's working color space (at intake or per-item in the segment encoder) before composing
Degraded colors after normalization	`zscale` filter missing from ffmpeg	Run `montaj doctor` to check; `montaj install ffmpeg` to rebuild with libzimg
Clips trimmed short at cut points	`-t duration` measured from keyframe, not `inPoint`	Fixed by using `-to outPoint` instead

Render Engine

On this page