Render Engine
The render pipeline — React + Puppeteer + ffmpeg turns project.json into a final MP4 (or N PNG slides for carousels).
Render Engine
The render engine lives in render/ and turns project.json [final] into either a final MP4 (video projects) or N still PNG slides (carousel projects). It reads the project, renders the visual content via React + Puppeteer, and either composites with source footage via ffmpeg (video) or screenshots each slide directly (carousel).
How Rendering Works
Invocation
montaj render <project>
# or directly:
node render/render.js <project.json> [--out <path>] [--workers <n>] [--clean]
node render/render-carousel.js --project-json <project.json> [--out <path>] [--clean]Project status must be "final" before rendering. The render is non-destructive — source files are never modified.
Render dispatch by project type
montaj render and the HTTP endpoint POST /api/projects/{id}/render both read projectType from project.json and dispatch:
projectType | Renderer | Output |
|---|---|---|
editing, music_video, ai_video | render.js | <project>/render/final.mp4 |
carousel | render-carousel.js | <project>/render/slide_NN.png + manifest.json |
The video pipeline (Pipeline section below) covers the MP4 path. Carousel rendering is much simpler:
- Read
project.slides[]. - For each slide: bundle its image + overlay elements as JSX via esbuild, load in Puppeteer, screenshot at the project's native resolution, write
slide_NN.png(zero-padded, 1-based). - Write
manifest.jsonlisting each slide's filename + dimensions.
No segments, no ffmpeg, no audio. The carousel render modal in the UI streams the renderer's log lines and, on completion, opens a full-screen overlay with every slide as a clickable thumbnail and a Download all (.zip) button (GET /api/projects/{id}/render-zip). The zip excludes manifest.json — that's a renderer-side artifact for tooling, not user-facing.
Pipeline
project.json
│
├─ 1. Validate + resolve paths
├─ 2. Collect segment specs + video/image items
├─ 3. Normalize pre-pass (project working color space)
├─ 4. Process video items (remove_bg if flagged)
├─ 5. Bundle JSX → HTML (one per overlay/caption)
├─ 6. Render HTML → FFV1/MKV (Puppeteer pool)
├─ 7. Segment-based compose → concat
└─ 8. Mix audio tracks → final.mp4Project Color Space
Each project has an explicit working color space stored at settings.colorSpace in project.json. This setting drives the codec, pixel format, and color metadata the entire render pipeline emits — from the normalize pre-pass through the segment encoder to the final concat.
Three color spaces are supported:
| Key | Encoder | Pixel format | Transfer | Typical source |
|---|---|---|---|---|
sdr_bt709 | libx264 | yuv420p | bt709 | most non-HDR footage |
hdr_hlg | libx265 | yuv420p10le | arib-std-b67 | iPhone "HDR Video" default |
hdr_pq | libx265 | yuv420p10le | smpte2084 | iPhone "Dolby Vision", HDR10 |
Smart-detect at init. When clips are added to a project (montaj run or montaj init), each clip's color transfer is probed and the project color space is the modal (most common) value across all clips. Outliers are converted on the fly — HDR sources in an SDR project are tonemapped per-segment, SDR sources in an HDR project are stretched into the HDR container. This matches the FCP/Resolve pattern: one SDR clip dropped into an iPhone-HDR project is treated as SDR-graded content shown on an HDR canvas, not a reason to flip the whole project down to SDR.
- All clips HLG →
hdr_hlg. - All clips PQ →
hdr_pq. - 27 HLG + 1 SDR →
hdr_hlg(modal wins; the 1 SDR clip is stretched into HLG on the fly). - 27 SDR + 1 HLG →
sdr_bt709(modal wins; the 1 HLG clip is tonemapped). - Tied modes (no clear majority) — tiebreaks: HLG+PQ tied →
hdr_pq(larger gamut, clean HLG→PQ conversion); SDR tied with HDR →sdr_bt709(conservative — tonemap-down is well-defined, inverse-stretch is creative when there's no signal of intent). - No clips probed →
sdr_bt709default.
Override. Pass --color-space {sdr_bt709|hdr_hlg|hdr_pq|auto} to montaj init (default auto runs the smart-detect rules above), or include "colorSpace" in the HTTP intake JSON, to force a specific working space regardless of source detection.
Legacy projects. Projects created before settings.colorSpace existed (or any project where the field was deleted) trigger the same smart-detect rules at render time. The detected value is written back to project.json so subsequent renders skip the detection step. This means an older project will render correctly on first run with no manual migration — the engine fills in what's missing.
Per-color-space behavior in the segment encoder. SDR projects emit libx264 yuv420p with bt709 color metadata; HDR projects emit libx265 yuv420p10le with bt2020nc colorimetry plus the appropriate transfer (arib-std-b67 for HLG, smpte2084 for PQ with static HDR10 mastering metadata). Sources whose color space conflicts with the project are converted at the per-item filter chain in the segment encoder (zscale-based tonemap for HDR→SDR; stretch into HDR container for SDR→HDR; HLG↔PQ via zscale transfer-curve conversion).
Step 3 — Normalize Pre-Pass
The normalize pre-pass is color-space-aware. A source is conformant when its color transfer and bit depth match the project's working color space, and its keyframe interval is ≤ 2.0s (required for the segment encoder's input-level fast seek). When all three hold, the source passes through with no transcode — iPhone HDR HLG clips in an hdr_hlg project are essentially a no-op at intake.
When a source conflicts, normalize emits the project's working format using the encoder, pixel format, and color args from the color-space spec:
sdr_bt709project:libx264 -pix_fmt yuv420pwithbt709stream metadata. HDR sources are tonemapped viazscale+tonemap(with a bare-tonemap fallback whenzscaleis missing — accompanied by a loud warning).hdr_hlgproject:libx265 -pix_fmt yuv420p10lewithbt2020nc/arib-std-b67stream metadata.hdr_pqproject:libx265 -pix_fmt yuv420p10lewithbt2020nc/smpte2084stream metadata + static HDR10 mastering metadata.
All paths emit AAC 48 kHz audio and force IDR keyframes every ~1s.
Resolution is preserved. Source clips remain at their native resolution through the entire pipeline; the segment encoder scales per-item at compose time via the scale= filter in encode-segment.js. This avoids the permanent quality loss of intake-time downscaling and preserves headroom for crops, zooms, and re-frames.
Parallel execution: Both init-time and render-time pre-pass normalize loops run with a concurrency cap of 2. Memory-heavy 4K HDR encodes are the worst case; 2 workers stays within bounds on systems with ≥8GB free RAM. The cap applies to both libx264 (SDR projects) and libx265 (HDR projects) — both are preset-bound CPU encodes.
Normalization creates _normalized_<colorSpace>.mp4 files alongside the originals (e.g. clip_normalized_sdr_bt709.mp4 or clip_normalized_hdr_hlg.mp4) — originals are never modified. Namespacing by color space lets a project flip between SDR and HDR without colliding with cached normalize output. The lib/normalize.py module backs this and is also used at ingest time (project/init.py) and for AI-generated clips.
Cached output is reused across renders. When the deterministic _normalized_<colorSpace>.mp4 output already exists and its mtime is at least as fresh as the source, the render-time pre-pass skips the re-encode entirely. Replacing or re-recording a source file (which advances its mtime) correctly invalidates the cache. Net result: legacy projects pay the normalize cost once, not on every render.
After normalization, every source entering the compose pipeline conforms to the project's working color space. The segment encoder still handles per-item scaling at compose time, and applies in-line color conversion for any source that arrives in a different color space than the project. Resolution is intentionally NOT unified at intake.
Step 5 — JSX Bundling
Each overlay/caption JSX component is compiled into a self-contained HTML page using esbuild. The page exposes window.__setFrame(n) so Puppeteer can drive it frame-by-frame.
Step 6 — Puppeteer Rendering
A pool of N Chromium browsers (default: os.cpus().length) renders each segment in parallel.
Per-job flow:
- Open a new page, set viewport to design resolution (1080 x 1920)
- Navigate to the bundled HTML file
- For each frame: call
window.__setFrame(f), wait for paint confirmation, screenshot to PNG - Encode PNG sequence to FFV1 in MKV container
- If segment exceeds chunk size, split into chunks and concatenate after encoding
Step 7 — Segment-Based Compositing
Compositing uses a segment-based pipeline with three stages:
normalized video items + Puppeteer segments
│
├─ 1. segment-plan.js → plan segments at clip/overlay boundaries
├─ 2. encode-segment.js → encode each segment independently
└─ 3. ffmpeg concat → join segments via concat demuxerSegment planning — the timeline is divided into segments at every clip and overlay boundary. Each segment is a contiguous time range where the set of active layers does not change.
Segment encoding — each segment is encoded independently with its own ffmpeg call. Items at non-project resolution are scaled by the per-item scale= filter — this is what enables source-resolution preservation at intake. Segments are encoded in parallel.
Concat — all segments are joined via the ffmpeg concat demuxer with -c:v copy (no re-encode). This is near-instant.
Step 8 — Audio Mixing
Independent audio tracks (music, voiceover, sound effects) are mixed in a final pass via mix-audio.js. Handles volume, ducking (sidechaincompress), delay offsets, and in/out points.
Three Render Stages (Summary)
montaj render runs three stages:
- Normalize + base video — normalize all sources, then trim and prepare source clips. Canvas projects (no video track) generate a synthetic black base from overlay durations.
- Overlay segments — each JSX overlay is bundled with esbuild, rendered frame-by-frame in headless Chromium, and encoded to lossless FFV1/MKV. Segments are rendered at design resolution (1080 x 1920) regardless of output resolution.
- Compose + mix — segment-based pipeline encodes each timeline segment independently, concats them, then mixes audio tracks.
Output Encoding
Per-segment encoding follows the project's color space (see Project Color Space above). Within a single render every segment shares one codec and pixel format, so the concat demuxer can stream-copy video without re-encoding:
- SDR projects (
sdr_bt709):libx264 -preset fast -crf 18 -pix_fmt yuv420pwithbt709stream-level color metadata. - HDR projects (
hdr_hlg,hdr_pq):libx265 -preset fast -crf 22 -pix_fmt yuv420p10lewithbt2020nccolorimetry plus the project's transfer curve (arib-std-b67for HLG,smpte2084+ static HDR10 mastering metadata for PQ).
Per-frame setparams and per-stream color args come from the color-space spec, ensuring downstream players read the same colorimetry the encoder produced.
Intermediate Files
<project>/
└── render/
├── segments/ Puppeteer FFV1/MKV files + composed segment files
│ ├── <id>-chunk-0.mkv (Puppeteer renders)
│ ├── seg-000.mp4 (composed segments)
│ └── ...
└── final.mp4 Final outputIntermediate files are kept by default and reused on re-runs. Use --clean to delete them after compositing. Set MONTAJ_KEEP_SEGMENTS=1 to preserve composed segment files for debugging (they are cleaned up by default after concat).
Clip Seeking
Each video clip is fed as:
-ss <inPoint> -to <outPoint> -i <src>Use -to outPoint (absolute file timestamp), not -t duration. Fast seek (-ss before -i) lands on the nearest keyframe before inPoint. -t measures duration from that keyframe — if the keyframe is 0.3s early, the clip is silently trimmed short. -to stops at the absolute file timestamp regardless.
Custom Overlays
Overlays are custom JSX components written by the agent. There are no built-in overlay templates — every overlay is a React component the agent writes, styled to the editing prompt and brand context.
Overlay Item in project.json
{
"id": "ov-hook",
"type": "overlay",
"src": "/abs/path/to/project/overlays/hook.jsx",
"props": { "text": "She built an AI employee" },
"start": 0.0,
"end": 3.0
}| Field | Required | Description |
|---|---|---|
type | yes | "overlay" for custom JSX |
src | yes | Absolute path to the JSX file |
start / end | yes | Time window in output video (seconds) |
props | no | Arbitrary data injected as the props global |
offsetX / offsetY | no | Position offset as % of frame size (set by UI drag) |
scale | no | Uniform scale multiplier (set by UI resize) |
offsetX, offsetY, and scale are applied by the render engine as a CSS transform on the component container. The JSX component itself is unaware of them.
Component Globals
Overlay components have access to these globals:
| Global | Description |
|---|---|
frame | Current frame number |
fps | Frames per second |
props | Data from the overlay item |
interpolate(frame, inputRange, outputRange) | Map frame number to any value |
spring({ frame, fps, config }) | Physics-based easing (mass, stiffness, damping) |
How Overlays Are Rendered
- The JSX file is bundled into a self-contained HTML page by esbuild
- Puppeteer opens the page in headless Chrome at 1080 x 1920 (transparent background)
- For each frame:
window.__setFrame(n)increments, screenshot to PNG with alpha - ffmpeg encodes the PNG sequence into a transparent FFV1/MKV video segment
- The segment is composited onto the source footage in the segment-based compose pipeline
Parallelism
Overlay rendering is CPU-bound. Two levels of parallelism:
- Segment-level — all overlay segments are independent and rendered simultaneously by a worker pool
- Frame chunking — segments above 1,000 frames (~33s at 30fps) are split into chunks, each rendered by a separate worker
caption track (18,000 frames) → 18 chunks × 1,000 frames → 18 workers
lower-third (135 frames) → 1 chunk → 1 worker
flash (9 frames) → 1 chunk → 1 workerConfigurable via ~/.montaj/config.json:
{ "render": { "workers": 8, "chunkSize": 1000 } }Browser Recycling
Each Puppeteer worker restarts its browser every 5 jobs. After many segments, browser processes accumulate memory and can start timing out. Recycling flushes that state.
Design Resolution
Overlays are always rendered at 1080 x 1920 regardless of output resolution. The pipeline upscales at compose time for higher resolutions (e.g., 2x for 4K).
Caption Templates
Caption templates are pre-built React components referenced by style name. Unlike overlays (which are always custom JSX), captions use built-in templates selected by the agent or user.
Available Styles
| Style | Description |
|---|---|
word-by-word | One word at a time, spring pop-in animation |
pop | Segment-at-a-time with scale entry animation |
karaoke | Words highlight progressively as they're spoken |
subtitle | Static line at bottom, segments replace sequentially |
How Captions Work
transcribestep generates word-level timestamps via whisper.cppcaptionstep converts the transcript into a caption track with a chosen style- Caption data is stored inline in project.json (not as a file pointer)
- At render time, the template component receives the caption segments and renders frame-by-frame via Puppeteer
Caption Data Format
{
"id": "captions",
"type": "caption",
"style": "word-by-word",
"segments": [
{
"text": "Hello world",
"start": 0.0,
"end": 1.2,
"words": [
{ "word": "Hello", "start": 0.0, "end": 0.5 },
{ "word": "world", "start": 0.5, "end": 1.2 }
]
}
]
}Each segment contains:
text— the full segment textstart/end— segment time window (seconds)words— individual word timestamps for styles that animate per-word
Rendering
Caption templates produce the same output as custom overlays: rendered frame-by-frame by Puppeteer, composited into the video by ffmpeg. The template component uses interpolate and spring utilities for animation.
Choosing a Style
Via CLI:
montaj step caption --input transcript.json --style word-by-word
montaj step caption --input transcript.json --style karaoke
montaj step caption --input transcript.json --style pop
montaj step caption --input transcript.json --style subtitleVia the editing prompt:
"add word-by-word captions"
"karaoke-style captions, bold text"Editing Captions
In the UI review phase, you can:
- Click a caption segment to edit its text inline
- Drag segments to adjust their timing
- Change the caption style
Changes update the segments array in the caption track of project.json.
GPU Acceleration
The Montaj render pipeline is mostly CPU-bound. GPU acceleration applies at one specific step.
Where GPU Applies
| Step | Bound | GPU |
|---|---|---|
| Puppeteer frame rendering | CPU | Parallelism is the lever |
| ffmpeg segment encoding | CPU | Limited GPU filter support |
| ffmpeg intermediate encode (PNG to FFV1) | CPU | Alpha formats lack hwaccel support |
| Final encode | GPU | VideoToolbox (macOS), NVENC (NVIDIA), VAAPI (Intel/Linux). Codec follows the project's color space — H.264 for SDR projects, HEVC 10-bit for HDR projects. |
ffmpeg detects and uses available hardware encoders automatically, providing a 5-10x speedup on the final encode.
CPU-Bound Stages
Puppeteer Frame Rendering
The main bottleneck. Each overlay/caption segment is rendered frame-by-frame in headless Chrome. Two parallelism strategies:
- Segment-level — all segments rendered simultaneously via a worker pool (default: CPU core count)
- Frame chunking — segments above 1,000 frames are split into chunks for parallel rendering
ffmpeg Compositing
Each segment is composited independently — the per-segment filter graph is simple since all sources are pre-normalized. GPU filters exist but are limited.
Configuration
Control worker count and chunk size via ~/.montaj/config.json:
{
"render": {
"workers": 8,
"chunkSize": 1000
}
}Or via CLI flags:
montaj render --workers 4Color Space Handling
Color space is a per-project setting (settings.colorSpace), not a fixed pipeline assumption. iPhone HDR clips (HEVC, BT.2020/HLG 10-bit) flow through an HDR project with no transformation — the working format already matches their native color space. In an SDR project they're tonemapped via zscale + tonemap (with a fallback path when zscale is missing). Sources whose color space conflicts with the project are converted at intake or, for late-arriving items, in the segment encoder.
Run montaj doctor to verify your ffmpeg has zscale support. zscale is required for any cross-color-space conversion (HDR↔SDR, HLG↔PQ); on systems without it, only direct passes (matching project color space) and a degraded HDR→SDR fallback work. montaj install ffmpeg rebuilds ffmpeg with libzimg on macOS.
Troubleshooting
| Issue | Cause | Solution |
|---|---|---|
| Slow render | Too many Puppeteer workers saturating memory | Reduce --workers |
| Browser timeout errors | Memory pressure from many segments | Browser auto-recycles every 5 jobs; reduce workers if still failing |
| Mixed HDR/SDR color shifts | HDR and SDR sources with mismatched color metadata in the same compose | Fixed by project-color-space contract — every source is converted to the project's working color space (at intake or per-item in the segment encoder) before composing |
| Degraded colors after normalization | zscale filter missing from ffmpeg | Run montaj doctor to check; montaj install ffmpeg to rebuild with libzimg |
| Clips trimmed short at cut points | -t duration measured from keyframe, not inPoint | Fixed by using -to outPoint instead |