Render Engine
The render pipeline — React + Puppeteer + ffmpeg turns project.json into a final MP4.
Render Engine
The render engine lives in render/ and turns project.json [final] into a final MP4. It reads the captions and overlays tracks, renders each item as a transparent video segment via React + Puppeteer, then composites everything with the source footage via ffmpeg.
How Rendering Works
Invocation
montaj render
# or directly:
node render/render.js <project.json> [--out <path>] [--workers <n>] [--clean]Project status must be "final" before rendering. The render is non-destructive — source files are never modified.
Pipeline
project.json
│
├─ 1. Validate + resolve paths
├─ 2. Collect segment specs + video/image items
├─ 3. Process video items (remove_bg if flagged)
├─ 4. Bundle JSX → HTML (one per overlay/caption)
├─ 5. Render HTML → FFV1/MKV (Puppeteer pool)
├─ 6. Probe source video dimensions
└─ 7. Compose → final.mp4Step 4 — JSX Bundling
Each overlay/caption JSX component is compiled into a self-contained HTML page using esbuild. The page exposes window.__setFrame(n) so Puppeteer can drive it frame-by-frame.
Step 5 — Puppeteer Rendering
A pool of N Chromium browsers (default: os.cpus().length) renders each segment in parallel.
Per-job flow:
- Open a new page, set viewport to design resolution (1080 x 1920)
- Navigate to the bundled HTML file
- For each frame: call
window.__setFrame(f), wait for paint confirmation, screenshot to PNG - Encode PNG sequence to FFV1 in MKV container
- If segment exceeds chunk size, split into chunks and concatenate after encoding
Step 7 — Compositing
A single ffmpeg filter_complex command layers everything onto a black canvas:
color=black (canvas)
│
├─ [each video item] → scale → overlay (enable=between(t,...))
│ └─ audio: adelay → amix
│
├─ [each Puppeteer segment] → format=yuva420p → scale → overlay
│
└─ [music if present] → volume → amix (+ sidechaincompress if ducking)Three Render Stages
montaj render runs three stages:
- Base video — trims and concatenates source clips via ffmpeg stream-copy. Canvas projects (no video track) generate a synthetic black base from overlay durations.
- Overlay segments — each JSX overlay is bundled with esbuild, rendered frame-by-frame in headless Chromium, and encoded to lossless FFV1/MKV. Segments are rendered at design resolution (1080 x 1920) regardless of output resolution.
- Compose — a single
ffmpeg filter_complexoverlays all segments onto the base. For 4K output, segments are upscaled 2x before compositing.
Output Encoding
Final H.264 MP4:
libx264 -preset fast -crf 18 -pix_fmt yuv420pHDR source clips (bt2020/HLG) are composed in 10-bit (yuv420p10le) with full bt2020 color metadata preserved end-to-end.
Intermediate Files
<project>/
└── render/
├── segments/ Puppeteer FFV1/MKV files (wiped each run)
├── output_chunk0.mkv Lossless compose intermediates (if chunked)
└── final.mp4 Final outputIntermediate files are kept by default and reused on re-runs. Use --clean to delete them after compositing.
Chunked Compositing
When a project has more than 5 video items, the compose step splits the timeline into 30-second passes, renders each as a lossless intermediate, then concatenates them.
Custom Overlays
Overlays are custom JSX components written by the agent. There are no built-in overlay templates — every overlay is a React component the agent writes, styled to the editing prompt and brand context.
Overlay Item in project.json
{
"id": "ov-hook",
"type": "overlay",
"src": "/abs/path/to/project/overlays/hook.jsx",
"props": { "text": "She built an AI employee" },
"start": 0.0,
"end": 3.0
}| Field | Required | Description |
|---|---|---|
type | yes | "overlay" for custom JSX |
src | yes | Absolute path to the JSX file |
start / end | yes | Time window in output video (seconds) |
props | no | Arbitrary data injected as the props global |
offsetX / offsetY | no | Position offset as % of frame size (set by UI drag) |
scale | no | Uniform scale multiplier (set by UI resize) |
offsetX, offsetY, and scale are applied by the render engine as a CSS transform on the component container. The JSX component itself is unaware of them.
Component Globals
Overlay components have access to these globals:
| Global | Description |
|---|---|
frame | Current frame number |
fps | Frames per second |
props | Data from the overlay item |
interpolate(frame, inputRange, outputRange) | Map frame number to any value |
spring({ frame, fps, config }) | Physics-based easing (mass, stiffness, damping) |
How Overlays Are Rendered
- The JSX file is bundled into a self-contained HTML page by esbuild
- Puppeteer opens the page in headless Chrome at 1080 x 1920 (transparent background)
- For each frame:
window.__setFrame(n)increments, screenshot to PNG with alpha - ffmpeg encodes the PNG sequence into a transparent FFV1/MKV video segment
- The segment is composited onto the source footage in the final ffmpeg filter graph
Parallelism
Overlay rendering is CPU-bound. Two levels of parallelism:
- Segment-level — all overlay segments are independent and rendered simultaneously by a worker pool
- Frame chunking — segments above 1,000 frames (~33s at 30fps) are split into chunks, each rendered by a separate worker
caption track (18,000 frames) → 18 chunks × 1,000 frames → 18 workers
lower-third (135 frames) → 1 chunk → 1 worker
flash (9 frames) → 1 chunk → 1 workerConfigurable via ~/.montaj/config.json:
{ "render": { "workers": 8, "chunkSize": 1000 } }Browser Recycling
Each Puppeteer worker restarts its browser every 5 jobs. After many segments, browser processes accumulate memory and can start timing out. Recycling flushes that state.
Design Resolution
Overlays are always rendered at 1080 x 1920 regardless of output resolution. The pipeline upscales at compose time for higher resolutions (e.g., 2x for 4K).
Caption Templates
Caption templates are pre-built React components referenced by style name. Unlike overlays (which are always custom JSX), captions use built-in templates selected by the agent or user.
Available Styles
| Style | Description |
|---|---|
word-by-word | One word at a time, spring pop-in animation |
pop | Segment-at-a-time with scale entry animation |
karaoke | Words highlight progressively as they're spoken |
subtitle | Static line at bottom, segments replace sequentially |
How Captions Work
transcribestep generates word-level timestamps via whisper.cppcaptionstep converts the transcript into a caption track with a chosen style- Caption data is stored inline in project.json (not as a file pointer)
- At render time, the template component receives the caption segments and renders frame-by-frame via Puppeteer
Caption Data Format
{
"id": "captions",
"type": "caption",
"style": "word-by-word",
"segments": [
{
"text": "Hello world",
"start": 0.0,
"end": 1.2,
"words": [
{ "word": "Hello", "start": 0.0, "end": 0.5 },
{ "word": "world", "start": 0.5, "end": 1.2 }
]
}
]
}Each segment contains:
text— the full segment textstart/end— segment time window (seconds)words— individual word timestamps for styles that animate per-word
Rendering
Caption templates produce the same output as custom overlays: rendered frame-by-frame by Puppeteer, composited into the video by ffmpeg. The template component uses interpolate and spring utilities for animation.
Choosing a Style
Via CLI:
montaj step caption --input transcript.json --style word-by-word
montaj step caption --input transcript.json --style karaoke
montaj step caption --input transcript.json --style pop
montaj step caption --input transcript.json --style subtitleVia the editing prompt:
"add word-by-word captions"
"karaoke-style captions, bold text"Editing Captions
In the UI review phase, you can:
- Click a caption segment to edit its text inline
- Drag segments to adjust their timing
- Change the caption style
Changes update the segments array in the caption track of project.json.
GPU Acceleration
The Montaj render pipeline is mostly CPU-bound. GPU acceleration applies at one specific step.
Where GPU Applies
| Step | Bound | GPU |
|---|---|---|
| Puppeteer frame rendering | CPU | Parallelism is the lever |
| ffmpeg compositing (filter graph) | CPU | Limited GPU filter support |
| ffmpeg intermediate encode (PNG to FFV1) | CPU | Alpha formats lack hwaccel support |
| Final H.264 encode | GPU | VideoToolbox (macOS), NVENC (NVIDIA), VAAPI (Intel/Linux) |
ffmpeg detects and uses available hardware encoders automatically, providing a 5-10x speedup on the final encode.
CPU-Bound Stages
Puppeteer Frame Rendering
The main bottleneck. Each overlay/caption segment is rendered frame-by-frame in headless Chrome. Two parallelism strategies:
- Segment-level — all segments rendered simultaneously via a worker pool (default: CPU core count)
- Frame chunking — segments above 1,000 frames are split into chunks for parallel rendering
ffmpeg Compositing
The filter graph overlays all Puppeteer segments onto the base video. GPU filters exist but are limited — this remains CPU-bound.
Configuration
Control worker count and chunk size via ~/.montaj/config.json:
{
"render": {
"workers": 8,
"chunkSize": 1000
}
}Or via CLI flags:
montaj render --workers 4HDR Handling
iPhone HEVC source footage (BT.2020/HLG 10-bit) is handled automatically. The pipeline:
- Composes in 10-bit (
yuv420p10le) - Preserves BT.2020 color primaries
- Corrects the YCbCr matrix (bt2020nc to bt709) via
setparamsin the filter graph - Preserves HLG transfer function
No tone-mapping filters are applied — the raw + metadata approach produces better visual output.
Troubleshooting
| Issue | Cause | Solution |
|---|---|---|
| Slow render | Too many Puppeteer workers saturating memory | Reduce --workers |
| Browser timeout errors | Memory pressure from many segments | Browser auto-recycles every 5 jobs; reduce workers if still failing |
| Pale/yellowish output | Color metadata not propagated correctly | Handled automatically via setparams in the filter graph |