MontajMontajdocs

Render Engine

The render pipeline — React + Puppeteer + ffmpeg turns project.json into a final MP4.

Render Engine

The render engine lives in render/ and turns project.json [final] into a final MP4. It reads the captions and overlays tracks, renders each item as a transparent video segment via React + Puppeteer, then composites everything with the source footage via ffmpeg.


How Rendering Works

Invocation

montaj render
# or directly:
node render/render.js <project.json> [--out <path>] [--workers <n>] [--clean]

Project status must be "final" before rendering. The render is non-destructive — source files are never modified.

Pipeline

project.json

    ├─ 1. Validate + resolve paths
    ├─ 2. Collect segment specs + video/image items
    ├─ 3. Process video items (remove_bg if flagged)
    ├─ 4. Bundle JSX → HTML  (one per overlay/caption)
    ├─ 5. Render HTML → FFV1/MKV  (Puppeteer pool)
    ├─ 6. Probe source video dimensions
    └─ 7. Compose → final.mp4

Step 4 — JSX Bundling

Each overlay/caption JSX component is compiled into a self-contained HTML page using esbuild. The page exposes window.__setFrame(n) so Puppeteer can drive it frame-by-frame.

Step 5 — Puppeteer Rendering

A pool of N Chromium browsers (default: os.cpus().length) renders each segment in parallel.

Per-job flow:

  1. Open a new page, set viewport to design resolution (1080 x 1920)
  2. Navigate to the bundled HTML file
  3. For each frame: call window.__setFrame(f), wait for paint confirmation, screenshot to PNG
  4. Encode PNG sequence to FFV1 in MKV container
  5. If segment exceeds chunk size, split into chunks and concatenate after encoding

Step 7 — Compositing

A single ffmpeg filter_complex command layers everything onto a black canvas:

color=black (canvas)

    ├─ [each video item]  → scale → overlay (enable=between(t,...))
    │       └─ audio: adelay → amix

    ├─ [each Puppeteer segment]  → format=yuva420p → scale → overlay

    └─ [music if present]  → volume → amix (+ sidechaincompress if ducking)

Three Render Stages

montaj render runs three stages:

  1. Base video — trims and concatenates source clips via ffmpeg stream-copy. Canvas projects (no video track) generate a synthetic black base from overlay durations.
  2. Overlay segments — each JSX overlay is bundled with esbuild, rendered frame-by-frame in headless Chromium, and encoded to lossless FFV1/MKV. Segments are rendered at design resolution (1080 x 1920) regardless of output resolution.
  3. Compose — a single ffmpeg filter_complex overlays all segments onto the base. For 4K output, segments are upscaled 2x before compositing.

Output Encoding

Final H.264 MP4:

libx264 -preset fast -crf 18 -pix_fmt yuv420p

HDR source clips (bt2020/HLG) are composed in 10-bit (yuv420p10le) with full bt2020 color metadata preserved end-to-end.

Intermediate Files

<project>/
└── render/
    ├── segments/           Puppeteer FFV1/MKV files (wiped each run)
    ├── output_chunk0.mkv   Lossless compose intermediates (if chunked)
    └── final.mp4           Final output

Intermediate files are kept by default and reused on re-runs. Use --clean to delete them after compositing.

Chunked Compositing

When a project has more than 5 video items, the compose step splits the timeline into 30-second passes, renders each as a lossless intermediate, then concatenates them.


Custom Overlays

Overlays are custom JSX components written by the agent. There are no built-in overlay templates — every overlay is a React component the agent writes, styled to the editing prompt and brand context.

Overlay Item in project.json

{
  "id": "ov-hook",
  "type": "overlay",
  "src": "/abs/path/to/project/overlays/hook.jsx",
  "props": { "text": "She built an AI employee" },
  "start": 0.0,
  "end": 3.0
}
FieldRequiredDescription
typeyes"overlay" for custom JSX
srcyesAbsolute path to the JSX file
start / endyesTime window in output video (seconds)
propsnoArbitrary data injected as the props global
offsetX / offsetYnoPosition offset as % of frame size (set by UI drag)
scalenoUniform scale multiplier (set by UI resize)

offsetX, offsetY, and scale are applied by the render engine as a CSS transform on the component container. The JSX component itself is unaware of them.

Component Globals

Overlay components have access to these globals:

GlobalDescription
frameCurrent frame number
fpsFrames per second
propsData from the overlay item
interpolate(frame, inputRange, outputRange)Map frame number to any value
spring({ frame, fps, config })Physics-based easing (mass, stiffness, damping)

How Overlays Are Rendered

  1. The JSX file is bundled into a self-contained HTML page by esbuild
  2. Puppeteer opens the page in headless Chrome at 1080 x 1920 (transparent background)
  3. For each frame: window.__setFrame(n) increments, screenshot to PNG with alpha
  4. ffmpeg encodes the PNG sequence into a transparent FFV1/MKV video segment
  5. The segment is composited onto the source footage in the final ffmpeg filter graph

Parallelism

Overlay rendering is CPU-bound. Two levels of parallelism:

  • Segment-level — all overlay segments are independent and rendered simultaneously by a worker pool
  • Frame chunking — segments above 1,000 frames (~33s at 30fps) are split into chunks, each rendered by a separate worker
caption track (18,000 frames) → 18 chunks × 1,000 frames → 18 workers
lower-third (135 frames)      → 1 chunk → 1 worker
flash (9 frames)              → 1 chunk → 1 worker

Configurable via ~/.montaj/config.json:

{ "render": { "workers": 8, "chunkSize": 1000 } }

Browser Recycling

Each Puppeteer worker restarts its browser every 5 jobs. After many segments, browser processes accumulate memory and can start timing out. Recycling flushes that state.

Design Resolution

Overlays are always rendered at 1080 x 1920 regardless of output resolution. The pipeline upscales at compose time for higher resolutions (e.g., 2x for 4K).


Caption Templates

Caption templates are pre-built React components referenced by style name. Unlike overlays (which are always custom JSX), captions use built-in templates selected by the agent or user.

Available Styles

StyleDescription
word-by-wordOne word at a time, spring pop-in animation
popSegment-at-a-time with scale entry animation
karaokeWords highlight progressively as they're spoken
subtitleStatic line at bottom, segments replace sequentially

How Captions Work

  1. transcribe step generates word-level timestamps via whisper.cpp
  2. caption step converts the transcript into a caption track with a chosen style
  3. Caption data is stored inline in project.json (not as a file pointer)
  4. At render time, the template component receives the caption segments and renders frame-by-frame via Puppeteer

Caption Data Format

{
  "id": "captions",
  "type": "caption",
  "style": "word-by-word",
  "segments": [
    {
      "text": "Hello world",
      "start": 0.0,
      "end": 1.2,
      "words": [
        { "word": "Hello", "start": 0.0, "end": 0.5 },
        { "word": "world", "start": 0.5, "end": 1.2 }
      ]
    }
  ]
}

Each segment contains:

  • text — the full segment text
  • start / end — segment time window (seconds)
  • words — individual word timestamps for styles that animate per-word

Rendering

Caption templates produce the same output as custom overlays: rendered frame-by-frame by Puppeteer, composited into the video by ffmpeg. The template component uses interpolate and spring utilities for animation.

Choosing a Style

Via CLI:

montaj step caption --input transcript.json --style word-by-word
montaj step caption --input transcript.json --style karaoke
montaj step caption --input transcript.json --style pop
montaj step caption --input transcript.json --style subtitle

Via the editing prompt:

"add word-by-word captions"
"karaoke-style captions, bold text"

Editing Captions

In the UI review phase, you can:

  • Click a caption segment to edit its text inline
  • Drag segments to adjust their timing
  • Change the caption style

Changes update the segments array in the caption track of project.json.


GPU Acceleration

The Montaj render pipeline is mostly CPU-bound. GPU acceleration applies at one specific step.

Where GPU Applies

StepBoundGPU
Puppeteer frame renderingCPUParallelism is the lever
ffmpeg compositing (filter graph)CPULimited GPU filter support
ffmpeg intermediate encode (PNG to FFV1)CPUAlpha formats lack hwaccel support
Final H.264 encodeGPUVideoToolbox (macOS), NVENC (NVIDIA), VAAPI (Intel/Linux)

ffmpeg detects and uses available hardware encoders automatically, providing a 5-10x speedup on the final encode.

CPU-Bound Stages

Puppeteer Frame Rendering

The main bottleneck. Each overlay/caption segment is rendered frame-by-frame in headless Chrome. Two parallelism strategies:

  1. Segment-level — all segments rendered simultaneously via a worker pool (default: CPU core count)
  2. Frame chunking — segments above 1,000 frames are split into chunks for parallel rendering

ffmpeg Compositing

The filter graph overlays all Puppeteer segments onto the base video. GPU filters exist but are limited — this remains CPU-bound.

Configuration

Control worker count and chunk size via ~/.montaj/config.json:

{
  "render": {
    "workers": 8,
    "chunkSize": 1000
  }
}

Or via CLI flags:

montaj render --workers 4

HDR Handling

iPhone HEVC source footage (BT.2020/HLG 10-bit) is handled automatically. The pipeline:

  • Composes in 10-bit (yuv420p10le)
  • Preserves BT.2020 color primaries
  • Corrects the YCbCr matrix (bt2020nc to bt709) via setparams in the filter graph
  • Preserves HLG transfer function

No tone-mapping filters are applied — the raw + metadata approach produces better visual output.

Troubleshooting

IssueCauseSolution
Slow renderToo many Puppeteer workers saturating memoryReduce --workers
Browser timeout errorsMemory pressure from many segmentsBrowser auto-recycles every 5 jobs; reduce workers if still failing
Pale/yellowish outputColor metadata not propagated correctlyHandled automatically via setparams in the filter graph