VOD Deep Dive Part 4: Container Formats — .mp4 Is Not a Codec
Containers vs codecs, MP4 internals (Box structure), the faststart trap, fragmented MP4, CMAF for unified HLS+DASH, segment length trade-offs, and subtitle formats.
This is Part 4 of the VOD Streaming Deep Dive series.
The Most Common Misconception
Here’s a mistake every beginner makes:
❌ “The video codec is MP4.”
✅ “The video container is MP4. The codec is H.264 (or H.265, AV1, etc.).”
MP4 is a container — a box. The box can hold any supported codec stream.
┌──────────────────────────────────────┐
│ MP4 Container (a .mp4 file) │
│ ┌────────────────────────────────┐ │
│ │ Video stream: H.264 / H.265 / AV1│
│ └────────────────────────────────┘ │
│ ┌────────────────────────────────┐ │
│ │ Audio stream: AAC / Opus / AC-3 │ │
│ └────────────────────────────────┘ │
│ ┌────────────────────────────────┐ │
│ │ Subtitle stream: TTML / WebVTT │ │
│ └────────────────────────────────┘ │
│ ┌────────────────────────────────┐ │
│ │ Metadata: title, duration, index│ │
│ └────────────────────────────────┘ │
└──────────────────────────────────────┘
Think of MP4 as a shipping box. H.264 is the product inside. The same box can hold a phone (H.264), a shirt (H.265), or fruit (AV1).
Why Do We Need Containers?
Can’t we just store raw encoded bitstreams?
No. Raw bitstreams lack:
- How to synchronize video and audio playback
- How to align subtitles
- Where to start reading when the user seeks to 1:00
- How many audio tracks exist
- What codec is used, with what parameters
Containers organize all of this so the player can read data along a timeline.
Major Container Formats
| Container | Extension | Creator | Typical contents | Best for |
|---|---|---|---|---|
| MP4 / fMP4 | .mp4 .m4s .m4a | MPEG | H.264/H.265/AV1 + AAC | Most universal, streaming default |
| MOV | .mov | Apple | Nearly anything | macOS production, mezzanine transfer |
| MKV | .mkv | CoreCodec | Nearly anything | HD downloads, open-source community |
| WebM | .webm | VP9/AV1 + Opus | Web (non-iOS) | |
| MPEG-TS | .ts | MPEG | H.264/HEVC + AAC/AC-3 | Broadcast, legacy HLS |
| FLV | .flv | Adobe | H.264 + AAC | Retired (RTMP legacy) |
For VOD, you realistically only need two: MP4/fMP4 (streaming) and MOV (mezzanine transfer).
MP4 / ISO BMFF Internals
MP4’s formal standard is ISO Base Media File Format (ISO BMFF), ISO/IEC 14496-12. Internally, it’s composed of nested units called Boxes (also called atoms):
File start
│
├── [ftyp] File Type Box ← Tells the player "this is MP4"
│
├── [moov] Movie Box ← Metadata "table of contents":
│ ├── [mvhd] Movie Header (total duration, timescale)
│ ├── [trak] Track Box (one per audio/video/subtitle track)
│ │ ├── [tkhd] Track Header
│ │ └── [mdia] Media
│ │ ├── [mdhd] Media Header
│ │ ├── [hdlr] Handler (vide/soun/subt)
│ │ └── [minf]
│ │ └── [stbl] Sample Table ← byte offset of every frame
│ └── [trak] ... (more tracks)
│
└── [mdat] Media Data Box ← The actual audio/video binary data
└── (large binary blob)
Two key concepts:
- moov: The metadata/directory. Tells the player “frame 1 is at byte 12345, frame 2 is at byte 13800…”
- mdat: The actual data. Pure H.264 + AAC binary.
The Faststart Trap
By default, MP4 encoders place the moov box at the end of the file (because the full timestamp index isn’t known until encoding completes).
Default MP4:
┌─────────┬─────────────────────────────────┬──────┐
│ ftyp │ mdat (99.9%) │ moov │
└─────────┴─────────────────────────────────┴──────┘
8 bytes hundreds of MB/GB tens of KB
The problem: a web player must read moov before it can play, but moov is at the end → the entire file must download before playback begins.
Solution: Faststart (moov at the front)
Faststart MP4:
┌─────────┬──────┬──────────────────────────────┐
│ ftyp │ moov │ mdat │
└─────────┴──────┴──────────────────────────────┘
↑
After reading this (< 1 MB), playback can begin
ffmpeg -i input.mov -c copy -movflags +faststart output.mp4
VOD rule: always apply faststart before publishing. Without it, users see a long blank screen after pressing play.
Fragmented MP4 (fMP4): The Streaming Choice
Traditional MP4 has a problem: the moov box describes the entire file. For a multi-hour movie, moov grows to megabytes — slow to parse and expensive to update.
fMP4 (Fragmented MP4) splits the file into many small fragments, each with its own mini-moov (called moof):
fMP4 structure:
┌──────┬──────┐ ┌──────┬──────┐ ┌──────┬──────┐ ┌──────┬──────┐
│ moov │ - │ │ moof │ mdat │ │ moof │ mdat │ │ moof │ mdat │
│ init │ │ │ frag1│frag1 │ │ frag2│frag2 │ │ frag3│frag3 │
└──────┴──────┘ └──────┴──────┘ └──────┴──────┘ └──────┴──────┘
Standalone
init segment Each fragment is independently decodable
Two core benefits:
- Independent segments: Each
moof+mdatpair can be stored as a separate file (.m4s). Exactly what streaming protocols need. - Startup only requires a tiny init segment (~tens of KB), not the entire moov.
Modern HLS, DASH, and CMAF are all based on fMP4. MPEG-TS (legacy HLS) is being phased out.
CMAF: One File to Rule HLS and DASH
Historically, Apple pushed HLS (with TS segments) and the rest of the industry pushed DASH (with fMP4 segments). Same video, stored twice.
CMAF (Common Media Application Format), standardized in 2018, fixes this:
HLS and DASH share the same fMP4 files. Only the manifest differs.
One set of CMAF fMP4 segments:
┌──────┐ ┌───────┐ ┌───────┐ ┌───────┐
│ init │ │ seg1 │ │ seg2 │ │ seg3 │ ← Only one copy on disk
└──────┘ └───────┘ └───────┘ └───────┘
↑ ↑
┌────────┴───────┐ ┌──┴──────────┐
│ HLS master.m3u8│ │ DASH .mpd │
│ points to same │ │ points to │
│ segments │ │ same segments│
└────────────────┘ └─────────────┘
Benefits: storage halved, CDN cache hit rate doubled, transcode once.
Trade-off: all platforms must support CMAF (modern ones do) and DRM must use CBCS mode (compatible with FairPlay).
For projects started after 2020: use CMAF directly. Don’t do “HLS TS + DASH fMP4” dual publishing.
Segment Length: How Long Should Each Slice Be?
| Segment length | Pros | Cons | Best for |
|---|---|---|---|
| 1–2 sec | Fast startup, quick ABR adaptation | Many files, many HTTP requests | Short-form video, low-latency live |
| 2–4 sec | Balanced | — | HLS/DASH recommended default (4s) |
| 6–10 sec | Fewer HTTP requests, CDN-friendly | Slow startup, coarse seeking | Long movies, traditional broadcast |
Short-form video apps should use 2-second segments (users swipe frequently). Long-form VOD should use 4–6 seconds.
Segment length must be an integer multiple of the GOP duration — see Part 1 on keyframes.
Subtitles in Containers
Three approaches:
Sidecar (External)
Subtitles as separate files:
video.mp4
video.en.vtt ← English subtitles
video.zh.vtt ← Chinese subtitles
The HLS/DASH manifest references these files. Easy to add languages; changing subtitles doesn’t require re-encoding video.
Embedded
Subtitles as a track inside the MP4. One file contains everything.
Burned-In (Hardcoded)
Subtitles rendered directly into the video pixels. Cannot be turned off.
Recommendation: VOD with multi-language support → Sidecar WebVTT. Short-form video where subtitles are part of the creative → burned-in.
Common Subtitle Formats
| Format | Key feature | Used in |
|---|---|---|
| SRT | Simplest: text + timestamps | Universal |
| WebVTT | SRT enhanced (styling, positioning) | HTML5 / HLS standard |
| TTML / IMSC1 | XML, complex layout | DASH, broadcast |
| ASS / SSA | Powerful styling, animation | Anime community |
Hands-On: Container Operations with ffmpeg
Inspect what’s inside an MP4
ffprobe -v error -show_streams input.mp4
Lists all video / audio / subtitle streams.
Convert .mov to fMP4 for streaming (no re-encoding)
ffmpeg -i input.mov \
-c copy \
-movflags +faststart+frag_keyframe+empty_moov+default_base_moof \
output_fragmented.mp4
Slice into HLS segments (fMP4 format)
ffmpeg -i input.mp4 \
-c:v libx264 -preset slow -crf 22 -g 60 -keyint_min 60 -sc_threshold 0 \
-c:a aac -b:a 128k \
-f hls \
-hls_time 4 \
-hls_segment_type fmp4 \
-hls_playlist_type vod \
-hls_list_size 0 \
-hls_segment_filename "seg_%04d.m4s" \
output.m3u8
Output:
output.m3u8 ← HLS manifest
init.mp4 ← CMAF init segment
seg_0000.m4s
seg_0001.m4s
seg_0002.m4s
...
This is a minimal working HLS stream. Host it on any web server (even python3 -m http.server) and play it with Safari or hls.js.
Key Takeaways
- Container ≠ Codec. MP4 is a container; H.264 is a codec.
- MP4 internals are Box-based:
moovis the directory,mdatis the data. - Always apply
-movflags +faststartfor VOD — moves moov to the front for progressive playback. - fMP4 splits the file into independent fragments — the foundation of modern streaming.
- CMAF lets HLS and DASH share one set of fMP4 files: storage halved, cache doubled.
- TS is legacy. New projects should use fMP4/CMAF.
- Segment length: short-form video → 2s; long-form VOD → 4–6s.
- Subtitles: prefer sidecar WebVTT for multi-language.
Previous: Part 3: Audio Fundamentals