VOD Deep Dive Part 1: Video Fundamentals — What Is a Video, Really?

The first installment of our 12-part VOD streaming series. Learn what video actually is at the byte level — pixels, resolution, frame rates, bitrate, I/P/B frames, GOP, color spaces, and HDR.

zhuermu · · 20 min
vodstreamingvideoresolutionframe-ratebitrate

This is Part 1 of the VOD Streaming Deep Dive series — a 12-part technical guide covering everything from raw pixels to global-scale delivery.


Questions You’ve Probably Never Thought About

You use video every day, but have you ever wondered:

  • What actually happens between tapping “play” and seeing the first frame?
  • Why does a 2-hour Netflix movie weigh only 2 GB, while 2 hours of raw iPhone footage is 20 GB?
  • Why does video “get blurry” on a weak connection instead of just freezing?
  • Why can iPhones only play HLS but not DASH natively?
  • Why can’t you copy a downloaded Netflix movie to someone else’s phone?
  • How do short-form video apps achieve near-instant playback when you swipe?

By the end of this series, you’ll be able to answer every one of these.


What Is Video on Demand (VOD)?

Video on DemandVOD — means exactly what it says: the user watches whatever they want, whenever they want. The video file was recorded and stored on a server long before playback.

The counterpart is live streaming:

VODLive Streaming
Content sourcePre-recorded filesCamera/encoder producing in real time
Seekable?YesNo (or limited DVR window)
ExamplesNetflix, YouTube, Bilibili, online coursesSports broadcasts, e-commerce live, game streaming
Engineering challengeDeliver to the most users at the lowest costKeep latency low, encode in real time

Short-form video (TikTok, YouTube Shorts) is also VOD. Although it feels real-time, every clip is a pre-uploaded recording. It’s broken into second-level segments and served by recommendation algorithms — that’s what makes the feed feel endless.

This series focuses on VOD, but most of the technology (codecs, containers, protocols, DRM) applies to live streaming too.


The VOD Journey: From Camera to Your Screen

A video goes through six stages to reach your phone:

①Capture/Upload     ②Transcode           ③Package
┌─────────┐        ┌─────────┐         ┌─────────┐
│ Director │ ────►  │ Compress │ ──────► │ Cut into │
│ uploads  │        │ into many│         │ small    │
│ raw file │        │ qualities│         │ segments │
└─────────┘        └─────────┘         └─────────┘

     ┌───────────────────────────────────────┘


④Store in Cloud     ⑤CDN Distribution    ⑥Playback
┌─────────┐        ┌─────────┐         ┌─────────┐
│ Put into │ ────►  │ Copy to  │ ──────► │ Auto-   │
│ object   │        │ nearest  │         │ select  │
│ storage  │        │ data     │         │ quality │
│ (S3 etc) │        │ center   │         │ & play  │
└─────────┘        └─────────┘         └─────────┘

Each stage maps to a chapter in this series:

StageProblem it solvesSeries part
①Capture/UploadHow to reliably send large files to the serverPart 11
②TranscodeHow to compress a 20 GB master down to 200 MB and still look greatPart 2, Part 3
③PackageHow to combine video + audio + subtitles and slice into segmentsPart 4, Part 5
④StorageHow to store massive amounts of video cheaplyPart 11, Part 12
⑤CDNHow to make it fast for users worldwidePart 7
⑥PlaybackHow to adapt to network speed and prevent piracyPart 6, Part 8, Part 9

And the thread running through everything — how do you know if users are having a good experience? — is Part 10: QoE Metrics.


A Video Is Just a Stack of Photos

This is the single most important sentence in this chapter:

A video = a sequence of images played rapidly + an audio track.

When you watch a video, your brain sees:

Frame 1   Frame 2   Frame 3   Frame 4   Frame 5  ...
┌────┐   ┌────┐   ┌────┐   ┌────┐   ┌────┐
│    │   │    │   │    │   │    │   │    │
│ 🚗 │   │ 🚗 │   │ 🚗 │   │ 🚗 │   │ 🚗 │
│    │   │    │   │    │   │    │   │    │
└────┘   └────┘   └────┘   └────┘   └────┘
         (car shifts slightly right)

           ▼  Play 30 images per second → you see "smooth driving"

Each image is called a frame.


Pixels and Resolution

Zoom into any image far enough and you’ll see tiny squares — each one records a single color. That square is a pixel.

  • A 1920×1080 image has 1920 columns × 1080 rows = 2,073,600 pixels (~2 megapixels).
  • Each pixel stores a color value that takes a few bytes.

Resolution is just the pixel dimensions. The common labels:

LabelResolutionTotal pixelsRelative size
240p426 × 240~100K1x (baseline)
360p640 × 360~230K2.3x
480p (SD)854 × 480~410K4.1x
720p (HD)1280 × 720~920K9.2x
1080p (FHD)1920 × 1080~2M20x
1440p (2K)2560 × 1440~3.7M37x
2160p (4K UHD)3840 × 2160~8.3M83x
4320p (8K)7680 × 4320~33.2M332x

Note: “4K” has two flavors — UHD 4K (consumer: 3840×2160) and DCI 4K (cinema: 4096×2160).

Portrait mobile video uses a 9:16 ratio (e.g., 720×1280), the inverse of landscape 16:9 (1920×1080).


How Pixels Store Color: RGB, YUV, and Bit Depth

RGB

The most intuitive method: store Red, Green, Blue intensity per pixel.

  • Black = R:0 G:0 B:0
  • White = R:255 G:255 B:255
  • Each channel uses 8 bits (1 byte, 0–255), so one RGB pixel = 3 bytes.

Quick math: a single 1080p RGB frame = 1920 × 1080 × 3 bytes ≈ 6.2 MB. At 30 fps, that’s 186 MB/sec — a 2-hour movie would be 1.3 TB uncompressed!

That’s why video must be compressed.

YUV (The Video Industry Standard)

Video uses YUV (also written YCbCr):

  • Y (Luma): How bright the pixel is (0 = black, 255 = white)
  • U, V (Chroma): What color the pixel is

Why not just use RGB? Because:

The human eye is far more sensitive to brightness than to color.

YUV exploits this: you can record less color information with virtually no perceived difference.

Chroma Subsampling

SchemeDescriptionData vs 4:4:4Used in
4:4:4Full Y/U/V per pixel100%Film post-production
4:2:2Two adjacent pixels share one U/V pair67%Broadcast, professional
4:2:0Four adjacent pixels share one U/V pair50%Nearly all consumer streaming
   Luma Y (all kept)          Chroma U/V (one per 2×2 block)
   ┌──┬──┬──┬──┐             ┌─────┬─────┐
   │Y │Y │Y │Y │             │     │     │
   ├──┼──┼──┼──┤             │ UV  │ UV  │
   │Y │Y │Y │Y │             │     │     │
   ├──┼──┼──┼──┤             ├─────┼─────┤
   │Y │Y │Y │Y │             │     │     │
   ├──┼──┼──┼──┤             │ UV  │ UV  │
   │Y │Y │Y │Y │             │     │     │
   └──┴──┴──┴──┘             └─────┴─────┘
   16 Y values                4 UV pairs

   RGB 4:4:4 = 16 × 3 = 48 bytes
   YUV 4:2:0 = 16 + 4 + 4 = 24 bytes (half the data)

The trade-off: sharp red text on a pure black background may show slight color bleeding. But 99% of natural scenes look identical.

Bit Depth

How many bits per channel:

Bit depthRange per channelColors per pixelUsed in
8-bit0–25516.7MMost consumer streaming
10-bit0–10231.07BHDR required; Netflix 4K, Blu-ray
12-bit0–409568.7BFilm masters, Dolby Vision

8-bit is usually fine, but on smooth gradients (e.g., a blue sky transitioning from deep to light blue), you get visible banding — unnatural step-like boundaries. HDR content needs 10-bit to eliminate this.


Frame Rate (fps)

fps = frames per second.

  • 24 fps: Cinema standard since the 1920s. Gives that “film look.”
  • 25 / 50 fps: PAL television (Europe, China).
  • 29.97 / 30 fps: NTSC (North America, Japan). Default for most phone recordings.
  • 60 fps: Gaming, sports, YouTube high-frame-rate.
  • 120 / 240 fps: Slow motion, professional capture.

Why is 24 fps enough for movies? Human “persistence of vision” kicks in around 16 fps — your brain already sees continuous motion. 24 fps was the 1920s sweet spot of “smooth enough + saves the most film stock.” But for fast action (sports, gaming), 60+ fps is needed to avoid motion blur.

Watch out for 29.97 fps — it’s not a typo. NTSC color television deliberately offset the frequency to avoid interference with black-and-white signals.

Higher frame rate = larger file. 60 fps is roughly 1.7× the size of 30 fps at the same resolution and quality.


Bitrate: How Much Data Per Second

Bitrate is the number of bits consumed per second of video.

  • kbps (kilobits/sec): 1 Mbps = 1000 kbps
  • Mbps (megabits/sec): the common unit

File size ≈ bitrate × duration:

1 Mbps × 60 seconds ÷ 8 (bits to bytes) ≈ 7.5 MB

Typical Bitrates (H.264)

ResolutionRecommended bitrate1 min file size
240p0.3–0.5 Mbps~3 MB
360p0.5–0.8 Mbps~5 MB
480p0.8–1.2 Mbps~8 MB
720p1.5–3 Mbps~15 MB
1080p3–6 Mbps~30 MB
4K15–30 Mbps~150 MB

CBR / VBR / CRF

Three rate-control modes:

ModeMeaningAnalogy
CBR (Constant Bitrate)Fixed bits per secondAlways ordering exactly 2 dishes
VBR (Variable Bitrate)More bits for complex scenes, fewer for simple onesBig eater orders more, light eater orders less
CRF (Constant Rate Factor)Quality stays constant, bitrate adaptsNo matter what you order, you eat until 80% full

VOD favors VBR or CRF — better quality at the same file size. Live streaming favors CBR — predictable bitrate for stable network transmission.


I-Frames, P-Frames, B-Frames: The Core of Video Compression

This is the most critical concept in this chapter. Understand it, and everything in the codec chapter falls into place.

Why Can Video Be Compressed So Aggressively?

Imagine a video of someone sitting on a couch watching TV:

Frame 1: Person on couch, TV playing animation
Frame 2: Person on couch, TV playing animation (TV image changes slightly)
Frame 3: Person blinks, TV playing animation
Frame 4: Person on couch, TV playing animation

99% of pixels between adjacent frames are identical. Storing every frame in full is massive waste.

The smart approach:

  • Occasionally store a “complete snapshot”
  • The rest of the time, only store “what changed since the last frame”

Three Frame Types

TypeFull nameContentSizeCan decode independently?
I-frame (keyframe)Intra-codedA complete image (like a JPEG)LargeYes
P-framePredicted”Difference from a previous frame”SmallNo — needs the reference frame first
B-frameBidirectional”Difference from both previous and next frames”SmallestNo — needs both reference frames
Timeline →
 I - P - P - P - B - P - P - B - I - P - P - P ...
 ▲                               ▲
 Keyframe                        Next keyframe
 (appears every N frames)

IDR Frames

An IDR frame (Instantaneous Decoder Refresh) is a special I-frame: all subsequent frames are forbidden from referencing anything before it. IDR frames are “safe start points.” When you seek to the middle of a video, the player jumps to the nearest IDR frame to begin decoding.

GOP (Group of Pictures)

A GOP is the group of frames between two I-frames:

 ┌──── GOP 1 ────┐ ┌──── GOP 2 ────┐ ┌──── GOP 3 ...
  I  P  B  P  P  B  I  P  B  P  P  B  I  P ...

                    New IDR starts here

GOP length determines segmentation granularity:

  • Short GOP (1–2 sec): Fine segments, fast seeking and startup; slightly larger files (more I-frames)
  • Long GOP (4–10 sec): Smaller files, but slower seeking

Short-form video apps typically use short GOPs (1–2s) because users swipe frequently between episodes. Feature-length VOD can use longer GOPs to save bandwidth.


Color Spaces and HDR

Color Spaces

The same numeric RGB values display different actual colors under different standards:

StandardUsed inGamut size
sRGBWeb, computersBaseline
BT.709HDTV, 1080p streaming≈ sRGB
BT.2020HDR, 4K/8K~72% larger than BT.709
DCI-P3Cinema, Apple ecosystemBetween BT.709 and BT.2020

HDR: Brighter Brights, Darker Darks, More Colors

Traditional SDR peaks at ~100 nits. HDR reaches 1,000–4,000 nits peak brightness, combined with 10-bit depth + BT.2020 gamut:

  • Stars in a night sky appear brighter
  • Shadow details are preserved
  • Colors are more saturated without clipping

Major HDR formats:

FormatByKey feature
HDR10Blu-ray Disc AssociationRoyalty-free; static metadata per movie
HDR10+Samsung / AmazonDynamic metadata per scene
Dolby VisionDolby12-bit, dynamic metadata; highest quality; royalty required
HLGBBC / NHKCompatible with SDR displays; preferred for broadcast

Be aware: HDR video on an SDR display won’t magically look better. Without tone mapping, it looks washed out and gray.


Hands-On: Inspect a Video with ffprobe

# macOS / Linux
brew install ffmpeg  # or: apt install ffmpeg

# Inspect a video
ffprobe -v error -show_streams -select_streams v:0 myvideo.mp4

Typical output:

codec_name=h264            # Codec (H.264) — see Part 2
profile=High               # Encoding profile
width=1920
height=1080                # Resolution: 1080p
r_frame_rate=30000/1001    # Frame rate: 29.97 fps
pix_fmt=yuv420p            # Pixel format: YUV 4:2:0, 8-bit
color_space=bt709          # Color space: SDR
bit_rate=4500000           # Bitrate: 4.5 Mbps

After reading this chapter, you should understand every line.


Key Takeaways

  1. Video = a sequence of images + audio. Each image is a frame.
  2. Each frame is made of pixels; resolution is the pixel dimensions.
  3. The video world uses YUV 4:2:0 (half the data of RGB, imperceptible difference).
  4. Bit depth: 8-bit is standard; HDR requires 10-bit.
  5. Frame rate: 24 fps (cinema) / 30 fps (TV) / 60 fps (gaming/sports).
  6. Bitrate = data per second. VBR/CRF is preferred for VOD.
  7. I/P/B frames are how video achieves 50–100× compression.
  8. GOP = the group between keyframes. Short-form video uses short GOPs (1–2s).
  9. HDR = 10-bit + wider gamut + higher brightness — fundamentally different from SDR.

Three Pairs of Concepts You’ll See Everywhere

Before diving deeper, pin these down:

  1. Codec ≠ Container — H.264 is a compression algorithm (codec); MP4 is a file format (container). An .mp4 file can hold H.264 or H.265 or AV1.

  2. Protocol ≠ Packaging — HLS and DASH are “how to deliver” rules (protocols); fMP4 and TS are “how to slice and wrap” formats (packaging).

  3. Encryption ≠ DRM — HLS AES-128 is lightweight encryption (key leaks = game over). DRM is an entire system: key distribution + device restrictions + output protection.

All three pairs are covered in detail throughout this series.


Next up: How does a 4K movie fit in 5 GB?Part 2: Video Codecs — H.264, H.265, and AV1