VOD Deep Dive Part 3: Audio Fundamentals — Making Sound Small
How digital audio works: sampling rates, bit depth, channels, AAC vs Opus vs Dolby Atmos, multi-language tracks, loudness normalization, and practical ffmpeg recipes.
This is Part 3 of the VOD Streaming Deep Dive series.
How Sound Becomes Digital
Sound is air vibration — a continuous waveform. Computers can only store numbers, not continuous waves. Two steps are needed:
- Sampling: Measure the wave’s height at regular intervals
- Quantization: Convert each measurement into a number
Amplitude
▲
│ ● ● ● Sample points
│ ● ● ●
│ ● ●
│ ● ● ●
└──────────────────────► Time
↑ ↑ ↑ Sample N times per second — N is the "sample rate"
Sample Rate
Unit: Hz (hertz — samples per second)
| Sample rate | Use case |
|---|---|
| 8 kHz | Telephone voice |
| 16 kHz | Speech recognition, VoIP (Zoom/Teams) |
| 22.05 kHz | Retro games, AM radio |
| 44.1 kHz | CD audio, music preferred |
| 48 kHz | Video industry default (film, streaming, broadcast) |
| 96 kHz | Hi-fi recording |
| 192 kHz | Professional studio |
Nyquist theorem: To reproduce a frequency F, you need a sample rate of at least 2F. Human hearing tops out around 20 kHz, so 44.1/48 kHz is just enough (with a small margin).
For VOD, standardize on 48 kHz. If your source is 44.1 kHz, resample during transcoding with -ar 48000.
Bit Depth
How many bits per sample:
| Bit depth | Loudness levels | Use case |
|---|---|---|
| 8-bit | 256 | Retro games, telephony |
| 16-bit | 65,536 | CD, consumer streaming |
| 24-bit | ~16.7M | Professional recording |
| 32-bit float | Astronomical | Audio production internal format |
Most video audio is 16-bit, 48 kHz.
Channels
A channel is an independent audio track:
| Channels | Name | Configuration | Used in |
|---|---|---|---|
| 1.0 | Mono | Single channel | Telephony, old TV |
| 2.0 | Stereo | Left + Right | Music, most video |
| 5.1 | Surround | Front L + Center + Front R + Rear L + Rear R + LFE (.1 = subwoofer) | Cinema, home theater |
| 7.1 | Surround | 5.1 + two side channels | Premium home theater |
| 7.1.4 | Atmos etc. | 7.1 + 4 overhead channels | Dolby Atmos |
5.1 surround layout (top-down view):
FL ──── C ──── FR
│ 🧑 │
│ │
SL ──┻━━━━──SR
LFE
Audio Bitrate: How Many kbps Is Enough?
Audio bitrate is also bits per second, but much smaller than video — typically 5–10% of the video bitrate.
| Bitrate | Perception | Typical use |
|---|---|---|
| 32 kbps | Voice OK, music broken | Extreme low bandwidth |
| 64 kbps | Voice clear, music passable | Low-bitrate scenarios |
| 96 kbps | Music acceptable | Broadcast, YouTube default |
| 128 kbps | Music sounds good | Streaming default |
| 192 kbps | High fidelity | Premium music streaming |
| 256 kbps | Audiophile-grade | Apple Music |
| 320 kbps | MP3 maximum | Music enthusiasts |
| Lossless (FLAC) | Transparent | Hi-fi niche |
For VOD: stereo AAC at 128 kbps is the correct answer for the vast majority of scenarios.
Major Audio Codecs
AAC (Advanced Audio Coding) — The Streaming Default
- By: MPEG (same organization behind H.264)
- Year: 1997
- Compatibility: every video platform, browser, and phone
- Variants:
- AAC-LC (Low Complexity): Most common. HLS/DASH default.
- HE-AAC (High Efficiency): Better at low bitrates (<64 kbps)
- HE-AAC v2: HE-AAC + parametric stereo, decent at 48 kbps
MP3 — Retired
Classic but less efficient than AAC. Original patents expired in 2017. No reason to use MP3 in new projects.
Opus — The Web Newcomer
- Open-source, royalty-free
- Excellent from 6 kbps (voice) to 510 kbps (music)
- WebRTC default, used by Discord
- But HLS/DASH compatibility lags behind AAC; limited iOS/Safari support
Dolby Family — Cinema-Grade
| Codec | Use case |
|---|---|
| AC-3 (Dolby Digital) | 5.1 surround, Blu-ray, legacy HDTV |
| E-AC-3 / DD+ (Dolby Digital Plus) | 5.1/7.1, streaming movies |
| Dolby Atmos (E-AC-3 + JOC or AC-4) | Spatial audio, premium platforms |
Dolby Atmos on Netflix, Disney+, and Apple TV+ is a hallmark of premium subscriptions.
FLAC / ALAC — Lossless
Lossless compression reduces size by 50–70% while perfectly preserving the original PCM data. Used in Apple Music lossless tier and audiophile contexts. Not practical for video streaming — bitrate is too high.
Multi-Language Audio Tracks
A single video file can carry multiple audio tracks:
MP4 file
├── video track (H.264)
├── audio track 1 (AAC, English)
├── audio track 2 (AAC, Chinese)
├── audio track 3 (AAC, Japanese)
└── subtitle track (WebVTT)
Streaming protocols (HLS/DASH) support independent audio track delivery — the player only downloads the language the user selected:
#EXT-X-MEDIA:TYPE=AUDIO,GROUP-ID="audio",LANGUAGE="en",NAME="English",DEFAULT=YES,URI="audio/en/index.m3u8"
#EXT-X-MEDIA:TYPE=AUDIO,GROUP-ID="audio",LANGUAGE="zh",NAME="中文",URI="audio/zh/index.m3u8"
More on this in Part 5: Streaming Protocols.
Loudness Normalization
Ever noticed the volume spike when a commercial cuts in? That’s because different content has wildly different loudness levels.
Loudness normalization adjusts all content to a uniform perceived loudness level (not peak volume).
Common Standards
| Standard | Target loudness | Used by |
|---|---|---|
| EBU R128 | -23 LUFS | European broadcast |
| ATSC A/85 | -24 LUFS | North American broadcast |
| Apple Music / Spotify | -14 LUFS | Music streaming |
| YouTube | -14 LUFS | Default |
| Short-form / mobile | -16 to -14 LUFS | Phone speaker range |
LUFS (Loudness Units Full Scale) is the international standard for perceived loudness.
ffmpeg Loudness Normalization
# Normalize audio to -14 LUFS
ffmpeg -i input.mp4 -af loudnorm=I=-14:TP=-1.5:LRA=11 -c:v copy output.mp4
Hands-On: Inspect and Transcode Audio
Check audio tracks in a video
ffprobe -v error -show_streams -select_streams a input.mp4
Typical output:
codec_name=aac
sample_rate=48000
channels=2
channel_layout=stereo
bit_rate=128000
Standardize to AAC 48 kHz 128 kbps stereo
ffmpeg -i input.mov \
-c:a aac -b:a 128k -ar 48000 -ac 2 \
-c:v copy \
output.mp4
-c:a aac: Audio codec AAC-b:a 128k: 128 kbps bitrate-ar 48000: 48 kHz sample rate-ac 2: 2 channels (stereo)-c:v copy: Copy video stream as-is (saves time)
Key Takeaways
- Digital audio requires sampling rate (temporal density) and bit depth (amplitude precision).
- VOD default: 48 kHz sample rate, 16-bit depth.
- Consumer streaming defaults to stereo (2.0); cinema uses 5.1 / Atmos.
- AAC-LC at 128 kbps is the default audio setting for VOD.
- A single video file can carry multiple audio tracks (multi-language).
- Loudness normalization (EBU R128 / -14 LUFS) prevents the “ads are too loud” problem.
Previous: Part 2: Video Codecs