VOD Deep Dive Part 10: QoE Metrics — How to Measure What Users Actually Feel

QoE vs QoS, six core metrics (VST, RBR, VSF, EBVS, VPF, Avg Bitrate), data pipelines, multi-dimensional drill-down, troubleshooting cases, and when to buy vs build.

zhuermu · · 20 min
vodstreamingqoemonitoringmuxconviva

This is Part 10 of the VOD Streaming Deep Dive series.


QoE vs QoS: Two Often-Confused Terms

AbbreviationFull nameDefinitionPerspective
QoSQuality of ServiceObjective network/infrastructure performance (bandwidth, packet loss, latency)Network engineering / Ops
QoEQuality of ExperienceUser-perceived qualityProduct / User research

QoS says “I’m delivering 10 Mbps.” QoE says “Can users actually watch without buffering?”

Same QoS can produce very different QoE depending on player implementation, ABR strategy, and encoding parameters.

We care about QoE. Running a video platform without QoE data is like operating a highway system without measuring traffic congestion.


Six Core QoE Metrics

1. Video Startup Time (VST)

Definition: Seconds from user pressing play to first frame rendered.

Targets:

  • Short-form mobile video: P50 < 300ms, P95 < 800ms
  • Long-form VOD: P50 < 1s, P95 < 2s

The most sensitive metric. Users won’t tolerate a 2-second black screen.

2. Rebuffering Ratio (RBR)

Definition: rebuffer_time / (rebuffer_time + play_time)

Example: User watched 60 seconds, buffered for 3 seconds total. RBR = 3/(60+3) = 4.8%.

Target: < 0.5% (excellent), < 1% (acceptable).

Impacts retention directly: Conviva research shows that every 1% increase in RBR reduces watch time by 2–5%.

3. Video Start Failure (VSF)

Definition: User triggered playback but the first frame never rendered (due to 404, CORS, DRM error, etc.).

Target: < 1%

Conviva further subdivides this:

  • VSF-T (Technical): Failed due to technical reasons — counts as QoE failure
  • VSF-B (Business): Failed for business reasons (no subscription, geo-blocked) — excluded from QoE

4. Exit Before Video Start (EBVS)

Definition: User triggered playback but left voluntarily before the first frame — not an error, just impatience.

Target: < 3%

Strongly correlated with VST. Slow startup → high EBVS → poor retention.

5. Video Playback Failure (VPF)

Definition: Playback crashed mid-stream (decode error, expired certificate, CDN stream cut).

Target: < 0.5%

6. Average Bitrate

Definition: Time-weighted average of bitrate tiers during actual playback.

Purpose: Measures whether users are actually seeing acceptable quality. If 70% of users average 480p, it could mean:

  • Network conditions are generally poor
  • ABR algorithm is too conservative
  • High tiers weren’t transcoded

Supporting Metrics

MetricDescription
Rebuffer FrequencyRebuffer events per minute of playback (target: < 0.1/min)
Bitrate SwitchingNumber and magnitude of quality switches (stability preferred)
Video Complete RatePercentage of users who finish the video (business metric)
Time to Key DecodeTime to acquire DRM license
First Byte TimeTime until first byte arrives from CDN

Conviva’s SPI: A Composite Index

SPI (Streaming Performance Index): Conviva’s composite KPI representing the percentage of sessions with “good or very good” experience.

A session qualifies as “good” when it simultaneously meets:

  • No VSF-T or VPF-T errors
  • No or minimal rebuffering (CIRR below threshold)
  • Average bitrate meets the screen-size quality bar
  • Video Start Time within acceptable range
  • User didn’t wait excessively before exit

Single metrics can mislead (e.g., low RBR but extremely low bitrate). SPI provides a holistic view.


Multi-Dimensional Drill-Down

Never look at just “overall RBR.” Always slice by dimensions:

DimensionExamples
GeographyCountry / State / City / ISP
DeviceOS version, model, chipset, screen size
NetworkWiFi / 4G / 5G, throughput range
CDNProvider, PoP, Shield
ContentTitle, resolution tier, codec, duration
TimeHour, day, week
UserNew/returning, paid/free, region

Standard troubleshooting pattern:

Overall RBR spiked to 2% → cause unknown
  ↓ Drill by CDN → CDN-A RBR 5%, CDN-B RBR 0.3%
  ↓ Drill CDN-A by region → Mumbai RBR 12%
  ↓ Drill Mumbai by time → 19:00-22:00 peak spike
  → Conclusion: CDN-A Mumbai PoP degraded during evening peak
  → Action: Route India traffic to CDN-B

Data Pipeline: From Client to Dashboard

Typical Architecture

┌──────────────┐
│ Client SDK    │
│ (iOS/Android/ │── HTTPS batch POST every 5-10s
│  Web)         │   (events JSON)
└──────────────┘


┌──────────────┐
│ Ingestion     │   nginx / ALB / API Gateway / CloudFront
│ (Edge)        │   with rate limiting + auth
└──────────────┘


┌──────────────┐
│   Kafka       │   Persistent message queue
│  Topic: qoe   │   Partitioned by day
└──────────────┘

       ├─────────────────────┐
       ▼                     ▼
┌──────────────┐      ┌──────────────┐
│ Flink / Spark │      │ ClickHouse / │   Real-time data warehouse
│ Streaming     │      │ BigQuery     │
│ (aggregation) │      │              │
└──────────────┘      └──────────────┘
       │                     │
       ▼                     ▼
┌──────────────┐      ┌──────────────┐
│  Alerting     │      │ BI Dashboard │   Grafana, Tableau, Looker
│ (PagerDuty)   │      │              │
└──────────────┘      └──────────────┘

Event Schema

Every event includes:

{
  "event": "video_rebuffer_start",
  "session_id": "uuid-...",
  "user_id": "u-12345",
  "video_id": "ep-789",
  "timestamp": 1715084800123,
  "player_version": "2.3.4",
  "device": {
    "os": "iOS",
    "os_version": "17.4",
    "model": "iPhone 15 Pro"
  },
  "network": {
    "type": "cellular",
    "carrier": "Verizon",
    "effective_type": "4g"
  },
  "cdn": "cloudfront",
  "bitrate": 2500000,
  "buffer_level_sec": 0.8,
  "position_sec": 45.2
}

Batch vs Real-Time

Don’t send one HTTP request per event (100K DAU x 100 events/user = 10M requests/day).

Batch strategy: Accumulate 10 seconds or 50 events, then send one POST.


Real-World Troubleshooting Cases

Case 1: Overall VST Spike

Monday 9 AM: VST P50 jumped from 400ms to 1.2s

├── Drill by OS → Android VST spiked to 2s, iOS normal

├── Drill by app version → v3.4.5 all 2s, v3.4.4 normal

├── Check changelog → v3.4.5 introduced a new player library

└── Action: Emergency hotfix / rollback to v3.4.4

Case 2: Single Title Rebuffering

New show Episode 3: RBR anomalously high at 5%

├── Drill by CDN → All CDNs high (not a CDN issue)

├── Check segments → One segment is 20 MB (others are 2 MB)

├── Check encoding log → 10-second action scene caused bitrate spike

└── Fix: Re-transcode with MaxBitrate cap on peak bitrate

Case 3: Regional Conversion Drop

India new-user first-hour completion rate dropped from 30% to 15%

├── Drill by VST → India VST P50 rose from 0.8s to 3s

├── Drill by CDN → CDN-A edge node latency elevated in India

├── Ping test → CDN-A Mumbai PoP latency 400ms for 4 hours

└── Action: Route India traffic to CDN-B, escalate to CDN-A support

Build vs Buy: Mux, Conviva, or Self-Built?

Managed Services

ServiceStrengths
MuxDeveloper-friendly, simple integration, ~$1.25/1K sessions
ConvivaEnterprise-grade, most comprehensive, most expensive
Datadog RUMIntegrated APM in one platform
NPAW (YOUBORA)Strong in European markets

Pros: Hours to integrate, dashboards out of the box, zero maintenance. Cons: Expensive at scale, data lives with third party, limited customization.

Self-Built

Pros: Full customization, data can be joined with business metrics (orders, retention), cost advantage at scale. Cons: High development/maintenance cost, multi-platform SDK consistency is hard.

Common Evolution

  • Early stage: Buy Mux — get usable dashboards fast
  • At scale: Self-built pipeline + keep Mux as a benchmark for comparison

Client SDK Best Practices

Don’t Slow Down Playback

The QoE SDK itself must not degrade the experience:

  • Report on a separate low-priority thread
  • Network failures: silent retry, never block UI
  • SDK crash must not bring down the app

Offline Compensation

Users may finish watching offline. When back online:

  • Events written to local SQLite/file during offline
  • Batch-upload in FIFO order when connectivity returns

Clock Alignment

Device clocks can be inaccurate:

  • Use server timestamps (HTTP Date header) as reference
  • Events carry relative time (delta_ms from session start)

Sampling at Scale

At massive scale, 100% reporting is too expensive:

  • Critical error events: Always 100% reported
  • Normal events: Sample at 10–30%
  • Hash by user_id to ensure all-or-nothing per user (preserves session analysis)

Essential Dashboards

Dashboard 1: Global Overview

  • DAU, total play sessions
  • VST P50 / P95
  • RBR, VSF, VPF
  • SPI (composite score)
  • Top 10 countries drill-down

Dashboard 2: CDN Health

  • Per-CDN RBR, VST, error rate
  • CDN comparison panel (same time window)
  • CDN edge node map

Dashboard 3: Content Quality

  • New title first-24-hour quality metrics
  • Per-title completion rate and RBR
  • Anomalous title alerts

Dashboard 4: Device and Version

  • Error rate by app version
  • VST by OS version
  • RBR by device model

The QoE Optimization Loop

QoE data isn’t for passive observation — it drives engineering decisions:

        Data reveals problem


       Locate root cause
       (CDN? Encoding? ABR?)


       Try fix + A/B test


       Verify QoE improved


       Ship to 100% + keep monitoring

Weekly QoE review is standard practice for every mature video team.


Key Takeaways

  1. QoE measures user-perceived experience, not network metrics.
  2. Six core metrics: VST / RBR / VSF / EBVS / VPF / Average Bitrate.
  3. Conviva’s SPI is a composite “good experience session ratio.”
  4. Data must be sliced by multiple dimensions — a single global number can’t locate problems.
  5. Standard pipeline: Client SDK -> Kafka -> Flink/ClickHouse -> BI.
  6. Start with Mux/Conviva in early stages; build in-house when scale justifies it.
  7. QoE data drives decisions — review weekly.

Previous: Part 9: Video Players

Next: Part 11: End-to-End Workflow