What are the ODS, DWD, DWS, and ADS data warehouse layers?

ODS (Operational Data Store) holds raw ingested data. DWD (Detail) cleanses and enriches it. DWS (Summary) aggregates into metrics. ADS (Application) creates wide feature tables and ML samples ready for downstream consumption.

How does a single click event flow through the entire data pipeline?

Client SDK → API Gateway → Lambda → MSK topic → consumed by both Firehose (→ S3/Iceberg for offline) and Flink (→ DynamoDB for real-time features). The same event serves training and inference simultaneously.

阅读中文版 →

Big Data on AWS Deep Dive (Part 6): End-to-End Data Pipeline — From Source to Feature Store

Connect all the dots: trace a click event from client SDK through API Gateway, MSK, Firehose, S3, warehouse layers (ODS→DWD→DWS→ADS), to DynamoDB for real-time serving.

zhuermu · May 15, 2025 · 10 min

big-dataawsdata-pipelineend-to-endevent-drivendata-warehouse-layers

Chapters 03 through 05 covered each service in detail. This chapter connects them all together, mapping the architecture to a real-world social app use case.

No new services are introduced here. The goal is: after reading this chapter, you can explain the entire data-side architecture using a single diagram.

Customer Scenario Recap

Key facts from the customer’s solution document:

Business: Social app targeting user recommendations (follow / feed / People You May Know)
Data sources: MySQL (business database), ES (search), DocumentDB (documents), client-side event tracking
DAU: 1 million+
DocumentDB version: 5.0 (supports Change Streams)
Latency requirements: T+1 (offline-first; real-time considered for a later phase)
Existing Kafka: None (must be built from scratch)

End-to-End Architecture Diagram

End to End

The entire pipeline is divided into 5 layers:

Layer	Role
Layer 1: Data Sources	Aurora MySQL / DocumentDB / OpenSearch / Client-side events / Third-party data
Layer 2: Ingestion Channels	Aurora Zero-ETL / DMS / OpenSearch Ingestion / API GW + MSK + Firehose / EventBridge
Layer 3: S3 + Iceberg ODS	14+ Iceberg tables, grouped by data source
Layer 4: Layered Processing	DWD / DWS / ADS, orchestrated by MWAA
Layer 5: Downstream Consumers	BI / ML training / Online layer sync / Real-time pipeline

Pipeline Decomposition: 5 Independent Data Flows

Breaking the entire pipeline into 5 relatively independent sub-pipelines makes it easier to understand:

Pipeline A: Business Database CDC (Aurora to Data Lake)

Aurora MySQL (orders, users, posts, follows)
     | binlog
     v
Aurora Zero-ETL to SageMaker Lakehouse  (sub-second latency)
     | (AWS-managed, zero ops)
     v
S3 Tables (Iceberg) + auto-registered in Glue Catalog
     |
     v
ods_user / ods_post / ods_follow / ods_order

Key points:

Choose Zero-ETL over traditional DMS (Zero-ETL is the recommended path for Aurora)
Data lands in S3 Tables (AWS-native Iceberg storage) — you cannot see underlying files in the console, but Athena/SageMaker can read them directly
Glue Catalog auto-registers tables; schema evolves with the source

Pipeline B: DocumentDB CDC

DocumentDB 5.0 (Change Streams enabled)
     | change streams
     v
DMS Replication Task (source = DocDB, target = S3)
     |
     v
S3 dms-raw/docdb/<collection>/...parquet
     |
     v
Glue Job hourly MERGE → ods_doc_user / ods_doc_msg (Iceberg)

Key points:

DocumentDB must be version 4.0+ with Change Streams enabled (increases source DB I/O)
DMS output is raw Parquet, not Iceberg — a Glue Job is needed to merge into Iceberg tables
The s3://.../dms-raw/ directory is a staging area; production tables live in s3://.../warehouse/ods/

Pipeline C: OpenSearch to S3

OpenSearch Service (managed)
     | scroll API / PIT
     v
OpenSearch Ingestion Pipeline (yaml)
     |
     v
S3 ods_es_search/ (Parquet)

Key points:

First confirm necessity — if ES only contains a search replica of MySQL data, ingesting from MySQL is more direct
DMS does not support ES as a source
For self-managed ES, use Logstash + S3 output instead

Pipeline D: Event Tracking (Core Pipeline, Two Phases)

The customer’s current requirement is T+1 with no Kafka infrastructure. Phase 1 should not rush into MSK — wait until Phase 3 when real-time features are actually needed. Below shows both the simplified version (Phase 1) and the target architecture (Phase 3+).

Phase 1: Simplified Version (Sufficient for T+1)

Client SDK → API Gateway (HTTP API) → Lambda (auth + enrichment) → Firehose → S3 ods_event

Characteristics:

Fully serverless, minimal ops burden
Monthly cost approximately $4K (API Gateway is the largest component)
Drawback: Firehose delivers to a single destination (S3); going real-time later requires architectural changes

Phase 3+: Target Architecture (When Real-Time Features Go Live)

Client SDK
     |
     v
API Gateway (HTTP API, auth)
     |
     v
Lambda (enrichment: server_ts, geo, ip, app_ver)
     |
     v
Amazon MSK (Kafka, topic=events, 12 partitions, across 3 AZs)
     |
     |-- Consumer Group "offline"  --> Firehose --> S3 ods_event (Parquet)
     |-- Consumer Group "realtime" --> Managed Flink --> DynamoDB user_realtime_features
     +-- Consumer Group "risk"     --> Lambda fraud detection

Migration cost: Change the Lambda delivery target from Firehose to MSK; downstream attaches Firehose with Kafka source (supported since 2024 — MSK as Firehose source). The application layer requires no changes.

Key points:

API Gateway uses HTTP API (approximately 70% cheaper than REST API)
Lambda is essential (auth + enrichment + server_ts)
MSK value: Multi-subscriber message bus — the same data is consumed independently by N downstream consumers + supports data replay
Schema managed via Glue Schema Registry (essential once event tracking exceeds 50+ event types)

The timeline below illustrates the dual-path consumption:

Timeline

The same click event: lands in DynamoDB within 500ms for recommendation serving, and lands in S3 within 60 seconds for model training.

Pipeline E: Third-Party Data

EventBridge Schedule (hourly / daily)
     |
     v
Lambda (calls third-party HTTP APIs)
     |
     v
S3 ods_3rd_channel/dt=.../*.json

Key points:

Use EventBridge rather than cron or Lambda Scheduled — it is more standard
Failures must retry; recommend small batches with idempotent writes

Layered Processing (Daily Routine)

Orchestration

Orchestrated by an MWAA Airflow DAG, running daily in the early morning:

02:00 Wait for Zero-ETL / DMS daily data completeness (sentinel task)
02:30 |- Glue Job: dwd_user_action_clean       (cleanse + IP→geo + join user dim)
      +- Glue Job: dwd_post_enrich              (post + tags + engagement counts)
03:00 |- Athena CTAS: dws_user_daily            (user daily metric aggregation)
      |- Athena CTAS: dws_post_daily            (post daily metrics)
      +- Athena CTAS: dws_pair_interaction      (user-pair interaction)
03:30 |- EMR Serverless: ads_user_features     (user 100+ dimension feature wide table)
      |- EMR Serverless: ads_post_features     (content features)
      |- EMR Serverless: ads_sample_follow     (follow prediction samples, PIT-correct)
      +- EMR Serverless: ads_recall_u2u_cf     (collaborative filtering recall pool)
04:30 |- Glue Job: sync_user_features_to_ddb   (write to DynamoDB)
      +- Glue Job: sync_recall_pool_to_ddb     (write to DynamoDB)
05:00 |- SageMaker Training Job: train_recall_two_tower
      +- SageMaker Training Job: train_rank_lightgbm
06:00 Lambda: deploy SageMaker Endpoint (blue/green or canary)
06:30 DQ Check + report (success / failure → Slack / email)

Each step:

Retries up to 2 times on failure
If still failing → alert on-call engineer
Overall DAG SLA: 7 hours

Data Asset Inventory

Below is the core Iceberg table inventory for the customer scenario:

ODS Layer (Source Mirrors)

Table	Source	Primary Key	Partition
ods_user	Aurora users	user_id	dt
ods_post	Aurora posts	post_id	dt, hr
ods_follow	Aurora follows	(follower_id, followee_id)	dt
ods_doc_user	DocumentDB user_profile	user_id	dt
ods_doc_msg	DocumentDB messages	msg_id	dt, hr
ods_event	Event tracking (MSK→Firehose)	event_id	dt, hr, event_type

DWD Layer (Detail — Cleansed and Enriched)

Table	Description
dwd_user_action	Events + user dimensions + IP→geo
dwd_post	Post detail + tags + engagement counts
dwd_user_relation	Follow relationship slowly changing dimension (SCD) table

DWS Layer (Summary — Light Aggregation)

Table	Description
dws_user_daily	User daily: impressions, clicks, follows, dwell time
dws_post_daily	Post daily: impressions, clicks, likes, shares
dws_pair_interaction	User-pair interaction cumulative values (for U2U collaborative filtering)

ADS Layer (Application / Data Mart)

Table	Description	Purpose
ads_user_features	User 100+ dimension feature wide table	Recommendation features + online sync
ads_post_features	Content features	Recommendation features
ads_sample_follow	Follow prediction sample table (label + feature PIT snapshot)	Model training
ads_sample_ctr	Click-through rate prediction sample table	Model training
ads_recall_u2u_cf	Collaborative filtering recall pool (user → top-K candidates)	Online recall sync

Key Design Decision Review

Looking back at the entire architecture, every decision has a rationale:

Decision	Rationale
Use Lakehouse (S3 + Iceberg) instead of Redshift	ML-friendly + cost-effective + extensible
Aurora uses Zero-ETL instead of DMS	Native Aurora path, sub-second latency, zero ops
DocumentDB uses DMS	DocumentDB does not have Zero-ETL
Event tracking: API GW + MSK + Firehose	MSK multi-subscription supports both offline and real-time
ETL: EMR Serverless primary + Athena CTAS for lightweight jobs	Optimal cost combination
Orchestration: MWAA	Complex task dependencies; Airflow has strong expressiveness
Data format: Parquet + Iceberg	Columnar compression + ACID + Time Travel + PIT
Schema management: Glue Schema Registry	Essential for 50+ event types

Original vs. Improved Architecture Comparison

#	Customer’s Original Approach	Improved Approach	Improvement
1	MySQL → S3 (method unspecified)	Aurora Zero-ETL → Lakehouse	Recommended path, sub-second, zero ops
2	ES → S3 using DMS	OpenSearch Ingestion	DMS does not support ES as a source
3	DocumentDB → S3	DMS + DocumentDB Change Streams	Must enable Change Streams first; note the cost impact
4	S3 + Athena data warehouse	Add Iceberg + Parquet + partitioning	Solves UPDATE / DELETE / PIT requirements
5	ODS → DWD layering	Explicit 4 layers + engine combination + Iceberg PIT	PIT correctness is mandatory for recommendation scenarios
6	API GW → Firehose → S3	API GW → Lambda → MSK → Firehose	MSK supports multi-subscription; reserves capacity for real-time pipeline

Chapter Summary

At this point, the data-side architecture is fully covered. You should now be able to:

Draw the complete data pipeline diagram
Explain the role of every service in each pipeline
Articulate “why X was chosen over Y” for each decision
Map everything to the customer’s actual data asset inventory

The next four chapters move into the ML side (recommendation system):

Chapter 07: Recommendation system fundamentals (recall / ranking / features / models)
Chapter 08: Online features and recall storage (DynamoDB / Redis / OpenSearch kNN / Neptune)
Chapter 09: SageMaker and the ML platform (Feature Store / Training / Endpoint)
Chapter 10: Complete end-to-end architecture + cost estimation

References

Apache Airflow documentation — Apache
AWS Glue Developer Guide — AWS Documentation
Apache Iceberg — Apache