Big Data on AWS Deep Dive (Part 6): End-to-End Data Pipeline — From Source to Feature Store
Connect all the dots: trace a click event from client SDK through API Gateway, MSK, Firehose, S3, warehouse layers (ODS→DWD→DWS→ADS), to DynamoDB for real-time serving.
Chapters 03 through 05 covered each service in detail. This chapter connects them all together, mapping the architecture to a real-world social app use case.
No new services are introduced here. The goal is: after reading this chapter, you can explain the entire data-side architecture using a single diagram.
Customer Scenario Recap
Key facts from the customer’s solution document:
- Business: Social app targeting user recommendations (follow / feed / People You May Know)
- Data sources: MySQL (business database), ES (search), DocumentDB (documents), client-side event tracking
- DAU: 1 million+
- DocumentDB version: 5.0 (supports Change Streams)
- Latency requirements: T+1 (offline-first; real-time considered for a later phase)
- Existing Kafka: None (must be built from scratch)
End-to-End Architecture Diagram
The entire pipeline is divided into 5 layers:
| Layer | Role |
|---|---|
| Layer 1: Data Sources | Aurora MySQL / DocumentDB / OpenSearch / Client-side events / Third-party data |
| Layer 2: Ingestion Channels | Aurora Zero-ETL / DMS / OpenSearch Ingestion / API GW + MSK + Firehose / EventBridge |
| Layer 3: S3 + Iceberg ODS | 14+ Iceberg tables, grouped by data source |
| Layer 4: Layered Processing | DWD / DWS / ADS, orchestrated by MWAA |
| Layer 5: Downstream Consumers | BI / ML training / Online layer sync / Real-time pipeline |
Pipeline Decomposition: 5 Independent Data Flows
Breaking the entire pipeline into 5 relatively independent sub-pipelines makes it easier to understand:
Pipeline A: Business Database CDC (Aurora to Data Lake)
Aurora MySQL (orders, users, posts, follows)
| binlog
v
Aurora Zero-ETL to SageMaker Lakehouse (sub-second latency)
| (AWS-managed, zero ops)
v
S3 Tables (Iceberg) + auto-registered in Glue Catalog
|
v
ods_user / ods_post / ods_follow / ods_order
Key points:
- Choose Zero-ETL over traditional DMS (Zero-ETL is the recommended path for Aurora)
- Data lands in S3 Tables (AWS-native Iceberg storage) — you cannot see underlying files in the console, but Athena/SageMaker can read them directly
- Glue Catalog auto-registers tables; schema evolves with the source
Pipeline B: DocumentDB CDC
DocumentDB 5.0 (Change Streams enabled)
| change streams
v
DMS Replication Task (source = DocDB, target = S3)
|
v
S3 dms-raw/docdb/<collection>/...parquet
|
v
Glue Job hourly MERGE → ods_doc_user / ods_doc_msg (Iceberg)
Key points:
- DocumentDB must be version 4.0+ with Change Streams enabled (increases source DB I/O)
- DMS output is raw Parquet, not Iceberg — a Glue Job is needed to merge into Iceberg tables
- The
s3://.../dms-raw/directory is a staging area; production tables live ins3://.../warehouse/ods/
Pipeline C: OpenSearch to S3
OpenSearch Service (managed)
| scroll API / PIT
v
OpenSearch Ingestion Pipeline (yaml)
|
v
S3 ods_es_search/ (Parquet)
Key points:
- First confirm necessity — if ES only contains a search replica of MySQL data, ingesting from MySQL is more direct
- DMS does not support ES as a source
- For self-managed ES, use Logstash + S3 output instead
Pipeline D: Event Tracking (Core Pipeline, Two Phases)
The customer’s current requirement is T+1 with no Kafka infrastructure. Phase 1 should not rush into MSK — wait until Phase 3 when real-time features are actually needed. Below shows both the simplified version (Phase 1) and the target architecture (Phase 3+).
Phase 1: Simplified Version (Sufficient for T+1)
Client SDK → API Gateway (HTTP API) → Lambda (auth + enrichment) → Firehose → S3 ods_event
Characteristics:
- Fully serverless, minimal ops burden
- Monthly cost approximately $4K (API Gateway is the largest component)
- Drawback: Firehose delivers to a single destination (S3); going real-time later requires architectural changes
Phase 3+: Target Architecture (When Real-Time Features Go Live)
Client SDK
|
v
API Gateway (HTTP API, auth)
|
v
Lambda (enrichment: server_ts, geo, ip, app_ver)
|
v
Amazon MSK (Kafka, topic=events, 12 partitions, across 3 AZs)
|
|-- Consumer Group "offline" --> Firehose --> S3 ods_event (Parquet)
|-- Consumer Group "realtime" --> Managed Flink --> DynamoDB user_realtime_features
+-- Consumer Group "risk" --> Lambda fraud detection
Migration cost: Change the Lambda delivery target from Firehose to MSK; downstream attaches Firehose with Kafka source (supported since 2024 — MSK as Firehose source). The application layer requires no changes.
Key points:
- API Gateway uses HTTP API (approximately 70% cheaper than REST API)
- Lambda is essential (auth + enrichment + server_ts)
- MSK value: Multi-subscriber message bus — the same data is consumed independently by N downstream consumers + supports data replay
- Schema managed via Glue Schema Registry (essential once event tracking exceeds 50+ event types)
The timeline below illustrates the dual-path consumption:
The same click event: lands in DynamoDB within 500ms for recommendation serving, and lands in S3 within 60 seconds for model training.
Pipeline E: Third-Party Data
EventBridge Schedule (hourly / daily)
|
v
Lambda (calls third-party HTTP APIs)
|
v
S3 ods_3rd_channel/dt=.../*.json
Key points:
- Use EventBridge rather than cron or Lambda Scheduled — it is more standard
- Failures must retry; recommend small batches with idempotent writes
Layered Processing (Daily Routine)
Orchestrated by an MWAA Airflow DAG, running daily in the early morning:
02:00 Wait for Zero-ETL / DMS daily data completeness (sentinel task)
02:30 |- Glue Job: dwd_user_action_clean (cleanse + IP→geo + join user dim)
+- Glue Job: dwd_post_enrich (post + tags + engagement counts)
03:00 |- Athena CTAS: dws_user_daily (user daily metric aggregation)
|- Athena CTAS: dws_post_daily (post daily metrics)
+- Athena CTAS: dws_pair_interaction (user-pair interaction)
03:30 |- EMR Serverless: ads_user_features (user 100+ dimension feature wide table)
|- EMR Serverless: ads_post_features (content features)
|- EMR Serverless: ads_sample_follow (follow prediction samples, PIT-correct)
+- EMR Serverless: ads_recall_u2u_cf (collaborative filtering recall pool)
04:30 |- Glue Job: sync_user_features_to_ddb (write to DynamoDB)
+- Glue Job: sync_recall_pool_to_ddb (write to DynamoDB)
05:00 |- SageMaker Training Job: train_recall_two_tower
+- SageMaker Training Job: train_rank_lightgbm
06:00 Lambda: deploy SageMaker Endpoint (blue/green or canary)
06:30 DQ Check + report (success / failure → Slack / email)
Each step:
- Retries up to 2 times on failure
- If still failing → alert on-call engineer
- Overall DAG SLA: 7 hours
Data Asset Inventory
Below is the core Iceberg table inventory for the customer scenario:
ODS Layer (Source Mirrors)
| Table | Source | Primary Key | Partition |
|---|---|---|---|
| ods_user | Aurora users | user_id | dt |
| ods_post | Aurora posts | post_id | dt, hr |
| ods_follow | Aurora follows | (follower_id, followee_id) | dt |
| ods_doc_user | DocumentDB user_profile | user_id | dt |
| ods_doc_msg | DocumentDB messages | msg_id | dt, hr |
| ods_event | Event tracking (MSK→Firehose) | event_id | dt, hr, event_type |
DWD Layer (Detail — Cleansed and Enriched)
| Table | Description |
|---|---|
| dwd_user_action | Events + user dimensions + IP→geo |
| dwd_post | Post detail + tags + engagement counts |
| dwd_user_relation | Follow relationship slowly changing dimension (SCD) table |
DWS Layer (Summary — Light Aggregation)
| Table | Description |
|---|---|
| dws_user_daily | User daily: impressions, clicks, follows, dwell time |
| dws_post_daily | Post daily: impressions, clicks, likes, shares |
| dws_pair_interaction | User-pair interaction cumulative values (for U2U collaborative filtering) |
ADS Layer (Application / Data Mart)
| Table | Description | Purpose |
|---|---|---|
| ads_user_features | User 100+ dimension feature wide table | Recommendation features + online sync |
| ads_post_features | Content features | Recommendation features |
| ads_sample_follow | Follow prediction sample table (label + feature PIT snapshot) | Model training |
| ads_sample_ctr | Click-through rate prediction sample table | Model training |
| ads_recall_u2u_cf | Collaborative filtering recall pool (user → top-K candidates) | Online recall sync |
Key Design Decision Review
Looking back at the entire architecture, every decision has a rationale:
| Decision | Rationale |
|---|---|
| Use Lakehouse (S3 + Iceberg) instead of Redshift | ML-friendly + cost-effective + extensible |
| Aurora uses Zero-ETL instead of DMS | Native Aurora path, sub-second latency, zero ops |
| DocumentDB uses DMS | DocumentDB does not have Zero-ETL |
| Event tracking: API GW + MSK + Firehose | MSK multi-subscription supports both offline and real-time |
| ETL: EMR Serverless primary + Athena CTAS for lightweight jobs | Optimal cost combination |
| Orchestration: MWAA | Complex task dependencies; Airflow has strong expressiveness |
| Data format: Parquet + Iceberg | Columnar compression + ACID + Time Travel + PIT |
| Schema management: Glue Schema Registry | Essential for 50+ event types |
Original vs. Improved Architecture Comparison
| # | Customer’s Original Approach | Improved Approach | Improvement |
|---|---|---|---|
| 1 | MySQL → S3 (method unspecified) | Aurora Zero-ETL → Lakehouse | Recommended path, sub-second, zero ops |
| 2 | ES → S3 using DMS | OpenSearch Ingestion | DMS does not support ES as a source |
| 3 | DocumentDB → S3 | DMS + DocumentDB Change Streams | Must enable Change Streams first; note the cost impact |
| 4 | S3 + Athena data warehouse | Add Iceberg + Parquet + partitioning | Solves UPDATE / DELETE / PIT requirements |
| 5 | ODS → DWD layering | Explicit 4 layers + engine combination + Iceberg PIT | PIT correctness is mandatory for recommendation scenarios |
| 6 | API GW → Firehose → S3 | API GW → Lambda → MSK → Firehose | MSK supports multi-subscription; reserves capacity for real-time pipeline |
Chapter Summary
At this point, the data-side architecture is fully covered. You should now be able to:
- Draw the complete data pipeline diagram
- Explain the role of every service in each pipeline
- Articulate “why X was chosen over Y” for each decision
- Map everything to the customer’s actual data asset inventory
The next four chapters move into the ML side (recommendation system):
- Chapter 07: Recommendation system fundamentals (recall / ranking / features / models)
- Chapter 08: Online features and recall storage (DynamoDB / Redis / OpenSearch kNN / Neptune)
- Chapter 09: SageMaker and the ML platform (Feature Store / Training / Endpoint)
- Chapter 10: Complete end-to-end architecture + cost estimation