Big Data on AWS Deep Dive (Part 6): End-to-End Data Pipeline — From Source to Feature Store

Connect all the dots: trace a click event from client SDK through API Gateway, MSK, Firehose, S3, warehouse layers (ODS→DWD→DWS→ADS), to DynamoDB for real-time serving.

zhuermu · · 10 min
big-dataawsdata-pipelineend-to-endevent-drivendata-warehouse-layers

Chapters 03 through 05 covered each service in detail. This chapter connects them all together, mapping the architecture to a real-world social app use case.

No new services are introduced here. The goal is: after reading this chapter, you can explain the entire data-side architecture using a single diagram.

Customer Scenario Recap

Key facts from the customer’s solution document:

  • Business: Social app targeting user recommendations (follow / feed / People You May Know)
  • Data sources: MySQL (business database), ES (search), DocumentDB (documents), client-side event tracking
  • DAU: 1 million+
  • DocumentDB version: 5.0 (supports Change Streams)
  • Latency requirements: T+1 (offline-first; real-time considered for a later phase)
  • Existing Kafka: None (must be built from scratch)

End-to-End Architecture Diagram

End to End

The entire pipeline is divided into 5 layers:

LayerRole
Layer 1: Data SourcesAurora MySQL / DocumentDB / OpenSearch / Client-side events / Third-party data
Layer 2: Ingestion ChannelsAurora Zero-ETL / DMS / OpenSearch Ingestion / API GW + MSK + Firehose / EventBridge
Layer 3: S3 + Iceberg ODS14+ Iceberg tables, grouped by data source
Layer 4: Layered ProcessingDWD / DWS / ADS, orchestrated by MWAA
Layer 5: Downstream ConsumersBI / ML training / Online layer sync / Real-time pipeline

Pipeline Decomposition: 5 Independent Data Flows

Breaking the entire pipeline into 5 relatively independent sub-pipelines makes it easier to understand:

Pipeline A: Business Database CDC (Aurora to Data Lake)

Aurora MySQL (orders, users, posts, follows)
     | binlog
     v
Aurora Zero-ETL to SageMaker Lakehouse  (sub-second latency)
     | (AWS-managed, zero ops)
     v
S3 Tables (Iceberg) + auto-registered in Glue Catalog
     |
     v
ods_user / ods_post / ods_follow / ods_order

Key points:

  • Choose Zero-ETL over traditional DMS (Zero-ETL is the recommended path for Aurora)
  • Data lands in S3 Tables (AWS-native Iceberg storage) — you cannot see underlying files in the console, but Athena/SageMaker can read them directly
  • Glue Catalog auto-registers tables; schema evolves with the source

Pipeline B: DocumentDB CDC

DocumentDB 5.0 (Change Streams enabled)
     | change streams
     v
DMS Replication Task (source = DocDB, target = S3)
     |
     v
S3 dms-raw/docdb/<collection>/...parquet
     |
     v
Glue Job hourly MERGE → ods_doc_user / ods_doc_msg (Iceberg)

Key points:

  • DocumentDB must be version 4.0+ with Change Streams enabled (increases source DB I/O)
  • DMS output is raw Parquet, not Iceberg — a Glue Job is needed to merge into Iceberg tables
  • The s3://.../dms-raw/ directory is a staging area; production tables live in s3://.../warehouse/ods/

Pipeline C: OpenSearch to S3

OpenSearch Service (managed)
     | scroll API / PIT
     v
OpenSearch Ingestion Pipeline (yaml)
     |
     v
S3 ods_es_search/ (Parquet)

Key points:

  • First confirm necessity — if ES only contains a search replica of MySQL data, ingesting from MySQL is more direct
  • DMS does not support ES as a source
  • For self-managed ES, use Logstash + S3 output instead

Pipeline D: Event Tracking (Core Pipeline, Two Phases)

The customer’s current requirement is T+1 with no Kafka infrastructure. Phase 1 should not rush into MSK — wait until Phase 3 when real-time features are actually needed. Below shows both the simplified version (Phase 1) and the target architecture (Phase 3+).

Phase 1: Simplified Version (Sufficient for T+1)

Client SDK → API Gateway (HTTP API) → Lambda (auth + enrichment) → Firehose → S3 ods_event

Characteristics:

  • Fully serverless, minimal ops burden
  • Monthly cost approximately $4K (API Gateway is the largest component)
  • Drawback: Firehose delivers to a single destination (S3); going real-time later requires architectural changes

Phase 3+: Target Architecture (When Real-Time Features Go Live)

Client SDK
     |
     v
API Gateway (HTTP API, auth)
     |
     v
Lambda (enrichment: server_ts, geo, ip, app_ver)
     |
     v
Amazon MSK (Kafka, topic=events, 12 partitions, across 3 AZs)
     |
     |-- Consumer Group "offline"  --> Firehose --> S3 ods_event (Parquet)
     |-- Consumer Group "realtime" --> Managed Flink --> DynamoDB user_realtime_features
     +-- Consumer Group "risk"     --> Lambda fraud detection

Migration cost: Change the Lambda delivery target from Firehose to MSK; downstream attaches Firehose with Kafka source (supported since 2024 — MSK as Firehose source). The application layer requires no changes.

Key points:

  • API Gateway uses HTTP API (approximately 70% cheaper than REST API)
  • Lambda is essential (auth + enrichment + server_ts)
  • MSK value: Multi-subscriber message bus — the same data is consumed independently by N downstream consumers + supports data replay
  • Schema managed via Glue Schema Registry (essential once event tracking exceeds 50+ event types)

The timeline below illustrates the dual-path consumption:

Timeline

The same click event: lands in DynamoDB within 500ms for recommendation serving, and lands in S3 within 60 seconds for model training.

Pipeline E: Third-Party Data

EventBridge Schedule (hourly / daily)
     |
     v
Lambda (calls third-party HTTP APIs)
     |
     v
S3 ods_3rd_channel/dt=.../*.json

Key points:

  • Use EventBridge rather than cron or Lambda Scheduled — it is more standard
  • Failures must retry; recommend small batches with idempotent writes

Layered Processing (Daily Routine)

Orchestration

Orchestrated by an MWAA Airflow DAG, running daily in the early morning:

02:00 Wait for Zero-ETL / DMS daily data completeness (sentinel task)
02:30 |- Glue Job: dwd_user_action_clean       (cleanse + IP→geo + join user dim)
      +- Glue Job: dwd_post_enrich              (post + tags + engagement counts)
03:00 |- Athena CTAS: dws_user_daily            (user daily metric aggregation)
      |- Athena CTAS: dws_post_daily            (post daily metrics)
      +- Athena CTAS: dws_pair_interaction      (user-pair interaction)
03:30 |- EMR Serverless: ads_user_features     (user 100+ dimension feature wide table)
      |- EMR Serverless: ads_post_features     (content features)
      |- EMR Serverless: ads_sample_follow     (follow prediction samples, PIT-correct)
      +- EMR Serverless: ads_recall_u2u_cf     (collaborative filtering recall pool)
04:30 |- Glue Job: sync_user_features_to_ddb   (write to DynamoDB)
      +- Glue Job: sync_recall_pool_to_ddb     (write to DynamoDB)
05:00 |- SageMaker Training Job: train_recall_two_tower
      +- SageMaker Training Job: train_rank_lightgbm
06:00 Lambda: deploy SageMaker Endpoint (blue/green or canary)
06:30 DQ Check + report (success / failure → Slack / email)

Each step:

  • Retries up to 2 times on failure
  • If still failing → alert on-call engineer
  • Overall DAG SLA: 7 hours

Data Asset Inventory

Below is the core Iceberg table inventory for the customer scenario:

ODS Layer (Source Mirrors)

TableSourcePrimary KeyPartition
ods_userAurora usersuser_iddt
ods_postAurora postspost_iddt, hr
ods_followAurora follows(follower_id, followee_id)dt
ods_doc_userDocumentDB user_profileuser_iddt
ods_doc_msgDocumentDB messagesmsg_iddt, hr
ods_eventEvent tracking (MSK→Firehose)event_iddt, hr, event_type

DWD Layer (Detail — Cleansed and Enriched)

TableDescription
dwd_user_actionEvents + user dimensions + IP→geo
dwd_postPost detail + tags + engagement counts
dwd_user_relationFollow relationship slowly changing dimension (SCD) table

DWS Layer (Summary — Light Aggregation)

TableDescription
dws_user_dailyUser daily: impressions, clicks, follows, dwell time
dws_post_dailyPost daily: impressions, clicks, likes, shares
dws_pair_interactionUser-pair interaction cumulative values (for U2U collaborative filtering)

ADS Layer (Application / Data Mart)

TableDescriptionPurpose
ads_user_featuresUser 100+ dimension feature wide tableRecommendation features + online sync
ads_post_featuresContent featuresRecommendation features
ads_sample_followFollow prediction sample table (label + feature PIT snapshot)Model training
ads_sample_ctrClick-through rate prediction sample tableModel training
ads_recall_u2u_cfCollaborative filtering recall pool (user → top-K candidates)Online recall sync

Key Design Decision Review

Looking back at the entire architecture, every decision has a rationale:

DecisionRationale
Use Lakehouse (S3 + Iceberg) instead of RedshiftML-friendly + cost-effective + extensible
Aurora uses Zero-ETL instead of DMSNative Aurora path, sub-second latency, zero ops
DocumentDB uses DMSDocumentDB does not have Zero-ETL
Event tracking: API GW + MSK + FirehoseMSK multi-subscription supports both offline and real-time
ETL: EMR Serverless primary + Athena CTAS for lightweight jobsOptimal cost combination
Orchestration: MWAAComplex task dependencies; Airflow has strong expressiveness
Data format: Parquet + IcebergColumnar compression + ACID + Time Travel + PIT
Schema management: Glue Schema RegistryEssential for 50+ event types

Original vs. Improved Architecture Comparison

#Customer’s Original ApproachImproved ApproachImprovement
1MySQL → S3 (method unspecified)Aurora Zero-ETL → LakehouseRecommended path, sub-second, zero ops
2ES → S3 using DMSOpenSearch IngestionDMS does not support ES as a source
3DocumentDB → S3DMS + DocumentDB Change StreamsMust enable Change Streams first; note the cost impact
4S3 + Athena data warehouseAdd Iceberg + Parquet + partitioningSolves UPDATE / DELETE / PIT requirements
5ODS → DWD layeringExplicit 4 layers + engine combination + Iceberg PITPIT correctness is mandatory for recommendation scenarios
6API GW → Firehose → S3API GW → Lambda → MSK → FirehoseMSK supports multi-subscription; reserves capacity for real-time pipeline

Chapter Summary

At this point, the data-side architecture is fully covered. You should now be able to:

  • Draw the complete data pipeline diagram
  • Explain the role of every service in each pipeline
  • Articulate “why X was chosen over Y” for each decision
  • Map everything to the customer’s actual data asset inventory

The next four chapters move into the ML side (recommendation system):

  • Chapter 07: Recommendation system fundamentals (recall / ranking / features / models)
  • Chapter 08: Online features and recall storage (DynamoDB / Redis / OpenSearch kNN / Neptune)
  • Chapter 09: SageMaker and the ML platform (Feature Store / Training / Endpoint)
  • Chapter 10: Complete end-to-end architecture + cost estimation