Why use Parquet instead of CSV or JSON for big data?

Parquet is a columnar format that only reads needed columns, supports predicate pushdown, and achieves 5-10x compression vs CSV. Analytics queries on Parquet scan far less data, reducing both time and cost.

What does Apache Iceberg add on top of Parquet files?

Iceberg adds ACID transactions, UPDATE/DELETE support, schema evolution, time travel, and partition evolution to files stored on S3 — turning a data lake into a lakehouse without moving data.

阅读中文版 →

Big Data on AWS Deep Dive (Part 2): S3, Parquet, and Apache Iceberg Explained

Master the storage foundation of modern data lakes — S3 object storage, Parquet columnar format, and how Iceberg adds ACID transactions to files on S3.

zhuermu · May 11, 2025 · 14 min

big-dataawss3parquetapache-icebergdata-lakecolumnar-storage

This chapter covers two foundational topics of data lakes in depth:

Where data lives (S3)

What format data takes (Parquet columnar storage + Iceberg table format)

These two decisions underpin every higher-level service — Athena, Glue, SageMaker, and beyond.

Amazon S3: The Bedrock of the Data Lake

What It Is

S3 (Simple Storage Service) is AWS’s earliest and most important service (launched 2006). At its core, it is a globally distributed object store: you upload a file of any size (up to 5 TB per object), assign it a key (path), and retrieve it later using that key.

s3://my-bucket/warehouse/ods/orders/dt=2026-05-10/part-0001.parquet
   |           |                            |             |
   bucket      path prefix                  partition dir  file

Key Properties (Why Data Lakes Choose S3)

Property	Details	What It Means for a Data Lake
11 nines durability	Probability of losing one object per year: ~0.000000001%	Data won’t disappear
Virtually unlimited scale	A single bucket can hold EB-scale data with automatic sharding	PB-scale data needs no manual sharding
Pay-per-use	You pay only for the GB stored	Archiving historical data is cheap
Strong consistency (since 2020)	A GET immediately after a PUT always returns the latest version	No stale-read surprises
Multiple storage classes (Standard / IA / Glacier)	Hot-to-cold tiering	Old data automatically moves to cheaper tiers
API-friendly	HTTP REST + AWS SDKs	Every engine can read and write

Storage Classes and Hot/Cold Tiering

S3 isn’t a single tier — it offers multiple storage classes with price and retrieval latency varying by orders of magnitude:

Storage Class	Price (us-east-1)	Retrieval Latency	Best For
S3 Standard	~$0.023/GB/mo	Milliseconds	Current hot data
S3 Intelligent-Tiering	~$0.023/GB/mo (auto-downtiers)	Milliseconds to minutes	Default recommendation when access patterns are unknown
S3 Standard-IA (Infrequent Access)	~$0.0125/GB/mo	Milliseconds	Occasional access
S3 Glacier Instant Retrieval	~$0.004/GB/mo	Milliseconds	Monthly access
S3 Glacier Flexible / Deep Archive	$0.0036 / $0.00099/GB/mo	Minutes to hours	Regulatory archives

Practical tip: Enable Intelligent-Tiering as the default for all warehouse buckets. Let S3 automatically move objects based on access frequency. One configuration change can save 30-50% annually.

S3 = “File System” or “Object Store”?

Many people instinctively treat S3 like a file system. S3 is not a file system — it is a key-value store that only supports whole-object reads and writes. You cannot append or modify a single byte in place.

This constraint drives many upstream design decisions:

Data lake files follow a write-once, read-many (WORM) pattern
Want to UPDATE a single row? You must rewrite the entire file — which is precisely why you need Iceberg to manage this (see the Iceberg section below)

About S3 Tables (GA December 2024, Evolving Through 2025-2026)

Standard S3: you can see all .parquet files in the console
S3 Tables: AWS-native “Iceberg table storage” — the console shows tables, not underlying files. AWS automatically handles compaction, expired snapshot cleanup, and metadata management

2025-2026 new capabilities:

Cross-region replication (DR)
Intelligent-Tiering support (automatic hot/cold tiering)
Bedrock Knowledge Bases can read S3 Tables directly for structured retrieval
Bidirectional sync with Glue Data Catalog

The Zero-ETL to SageMaker Lakehouse path in production architectures lands on S3 Tables.

File Formats: CSV vs JSON vs Parquet vs ORC

S3 is a file store — but what format lives inside those files matters enormously.

Candidate Formats

Format	Type	Storage Efficiency	Query Performance	Human Readability
CSV	Row-oriented text	Low	Low	Excellent (human-readable)
JSON / JSON Lines	Row-oriented text	Low	Low	Good
Avro	Row-oriented binary	Medium	Medium	Low
Parquet	Columnar binary	Very High	Very High	Low
ORC	Columnar binary	Very High	High	Low

Verdict: For data warehouses and data lakes, default to Parquet.

Why? Because analytical queries primarily select a few columns, scan massive rows, and aggregate — exactly what columnar storage excels at.

Row Store vs Column Store: Why OLAP Demands Columnar

Row vs Columnar

A Concrete Example

Imagine an events table with 100 million rows and 50 columns. You run:

SELECT city, SUM(amount) FROM events WHERE dt='2026-05-01' GROUP BY city;

Row store (CSV / MySQL):

Must read all 50 columns of every row just to extract city and amount
Even though you only need 2 columns, you read all 50 columns of data
I/O waste: 96%

Column store (Parquet):

The city column is stored contiguously, as is the amount column
Reading only city + amount saves 96% of I/O
Since all values in a column share the same type, compression ratios are extreme (e.g., city has many repeated values — gzip/snappy can compress to 1/10th)
Modern CPUs can use SIMD vectorized computation (processing 8 values per instruction)

Real-world benchmarks: same data, CSV 100 GB compresses to ~15 GB in Parquet with Snappy; the same query runs 5-20x faster on Athena with 80%+ cost reduction (billed per bytes scanned).

Parquet Internal Structure (Simplified)

+------------------------------------------+
|  File Header (PAR1)                      |
+------------------------------------------+
|  Row Group 1 (~128 MB of rows)           |
|    Column Chunk: user_id  [encoded data] |
|    Column Chunk: city     [encoded data] |
|    Column Chunk: amount   [encoded data] |
|  ...                                     |
+------------------------------------------+
|  Row Group 2                             |
|  ...                                     |
+------------------------------------------+
|  File Footer:                            |
|    schema                                |
|    min/max for each column chunk         | <-- predicate pushdown relies on this
|    compression and encoding info         |
+------------------------------------------+

Key design decisions:

Row Groups: Rows are split into ~128 MB groups; within each group data is stored columnar (balancing scan throughput and random access)
Per-column compression/encoding: Each column can use the optimal algorithm for its data characteristics (dictionary, RLE, bit-packing)
Column statistics: Each column chunk records min/max/null count, enabling the query engine to skip entire column chunks

Predicate Pushdown

This is the killer feature of columnar formats. Consider:

SELECT * FROM events WHERE user_id = 99999;

When the execution engine reads a Parquet file:

It reads the footer and sees that this file’s user_id range is [100000, 200000]
The entire file is skipped — not a single byte of row data is read

Similarly:

File-level min/max can skip entire files
Row Group-level min/max can skip entire row groups
Page-level min/max can skip pages

After layer upon layer of skipping, the engine may physically read only 1% of the data.

Parquet Best Practices (Must-Do)

Target file size: 128 MB to 512 MB
- Too small: excessive file count, metadata overhead, slow queries
- Too large: poor parallelism
Partition by dt (date)
- Path: .../events/dt=2026-05-10/part-001.parquet
- WHERE dt='2026-05-10' directly targets the directory, skipping all other dates
Secondary partition by business dimension (e.g., event_type, app_id), but keep cardinality low (fewer than a few thousand) — otherwise you get a small-file explosion
Run periodic compaction to merge small files into larger ones (Glue’s built-in jobs or Iceberg’s OPTIMIZE command)

Table Formats: Turning an S3 Folder into a Database

So far, we’ve made the data lake cheap and fast. But there’s one critical gap: S3 files don’t support UPDATE or DELETE.

Why is this a big deal? Consider real-world scenarios:

The ODS layer receives MySQL CDC events requiring UPSERT (same user_id arriving means update the record)
Business requirements demand “delete all data for a given user” (GDPR / data privacy regulations)
The DWD layer needs data backfill or bug-fix rewrites of specific partitions

With only S3 + Parquet, all of these require rewriting entire partitions — cost and complexity explode.

Solution: Add a table format layer on top of S3 files. Three candidates exist:

Table Format	Created By	AWS Integration	Key Characteristic
Apache Iceberg	Netflix, then Apache	Native AWS first-class support	Rigorous design, painless schema evolution
Apache Hudi	Uber, then Apache	Good	Write-friendly (Merge-on-Read)
Delta Lake	Databricks	Limited	Best in Databricks ecosystem; open-source version has fewer features

Verdict: On AWS, choose Iceberg. Athena, Glue, EMR, Redshift Spectrum, and SageMaker all support it natively.

Apache Iceberg Deep Dive

Iceberg Internals

Iceberg’s Layered Metadata Architecture

Iceberg’s key design: on top of data files, it adds a Catalog pointer + three layers of metadata files, each stored as objects on S3.

Catalog (Glue)                       <-- Layer 0: mutable pointer
    |
    +--points to-->  metadata.json (version v3)       <-- Layer 1: table metadata (schema, partition spec, snapshot list)
                |
                +--points to-->  manifest list      <-- Layer 2: which manifests compose the snapshot
                              |
                              +--points to-->  manifest        <-- Layer 3: min/max + path for each data file
                                          |
                                          +--points to-->  data.parquet (actual data)

Each write operation:

Writes new Parquet data files
Writes a new manifest registering those files
Writes a new snapshot referencing the batch of manifests
Writes a new metadata.json pointing the current snapshot to the new version
Updates the Catalog (Glue) pointer to the new metadata.json

The entire process is atomic (the final step is a single KV write) — this is how Iceberg achieves ACID guarantees.

How UPDATE / DELETE Works

Iceberg provides two strategies:

Copy on Write (COW) — the default strategy:

Updating one row means reading the entire Parquet file containing that row, modifying it, writing a new file, and marking the old file as obsolete
Slow writes, fast reads

Merge on Read (MOR):

Updating one row means writing a delete file (“this row is deleted”) plus a new data file (containing the updated row)
Fast writes, reads require merge
Suitable for high-frequency update scenarios, but requires periodic compaction

Time Travel (A Critical Capability)

Every write generates a snapshot. Files referenced by old snapshots are not immediately deleted (retention period is configurable, default 5 days).

-- Query the table at a specific point in time
SELECT * FROM ads_user_features 
FOR TIMESTAMP AS OF '2026-05-01 00:00:00';

-- Query a specific snapshot version
SELECT * FROM ads_user_features 
FOR VERSION AS OF 2934856;

-- Roll back after an accidental delete
ALTER TABLE ads_user_features 
EXECUTE rollback_to_snapshot(2934856);

Why Recommendation Systems Require Iceberg: Point-in-Time Correctness

Point-in-Time (PIT) correctness is the most common pitfall in recommendation system feature engineering.

The problem: Training samples must use feature values as they existed at the moment the event occurred, not the latest values.

Example:

User A clicked a video on May 1 (a positive sample)
On May 1, User A’s interest tag was “Food”
On May 5, User A’s interest tag was updated to “Travel” (by online learning)
On May 6, you train the model and fetch User A’s tag from ads_user_features — you get “Travel”
Training the “clicked a food video” sample with “Travel” as a feature is a feature leakage — the model learns incorrect patterns

Common Misconceptions About Iceberg Time Travel

Many articles suggest:

-- This syntax is NOT valid (Athena/Spark/Trino all reject it)
SELECT ... 
FROM ads_user_features FOR TIMESTAMP AS OF s.event_ts
JOIN ads_sample s ON ...

Reality: In the Iceberg / SQL:2011 specification, FOR TIMESTAMP AS OF only accepts literal constants or bind parameters — not column references. Iceberg Time Travel cannot perform row-level PIT joins — it’s designed for “rewind the entire table to a single point in time.”

Three Correct PIT Implementations

Approach A: Daily Feature Snapshot Partitions (Recommended, Most Common)

Design ads_user_features as a daily-partitioned table with a full snapshot each day:

-- Training sample table contains (user_id, item_id, event_ts, event_dt, label)
SELECT s.label, u.tag, u.age
FROM   ads_sample_follow s
JOIN   ads_user_features_daily u
       ON u.user_id = s.user_id
       AND u.dt    = s.event_dt;   -- align with the day the event occurred

Keep N days of daily partitions (use Iceberg’s expire_snapshots + partition retention to control costs).

Approach B: Slowly Changing Dimension Type 2 (Accurate to the Second)

-- ads_user_features_history(user_id, tag, age, valid_from, valid_to)
-- Each feature change inserts a new row
SELECT s.label, u.tag
FROM   ads_sample_follow s
JOIN   ads_user_features_history u
       ON s.user_id = u.user_id
       AND s.event_ts >= u.valid_from
       AND s.event_ts <  u.valid_to;

Approach C: SageMaker Feature Store

Feature Store provides a built-in PIT retrieval API (get_record(record_id, event_time)). Under the hood, it uses Iceberg + event-time indexing — AWS handles the complexity of approaches A/B for you.

What Iceberg Time Travel Actually Excels At

Although it cannot do row-level PIT joins, Time Travel is extremely useful for:

Data rollback: rollback_to_snapshot after accidental UPDATEs or DELETEs
Table-level auditing: Compare “the table at midnight yesterday” vs “the table at midnight today”
Reproducible training: Pin a snapshot; six months later you can still produce the exact same training dataset

SELECT * FROM ads_user_features 
FOR TIMESTAMP AS OF TIMESTAMP '2026-05-01 00:00:00';   -- literal constant

SELECT * FROM ads_user_features 
FOR VERSION AS OF 2934856;                              -- snapshot id

Schema Evolution

Adding columns, renaming columns, reordering columns — Iceberg handles all of these without rewriting existing data:

ALTER TABLE events ADD COLUMN device_id STRING;             -- add column
ALTER TABLE events RENAME COLUMN ip TO client_ip;            -- rename
ALTER TABLE events ALTER COLUMN amount TYPE DECIMAL(20,4);   -- widen type (compatible direction)

Old Parquet files remain untouched. The new column reads as NULL from old files, and the renamed column continues to work. This was impossible in the Hive-table era.

Practical Usage

-- Create an Iceberg table in Athena
CREATE TABLE poc_social_layla.ods_event (
  event_id   STRING,
  user_id    BIGINT,
  event_type STRING,
  event_ts   TIMESTAMP,
  payload    STRING,
  dt         STRING
)
PARTITIONED BY (dt)
LOCATION 's3://my-bucket/warehouse/ods/event/'
TBLPROPERTIES (
  'table_type' = 'ICEBERG',
  'format'     = 'parquet',
  'write_compression' = 'snappy'
);

-- UPDATE / DELETE / MERGE just like a traditional database
UPDATE ods_event SET event_type = 'view' WHERE event_type = 'expo';

DELETE FROM ods_event WHERE user_id = 99 AND dt = '2026-05-10';

MERGE INTO ods_event t
USING staging_event s ON t.event_id = s.event_id
WHEN MATCHED THEN UPDATE SET payload = s.payload
WHEN NOT MATCHED THEN INSERT VALUES (s.*);

Physical Directory Layout: A Production Architecture

s3://my-bucket/warehouse/
+-- ods/
|   +-- ods_user/        <-- mirror of MySQL users table (CDC)
|   +-- ods_event/       <-- raw event stream (Firehose landing)
|   +-- ods_post/        <-- MySQL posts table
+-- dwd/
|   +-- dwd_user_action/ <-- events joined with user/IP dimensions
|   +-- dwd_post/        <-- enriched post details
+-- dws/
|   +-- dws_user_daily/  <-- daily aggregated user metrics
+-- ads/
|   +-- ads_user_features/    <-- recommendation feature wide table
|   +-- ads_sample_follow/    <-- follow-event training samples
|   +-- ads_recall_u2u_cf/    <-- collaborative filtering recall pool
+-- athena-results/      <-- Athena query result staging

Every directory contains an Iceberg table. Each table’s metadata files are registered in the Glue Data Catalog.

Chapter Summary

Concept	One-Liner
S3	The physical foundation of the data lake — virtually unlimited capacity, pay-per-GB, 11 nines durability
Row store vs Column store	OLAP demands columnar: 80%+ I/O savings, superior compression
Parquet	The standard file format for AWS data lakes
Predicate pushdown	Parquet column statistics let engines skip entire files — critical for performance
Iceberg	A “table format” layer on top of S3 files that adds ACID, UPDATE/DELETE, and Time Travel
PIT correctness	Recommendation systems must use daily snapshot partitions or SCD Type 2 to avoid feature leakage

Next up: how data moves from source systems into S3.

References

Apache Parquet — Apache
Apache ORC — Apache
Apache Avro — Apache
Apache Iceberg — Apache