guideApr 10, 20265 min read

Apache Iceberg Explained: The Open Table Format That Won

Written by The Data Workers Team — 14 autonomous agents shipping production data infrastructure since 2026.

Technically reviewed by the Data Workers engineering team.

Last updated Apr 10, 2026.

Apache Iceberg is an open table format that makes Parquet files in object storage behave like a database table — with ACID transactions, schema evolution, hidden partitioning, and time travel. It is the format that Snowflake, BigQuery, Databricks, Trino, and DuckDB all read and write, making it the de facto open standard for lakehouses in 2026.

This guide explains how Iceberg works under the hood, why it won the table format war, how to run it in production without breaking your ingestion pipelines, and the maintenance operations — compaction, expiration, vacuum — that every Iceberg deployment has to get right eventually.

What Is Apache Iceberg?

Iceberg is a metadata layer on top of Parquet or ORC files. The table is a tree of metadata files (snapshots, manifests, manifest lists) that point to the actual data files. Every write produces a new snapshot, making the table ACID-compliant without locking and enabling time travel queries against any historical snapshot.

The project started at Netflix in 2017 to solve Hive's limitations — Hive's directory-based partitioning could not express anything more complex than directory paths, could not evolve schemas safely, and could not support atomic writes. Iceberg replaced the directory listing with an explicit metadata file and unlocked the rest. It became an Apache top-level project in 2020.

How Iceberg Works

An Iceberg table is a pointer to the current metadata file. That file lists snapshots; each snapshot lists manifest files; each manifest lists data files and their statistics. When you query, the engine reads the metadata, prunes files using statistics, and reads only the Parquet files that might contain matching rows. This is why Iceberg queries are fast even at petabyte scale.

Writes follow the same pattern in reverse. A writer stages new data files, then atomically updates the metadata pointer to a new snapshot that references them. Readers querying during the write see the old snapshot until the swap completes, giving you serializable isolation without locks.

The Features That Matter

•Hidden partitioning — partition by transform (bucket, month) without exposing partition columns to queries
•Schema evolution — add, drop, rename, and reorder columns without rewriting data
•Time travel — query any historical snapshot by timestamp or ID
•ACID transactions — writes are atomic snapshots, readers never see partial state
•Partition evolution — change the partition scheme without rewriting the table
•Row-level operations — delete, update, and merge without full table rewrites
•Column-level statistics — min/max per file enable aggressive pruning

The Catalog Layer

Iceberg tables need a catalog to track the current metadata pointer. The original catalogs were Hive Metastore and AWS Glue; 2026's default is the Iceberg REST Catalog spec, implemented by Snowflake Polaris, Databricks Unity, Tabular (acquired by Databricks), Nessie, and AWS S3 Tables. REST catalogs unlock multi-engine reads because any engine can federate through the same standard API.

Catalog choice is the highest-leverage decision in an Iceberg rollout. Polaris, Unity, and Nessie each have different governance, access control, and multi-tenancy models — pick the one that matches your organization, not the one your vendor pushes.

Iceberg vs the Alternatives

Feature	Iceberg	Delta Lake	Hudi
Governance model	Apache (vendor-neutral)	Linux Foundation (Databricks-led)	Apache (Uber-led)
Engine coverage	Widest	Spark-first	Spark-first
Streaming upserts	Good	Good	Best
Default in 2026	Yes	On Databricks only	Niche

For the deep comparison, see iceberg vs delta vs hudi.

Running Iceberg in Production

The operational concerns are compaction, snapshot expiration, and orphan file cleanup. Small files accumulate on write-heavy tables; regular compaction rewrites them into larger files. Snapshots accumulate forever unless you expire them. Orphan files (data files no snapshot references) waste storage unless you vacuum.

Most teams automate these with scheduled jobs. Data Workers' pipeline agent runs compaction, expiration, and vacuum on a schedule without manual scripting. See autonomous data engineering.

Migration From Hive

Most Iceberg adoption in 2026 is migration away from Hive-partitioned Parquet. Iceberg supports two migration paths: ADD FILES (wraps existing files into a new Iceberg table) and CREATE TABLE AS SELECT (rewrites into a fresh Iceberg layout). ADD FILES is fast and non-destructive; CTAS gives you optimized file sizes and partitioning but costs a full rewrite.

The typical migration plan is to ADD FILES first, validate queries against the new Iceberg table, cut readers over, and then backfill to CTAS on a schedule to pick up Iceberg-native optimizations. This lets you unlock ACID semantics immediately without paying for a full rewrite upfront.

Iceberg on the Major Engines

Snowflake reads and writes Iceberg natively via Polaris or external catalogs; Snowflake-managed Iceberg tables get the same performance as Snowflake-native tables. BigQuery reads Iceberg through BigLake and writes via BigQuery Iceberg tables. Databricks reads via Delta Uniform or external Iceberg tables through Unity Catalog. Trino and Starburst are the leading query engines for federated Iceberg deployments.

DuckDB support landed in 2024 and makes Iceberg queryable from a laptop without a cluster — a game changer for analytics engineers who want to prototype on production data. The ecosystem has effectively made Iceberg the one format every tool has to support, and newer engines like ClickHouse, Apache Doris, and StarRocks all ship with Iceberg connectors now.

The engine coverage matters because it gives you optionality. A team that starts on Spark and Iceberg can add Snowflake for BI workloads later without re-ingesting data, or spin up DuckDB for quick exploration on the same tables. That flexibility is impossible with proprietary formats and hard with Delta outside of Databricks.

Getting Started

Create a catalog (REST, Glue, or Nessie), point Spark or Trino at it, and write your first table with CREATE TABLE. Migrate existing Parquet datasets via ADD FILES. Validate with DESCRIBE HISTORY and SELECT ... FOR TIMESTAMP AS OF. Book a demo to see the full lifecycle managed by agents.

Apache Iceberg is the open table format that won the standards war. Its metadata layer gives object storage the ACID semantics of a real warehouse, its REST catalog gives you true multi-engine portability, and its operations model scales to petabytes. If you are building a lakehouse in 2026, start with Iceberg.

Sources

See Data Workers in action

15 autonomous AI agents working across your entire data stack. MCP-native, open-source, deployed in minutes.

Book a Demo

Related Resources

Apache Iceberg for Data Engineers: The Table Format That Won 2026 — Apache Iceberg became the dominant open table format in 2026. For data engineers: schema evolution, time travel, partition evolution, and…
dbt Snapshots Explained: SCD Type 2 in Five Lines of YAML — Guide to dbt snapshots: timestamp vs check strategy, hard deletes, scaling considerations, and why never full-refresh.
Change Data Capture Explained: How CDC Keeps Warehouses in Sync — Guide to change data capture: log-based vs query-based, initial snapshots, deleted row handling, and landing CDC into lakehouses.
Iceberg vs Delta vs Hudi: Open Table Formats Compared — Three-way comparison of Apache Iceberg, Delta Lake, and Apache Hudi across governance, ecosystem, performance, and workload fit.
Delta Lake vs Iceberg: Which Table Format to Pick — Side-by-side comparison of Delta Lake and Apache Iceberg: origins, features, Delta Uniform, and the hybrid path.
Why AI Agents Need MCP Servers for Data Engineering — MCP servers give AI agents structured access to your data tools — Snowflake, BigQuery, dbt, Airflow, and more. Here is why MCP is the int…
The Complete Guide to Agentic Data Engineering with MCP — Agentic data engineering replaces manual pipeline management with autonomous AI agents. Here is how to implement it with MCP — without lo…
How AI Agents Cut Snowflake Costs by 40% Without Manual Tuning — Most Snowflake environments waste 30-40% of compute on zombie tables, oversized warehouses, and unoptimized queries. AI agents find and f…
RBAC for Data Engineering Teams: Why Manual Access Control Doesn't Scale — Manual RBAC breaks down at 50+ data assets. Policy drift, orphaned permissions, and PII exposure become inevitable. AI agents enforce gov…
From Alert to Resolution in Minutes: How AI Agents Debug Data Pipeline Incidents — The average data pipeline incident takes 4-8 hours to resolve. AI agents that understand your full data context can auto-diagnose and res…
Build Data Pipelines with AI: From Description to Deployment in Minutes — Building a data pipeline still takes 2-6 weeks of engineering time. AI agents that understand your data context can generate, test, and d…
Why Your Data Catalog Is Always Out of Date (And How AI Agents Fix It) — 40-60% of data catalog entries are outdated at any given time. AI agents that continuously scan, classify, and update metadata make the s…

Explore Topic Clusters

Data Governance: The Complete Guide — Policies, access controls, PII, and compliance at scale.
Data Catalog: The Complete Guide — Discovery, metadata, lineage, and the modern catalog stack.
Data Lineage: The Complete Guide — Column-level lineage, impact analysis, and observability.
Data Quality: The Complete Guide — Tests, SLAs, anomaly detection, and data reliability engineering.
AI Data Engineering: The Complete Guide — LLMs, agents, and autonomous workflows across the data stack.
MCP for Data: The Complete Guide — Model Context Protocol servers, tools, and agent integration.
Data Mesh & Data Fabric: The Complete Guide — Federated ownership, domain-oriented architecture, and interop.
Open-Source Data Stack: The Complete Guide — dbt, Airflow, Iceberg, DuckDB, and the modern OSS toolkit.