Building an Open, Multi-Engine Data Lakehouse with S3 and Python

by bradheon 2/18/25, 5:33 PMwith 11 comments
by tomnicholas1on 2/19/25, 5:27 AM

This entire stack also now exists for arrays as well as for tabular data. It's still S3 for storage, but Zarr instead of parquet, Icechunk instead of Iceberg, and Xarray for queries in python.

by datancoffeeon 2/18/25, 9:52 PM

Python support of Iceberg seems to be the biggest unrealized opportunity right now. SQL support seems to be in good shape, with DuckDB and such, but Python support is still quite nascent.

by dogman123on 2/18/25, 8:03 PM

i'm working on a project to do this with iceberg and sqlmesh executed via airflow at my job. sqlmesh seems really promising. i investigated multi-engine executions in dbt and it seems like you need to pay a lot of $$$ for it (multi-engine execution requires multiple dbt projects) and is not included in dbt core.

by teleforceon 2/19/25, 4:44 PM

This article is about building open data lakehouse with the new open table format namely Iceberg.

For building single engine AWS based data lake house you can refer to this article [1], or just use Amazon Sagemaker that also support Iceberg.

Fun Amazon AWS data storage dictionary:

S3: Data Lake

Glacier: Archival Storage

DocumentDB: NoSQL Document Database ala MongoDB

DynamoDB: NoSQL KV and WC Database

RDS: SQL Database

Timestream: Time-Series Database

Neptune: Graph Database

Redshift: Data Warehouse

SageMaker: Data Lakehouse

Islander: Data Mesh (okay kidding, just made this up)

[1] Build a Lake House Architecture on AWS:

https://aws.amazon.com/blogs/big-data/build-a-lake-house-arc...

by AdityaPophaleon 2/20/25, 5:19 AM

I am getting

An error occurred (AccessDenied) when calling the ListObjectsV2 operation: Access Denied

error when trying to run

aws s3 ls s3://mango-public-data/lakehouse-snapshots/peach-lake --recursive