> It will take a long time for folks to understand all the differences between d...

claytonjy · on Oct 24, 2022

Short version:

- data warehouse: schema on write. you have to know the end form before you load it. breaks every time upstream changes (a lot, in this world)

- data lake: schema on read. load everything into S3 and deal with it later. Mongo for data platforms

- data lakehouse: something in between. store everything loosely like a lake, but have in-lakehouse processes present user-friendly transforms or views like a warehouse. Made possible by cheap storage (parquet on S3), reduces schema breakage by keeping both sides of the T in the same place

DougBTX · on Oct 24, 2022

Materialised views for cloud storage?

MrPowers · on Oct 24, 2022

I am working on some blogs / videos that will hopefully help clarify the differences. I'm working on a Delta Lake vs Parquet blog post right now and gave a 5 Reasons Parquet files are better than CSV talk last year: https://youtu.be/9LYYOdIwQXg

Most of the content that I've seen in this area is really high-level. I'm trying to write posts that are a bit more concrete with some code snippets / high level benchmarks, etc. Hopefully this will help.

swyx · on Oct 24, 2022

(OP's coworker) We actually published a guide on data lakes/lakehouses last month! https://airbyte.com/blog/data-lake-lakehouse-guide-powered-b...

covering:

- What’s a Data Lake and Why Do You Need One?

- What’s the Differences between a Data Lake, Data Warehouse, and Data Lakehouse

- Components of a Data Lake

- Data Lake Trends in the Market

- How to Turn Your Data Lake into a Lakehouse

abrazensunset · on Oct 24, 2022

"Lakehouse" usually means a data lake (bunch of files in object storage with some arbitrary structure) that has an open source "table format" making it act like a database. E.g. using Iceberg or Delta Lake to handle deletes, transactions, concurrency control on top of parquet (the "file format").

The advantage is that various query engines will make it quack like a database, but you have a completely open interop layer that will let any combination of query engines (or just SDKs that implement the table format, or whatever) coexist. And in addition, you can feel good about "owning" your data and not being overtly locked in to Snowflake or Databricks.