- data warehouse: schema on write. you have to know the end form before you load it. breaks every time upstream changes (a lot, in this world)
- data lake: schema on read. load everything into S3 and deal with it later. Mongo for data platforms
- data lakehouse: something in between. store everything loosely like a lake, but have in-lakehouse processes present user-friendly transforms or views like a warehouse. Made possible by cheap storage (parquet on S3), reduces schema breakage by keeping both sides of the T in the same place
I am working on some blogs / videos that will hopefully help clarify the differences. I'm working on a Delta Lake vs Parquet blog post right now and gave a 5 Reasons Parquet files are better than CSV talk last year: https://youtu.be/9LYYOdIwQXg
Most of the content that I've seen in this area is really high-level. I'm trying to write posts that are a bit more concrete with some code snippets / high level benchmarks, etc. Hopefully this will help.
"Lakehouse" usually means a data lake (bunch of files in object storage with some arbitrary structure) that has an open source "table format" making it act like a database. E.g. using Iceberg or Delta Lake to handle deletes, transactions, concurrency control on top of parquet (the "file format").
The advantage is that various query engines will make it quack like a database, but you have a completely open interop layer that will let any combination of query engines (or just SDKs that implement the table format, or whatever) coexist. And in addition, you can feel good about "owning" your data and not being overtly locked in to Snowflake or Databricks.
What are some good resources that can help educate folks on these differences?