Data Warehouse. SQL engine that is much bigger than the databases of yester-year. Their size and scalability give you capabilities that DBAs of a MS SQL dB wouldn't necessarily allow you to do, but think "wow, that is a big database!"
Data Lake. Now your database is stored in blob storage (which is like CSV files on your computer) - and the blob storage is HUGE (like hundreds of Terabytes or larger). Also, you store unstructured data, too. So: pictures, JSONs, raw HTML. Whatever
Data Lakehouse... do you mean Delta Lake? This is a made up word from Dara Bricks. Take your Data lake, and slap a SQL engine on top of it, and add in some marketing slang ;)
Data Mesh. THIS is more about organizations than infrastructure. Imagine a Data Warehouse or Datalake. Now find 5 key stakeholders around your company. Let each of them have a playground in your warehouse. You drop the data into one zone, then they copy it, transform it, and share it in their zone - but they have to play by your rules. Bam. Data Mesh
The Delta Lake is a marketing term from Databricks to, in part, market their Delta file format and all the clusters you’ll be spinning up.
Delta files are actually an amazing format though. At their most basic, they took Parquet files (columnar database) and let you stream off them really easily. Which takes a lot of complexity out of your pipelines - dont need Kafka for everything, don’t need to figure out when new rows get added (or a whole other set of jobs around that).
But using Delta files really can change the way you develop pipelines (and ML pipelines), so I forgive them for inventing a new term.
As someone that is familiar with what a Lake House Arch is, I remain confused with what Delta Lake is mostly bc I find it difficult to differentiate what is Databricks Delta Lake marketing and the virtues of the Delta Lake Arch (Delta files, etc). It's frustrating, and I've given up...
I am however keeping up to date with Apache Iceberg bc it's much easier to follow and it seems to have a lot of advantages over the Delta Lake external table format (delta files?).
Iceberg seems to be better especially in handling schema evolution and drift. They both seem to use parquet and avro below the surface and generally have the same design, but am I missing anything by dismissing and ignoring Delta Lake?
With that out of the way, the Data Lakehouse name has a Marketing motivation, but the concept is “let’s not have a DW and a Lake as separate structures based on different technical stacks - let’s have one stack that can do both”. That is a very valid proposition, because the information needs tend to span what a DW or a Lake can easily do by themselves. Also, having one common stack makes it much easier to manage all data and generally be productive.
If you look back even at Inmon’s Corporate Information Factory, you’ll see mentions to unstructured data coming in, and the expectation that all data, structure or unstructured, should become useful information for some business purpose. This was back in the 90s, the need was already there, but the technology was not.
The Lakehouse as proposed by Databricks is essentially a combination of technologies (including but not limited to Delta Lake format) to deliver that unified design. It’s a platform where you can process all kinds of data, apply structuring where needed, manage quality, do DataOps and generally deliver information, regardless of the structure or lack of it in the original data.
So, to put it simply, a Lakehouse is a platform (or technical stack, if you want to build your own) where you can implement a DW, a Data Lake and mix capabilities of both to deliver data products more efficiently. It’s more than just the union of DW and Lake, because the combined capabilities allow you to do things that neither could by itself. For instance, parse a large set of documents for sentiment and context to generate metrics on some business topic. Environment, Sustainability and Governance - ESG - is one example area where the results are BI-like metrics for reports and dashboards, but the sources are usually text documents. Social media processing for Marketing purposes is another.
Data Warehouse. SQL engine that is much bigger than the databases of yester-year. Their size and scalability give you capabilities that DBAs of a MS SQL dB wouldn't necessarily allow you to do, but think "wow, that is a big database!"
Data Lake. Now your database is stored in blob storage (which is like CSV files on your computer) - and the blob storage is HUGE (like hundreds of Terabytes or larger). Also, you store unstructured data, too. So: pictures, JSONs, raw HTML. Whatever
Data Lakehouse... do you mean Delta Lake? This is a made up word from Dara Bricks. Take your Data lake, and slap a SQL engine on top of it, and add in some marketing slang ;)
Data Mesh. THIS is more about organizations than infrastructure. Imagine a Data Warehouse or Datalake. Now find 5 key stakeholders around your company. Let each of them have a playground in your warehouse. You drop the data into one zone, then they copy it, transform it, and share it in their zone - but they have to play by your rules. Bam. Data Mesh