As someone who knows nothing about this stuff, I'm looking at the "Data Mart" wiki page: https://en.wikipedia.org/wiki/Data_mart. Ok, so the entire diagram here is labelled "Data Warehouse", and within that there's a "Data Warehouse" block which seems to be solely comprised of a "Data Vault". Do you need a special data key to get into the data vault in the data warehouse? Okay, naturally the data marts are divided into normal marts and strategic marts - seems smart. But all the arrows between everything are labelled "ETL". Seems redundant. What does it mean anyway? Ok apparently it's just... moving data.
Now I look at https://en.wikipedia.org/wiki/Online_analytical_processing. What's that? First sentence: "is an approach to answer multi-dimensional analytical (MDA)". I click through to https://en.wikipedia.org/wiki/Multidimensional_analysis ... MDA "is a data analysis process that groups data into two categories: data dimensions and measurements". What the fuck? Who wrote this? Alright, back on the OLAP wiki page... "The measures are placed at the intersections of the hypercube, which is spanned by the dimensions as a vector space." Ah yes, the intersections... why not keep math out of it if you have no idea how to talk about it? Also, there's no actual mention of why this is considered "online" in the first place. I feel like I'm in a nightmare where the pandas documentation was rewritten in MBA-speak.
As someone who just arrived in data world, I feel your pain.
You didn't even mention the data lake and the data warehouse are set for merger into the data lakehouse. Not to mention where data mesh and data fabric fit into all of this.
It's hard for me to say why this all seems so much more confusing than the software dev world. My guess is because data is a thing that a business accumulates and processes, often as a side channel to its actual work. There's an inherent meta-ness to it, and both the business and tech people have had a hand in shaping the approaches. So it's kind of a mess, and for whatever reason, even more susceptible to buzzwordery than the rest of tech.
Thank you, as a data engineer, that's exactly how I feel about the various concept surrounding data engineering. Or at the very least, the way they're being explained.
I’ve been working in this field for 28 years, and it can be quite confusing for a newcomer. Let me see if I can help.
First, the name Data Warehouse is overloaded. It can refer to the entire architecture (often called “Data Warehousing architecture”) or the central data store that (ideally) contains all the corporate data in a format suitable to generate datasets/schemas for specific needs (the data marts, if that need is BI/reporting/dashboards/data slicing and dicing). The other common component of a DW architecture is the staging area, where you land data from the source systems. So, datawise, a DW architecture has 3 layers:
- Staging: where you land your data from the sources. You may have some temporary storage (some choose to just keep everything, just in case) of data, and a few control tables, but it will usually look very much like the databases of each source;
- Data Warehouse: Bill Inmon defines it as “a subject-oriented, integrated, time-variant and non-volatile collection of data in support of managemnt’s decision making process”. Essentially, reconcile as much as you can of the distinct domains of each source system (ideally, full reconciliation, but it’s almost impossible to get to 100%), keep as much history as you can afford and then apply (a little or a lot, depending on your preferred data modelling approach) denormalization to make it able to support BI tools. If you denormalize a lot, you may end up not needing data marts, but you’ll be applying area-specific needs to what should be a common basis for all areas. Common data modelling approaches for a Data Warehouse are normalised (3NF), detailed star schemas (search about Ralph Kimball if you don’t know) or, a bit more recently, Data Vault;
- Data Mart: an analytical database to support a specific business need - sales reporting, Marketing campaigns, financial reporting, Employee satisfaction, all these applications could have their own data marts. Each data mart should be fed from the Data Warehouse, applying heavy denormalization and calculating specific metrics and reporting structures (i.e., dimensions) to meet that use case’s needs. The most popular data modelling technique for data marts would be dimensional modelling (AKA, star schemas). The names come from the use of two main types of tables - facts and dimensions. A dimension could be a business entity, like departments, customers, products, employees, etc; a fact is a metric (like sales volume, sales $$$, satisfaction, stock) over time and a number of the aforementioned dimensions. When darn, the fact is in the middle and the dimensions around, looking a bit like a star, hence the name.
Analytical processing is nothing more than wading through this data searching for answers to business questions, interesting patterns, outliers or just confirmation that everything is as it should be. It usually means applying filters on the dimension tables, joining with the fact tables, aggregating over hierarchies and possibly time and watching how the resulting time series of metrics behave - the basic job of a data analyst. BI tools expect data to be structured as stars, and will help the analyst do this slicing and dicing over the data, generating the required database queries, possibly doing some part of the computation and then the visualisation rendering. The names “multidimensional” and “hypercube” come from the data being structured across several “dimensions” as I explained above. Some BI tools will even have its own compute/data storage engine, optimised for handling this kind of data. Usually this is called an “OLAP engine” or a “multidimensional database”. It’s a database-like functionality, optimised for filtering, aggregating and generally wading through a large chunk of data. When loaded into a specialised database like this, a “star” is usually referred to as a “cube” or “hypercube”.
And finally, about the “online” mention. All the analysis above is supposed to be ad hoc, interactive and close to instant, so that the analyst’s train of thought (think of it as each question generating additional questions to the data) is not interrupted. The term “online analytical processing” (OLAP) was coined to refer to this, in contrast to the pre-existing term “online transactional processing” (OLTP), which is what most modern systems do - send a large number of small transactions to be processed by a database online (as opposed to batch). OLAP sends a moderate number of very complex queries to be processed by the database online (as opposed to batch).
Now I look at https://en.wikipedia.org/wiki/Online_analytical_processing. What's that? First sentence: "is an approach to answer multi-dimensional analytical (MDA)". I click through to https://en.wikipedia.org/wiki/Multidimensional_analysis ... MDA "is a data analysis process that groups data into two categories: data dimensions and measurements". What the fuck? Who wrote this? Alright, back on the OLAP wiki page... "The measures are placed at the intersections of the hypercube, which is spanned by the dimensions as a vector space." Ah yes, the intersections... why not keep math out of it if you have no idea how to talk about it? Also, there's no actual mention of why this is considered "online" in the first place. I feel like I'm in a nightmare where the pandas documentation was rewritten in MBA-speak.