As someone who knows nothing about this stuff, I'm looking at the "Data Mart" wi...

acjohnson55 · on Oct 25, 2022

As someone who just arrived in data world, I feel your pain.

You didn't even mention the data lake and the data warehouse are set for merger into the data lakehouse. Not to mention where data mesh and data fabric fit into all of this.

It's hard for me to say why this all seems so much more confusing than the software dev world. My guess is because data is a thing that a business accumulates and processes, often as a side channel to its actual work. There's an inherent meta-ness to it, and both the business and tech people have had a hand in shaping the approaches. So it's kind of a mess, and for whatever reason, even more susceptible to buzzwordery than the rest of tech.

rjbwork · on Oct 24, 2022

It's a difficult sphere of knowledge to penetrate. All of that is perfectly coherent to me, FWIW.

From first principals, I can highly recommend Ralph Kimball's primer, The Data Warehouse Toolkit: The Definitive Guide to Dimensional Modeling.[1]

[1]https://www.amazon.com/gp/product/B00DRZX6XS

Shugarl · on Oct 25, 2022

Thank you, as a data engineer, that's exactly how I feel about the various concept surrounding data engineering. Or at the very least, the way they're being explained.

bobbruno · on Oct 25, 2022

I’ve been working in this field for 28 years, and it can be quite confusing for a newcomer. Let me see if I can help.

First, the name Data Warehouse is overloaded. It can refer to the entire architecture (often called “Data Warehousing architecture”) or the central data store that (ideally) contains all the corporate data in a format suitable to generate datasets/schemas for specific needs (the data marts, if that need is BI/reporting/dashboards/data slicing and dicing). The other common component of a DW architecture is the staging area, where you land data from the source systems. So, datawise, a DW architecture has 3 layers:

- Staging: where you land your data from the sources. You may have some temporary storage (some choose to just keep everything, just in case) of data, and a few control tables, but it will usually look very much like the databases of each source;

- Data Warehouse: Bill Inmon defines it as “a subject-oriented, integrated, time-variant and non-volatile collection of data in support of managemnt’s decision making process”. Essentially, reconcile as much as you can of the distinct domains of each source system (ideally, full reconciliation, but it’s almost impossible to get to 100%), keep as much history as you can afford and then apply (a little or a lot, depending on your preferred data modelling approach) denormalization to make it able to support BI tools. If you denormalize a lot, you may end up not needing data marts, but you’ll be applying area-specific needs to what should be a common basis for all areas. Common data modelling approaches for a Data Warehouse are normalised (3NF), detailed star schemas (search about Ralph Kimball if you don’t know) or, a bit more recently, Data Vault;

- Data Mart: an analytical database to support a specific business need - sales reporting, Marketing campaigns, financial reporting, Employee satisfaction, all these applications could have their own data marts. Each data mart should be fed from the Data Warehouse, applying heavy denormalization and calculating specific metrics and reporting structures (i.e., dimensions) to meet that use case’s needs. The most popular data modelling technique for data marts would be dimensional modelling (AKA, star schemas). The names come from the use of two main types of tables - facts and dimensions. A dimension could be a business entity, like departments, customers, products, employees, etc; a fact is a metric (like sales volume, sales $$$, satisfaction, stock) over time and a number of the aforementioned dimensions. When darn, the fact is in the middle and the dimensions around, looking a bit like a star, hence the name.

Analytical processing is nothing more than wading through this data searching for answers to business questions, interesting patterns, outliers or just confirmation that everything is as it should be. It usually means applying filters on the dimension tables, joining with the fact tables, aggregating over hierarchies and possibly time and watching how the resulting time series of metrics behave - the basic job of a data analyst. BI tools expect data to be structured as stars, and will help the analyst do this slicing and dicing over the data, generating the required database queries, possibly doing some part of the computation and then the visualisation rendering. The names “multidimensional” and “hypercube” come from the data being structured across several “dimensions” as I explained above. Some BI tools will even have its own compute/data storage engine, optimised for handling this kind of data. Usually this is called an “OLAP engine” or a “multidimensional database”. It’s a database-like functionality, optimised for filtering, aggregating and generally wading through a large chunk of data. When loaded into a specialised database like this, a “star” is usually referred to as a “cube” or “hypercube”.

And finally, about the “online” mention. All the analysis above is supposed to be ad hoc, interactive and close to instant, so that the analyst’s train of thought (think of it as each question generating additional questions to the data) is not interrupted. The term “online analytical processing” (OLAP) was coined to refer to this, in contrast to the pre-existing term “online transactional processing” (OLTP), which is what most modern systems do - send a large number of small transactions to be processed by a database online (as opposed to batch). OLAP sends a moderate number of very complex queries to be processed by the database online (as opposed to batch).

I hope that made it clearer.