The evolution of the data engineer role

prions · on Oct 24, 2022

IMO Data engineering is already a specialized form of software engineering. However what people interpret as DE's being slow to adopt best practices from traditional software engineering is more about the unique difficulties of working with data (especially at scale) and less about the awareness or desire to use best practices.

Speaking from my DE experience at Spotify and previously in startup land, the biggest challenge is the slow and distant feedback loop. The vast majority of data pipelines don't run on your machine and don't behave like they do on a local machine. They run as massively distributed processes and their state is opaque to the developer.

Validating the correctness of a large scale data pipeline can be incredibly difficult as the successful operation of a pipeline doesn't conclusively determine whether the data is actually correct for the end user. People working seriously in this space understand that traditional practices here like unit testing only go so far. And integration testing really needs to work at scale with easily recyclable infrastructure (and data) to not be a massive drag on developer productivity. Even getting the correct kind of data to be fed into a test can be very difficult if the ops/infra of the org isn't designed for it.

The best data tooling isn't going to look exactly like traditional swe tooling. Tools that vastly reduce the feedback loop of developing (and debugging) distributed pipelines running in the cloud and also provide means of validating the output on meaningful data is where tooling should be going. Trying to shoehorn traditional SWE best practices will really only take off once that kind of developer experience is realized.

mywittyname · on Oct 24, 2022

> Validating the correctness of a large scale data pipeline can be incredibly difficult as the successful operation of a pipeline doesn't conclusively determine whether the data is actually correct for the end user. People working seriously in this space understand that traditional practices here like unit testing only go so far.

I'm glad to see someone calling this out because the comment here are a sea of "data engineering needs more unit tests." Reliably getting data into a database is rarely where I've experienced issues. That's the easy part.

This is the biggest opportunity in this space, IMHO, since validation and data completeness/accuracy is where I spend the bulk of my work. Something that can analyze datasets and provide some sort of ongoing monitoring for confidence on the completeness and accuracy of the data would be great. These tools seem to exist mainly in the network security realm, but I'm sure they could be generalized to the DE space. When I can't leverage a second system for validation, I will generally run some rudimentary statistics to check to see if the volume and types of data I'm getting is similar to what is expected.

abrazensunset · on Oct 24, 2022

There is a huge round of "data observability" startups that address exactly this. As a category it was overfunded prior to the VC squeeze. Some of them are actually good.

They all have various strengths and weaknesses with respect to anomaly detection, schema change alerts, rules-based approaches, sampled diffs on PRs, incident management, tracking lineage for impact analysis, and providing usage/performance monitoring.

Datafold, Metaplane, Validio, Monte Carlo, Bigeye

Great Expectations has always been an open source standby as well and is being turned into a product.

mywittyname · on Oct 25, 2022

Thanks for the recommendations, I'm going to check some of them out.

snidane · on Oct 25, 2022

Engineers demanding unit test for data is a perfect test to weed out the SWEs who are bit DEs. Ask about experience with data quality and data testing when you interview candidates and you'll distinguish the people who will solve a problem with a simple relational join in 1 hour (DEs) vs those who will try to unknowingly build a shitty implementation of a database engine to solve a problem in one month (SWEs trying to solve data problems with C++ or Java).

rectang · on Oct 25, 2022

Unit testing is a means to an end: how do we verify that code is correct the first time, and how do we set ourselves up to evolve the code safely and prevent regressions in the future?

Strong typing can reduce the practical need for some unit testing. Systems written in loosely-typed languages like Python and JavaScript often see real-world robustness improvements from paranoid unit testing to validate sane behavior when fed wrongly-typed arguments. Those particular unit tests may not be needed in a more strongly typed language like Java or TypeScript or Rust.

Similarly, SQL is a better way to solve certain data problems, and may not need certain checks.

Nevertheless, my experience as a software developer has taught me that in addition to the code I write which implements the functionality, I need to write code which proves I got it right. This is in addition to QA spot checking that the code functions as expected in context (in dev, in prod, etc.) Doing both automated testing and QA gets the code to what I consider an acceptable level of robustness, despite the human tendency to write incorrect code the first time.

There are plenty of software developers who disagree about that and eschew unit testing in particular and error checking in general — and we tend to disagree about how best to achieve high development velocity. I expect there will always be such a bifurcation within the field of data engineering as well.

snidane · on Oct 25, 2022

If you need to maintain some sort of deductive correctness - ie. my inputs are correct and my code is correct therefore my outputs are also correct - you're gonna expose yourself to only a tiny amount of real world problems.

Data engineering is typically closely aligned with business and it's processes are inherently fuzzy. Things are 'correct' as long as no people/quality checks are complaining. There is no deductibe reasoning. No true axioms. No 'correctness'. You can only measure non-quality by how many complaints you have received, but not the actual quality, since it's not a closed deductive system.

Correctness is also defined by somebody downstream from you. What one team considers correct, the other complains about. You don't want to start throwing out good data for one team just because somebody else complained. But many people do. Or typically people coming from SWE into DE tend to, before they learn.

robertlagrant · on Oct 24, 2022

I've worked with medium-sized ETL, and not only does it have unique challenges, it's a sub-___domain that seems to reward quick and dirty and "it works" over strong validation.

The key problem is that more you validate incoming data, the more you can demonstrate correctness, but then the more often data coming in will be rejected, and you will be paged out of hours :)

conkeisterdoor · on Oct 24, 2022

I also manage a medium sized set of ETL pipelines (approx 40 pipelines across 13k-ish lines of Python) and have a very similar experience.

I've never been in a SWE role before, but am related to and have known a number of them, and have a general sense of what being a SWE entails. That disclaimer out of the way, it's my gut feeling that a DE typically does more "hacky" kind of coding than a SWE. Whereas SWEs have much more clearly established standards for how to do certain things.

My first modules were a hot nasty mess. I've been refactoring and refining them over the past 1.5 years so they're more effective, efficient, and easier to maintain. But they've always just worked, and that has been good enough for my employer.

I have one 1600 line module solely dedicated to validating a set of invoices from a single source. It took me months of trial and error to get that monster working reliably.

alexpetralia · on Oct 24, 2022

Oddly, this sounds like the difference between inductive and deductive systems.

azurezyq · on Oct 24, 2022

This is actually a great observation. Data pipelines are often written in various languages, running on heterogenous systems, with different time alignment schemes. I always found it tricky to "fully trust" a piece of result. Hmm, any best practice from your side?

prions · on Oct 25, 2022

Without getting into the weeds of it, I'd say smooth out the rough edges in your development experience and make it behave as similar to prod as possible. If there's less friction there's less incentive to cut corners and make hacks imo.

Some pain points:

- Does it take forever to spin up infra to run a single test?

- Is grabbing test data a manual process? This can be a huge pain especially if the test data is binary like avro or parquet. Test inputs and results should be human friendly

- Does setting up a testing environment require filling out tons of yaml files and manual steps?

- Things built at the wrong level of abstraction! This always irks me to experience. Keep your abstractions clean between which tools in your data stack do what. When people start inlining task-specific logic at the DAG level in airflow, or let their individual tasks figure out triggering or scheduling decisions is when things just become confusing.

Right now my workflow allows me to run a prod job (google cloud dataflow) from my local machine. It consumes prod data and writes to a test-prefixed path. With unit tests on the scala code + successful run of the dataflow job + validation and metrics thrown on the prod job I can feel pretty comfortable with the correctness of the pipeline.

oa335 · on Oct 24, 2022

Not OP, but a Data Engineer with 4 years experience in the space - I think the key is to first build the feedback loop - i.e. any thing that helps you answer how do you know the data pipeline is flowing and that the data is correct - then getting sign-off from both the producers and consumers of the data. Actually getting the data flowing is usually pretty easy after both parties agree about what that actually means.

MrPowers · on Oct 24, 2022

Great article.

> data engineers have been increasingly adopting software engineering best practices

I think the data engineering field is starting to adopt some software engineering best practices, but it's still really early days. I am the author of popular Spark testing libraries (spark-fast-tests, chispa) and they definitely have a large userbase, but could also grow a lot.

> The way organizations structure data teams has changed over the years. Now we see a shift towards decentralized data teams, self-serve data platforms, and ways to store data beyond the data warehouse – such as the data lake, data lakehouse, or the previously mentioned data mesh – to better serve the needs of each data consumer.

I think the Lakehouse architecture is the real future of data engineering, see the paper: https://www.cidrdb.org/cidr2021/papers/cidr2021_paper17.pdf

Disclosure: I am on the Delta Lake team, but joined because I believe in the Lakehouse architecture vision.

It will take a long time for folks to understand all the differences between data lakes, Lakehouses, data warehouses, etc. Over time, I think mass adoption of the Lakehouse architecture is inevitable (benefits of open file formats, no lock in, separating compute from storage, cost management, scalability, etc.).

oa335 · on Oct 24, 2022

I am a data engineer, and I STILL don't understand the differences between the following terms:

1. Data Warehouse

2. Datalake

3. Data Lakehouse

4. Data Mesh

Can someone please clearly explain the differences between these concepts?

geoduck14 · on Oct 25, 2022

Let me help you out:

Data Warehouse. SQL engine that is much bigger than the databases of yester-year. Their size and scalability give you capabilities that DBAs of a MS SQL dB wouldn't necessarily allow you to do, but think "wow, that is a big database!"

Data Lake. Now your database is stored in blob storage (which is like CSV files on your computer) - and the blob storage is HUGE (like hundreds of Terabytes or larger). Also, you store unstructured data, too. So: pictures, JSONs, raw HTML. Whatever

Data Lakehouse... do you mean Delta Lake? This is a made up word from Dara Bricks. Take your Data lake, and slap a SQL engine on top of it, and add in some marketing slang ;)

Data Mesh. THIS is more about organizations than infrastructure. Imagine a Data Warehouse or Datalake. Now find 5 key stakeholders around your company. Let each of them have a playground in your warehouse. You drop the data into one zone, then they copy it, transform it, and share it in their zone - but they have to play by your rules. Bam. Data Mesh

kdazzle · on Oct 25, 2022

The Delta Lake is a marketing term from Databricks to, in part, market their Delta file format and all the clusters you’ll be spinning up.

Delta files are actually an amazing format though. At their most basic, they took Parquet files (columnar database) and let you stream off them really easily. Which takes a lot of complexity out of your pipelines - dont need Kafka for everything, don’t need to figure out when new rows get added (or a whole other set of jobs around that).

But using Delta files really can change the way you develop pipelines (and ML pipelines), so I forgive them for inventing a new term.

chrisjc · on Oct 26, 2022

As someone that is familiar with what a Lake House Arch is, I remain confused with what Delta Lake is mostly bc I find it difficult to differentiate what is Databricks Delta Lake marketing and the virtues of the Delta Lake Arch (Delta files, etc). It's frustrating, and I've given up...

I am however keeping up to date with Apache Iceberg bc it's much easier to follow and it seems to have a lot of advantages over the Delta Lake external table format (delta files?).

Iceberg seems to be better especially in handling schema evolution and drift. They both seem to use parquet and avro below the surface and generally have the same design, but am I missing anything by dismissing and ignoring Delta Lake?

bobbruno · on Oct 25, 2022

Disclaimer: I work for Databricks.

With that out of the way, the Data Lakehouse name has a Marketing motivation, but the concept is “let’s not have a DW and a Lake as separate structures based on different technical stacks - let’s have one stack that can do both”. That is a very valid proposition, because the information needs tend to span what a DW or a Lake can easily do by themselves. Also, having one common stack makes it much easier to manage all data and generally be productive.

If you look back even at Inmon’s Corporate Information Factory, you’ll see mentions to unstructured data coming in, and the expectation that all data, structure or unstructured, should become useful information for some business purpose. This was back in the 90s, the need was already there, but the technology was not.

The Lakehouse as proposed by Databricks is essentially a combination of technologies (including but not limited to Delta Lake format) to deliver that unified design. It’s a platform where you can process all kinds of data, apply structuring where needed, manage quality, do DataOps and generally deliver information, regardless of the structure or lack of it in the original data.

So, to put it simply, a Lakehouse is a platform (or technical stack, if you want to build your own) where you can implement a DW, a Data Lake and mix capabilities of both to deliver data products more efficiently. It’s more than just the union of DW and Lake, because the combined capabilities allow you to do things that neither could by itself. For instance, parse a large set of documents for sentiment and context to generate metrics on some business topic. Environment, Sustainability and Governance - ESG - is one example area where the results are BI-like metrics for reports and dashboards, but the sources are usually text documents. Social media processing for Marketing purposes is another.

parkerhiggins · on Oct 25, 2022

Thank you, excellent explanation.

antipaul · on Oct 25, 2022

Marketing and hype. Especially data mesh. I gave up listening to talks/reading papers about it – could never wrap my head around the "mess"

Warehouse and Lake also feel mostly like distinctions without difference, though if you really want, I think this is the technical difference: - warehouse: structured data for analytics, outputted from ETL jobs - lake: a dump for unstructured data, to be cleaned up later: EL, then possibly T

Moreover, people want to push this even further, with "data river" - streaming(-like) data that is "continuously" transformed (isn't that just an ETL data pipeline?)

dijksterhuis · on Oct 25, 2022

1. Data Warehouse

Recommend reading up on Kimball vs Inmon.

Kimball -- centralised storage in a "dimensional model". Uses a star/snowflake schema where the central FACT table (containing the raw transactional data) can be filtered and aggregated based on choices made in dimension tables (edges of the star/snowflake).

    demographic  account_types      product_types
       \        /                    /
        \      /                    / 
        customer                  product
          \                      /
           \                    /
       FACT table (sale $, order quantity, order ID, product ID)

customer, account_types etc are dimensions to filter your low-level transactional data. The schema looks like a snowflake when you add enough dimensions, hence the name.

The FACT table makes "measures" available to the user. Example: Count of Orders. These are based on the values in the FACT table (your big table of IDs that link to dimensions and low-level transactional data).

You can then slice and dice your count of orders by fields in the dimensions.

You could then add Sum of Sale ($) as an additional measure. "Abstract" measures like Average Sale ($) per Order can also be added in the OLTP backend engine.

End users will often be using Excel or Tableau to create their own dashboards / graphs / reports. This pattern makes sense in that case --> user can explore the heavily structured business data according to all the pre-existing business rules.

Pros:

- Great for enterprise businesses with existing application databases

- Highly structured and transaction support (ACID compliance)

- Ease of use for end business user (create a new pivot table in Excel)

- Easy to query (basically a bunch of SQL queries)

- Encapsulates all your business rules in one place -- a.k.a. single source of truth.

Cons

- Massive start up cost (have to work out the schema before you even write any code)

- Slow to change (imagine if the raw transaction amounts suddenly changed to £ after a certain date!)

- Massive nightly ETL jobs (these break fairly often)

- Usually proprietary tooling / storage (think MS SQL Server)

---

2. Data Lake

Throw everything into an S3 bucket. Database table? Throw it into the S3 bucket. Image data? Throw it into the S3 bucket. Kitchen sink? Throw it into the S3 bucket.

Process your data when you're ready to process it. Read in your data from S3, process it, write back to S3 as an "output file" for downstream consumption.

Pros:

- Easy to set up

- Fairly simple and standardised i/o (S3 apis work with pandas and pyspark dataframes etc)

- Can store data remotely until ready to process it

- Highly flexible as mostly unstructured (create new S3 keys -- a.k.a. directories -- on the fly )

- Cheap storage

Cons:

- Doesn't scale -- turns into a "data swamp"

- Not always ACID compliant (looking at you Digital Ocean)

- Very easy to duplicate data

---

3. Data Lakehouse

Essentially a data lake with some additional bits.

EDIT: aims to mitigate against the "data swamp" problem, improve flexibility / performance when changing how some-random-column X is calculated, improved governance and auditability.

A. Delta Lake Storage Format a.k.a. Delta Tables

https://delta.io

EDIT: apparently there are a few more players and not just delta lake (Apache iceberg). But there’s still questions around how open the open source for these things are.

Versioned files acting like versioned tables. Writing to a file will create a new version of the file, with previous versions stored for a set number of updates. Appending to the file creates a new version of the file in the same way (e.g. add a new order streamed in from the ordering application).

Every file -- a.k.a. delta table -- becomes ACID compliant. You can rollback the table to last week and replay e.g. because change X caused bug Y to happen.

AWS does allow you do this, but it was a right ol' pain in the arse whenever I had to deal with massively partitioned parquet files. Delta Lake makes versioning the outputs much easier and it is much easier to rollback.

B. Data Storage Layout

Enforce a schema based on processing stages to get some performance & data governance benefits.

Example processing stage schema: DATA IN -> EXTRACT -> TRANSFORM -> AGGREGATE -> REPORTABLE

Or the "medallion" schema: Bronze -> Silver -> Gold.

Write out the data at each processing stage to a delta lake table/file. You can now query 5x data sources instead of 2x. The table's rarity indicates the degree of "data enrichment" you have performed -- i.e. how useful have you made the data. Want to update the codebase for the AGGREGATE stage? Just rerun from the TRANSFORM table (rather than run it all from scratch). This also acts as a caching layer. In a Data Warehouse, the entire query needs to be run from scratch each time you change a field. Here, you could just deliver the REPORTABLE tables as artefacts whenever you change them.

C. "Metadata" Tracking

See AWS Glue Data Catalog.

Index files that match a specific S3 key pattern and/or file format and/or AWS S3 tag etc. throughout your S3 bucket. Store the results in a publicly accessible table. Now you can perform SQL queries against the metadata of your data. Want to find that file you were working on last week? Run a query based on last modified time. Want to find files that contain a specific column name? Run a query based on column names.

EDIT: also helps with data governance and auditability as all of your files/tables are tracked in a centralised ___location.

Pros:

- transactional versioning -- ACID compliance and the ability to rollback data over time (I accidentally deleted an entire column of data / calculated VAT wrong yesterday)

- processing-stage schema storage layout acts as a caching layer (only process from the stage where you need to)

- no need for humans to remember the specific path to the files they were working on as files are all indexed

- less chance of creating a "data swamp"

- changes become easier to audit as you can track the changes between versions

Cons:

- Delta lake table format is only really available with Apache Spark / Databricks processing engines (mostly, for now)

- Requires enforcement of the processing-stage schema (your data scientists will just ignore you when you request they start using it)

- More setup cost than a simple data lake

- Basically a move back towards proprietary tooling (some FOSS libs are starting to pop up for delta tables, plus git-based data versioning is a thing)

---

4. Data Mesh

geoduck14's answer on this was pretty good. basically have a data infrastructure team, and them ___domain-specific teams that spring up as needed (like an infra team looking after your k8s clusters, and application teams that use the clusters). ___domain specific data team use the data platform provided by the data infrastructure team.

Previously worked somewhere in a "product" team which basically performed this function. They just didn't call it a "data mesh".

bobbruno · on Oct 25, 2022

Very thorough answer on the DW and Data Lake points. On the Lakehouse, though, as I wrote above, it’s not just the data storage format (Delta, Iceberg, whatever). The storage is important, because it creates the possibility of using cloud storage as a basis for table-like structures, with ACID (or a subset of it) capabilities. Beyond that, you still need management, performance, security, SQL access and probably some more that I forget.

The goal for a Lakehouse is to create a platform/stack that can effectively do everything a DW does, everything a Data Lake does (so, all your pros above), get rid of most of the cons and also allow mixing and matching of the functionalities. Acknowledging this need to handle all kinds of data is not new; having the technology to deliver on it is, and that’s what the Lakehouse is.

dijksterhuis · on Oct 30, 2022

Just seen this response. Thanks for the comment.

I was on the fence about including mentions of querying mechanisms like AWS Athena and the like, but you’re right it’s all part of the same horse (or house).

Was 5am by the time I finished writing this though… so I kind of had to stop (hence the very short data mesh section!).

damnruskie · on Oct 25, 2022

data fabric?

chrisjc · on Oct 24, 2022

You mean Lake 'Shanty' Architecture (think DataSwamp vs DataLake) am I right?

But in all seriousness, I totally agree with your opinion on LakeHouse Architecture and am especially excited about Apache Iceberg (external table format) and the support and attention it's getting.

Although I don't think that selecting any of these data technologies/philosophies comes down to making a mutually exclusive decision. In my opinion, they either build on or compliment each other quite nicely.

For those that are interested, here are my descriptions of each...

Data Lake Arch - all of your data is stored on blob-storage (S3, etc) in a way that is partitioned thoughtfully and easily accessible, along with a meta-index/catalogue of what data is there, and where it is.

Lake House Arch - similar to a DataLake but data is structured and mutable, and hopefully allows for transactions/atomic-ops, schema evolution/drift, time-travel/rollback, so on... Ideally all of the properties that you usually assume to get with any sort of OLAP (maybe even OLTP) DB table. But the most important property in my opinion is that the table is accessible through any compatible compute/query engine/layer. Separating storage and compute has revolutionized the Data Warehouse as we know it, and this is the next iteration of this movement in my opinion.

Data Mesh/Grid Arch - designing how the data moves from a source all the way through each and every target while storing/capturing this information in an accessible catalogue/meta-database even as things transpire and change. As a result it provides data lineage and provenance, potentially labeling/tagging, inventory, data-dictionary-like information etc... This one is the most ambiguous and maybe most difficult to describe and probably design/implement, and to be honest I've never seen a real-life working example. I do think this functionality is a critical missing piece of the data stack, whether the solution is a Data Mesh/Grid or something else. Data Engineers have their work cutout on this one, mostly bc this is where their paths cross with those of Application/Service Developers, Software Engineers. In my opinion, developers are usually creating services/applications that are glorified CRUD wrappers around some kind of operational/transactional data store like MySQL, Postgres, Mongo, etc. Analytics, reporting, retention, volume, etc are usually an after thought and not their problem. Until someone hooks the operational data store up to their SQL IDE or Tableau/Looker and takes down prod. Then along comes the data engineer to come up with yet another ETL/ELT to get the data out of the operational data store and into a data warehouse so that reports and analytics can be run without taking prod down again.

Data Warehouse (modern) - Massive Parallel Processing (MPP) over detached/separated columnar (for now) data. Some Data Warehouses are already somewhat compatible with Data Lakes since they can use their MPP compute to index and access external tables. Some are already planning to be even more Lake House compatible by not only leveraging their own MPP compute against externally managed tables (eg), but also managing external tables in the first place. That includes managing schemas and running all of the DDLs (CREATE, ALTER, DROP, etc) as well as DQLs (SELECT) and DMLs (MERGE, INSERT, UPDATE, DELETE, ...). Querying data across native DB tables, external tables (potentially from multiple Lake Houses, Data Lakes) all becomes possible with a join in a SQL statement. Additionally this allows for all kinds of governance related functionality as well. Masking, row/column level security, logging, auditing, so on.

As you might be able to tell from this post (and my post history) is that I'm a big fan of Snowflake. I'm excited for Snowflake managed Iceberg tables and then consume the data with a different compute/query engine. Snowflake (or other modern DW) could prepare the data (ETL/calc/crunch/etc) and then manage (DDL & DML) it in an Iceberg table. Then something like DuckDB could consume the Iceberg table schema and listen for table changes (oplog?), and then read/query the data performing last-mile analytics (pagination, order, filter, aggs, etc).

DuckDB doesn't support Apache Iceberg, but it can read parquet files which are used internally in Iceberg. Obviously supporting external tables is far more complex than just reading a parquet file, but I don't see why this isn't in their future. DuckDB guys, I know you're out there reading this :)

https://iceberg.apache.org/

https://www.snowflake.com/guides/olap-vs-oltp

https://www.snowflake.com/blog/iceberg-tables-powering-open-...

Finally one of my favorite articles:

https://engineering.linkedin.com/distributed-systems/log-wha...

oa335 · on Oct 24, 2022

Great write-up. I would add that I actually have seen something like a "Data Mesh" architecture, at a bank of all places. The key was a very stable, solid infrastructure and dev platform, as well as a custom Python library that worked across that Platform which was capable of ELT across all supported datastores and would properly log/annotate/catalog the data flows. Such a thing is really only possible when the platform is actually very stable and devs are somewhat forced to use the library.

beckingz · on Oct 24, 2022

I'm going to use "Lake Shanty" in the future. Powerful phrase to describe what happens when you run aground on the shore of a data swamp.

victor106 · on Oct 24, 2022

> It will take a long time for folks to understand all the differences between data lakes, Lakehouses, data warehouses, etc.

What are some good resources that can help educate folks on these differences?

claytonjy · on Oct 24, 2022

Short version:

- data warehouse: schema on write. you have to know the end form before you load it. breaks every time upstream changes (a lot, in this world)

- data lake: schema on read. load everything into S3 and deal with it later. Mongo for data platforms

- data lakehouse: something in between. store everything loosely like a lake, but have in-lakehouse processes present user-friendly transforms or views like a warehouse. Made possible by cheap storage (parquet on S3), reduces schema breakage by keeping both sides of the T in the same place

DougBTX · on Oct 24, 2022

Materialised views for cloud storage?

MrPowers · on Oct 24, 2022

I am working on some blogs / videos that will hopefully help clarify the differences. I'm working on a Delta Lake vs Parquet blog post right now and gave a 5 Reasons Parquet files are better than CSV talk last year: https://youtu.be/9LYYOdIwQXg

Most of the content that I've seen in this area is really high-level. I'm trying to write posts that are a bit more concrete with some code snippets / high level benchmarks, etc. Hopefully this will help.

swyx · on Oct 24, 2022

(OP's coworker) We actually published a guide on data lakes/lakehouses last month! https://airbyte.com/blog/data-lake-lakehouse-guide-powered-b...

covering:

- What’s a Data Lake and Why Do You Need One?

- What’s the Differences between a Data Lake, Data Warehouse, and Data Lakehouse

- Components of a Data Lake

- Data Lake Trends in the Market

- How to Turn Your Data Lake into a Lakehouse

abrazensunset · on Oct 24, 2022

"Lakehouse" usually means a data lake (bunch of files in object storage with some arbitrary structure) that has an open source "table format" making it act like a database. E.g. using Iceberg or Delta Lake to handle deletes, transactions, concurrency control on top of parquet (the "file format").

The advantage is that various query engines will make it quack like a database, but you have a completely open interop layer that will let any combination of query engines (or just SDKs that implement the table format, or whatever) coexist. And in addition, you can feel good about "owning" your data and not being overtly locked in to Snowflake or Databricks.

sherifnada · on Oct 24, 2022

In some sense, Data engineering today is where software engineering was a decade ago:

- Infrastructure as code is not the norm. Most tools are UI-focused. It's the equivalent of setting up your infra via the AWS UI.

- Prod/Staging/Dev environments are not the norm

- Version Control is not a first class concept

- DRY and component re-use is exceedingly difficult (how many times did you walk into a meeting where 3 people had 3 different definitions of the same metric?)

- API Interfaces are rarely explicitly defined, and fickle when they are (the hot name for this nowadays is "data contracts")

- unit/integration/acceptance testing is not as nearly as ubiquitous as it is in software

On the bright side, I think this means DE doesn't need to re-invent the wheel on a lot of these issues. We can borrow a lot from software engineering.

mywittyname · on Oct 24, 2022

My DE team has all of these, and I've never worked on a team without them. I speak as someone whose official title has been Data Engineer since 2015 and I've consulted for lots of F500 companies.

Unit testing is the only thing we tend to skip, mainly because it's more reliable to allow for fluidity in the data that's being ingested. Which is really easy now that so many databases can support automatic schema detection. External APIs can change without notice, so it's better to just design for that, then use the time you would spend on unit tests to build alerts around automated data validation.

artwr · on Oct 25, 2022

> Infrastructure as code is not the norm. Most tools are UI-focused. It's the equivalent of setting up your infra via the AWS UI. > Version Control is not a first class concept

Of course, I may have worked in all of the wrong places but all but one of the places I've worked for the past ten years had source control for data pipelines or the ability to setup via config/source control code as opposed to UIs.

> - Prod/Staging/Dev environments are not the norm

Fairly true, though in some cases, staging/dev has a bit more footprint/investment required than for backend or frontend development.

> DRY and component re-use is exceedingly difficult (how many times did you walk into a meeting where 3 people had 3 different definitions of the same metric?)

That's a hard one and I agree that's where a lot of opportunity is. There are several efforts to get at a more semantic layer / metric catalog where the people who care about the metrics can agree on the definition, but that's more of an organizational issue, not a data engineering issue.

Proper data modeling to ensure you can more easily reuse the metric as needed is also core here.

> - API Interfaces are rarely explicitly defined, and fickle when they are (the hot name for this nowadays is "data contracts")

That's another hard issue. The way I see it, it's still going to be a mix between nicely defined contracts and much looser logging that the DE still has to try to shape into something useful, sometimes even successfully.

> - unit/integration/acceptance testing is not as nearly as ubiquitous as it is in software

I take a slight issue with ubiquitous. The amount of software (from paid vendors no less) I have interacted with which does not have proper acceptance/integration testing is just plain sad.

beckingz · on Oct 24, 2022

You're talking about analytics not data engineering.

But yes, Data Analysis still needs more of this, though the smarter folks are getting on the Analytics Engineering / DataOps trains.

mywittyname · on Oct 24, 2022

> The declarative concept is highly tied to the trend of moving away from data pipelines and embracing data products –

Of course an Airbyte article would say this, because they are selling these tools, but my experience has been the opposite. People buy these tools because they claim to make it easier for non-software people to build pipelines. But the problem is that these tools seem to end up being far more complicated and less reliable than pipelines built in code.

There's a reason that this ___domain is saturated with so. many. tools. None of them do a great job. And when a company invariably hits the limits of one, they start shopping for a replacement, which will have it's own set of limitations. Lather-rinse-repeat.

I built a solid career over the past 8 or so years of replacing these "no code" pipeline tools with code once companies hit the ceilings of these tools. You can get surprisingly far in the data world with Airflow + a large scale database, but all of the major cloud providers have great tool offerings in this space. Plus, for platforms that these tools don't interface with, you're going to have to write code anyway.

Arimbr · on Oct 24, 2022

Oh, declarative doesn't necessarily mean no-code. Airbyte data integration connectors are built with an SDK in Python, Java, and a low-code SDK that was just released...

You can then build custom connectors on top of these and many users actually need to modify an existing connector, but would rather start from a template than from scratch.

Airbyte also provides a CLI and YAML configuration language that you can use to declare sources, destinations and connections without the UI: https://github.com/airbytehq/airbyte/blob/master/octavia-cli...

I agree with you that code is here to stay and power users need to see the code and modify it. That's why Airbyte code is open-source.

muspimerol · on Oct 24, 2022

> I built a solid career over the past 8 or so years of replacing these "no code" pipeline tools with code once companies hit the ceilings of these tools.

I'm sure you earn a nice living doing this, but surely this is not a convincing argument against using off-the-shelf data products. It will always come down to the cost (including ongoing maintenance) for the business. Bespoke in-house software is always the most flexible route, but rarely the cheaper one.

rectang · on Oct 24, 2022

I'm a software dev who's been bumping up against the data engineering field lately, and I've been dismayed as to how many popular tools shunt you towards unmaintainable, unevolvable system design.

- A predilection for SQL, yielding "get the answer right once" big-ball-of-SQL solutions which are infeasible to debug or modify without causing regressions.

- Poor support for unit testing.

- Poor support for version control.

- Frameworks over libraries (because the vendors want to lock you in).

> data engineers have been increasingly adopting software engineering best practices

We can only hope. I think it's more likely that in the near term data engineers will get better and better at prototyping within low-code frameworks, and that transitioning from the prototype to an evolvable system will get harder.

mynameisash · on Oct 24, 2022

I'm also a software engineer, though I've had the unofficial title "data engineer" applied to me for quite some time now.

The more I work with tools like Spark, the more dissatisfied I become with the data engineering world. Spark is a hot mess of fiddling with configuration - I've lost more productivity to dealing with memory limit, executor count, and other configuration than I think is reasonable.

Pandas is another one. It was good enough to make quick processing concise that it got significant uptake and became de facto. The API is a pain, though, and processing is slow. Now, couple Pandas and Spark in your day-to-day job and you get what I see from my data science colleagues: "I'll hack together some processing in Pandas until my machine can't handle any more data, at which point I'll throw it into Spark and provision a bunch of nodes to do that work for me." I don't mean that to sound pejorative, as they're generally just trying to be productive, but there's so little attention paid to real efficiency in the frameworks and infrastructure that we're blowing through compute, memory, and person-hours unnecessarily (IMHO).

FridgeSeal · on Oct 26, 2022

Spark is the bane of my existence.

Sure, you can spin out multiple executors and have it distribute work amongst them…

Provided you figure out the 3 million configs necessary to do this in a reasonable way, so our dependency, networking and storage issues, only for the runtime to decide that actually, it’d quite like to run your entire workload on only 1 of your machines, or something equally frustrating.

Given how frustrating it is to operate on, I’m shocked it has gained as much popularity as it has. I sort of wonder if that’s a bit by design from vested commercial interests- make it simultaneously exceedingly popular and so convoluted to run that you are basically obliged to pay for the likes of databricks just to make your life not awful.

Of course, once you’ve done this, you’ve now bought into a whole suite of other issues, but that’s a different discussion…

rectang · on Oct 24, 2022

> Pandas is another one.

But at least if I write a transform in pandas it's straightforward to unit test it: create a DataFrame with some dummy data, send it through the function which wraps the transform, test that what comes out is what's expected.

Validating a transform done in SQL is not nearly as straightforward. For starters it needs to be an integration test not a unit test (because you need a database). And that's assuming there's even a way to hook unit tests into your framework.

I'm not a huge fan of Pandas — it's way too prone to silent failure. I've written wrappers around e.g. read_csv which are there to defeat all the magic. But at least I can do that with Python code instead of being stuck with the big-ball-of-SQL (e.g. some complicated view SELECT statement that does a bunch of JOINs, CASE tests and casts).

MrPowers · on Oct 24, 2022

> A predilection for SQL, yielding "get the answer right once" big-ball-of-SQL solutions which are infeasible to debug or modify without causing regressions.

Yea, thankfully some frameworks have Python / Scala APIs that let you abstract "SQL logic" into programatic functions that can be chained and reused independently to avoid the big-ball-of-SQL problem. The best ones also allow for SQL because that's the best way to express some logic.

> Poor support for unit testing.

I've written pandas, PySpark, Scala Spark, and Dask testing libs. Not sure which framework you're referring to.

> Poor support for version control.

The execution platform should make it easy to package up code in JAR files / Python wheel files and attach them to your cluster, so you can version control the business logic. If not, yea, that's a huge limitation.

> Frameworks over libraries (because the vendors want to lock you in)

Not sure what you're referring to, but interested in learning more.

forgetfulness · on Oct 24, 2022

They were talking about the "modern data stack" no doubt.

The trend has been to shift as much work possible to the current generation of Data Warehouses, that abstract the programming model that Spark on columnar storage provided with only a SQL interface, reducing the space where you'd use Spark.

It makes it very accessible to write data pipelines then using dbt (which outcompeted Dataform, though the latter is still kicking), but then you don't have the richer programming facilities, stricter type systems, tooling and the practices of Python or Scala programming, you're in the world of SQL, set back decade or two in testing, checking, and a culture of using them, and with little tools to organize your code.

That, if the team has rebuked the siren songs of a myriad of cloud, low-code platforms for this or the other, with even fewer facilities to keep the data pipelines under control, be it that we call control any of: access, versioning, monitoring, data quality, anything really.

marcosdumay · on Oct 24, 2022

> stricter type systems ... the practices of Python or Scala

I do understand what you are talking about. But I really think you and the OP are both complaining about the wrong problem.

SQL doesn't require bad practices, doesn't inherently harm composability (the way the OP was referring), and don't inherently harm verification. Instead, it has stronger support for many of those than the languages you want to replace it with.

The problems you are talking about are very real. But they do not come from the language. (SQL does bring a few problems by itself, but they are much more subtle than those.)

forgetfulness · on Oct 24, 2022

At least BigQuery does a fair bit of typechecking, and gives error messages in a way that's to the par of application programming (e.g. not letting you pass a timestamp to a DATE function and stating that there's no matching signature).

But a tool that doesn't "require" bad practices but doesn't require good practices either makes your work harder in the long run.

Tooling is poor, the best IDE-similes you got until recently were of the type that connects to a live environment but doesn't tie to your codebase, and encourages you to put your code directly on the database rather than version control, the problems of developing with a REPL and little in the way to mitigate them. I'm talking of course of the problem of having view and function definitions live in the database with no tools to statically navigate the code.

Testing used to be completely hand rolled if anyone bothered with it at all.

That was until now, that data pipeline orchestration tools exist and let you navigate the pipeline as a dependency graph, a marked improvement, but until dbt's Python version is ready for production, we're talking here of a graph of Jinja templates and YAML definitions, with modest support for unit testing.

Dataform is a bit better but virtually unknown and was greatly hindered by the Google acquisition.

Functions have always been clunky and still are.

RDDs and then, to a lesser extent, Dataframes offered a much stronger programming model, but they were still subject to a lack of programming discipline from data engineers in many shops. The results of that, however, are on a different scale with undisciplined SQL programming, and it's downright hard to be disciplined when using it.

The trend to move from ETL to ELT I feel shouldn't have been unquestioningly transitioned to untyped Dataframes and then SQL.

acjohnson55 · on Oct 25, 2022

Could you elaborate on SQL's problems?

marcosdumay · on Oct 25, 2022

The language itself has some issues, like no attention at all paid into modularity and reusability; the old-language distinction between functions and data; lack of expressivity of the the type system (well, not compared with Python and Scala, but just ADTs would already bring a huge gain); and the complex limitations on the symbol literal vs. evaluated use that forces people into metaprograming every time they need to decide on a table at runtime.

The first one would limit the use of widespread best-practices, but in practice it's not the bottleneck, because every SQL-based tooling already creates strictly more constraining issues.

acjohnson55 · on Oct 25, 2022

Those critiques make sense to me, thanks!

morelisp · on Oct 24, 2022

Let me offer the more blunt materialist analysis: Data engineers are being deskilled into data analysts and too blinded by shiny cloud advertisements to notice.

(In this view though, "lack of tests" or whatever is the least concern - until someone figures out how to spin up another expensive cloud tool selling "testable queries".)

forgetfulness · on Oct 24, 2022

The "data engineer" became a distinct role to bring over Software Engineering practices to data processing; such as those practices are, they were a marked improvement over their absence.

Building a bridge from one shore to the other with application programming languages and data processing tools that worked much closer to other forms of programming was a huge part of that push.

Of course, the big data tools were intricate machines that were easy to learn and very hard to master, and data engineers had to be pretty sophisticated.

So, it became cheaper to move much of that apparatus to data warehouses and, as you said, commoditize the building of data pipelines that way.

Software is as widespread as it is today because in every generation the highly skilled priestly classes that were needed to get the job done were displaced by people with less training enabled by new tools or hardware; else it'd be all rocket simulations done by PhD physicists still.

But the technical debt will be hefty from this shift.

acjohnson55 · on Oct 25, 2022

At the end of the day, it's about providing value to businesses. If the same value can be provided with less intensive skillsets and more efficiently, this is a good thing.

chrisjc · on Oct 24, 2022

FYI

> write data pipelines then using dbt (which outcompeted Dataform, though the latter is still kicking), but then you don't have the richer programming facilities, stricter type systems, tooling and the practices of Python or Scala programming, you're in the world of SQL...

Recently announced and limited to only a handful of data platforms, but dbt now supports python models.

https://docs.getdbt.com/docs/building-a-dbt-project/building...

MrPowers · on Oct 24, 2022

> The trend has been to shift as much work possible to the current generation of Data Warehouses, that abstract the programming model that Spark on columnar storage provided with only a SQL interface, reducing the space where you'd use Spark.

I feel like there there are some data professionals that only want to use SQL. Other data professionals only want to use Python. I feel like the trend is to provide users with interfaces that let them be productive. I could be misreading the trend of course.

morelisp · on Oct 24, 2022

It's very unclear to me that anyone is more productive under these new tooling stacks. I'm certain they're not more productive commensurately with new costs and long-term risks.

bobbruno · on Oct 25, 2022

Let me see if I can at least justify some of these things:

- SQL became the de facto standard for data manipulation a few decades ago, and I’m still to see a worthy contender. While the language itself shows its age, the concept of it being (almost) totally declarative of “what” data to get and “what” to do with it, instead of “how”, is unbeatable. A better syntax for achieving this is certainly possible, but you won’t beat decades of SQL code being generated easily. About “get the answer right once”, for analysis it’s often all that’s needed, and good SW maintence doesn’t apply that hard (it just won’t be needed again). For actual repeatable data engineering, I’d suggest either using generated SQL or writing in some abstraction (like PySpark). But don’t underestimate the productivity of SQL and the ability of modern parsers;

- Unit testing: while I generally agree, after 28 years, I have to say that unit testing is not as useful in data engineering. Essentially, it doesn’t matter how much time and effort you put on designing test cases, you’ll never beat the chaotic creativity of real users making real systems ingest all kinds of wrong data in patterns you’d never imagine. Now multiply that by having to cross-reference data from systems that were designed independently, and think of all the possible data errors that might come out of that. You won’t, you won’t cover even 20% of the ways things can go wrong. So, instead of “testing for all ways things go wrong”, a much better pattern is “write code that can capture unexpected errors and make reasonable decisions on whether to continue or abort - with some nice messaging and data debugging, please”;

- Poor support for version control: I 100% agree with you. DataOps is still not nearly as popular as it should be;

- Frameworks over libraries: let me tell you one thing - data is coupled. Uncoupled data has less value than coupled data, because it can’t answer more complex questions. Having different libraries that represent, manipulate and expose data in different ways will just make things much harder for the DE trying to generate the data. And the focus is handling the data, not the code. So, while you may have a bit of a point, I’d suggest that you’re looking at this from the wrong perspective;

Low-code: it’s one of those things that come and go. When I started, I’d use Oracle’s PL/SQL and C. Then came the first generation of visual ETL tools (Informatica, Datastage, etc) and they looked great with the graphic flows. Until you actually had to use them and all the clicking just to get to a point where you could actually add some logic showed. Then came the Hadoop era, and data engineering split into the DW people (who prefer visual, low-code to this day) and the Hadoop people, who went back to coding. This split stays to this day, with each side not recognising what the other does as essentially the same thing. Personally, I prefer code, but I don’t care what others prefer. The visual x coding discussion is, in my opinion, the wrong problem. The real issue is being able to create logic that’s clear enough that something as dumb as a computer will unambiguously understand. STEM learning environments for kids show that this is totally possible in a visual way - the thing is, most companies choose the visual tool in hopes that a person without the right mindset will generate good logic with it. That’s what fails. The rest is a matter of preferred UX - relevant, but not core or one-sided.

dang · on Oct 24, 2022

Please stick to plain text - formatting gimmicks are distracting, not the convention here, and tend to start an arms race of copycats.

I've put hyphens there now. Your comment is otherwise fine of course.

nmarinov · on Oct 24, 2022

Out of curiosity, could you give or point to an example of "formatting gimmicks"?

I tried searching the FAQ but only found the formatdoc guide: https://news.ycombinator.com/formatdoc

rectang · on Oct 24, 2022

OT...

I used one of the Unicode bullets instead of a hyphen (◉). I understand what dang is getting at here — it's in the same spirit as stripping emojis (because otherwise we'd be drowning in them). I'm a past graphic designer so I'm accustomed to working with the medium to achieve maximum communicative impact, but I don't mind operating within constraints.

The convention of using hyphens to start a bulleted list isn't officially documented AFAIK. Having to separate bulleted list items with double newlines is a little weird, but it's fine.

dang · on Oct 25, 2022

Thanks for the thoughtful response.

You don't have to use hyphens for that. Asterisks are fine (see https://news.ycombinator.com/formatdoc for how to get a *) and no doubt there are other ways within the plaintext convention.

unholyguy001 · on Oct 24, 2022

It’s fascinating that the conclusion / forecast is that tools will abstract engineering problems and DE will move closer to the business . While over the last 20 years the exact opposite has happened and the toolset has actually become harder (not easier to use) but orders of magnitude more powerful and DE has moved closer to engineering, to the point where a good data engineer basically is a specialized software engineer.

The absolute pinnacle of “easy to use” was probably the Informatica / Oracle stack of the late 90’s and early 00’s. It just wasn’t powerful or scalable enough to meet the needs of the Big Data shift

Of course I guess this makes sense given the author works for a company with a vested interest in reversing that trend.

hnews_account_1 · on Oct 24, 2022

I think those days were easy to use within the zeitgeist. Even advanced versions of those tools would struggle against the data needs today which have become incredibly bespoke. My skill set extends all the way from my actual industry (finance) to the boundary of software development. I also have data, big data and cluster usage skills (slurm etc). I don’t use everything every day and obviously I cannot be a specialist in most of this stuff (I concentrate on finance more than anything else) considering the incredible range, but this is just the past 2 years for me.

I cannot imagine a less specialized future looking around today where some nice tool does 80% of my work. Not because the work I do is difficult to automate. But because the work I do won’t match the work other industries may do (beyond existing generalizations of pandas, regression toolkits and other low level stuff). There’s no point building a full automation suite just for my single work profile which itself will differ from other areas of finance.

rustyboy · on Oct 25, 2022

DE's at my company actually spend all their time making easy to use interfaces ontop of these tools (Github workflows, IaC/CI-CD solutions, Dev/SecOps features, etc.) so that DW/BI/DS Developers, or even analysts in some cases, can create new pipelines quite easily.

snidane · on Oct 25, 2022

That sounds like infra ops, not data engineering. It's a common theme, though, for DEs to slip into maintaining infra and analysts taking on the role of engineering data pipelines.

webshit2 · on Oct 24, 2022

As someone who knows nothing about this stuff, I'm looking at the "Data Mart" wiki page: https://en.wikipedia.org/wiki/Data_mart. Ok, so the entire diagram here is labelled "Data Warehouse", and within that there's a "Data Warehouse" block which seems to be solely comprised of a "Data Vault". Do you need a special data key to get into the data vault in the data warehouse? Okay, naturally the data marts are divided into normal marts and strategic marts - seems smart. But all the arrows between everything are labelled "ETL". Seems redundant. What does it mean anyway? Ok apparently it's just... moving data.

Now I look at https://en.wikipedia.org/wiki/Online_analytical_processing. What's that? First sentence: "is an approach to answer multi-dimensional analytical (MDA)". I click through to https://en.wikipedia.org/wiki/Multidimensional_analysis ... MDA "is a data analysis process that groups data into two categories: data dimensions and measurements". What the fuck? Who wrote this? Alright, back on the OLAP wiki page... "The measures are placed at the intersections of the hypercube, which is spanned by the dimensions as a vector space." Ah yes, the intersections... why not keep math out of it if you have no idea how to talk about it? Also, there's no actual mention of why this is considered "online" in the first place. I feel like I'm in a nightmare where the pandas documentation was rewritten in MBA-speak.

acjohnson55 · on Oct 25, 2022

As someone who just arrived in data world, I feel your pain.

You didn't even mention the data lake and the data warehouse are set for merger into the data lakehouse. Not to mention where data mesh and data fabric fit into all of this.

It's hard for me to say why this all seems so much more confusing than the software dev world. My guess is because data is a thing that a business accumulates and processes, often as a side channel to its actual work. There's an inherent meta-ness to it, and both the business and tech people have had a hand in shaping the approaches. So it's kind of a mess, and for whatever reason, even more susceptible to buzzwordery than the rest of tech.

rjbwork · on Oct 24, 2022

It's a difficult sphere of knowledge to penetrate. All of that is perfectly coherent to me, FWIW.

From first principals, I can highly recommend Ralph Kimball's primer, The Data Warehouse Toolkit: The Definitive Guide to Dimensional Modeling.[1]

[1]https://www.amazon.com/gp/product/B00DRZX6XS

Shugarl · on Oct 25, 2022

Thank you, as a data engineer, that's exactly how I feel about the various concept surrounding data engineering. Or at the very least, the way they're being explained.

bobbruno · on Oct 25, 2022

I’ve been working in this field for 28 years, and it can be quite confusing for a newcomer. Let me see if I can help.

First, the name Data Warehouse is overloaded. It can refer to the entire architecture (often called “Data Warehousing architecture”) or the central data store that (ideally) contains all the corporate data in a format suitable to generate datasets/schemas for specific needs (the data marts, if that need is BI/reporting/dashboards/data slicing and dicing). The other common component of a DW architecture is the staging area, where you land data from the source systems. So, datawise, a DW architecture has 3 layers:

- Staging: where you land your data from the sources. You may have some temporary storage (some choose to just keep everything, just in case) of data, and a few control tables, but it will usually look very much like the databases of each source;

- Data Warehouse: Bill Inmon defines it as “a subject-oriented, integrated, time-variant and non-volatile collection of data in support of managemnt’s decision making process”. Essentially, reconcile as much as you can of the distinct domains of each source system (ideally, full reconciliation, but it’s almost impossible to get to 100%), keep as much history as you can afford and then apply (a little or a lot, depending on your preferred data modelling approach) denormalization to make it able to support BI tools. If you denormalize a lot, you may end up not needing data marts, but you’ll be applying area-specific needs to what should be a common basis for all areas. Common data modelling approaches for a Data Warehouse are normalised (3NF), detailed star schemas (search about Ralph Kimball if you don’t know) or, a bit more recently, Data Vault;

- Data Mart: an analytical database to support a specific business need - sales reporting, Marketing campaigns, financial reporting, Employee satisfaction, all these applications could have their own data marts. Each data mart should be fed from the Data Warehouse, applying heavy denormalization and calculating specific metrics and reporting structures (i.e., dimensions) to meet that use case’s needs. The most popular data modelling technique for data marts would be dimensional modelling (AKA, star schemas). The names come from the use of two main types of tables - facts and dimensions. A dimension could be a business entity, like departments, customers, products, employees, etc; a fact is a metric (like sales volume, sales $$$, satisfaction, stock) over time and a number of the aforementioned dimensions. When darn, the fact is in the middle and the dimensions around, looking a bit like a star, hence the name.

Analytical processing is nothing more than wading through this data searching for answers to business questions, interesting patterns, outliers or just confirmation that everything is as it should be. It usually means applying filters on the dimension tables, joining with the fact tables, aggregating over hierarchies and possibly time and watching how the resulting time series of metrics behave - the basic job of a data analyst. BI tools expect data to be structured as stars, and will help the analyst do this slicing and dicing over the data, generating the required database queries, possibly doing some part of the computation and then the visualisation rendering. The names “multidimensional” and “hypercube” come from the data being structured across several “dimensions” as I explained above. Some BI tools will even have its own compute/data storage engine, optimised for handling this kind of data. Usually this is called an “OLAP engine” or a “multidimensional database”. It’s a database-like functionality, optimised for filtering, aggregating and generally wading through a large chunk of data. When loaded into a specialised database like this, a “star” is usually referred to as a “cube” or “hypercube”.

And finally, about the “online” mention. All the analysis above is supposed to be ad hoc, interactive and close to instant, so that the analyst’s train of thought (think of it as each question generating additional questions to the data) is not interrupted. The term “online analytical processing” (OLAP) was coined to refer to this, in contrast to the pre-existing term “online transactional processing” (OLTP), which is what most modern systems do - send a large number of small transactions to be processed by a database online (as opposed to batch). OLAP sends a moderate number of very complex queries to be processed by the database online (as opposed to batch).

I hope that made it clearer.

z3c0 · on Oct 24, 2022

For those of you who are genuinely curious why this field has so many similarly-named roles, here's a sincere, non-snarky, non-ironic explanation:

A Data Analyst is distinct from a Systems Analyst or a Business Analyst. They may perform both systems and business analysis tasks, but their distinction comes from their understanding of statistics and how they apply that to other forms of analysis.

A ML specialist is not a Data Scientist. Did you successfully build and deploy an ML model to production? Great! That's still uncommon, despite the hype. However, that would land you in the former position. You can claim the latter once you've walked that model through the scientific method, complete with hypothesis verification and democratization of your methodology.

A BI Engineer and a Data Engineer are going to overlap a lot, but the former is going to lean more towards report development, where the latter will spend more time with ELTs/ETLs. As a data engineer, most of the report development that I do is to report on the state of data pipelines. BI BI, I like to call it.

A Big Data Engineer or Specialist is a subset of data engineers and architects angled towards the problems of big data. This distinction actually matters now, because I'm encountering data professionals these day who have never worked outside the cloud or with small enterprise datasets (unthinkable only half-a-decade ago.)

It doesn't help that lack of understanding often leads to misnomer positions, but anybody who has spent time in this field gets used to the subtle differences quickly.

claytonjy · on Oct 24, 2022

This strikes me as incredibly rosy; I want to live in this world, but I don't. The world I live in:

- Data Analyst: someone who knows some SQL but not enough programming, so we can pay < 6 figures

- ML specialist: someone who figured out DS is a race to the bottom and ML in a title gets you paid more. Spends most of their time installing pytorch in various places

- BI Engineer: Data Analyst but paid a bit more

- Data Engineer: Airflow babysitter

- Big Data Engineer: middle-adged Scala user, Hadoop babysitter

tdj · on Oct 24, 2022

In my experience, I have started to believe ML Engineer is short for "YAML Engineer".

z3c0 · on Oct 24, 2022

Out of all the snark in this thread, this is the only bit to illicit a chuckle from me. Thank you.

itsoktocry · on Oct 24, 2022

What about Analytics Engineer, the hypiest-of-the-hyped right now?

claytonjy · on Oct 24, 2022

BI engineer that knows dbt

acjohnson55 · on Oct 25, 2022

A data pipeline person who does their work in SQL, rather than batch/stream processing tools.

It's not hype, it's just a role built around a different data architecture. It's less powerful than the old big data toolkit, but it's also probably perfectly suitable for many businesses.

buscoquadnary · on Oct 24, 2022

Business Analyst, Big Data Specialist, Data Mining Engineer, Data Scientist, Data Engineer.

Why is this field so prone to hype and repeating the same things with a new coat of paint. I mean what ever happened to OLAP, data cubes, Big Data, and whatever other super big next thing that has happened in the past 2 decades?

Methinks the problem with Business Intelligence solving problems is the firdt part of the term and not the second.

Avicebron · on Oct 24, 2022

I think the real interesting point is slapping the title of engineer/scientiest on to anything and everything regardless of the accreditation actual handed out. soon coming up.."cafeteria engineer", "janitorial engineer"...

Test0129 · on Oct 24, 2022

The difference of course being that other types of engineers have to take a PE. The idea of requiring a PE to have that title is protectionism no different than limiting the number of graduating doctors to keep salaries high. No one will ask a software engineer to build a bridge - relax. Your protection racket is safe. Software engineer is a title conferred on someone who builds systems. It is fitting. And, if we're being honest, the average "job threat level" of a so-called "real" engineer is about the same as a software engineer these days anyway. With the exception of some niche jobs every engineer I know is just a CADIA/SW/etc jocky and the real work is gatekept by greybeards.

No one will call someone a cafeteria engineer or janitorial engineer. The premise is ridiculous. There is a title called "operations engineer" that uses math to optimize processes. Does this one bother you too?

Avicebron · on Oct 25, 2022

i assume youre a micro service plumber :)

Test0129 · on Oct 25, 2022

I like that title better to be honest.

the-printer · on Oct 24, 2022

Woah, woah, woah. Cool it, buddy.

That’s already begun.

rustyboy · on Oct 25, 2022

Because data "is the new oil" and those titles are considered cost jobs not revenue jobs.

source: https://www.kalzumeus.com/2011/10/28/dont-call-yourself-a-pr...

hatware · on Oct 24, 2022

> Why is this field so prone to hype and repeating the same things with a new coat of paint.

Money and Marketing. It's no different from how Hadoop was a big deal around 2010, or how Functional Programming became the new thing from 2015 onwards.

Personally I think this is a failure of regulatory agencies.

MonkeyMalarky · on Oct 24, 2022

I dunno, I have to first put my data somewhere though. But where.. In a warehouse? Silo? Data lake? Lake house? (I really despise that last one, who could coin that phase with a straight face..)

MrPowers · on Oct 24, 2022

Data warehouse: bundles compute & storage and comes at a comparatively high price point. Great option for certain workflows. Not as great for scaling & non-SQL workflows.

Data lake: typically refers to Parquet files / CSV files in some storage system (cloud or HDFS). Data lakes are better for non-SQL workflows compared to data warehouses, but have a number of disadvantages.

Lakehouse storage formats: Based on OSS files and solve a number of data lake limitations. Options are Delta Lake, Iceberg, and Hudi. Lakehouse storage formats offer a ton of advantages and basically no downsides compared to Parquet tables for example.

Lakehouse architecture: An architectural paradigm to store data in a way such that's it's easily accessible for SQL-based and non-SQL-based workflows, see the paper: https://www.cidrdb.org/cidr2021/papers/cidr2021_paper17.pdf

There are a variety of tradeoffs to be weighed when selecting the optimal solution for your particular needs.

marktangotango · on Oct 24, 2022

If this is satire it's brilliant. I don't doubt it's factual, but the last sentence is a slayer.

fabatka · on Oct 24, 2022

Can you explain why do you find the above explanation amusing? I honestly don't see the absurdity of it, although, my livelihood may depend on me not seeing it :)

victor106 · on Oct 24, 2022

It’s seems like you are just being negative to a reply that was meant to genuinely clarify confusing terminology.

If not please elaborate why you doubt this is not factual?

PubliusMI · on Oct 24, 2022

IMHO,

"data lake" = collection of all data sources: HDFS, S3, SQL DB, etc

"lake house" = tool to query and combine results from all such sources: Dremio

buscoquadnary · on Oct 24, 2022

But what lake house is complete without a boat.

That's why my company is looking for investors who are interested in being at the forefront of the data revolution, using our data rowboat that will allow you to proactively leverage your data synergies to break down organizational data silos and use analytics to address your core competencies in order to leverage a strategic advantage to become the platform of choice in a holistic environment.

Tell me if this sounds familiar, your company has tons of data but it is spread out all over the place and you can't seem to get good info, you end up hounding engineers to get your reports and provide you information so you can look like you are making data driven decisions. Maybe you've implemented a data lake but now have no idea how to use it. We've got you covered with our patent pending data rowboat solution.

This will allow you to impressive everyone else in the mid level staff meetings by allowing you to say you are doing something around the "data revolution" in your org. The best part is that every implementation will come with a team of our in house consultants that will allow the project to drag on forever so that you always have something to report on in staff meetings and make you look good to your higher ups.

Now you may be an engineer looking to revolutionize your career and get involved in the next step of the glorious October data revolution. Well we've got you covered for a very reasonable price you can enroll in our "data rowboat boot camp", where you will spend hours locked in a room where someone who barely speaks English will read documentation to you.

But act quick otherwise you'll end up as one of the data kulaks as the new data rowboat revolution proceeds into a glorious future with our 5 year plan.

jdgoesmarching · on Oct 24, 2022

Brb, running to trademark every nautical data metaphor I can get my hands on.

What happens when your data rowboat runs ashore? Introducing Data Tugboat™, your single pane of glass solution for shoring up undocumented ETLs and reeling your data lineage safely into harbor.

MonkeyMalarky · on Oct 24, 2022

Need to run ML on your data? Try our DeepSea data drilling rigs, delivered in containers!

MonkeyMalarky · on Oct 24, 2022

Sir, I'm sorry, but a rowboat just won't scale, my needs are too vast. What I'm proposing is the next level of data extraction. You've heard of data mining? Well meet the aquatic equivalent, the Data Trawler. To find out more, contact our solution consultants today!

MikeDelta · on Oct 24, 2022

'Tis a field riddled with yachts... but where are all the customers' yachts?

usgroup · on Oct 24, 2022

I live in the world of data lakes and elaborate pipelines. Now and again I get to use a traditional star schema data warehouse and … it is an absolute pleasure to use in contrast to modern data access patterns.

swyx · on Oct 24, 2022

> Titles and responsibilities will also morph, potentially deeming the “data engineer” term obsolete in favor of more specialized and specific titles.

"analytics engineer" is mentioned but also just had its first conference at dbt's conference. all the talks are already up https://coalesce.getdbt.com/agenda/keynote-the-end-of-the-ro...

pjot · on Oct 24, 2022

Just to clarify, last week was dbt’s first _in person_ conference. Third overall.

acjohnson55 · on Oct 25, 2022

And in my opinion, it was an incredible time, and had great content.

iamjk · on Oct 24, 2022

I won't be surprised if DE ends up just falling under the "software engineering" umbrella as the jobs grow closer together. With hybrid OLAP/OLTP databases becoming more popular, the skillset delta is definitely smaller than it used to be. Data Engineers are higher leverage assets to an organization than they ever have been before.

snapetom · on Oct 24, 2022

I think it's mostly already there, but your big, enterprise houses were late getting the memo. About 12 years ago, I switched to a DE role/title and held it ever since. I worked in a variety of startups doing DE - moving data from over here to over there, with a variety of tools from orchestration frameworks to homegrown code in a variety of languages.

About six years ago, I walked into a local hospital to interview for a DE role and it was very clear that their definition of DE was different than mine. The whole dept worked in nothing but SQL. I thought I was good with SQL, but they absolutely crucified me on SQL and data architecture theories. I ended up getting kicked over to a software engineering role, doing DE in another capacity, which made more sense for me.

Only now I'm hearing that they're migrating to other tools like dbt and requiring their DEs to learn programming languages.

buscoquadnary · on Oct 24, 2022

Well my understanding is that a Data Engineer is basically just a DevOps engineer but instead of building infra to run applications they build infra to process, sanitize and normalize data.

tbarrera · on Oct 24, 2022

Author here - Of course, data engineering involves building infra and being knowledgeable about DevOps practices, but that’s not the only area data engineers should be familiar with. There are many, many more! In my personal experience, sometimes we end up not using DevOps best practices because we spread too thin. That’s why I believe in specialization within data engineering and the surge of “data reliability engineer” or similar

Avalaxy · on Oct 24, 2022

Imho that is absolutely not doing the role justice. For some people that may hold true, but I would expect a data engineer to know everything about distributed systems, database indexes, how different databases work and why you pick them, partitioning, replication, transactions/locking. These are topics a software engineer is typically familiar with. A DevOps engineer wouldn't be.

hbarka · on Oct 24, 2022

Or to denormalize data, the distinction of which the data engineer would be most familiar with why and how.

acjohnson55 · on Oct 25, 2022

I don't think that's true, in general, just because I think software engineers will largely be focused on creating end user experiences and custom systems, whereas data engineers will be focused on analytics, reporting, ML, and business operations.

There's definitely some overlap, where companies are producing fundamentally data driven user experiences, such as user-visible analytics or live recommendations. But that's niche.

3minus1 · on Oct 24, 2022

Yeah, maybe this will happen. Where I work (FAANG), I know that DEs get lower compensation than SWEs and SREs.

artwr · on Nov 1, 2022

Funny in my FAANG, it can be the reverse :)

antipaul · on Oct 25, 2022

Good readable historical overview, IMO.

The referenced "The Rise of the Data Engineer" article proved quite prescient.

The present overview seems less good on future predictions (it's hard!). Also, some mention of "data science" could've been a propos – though it's difficult to find consensus on what DS actually entails.

Overall, I think a good framework to navigate this space is to think of 2 overarching disciplines:

1) Engineering: scaling data & software.

2) Science: getting actionable insights from data combined with subject matter/___domain expertise

Usually, #1 is a platform for #2 – but getting #1 good seems just as, if not more, important, and is perhaps harder.

EDIT: formatting

Tycho · on Oct 24, 2022

I would describe myself as a dataframe engineer.

Dowwie · on Oct 25, 2022

Have you seen what is possible with Elixir and its Broadway library? You can set up a fault tolerant, concurrent worker pool utilizing all of the necessary feedback mechanisms involved with message processing.

iblaine · on Oct 24, 2022

A full history of DE should include some original low code tools (Cognos, Informatica, SSIS). To some extent, the failure of these tools to adopt to the evolution of the DE role has lead to our modern data stack.

Eumenes · on Oct 24, 2022

Agreed. This is the first thing I thought about - the evolution from reporting systems to ETL code to Hadoop to Spark, etc.

gregw2 · on Oct 25, 2022

Informatica and SSIS, in practice, in my experience, are not really low code tools.

I'd call them... GUI-UI based coding tools?