Maintainable ETL Practices at Stitchfix

jdreaver · on June 1, 2019

Huge +1 to leveraging SQL as much as possible during ETL.

We recently moved our data warehouse to Snowflake. Along the way, we adopted an ELT paradigm instead of ETL, where you load your raw data into the warehouse and then perform any transformations in the warehouse. A tool called dbt (https://www.getdbt.com/) has made this a joy. You write SQL select queries in SQL files, and dbt creates a DAG based on this SQL so it can efficiently create tables/views in parallel. dbt also comes with some nifty testing facilities to validate your data once it is transformed.

This is a great article, thank you for writing it!

systems · on June 2, 2019

I am not a big fan for the term ELT

Just because you are using a database as your work space doesnt mean you Load your data before transformation

Most complex dw, have multiple layers of transformation, and several workspaces along the way

There is always a final step where you load the clean data, into your clean schema, used for querying the data

ETL

    Extract = Fetch the data you want 
    T = Make all necessary change
    L = Create a clean copy

tomnipotent · on June 2, 2019

ETL can include the destruction of raw data between the T & L. ELT assumes it is retained. This is a legacy from days when storage was expensive and it often wasn't tenable to retain everything.

unityByFreedom · on June 2, 2019

We held onto middle data when I worked in ETL 15 years ago. It's normal to save data in the middle for unit testing and lookup tables. I agree with GP this doesn't need a new term.

ewngzen · on June 2, 2019

Multiple layers, You have:

1. The raw data layer where data comes directly from buisiness log

2. Common data layer where one copy of data serves multiple purposes

3. Application layer where final result is presented

teej · on June 1, 2019

DBT is fantastic, I very strongly recommend it.

sv123 · on June 2, 2019

DBT has been a game changer for us, such an awesome tool.

atombender · on June 2, 2019

In discussions like this, I often wonder how there can be enough data to require a "big data" pipeline with things like Spark and Presto.

Stitch Fix seems to be one of those online services that send you sets of clothes that they think fit your style. That seems like a really narrow, low-data kind of industry. How much data can they possibly have? And why so big on the backend? In January 2018 they had 100 engineers. Presumably they're even larger now. Just for a service that sends out clothing.

Maybe I'm lacking in imagination or insight into what takes to run a company like this. On the other hand, a single PostgreSQL instance can run complex ad-hoc queries, with CTEs and everything, on a single node involving millions, even billions, of rows.

achompas · on June 2, 2019

Their service is high-touch. Stylists speak with clients constantly, so they record this information.

They also have mobile apps, they run product experiments, they source and sell clothing and manage inventory, they build and iterate on algorithmic approaches to recommend and design clothing (many of which help stylists and never reach the screen of an external client).

You can skim through their Algorithms blog for some more detail. I find them impressive in how they scale the impact of relatively few stylists to about 3M users.

https://www.kleinerperkins.com/files/INTERNET_TRENDS_REPORT_...

threeseed · on June 2, 2019

Spark is not a competitor to PostgreSQL.

It is a distributed compute engine that has a lot of capabilities, is rock solid, allows you to blend SQL with Python/R/Scala and can support ML use cases as your needs grow. You can easily store all of your data in PostgreSQL and run Spark on top.

afpx · on June 2, 2019

At what point does it become “big data”? 10s of billions, 100s of billions? I would love to get a single PostgreSQL server to index all of my data in a timely manner. Heck, I’d be happy with a sharded scheme, if it worked. But, I don’t have time to learn the low-level details of a RDBMS and my system IO throughput to make that work. Spark works, it’s easy, and it’s relatively cheap. So, why make it more complicated?

reddickulous · on June 2, 2019

A transactional schema usually isn't ideal for running analytics on. So you create a data warehouse that transforms your data into a schema that's more amenable to reporting. They probably have a ton of marketing data also they integrate into the data warehouse to run analysis on.

piggybox · on June 2, 2019

I used to work for SF and managed one of the biggest table in the DW. I won't say what's used for but to give you a clue how big that is; we added a few billion more rows each day to that table. I don't think a single PostgreSQL instance had any chance to survice that.

tempguy9999 · on June 1, 2019

Too large a chunk of my life was burnt the ETL from hell. Other than SQL, I can't see any other tool that would have helped.

I found the easiest way was just dump tables of data into long varchars - even the supposedly numeric stuff, because often enough it wasn't - then scrub & filter it from there. (NB, if you take anything from this post, make it that).

There were multiple challenges, but one that was most painful was the sheer filth mixed into the data. Too often even a human couldn't work out what the heck was supposed to be in some field. Often we could, but it ate a lot of time.

It doesn't sound like you ETLs are anything like mine; you start with pretty clean data. You are lucky.

achompas · on June 2, 2019

Yikes, I’m sorry. What were your application developers doing to persist this messy data?

tempguy9999 · on June 2, 2019

It was an upgrade from a legacy system to its successor. I never used the legacy system so I don't know what checks they did. I suspect little to none; I guess they pretty much exposed the raw 'tables' to the user to insert whatever they wanted.

I wrote 'tables' because it wasn't a relational DB although the successor system was.

The amount of data was pretty minimal too, not much more than a dozen GB, but it ate up at least 4 man-years to convert. I still cringe thinking about it. Some parts were an interesting challenge but mostly just soul destroying grind.

gghhzzgghhzz · on June 2, 2019

well in the case of a system migration for consumer data; in many cases the developers may have moved on or even retired. In some cases operational data can have lived through pre RDBMS systems, migrations to RDBMS, migrations to different system software from different vendors, adhoc imports of data from other systems, bulk zero day currency conversions (to the Euro), different approaches for configuration of obscure product codes, bugs, systems that didn't understand leap years, developers and systems who didn't understand normalisation or timezones or decimal data types or rounding errors or character sets, crashes, 'manual' rollbacks and hacks to reset large batch jobs.

of course the value to the business of much of this data is highly questionable, but this isn't always an argument you win..

ak39 · on June 2, 2019

“The primary benefit of SQL is that it’s understood by all data professionals: data scientists, data engineers, analytics engineers, data analysts, DB admins, and many business analysts.“

No, the primary benefit of using SQL is not SQL itself but the “free features” you get with the RDBMS you’re using. Enforcement of primary keys, automatic data type conversions, free transactions! and rollbacks ... and the sheer gift that is set-based operations on large chunks of data. It is just plain stupid to want to code the transformations or insertion of millions of rows iteratively than to rely the implicit rollback power of an INSERT statement.

Rule # 1 in my projects: Avoid getting sexy in code, use the RDBMS for all transformations post staging. Always stage to RDBMS “as is” (data warts and all). Cleanup with SQL and stored procedures. This way you can SQL query the whole original format and figure out the best cleanup and transformations needed.

iblaine · on June 2, 2019

One comment is CTEs don’t use indexes, so you may want to avoid using them in production code. That’s something to consider if you may need to port code to/from databases where indexes may be available.