We've also been running airflow for the past 2-3 years at a similar scale (~5000 dags, 100k+ task executions daily) for our data platform. We weren't aware of a great alternative when we started. Our DAGs are all config-driven which populate a few different templates (e.g. ingestion = ingest > validate > publish > scrub PII > publish) so we really don't need all the flexibility that airflow provides. We have had SO many headaches operating airflow over the years, and each time we invest in fixing the issue I feel more and more entrenched. We've hit scaling issues at the k8s level, scheduling overhead in airflow, random race conditions deep in the airflow code, etc. Considering we have a pretty simplified DAG structure, I wish we had gone with a simpler, more robust/scalable solution (even if just rolling our own scheduler) for our specific needs.
Upgrades have been an absolute nightmare and so disruptive. The scalability improvements in airflow 2 were a boon for our runtimes since before we would often have 5-15 minutes of overhead between task scheduling, but man it was a bear of an upgrade. We've since tried multiple times to upgrade past the 2.0 release and hit issues every time, so we are just done with it. We'll stay at 2.0 until we eventually move off airflow altogether.
I stood up a prefect deployment for a hackathon and I found that it solved a ton of the issues with airflow (sane deployment options, not the insane file-based polling that airflow does). We looked into it ~1 year ago or so, I haven't heard a lot about it lately, I wonder if anyone has had success with it at scale.
If your team is comfortable writing in pure python and you're familiar with the concept of a makefile you might find Luigi a much lighter and less opinionated alternative to workflows.
Luigi doesn't force you into using a central orchestrator for executing and tracking the workflows. Tracking and updating tasks state is open functions left to the programmer to fill in.
It's probably geared for more expert programmers who work close to the metal that don't care about GUIs as much as high degrees of control and flexibility.
It's one of those frameworks where the code that is not written is sort of a killer feature in itself. But definitely not for everyone.
Really interesting to see a bioinformatics tool be proposed. I've worked in bioinformatics for over 20 years, written several workflow system for execution on compute clusters, used several other people's and been underwhelmed by most. I was hoping that AirFlow might be better, since it was written by real software engineers rather than people who do systems design as a means to their ends, but AirFlow was completely underwhelming.
The other orchestrator besides Toil to check out is Cromwell, but that uses WDL instead of Python for defining the DAG, and it's not a super powerful language, even if it hits exactly the needs for 99% of uses and does exactly the right sort of environment containment.
I'm also hugely underwhelmed by k8s and Mesos and all those "cloud" allocation schemes. I think that a big, dynamically sized Slurm cluster would probably serve a lot of people far better.
I did a proof of concept in luigi pretty early on and really liked it. Our main concerns were that we would have needed to bolt on a lot of extra functionality to make it easy to re-run workflows or specific steps in the workflows when necessary (manual intervention is unavoidable IME). The fact that airflow also had a functional UI out of the box made it hard to justify luigi when we were just getting off the ground.
Very similar experience to yours. Adopted Airflow about 3 years ago. Was aware of Prefect but it seemed a bit immature at the time. Checked back in on it recently and they were approaching alpha for what looked like a pretty substantial rewrite (now in beta). Maybe once the dust has settled from that I'll give it another look.
creator of prefect was an early major airflow committer. anyone know what motivated the substantial rewrite of prefect? i had assumed original version of prefect was already supposed to fix some design issues in airflow?
I'm a heavy Prefect user and was also very confused about the initial rewrite, even after reading several summaries. My best advice is to just try using 2.0 (Orion). Here's how I'd summarize the difference:
Prefect 1.0 feels like second-gen Airflow--less boilerplate, easy dynamic DAGs, better execution defaults, great local dev, etc etc. It's more sane but you still feel the impedance mismatch from working with an orchestrator.
Prefect 2.0 is a first-principles rewrite that removes most of the friction from interacting with an orchestrator in the first place. Finally, your code can breathe.
Yes, the original stack 'Prefect' was written to address issues in airflow.
The DAG on prefect was built using decorators in a context which was pretty cool and worked well but they moved to DAG generation as code on Orion.
Prefect very cleanly written, good design and flexible. IMHO it is a platform that will be the next big thing in the area.
How I know, I deployed prefect as a static config gathering system across 4000 servers, both Linux and Windows. No other software stack came close, as one of the core concepts of prefect is 'expect to fail'. Things like Ansible Tower die really quick with large clusters due to the normal number of failures and the incorrect assumption that most things will work (as you can for a small cluster).
I wish I got to use it in my current work but there is no use case. Yet.
I had many thousands of machines. I needed to collect disk size, ram, software inventory, some custom config, if present. Some machines are Linux, some windows.
With prefect I created a task 'collect machine details for windows', another 'collect machine details for Linux', another 'collect software inventory'.
I have a list of machines in a database so I create a task to get them. That task is an sqlalchemy query so I can pass the task a filter.
I get a list of linux machines and pass that to a task to run.
I get a list of windows machines and pass that to a task.
Note that the above don't depend on each other.
I have a task that filters good results from bad.
I have another task that writes a list to a database.
Other tasks have credentials.
Another task puts errors to an error table, the machines that failed get filtered from the results and run into this task.
I plumb the above up with a Prefect flow and it builds a DAG that runs the flow.
Everything that can be run in parallel does so, everything that has some other input waits for the input.
Tasks that fail can be retried by Prefect automatically. Intermediate results cached. And, I get a nice gui for everything. I can even schedule it in the gui.
It's a good question. I believe airflow was probably the right choice at the time we started. We were a small team, and deploying airflow was a major shortcut that more or less handled orchestration so we could focus on other problems. With the aid of hindsight, we would have been better off spinning off our own scheduler some time in the first year of the project. Like I mentioned in my OP, we have a set of well-defined workflows that are just templatized for different jobs. A custom-built orchestration system that could perform those steps in sequence and trigger downstream workflows would not be that complicated. But this is how software engineering goes, sometimes you take on tech debt and it can be hard to know when it's time to pay it off. We did eventually get to a stable steady state, but with lots of hair pulling along the way.
Can dbt run arbitrary code? If it can, it's not well advertised in the documentation. Every time I've looked into dbt, I found that it's mostly a scheduled SQL runner.
The primary reason we run Airflow is because it can execute Python code natively, or other programs via Bash. It's very rare that a DAG I write is entirely SQL-based.
You’re right. I think the strength of dbt is in the T part of ELT. I wrote ELT to make a distinction in principle from the traditional ETL. (E)xtract and (L)oad is the data ingestion phase that would probably be better served by Dagster, where you could use Python.
(T)transform is decoupled and would be served in set-based operations managed by dbt.
Upgrades have been an absolute nightmare and so disruptive. The scalability improvements in airflow 2 were a boon for our runtimes since before we would often have 5-15 minutes of overhead between task scheduling, but man it was a bear of an upgrade. We've since tried multiple times to upgrade past the 2.0 release and hit issues every time, so we are just done with it. We'll stay at 2.0 until we eventually move off airflow altogether.
I stood up a prefect deployment for a hackathon and I found that it solved a ton of the issues with airflow (sane deployment options, not the insane file-based polling that airflow does). We looked into it ~1 year ago or so, I haven't heard a lot about it lately, I wonder if anyone has had success with it at scale.