We've also been running airflow for the past 2-3 years at a similar scale (~5000...

pweissbrod · on May 23, 2022

If your team is comfortable writing in pure python and you're familiar with the concept of a makefile you might find Luigi a much lighter and less opinionated alternative to workflows.

Luigi doesn't force you into using a central orchestrator for executing and tracking the workflows. Tracking and updating tasks state is open functions left to the programmer to fill in.

It's probably geared for more expert programmers who work close to the metal that don't care about GUIs as much as high degrees of control and flexibility.

It's one of those frameworks where the code that is not written is sort of a killer feature in itself. But definitely not for everyone.

teej · on May 23, 2022

It’s worth noting that Luigi is no longer actively maintained and hasn’t had a major release in a year.

leafmeal · on May 23, 2022

Toil is pure Python, but I'm not sure how the feature set compares https://github.com/DataBiosphere/toil

epistasis · on May 28, 2022

Really interesting to see a bioinformatics tool be proposed. I've worked in bioinformatics for over 20 years, written several workflow system for execution on compute clusters, used several other people's and been underwhelmed by most. I was hoping that AirFlow might be better, since it was written by real software engineers rather than people who do systems design as a means to their ends, but AirFlow was completely underwhelming.

The other orchestrator besides Toil to check out is Cromwell, but that uses WDL instead of Python for defining the DAG, and it's not a super powerful language, even if it hits exactly the needs for 99% of uses and does exactly the right sort of environment containment.

I'm also hugely underwhelmed by k8s and Mesos and all those "cloud" allocation schemes. I think that a big, dynamically sized Slurm cluster would probably serve a lot of people far better.

emef · on May 23, 2022

I did a proof of concept in luigi pretty early on and really liked it. Our main concerns were that we would have needed to bolt on a lot of extra functionality to make it easy to re-run workflows or specific steps in the workflows when necessary (manual intervention is unavoidable IME). The fact that airflow also had a functional UI out of the box made it hard to justify luigi when we were just getting off the ground.

pyrophane · on May 23, 2022

Very similar experience to yours. Adopted Airflow about 3 years ago. Was aware of Prefect but it seemed a bit immature at the time. Checked back in on it recently and they were approaching alpha for what looked like a pretty substantial rewrite (now in beta). Maybe once the dust has settled from that I'll give it another look.

throwusawayus · on May 23, 2022

creator of prefect was an early major airflow committer. anyone know what motivated the substantial rewrite of prefect? i had assumed original version of prefect was already supposed to fix some design issues in airflow?

abrazensunset · on May 24, 2022

I'm a heavy Prefect user and was also very confused about the initial rewrite, even after reading several summaries. My best advice is to just try using 2.0 (Orion). Here's how I'd summarize the difference:

Prefect 1.0 feels like second-gen Airflow--less boilerplate, easy dynamic DAGs, better execution defaults, great local dev, etc etc. It's more sane but you still feel the impedance mismatch from working with an orchestrator.

Prefect 2.0 is a first-principles rewrite that removes most of the friction from interacting with an orchestrator in the first place. Finally, your code can breathe.

timost · on May 23, 2022

I think you mean prefect orion/v2[0]. I'm curious too.

[0] https://www.prefect.io/orion/

mianos · on May 24, 2022

Yes, the original stack 'Prefect' was written to address issues in airflow. The DAG on prefect was built using decorators in a context which was pretty cool and worked well but they moved to DAG generation as code on Orion.

Prefect very cleanly written, good design and flexible. IMHO it is a platform that will be the next big thing in the area.

How I know, I deployed prefect as a static config gathering system across 4000 servers, both Linux and Windows. No other software stack came close, as one of the core concepts of prefect is 'expect to fail'. Things like Ansible Tower die really quick with large clusters due to the normal number of failures and the incorrect assumption that most things will work (as you can for a small cluster).

I wish I got to use it in my current work but there is no use case. Yet.

timost · on May 24, 2022

You mean you used prefect to fetch nodes "system parameters/config" ?

Interesting use case, I use prefect for data pipelines, never thought about that kind of use case.

mianos · on May 24, 2022

I had many thousands of machines. I needed to collect disk size, ram, software inventory, some custom config, if present. Some machines are Linux, some windows.

With prefect I created a task 'collect machine details for windows', another 'collect machine details for Linux', another 'collect software inventory'.

I have a list of machines in a database so I create a task to get them. That task is an sqlalchemy query so I can pass the task a filter.

I get a list of linux machines and pass that to a task to run. I get a list of windows machines and pass that to a task.

Note that the above don't depend on each other.

I have a task that filters good results from bad. I have another task that writes a list to a database.

Other tasks have credentials.

Another task puts errors to an error table, the machines that failed get filtered from the results and run into this task.

I plumb the above up with a Prefect flow and it builds a DAG that runs the flow. Everything that can be run in parallel does so, everything that has some other input waits for the input.

Tasks that fail can be retried by Prefect automatically. Intermediate results cached. And, I get a nice gui for everything. I can even schedule it in the gui.

timost · on May 25, 2022

Very interesting ! Thank you for the details.

dopamean · on May 23, 2022

If you could go back and use something else instead what would you choose?

emef · on May 23, 2022

It's a good question. I believe airflow was probably the right choice at the time we started. We were a small team, and deploying airflow was a major shortcut that more or less handled orchestration so we could focus on other problems. With the aid of hindsight, we would have been better off spinning off our own scheduler some time in the first year of the project. Like I mentioned in my OP, we have a set of well-defined workflows that are just templatized for different jobs. A custom-built orchestration system that could perform those steps in sequence and trigger downstream workflows would not be that complicated. But this is how software engineering goes, sometimes you take on tech debt and it can be hard to know when it's time to pay it off. We did eventually get to a stable steady state, but with lots of hair pulling along the way.

hbarka · on May 23, 2022

dbt tool. getdbt.com

mywittyname · on May 23, 2022

Can dbt run arbitrary code? If it can, it's not well advertised in the documentation. Every time I've looked into dbt, I found that it's mostly a scheduled SQL runner.

The primary reason we run Airflow is because it can execute Python code natively, or other programs via Bash. It's very rare that a DAG I write is entirely SQL-based.

igrayson · on May 23, 2022

dbt has just opened a serious conversation about supporting Python models. I'm sure they'd value your viewpoint! https://github.com/dbt-labs/dbt-core/discussions/5261

hbarka · on May 23, 2022

You’re right. I think the strength of dbt is in the T part of ELT. I wrote ELT to make a distinction in principle from the traditional ETL. (E)xtract and (L)oad is the data ingestion phase that would probably be better served by Dagster, where you could use Python.

(T)transform is decoupled and would be served in set-based operations managed by dbt.

KptMarchewa · on May 23, 2022

Dbt is great, but solves only a small part of what Airflow does.

xaviermarcos123 · on May 25, 2022

I would check out Astronomer.io. Just recently came out with a new managed service that specifically targets these exact pain points.

Feel free to ping at [email protected] if you're open to a chat.

nojvek · on May 24, 2022

If you’re in kubernetes land, I’d suggest checking out argo workflows. Very similar DAG primitives like airflow.

It essentially abstracts workflows as custom resources. Works phenomenally well and quite stable.