What specific strategies does Hatchet employ to guarantee fault tolerance and enable durable execution? How does it handle partial failures in multi-step workflows?
Each task in Hatchet is backed by a workflow [1]. Workflows are predefined steps which are persisted in PostgreSQL. If a worker dies or crashes midway through (stops heartbeating to the engine), we reassign tasks (assuming they have retries left). We also track timeouts in the database, which means if we miss a timeout, we simply retry after some amount of time. Like I mentioned in the post, we avoid some classes of faults just by relying on PostgreSQL and persisting each workflow run, so you don't need to time out with distributed locks in Redis, for example, or worry about data loss if Redis OOMs. Our `ticker` service is basically its own worker which is assigned a lease for each step run.
We also store the input/output of each workflow step in the database. So resuming a multi-step workflow is pretty simple - we just replay the step with the same input.
To zoom out a bit - unlike many alternatives [2], the execution path of a multi-step workflow in Hatchet is declared ahead of time. There are tradeoffs to this approach; it makes it much easier to run a single-step workflow or if you know the workflow execution path ahead of time. You also avoid classes of problems related to workflow versioning, we can gracefully drain older workflow version with a different execution path. It's also more natural to debug and see a DAG execution instead of debugging procedural logic.
The clear tradeoff is that you can't try...catch the execution of a single task or concatenate a bunch of futures that you wait for later. Roadmap-wise, we're considering adding procedural execution on top of our workflows concept. Which means providing a nice API for calling `await workflow.run` and capturing errors. These would be a higher-level concept in Hatchet and are not built yet.
There are some interesting concepts around using semaphores and durable leases that are relevant here, which we're exploring [3].
I think the answer is no but just to be sure: are you able to trigger step executions programmatically from within a step, even if you can't await their results?
Related, but separately: can you trigger a variable number of task executions from one step? If the answer to the previous question is yes then it would of course be trivial; if not, I'm wondering if you could i.e. have a task act as a generator and yield values, or just return a list, and have each individual item get passed off to its own execution of the next task(s) in the DAG.
For example some of the examples involve a load_docs step, but all loaded docs seem to be passed to the next step execution in the DAG together, unless I'm just misunderstanding something. How could we tweak such an example to have a separate task execution per document loaded? The benefits of durable execution and being able to resume an intensive workflow without repeating work is lessened if you can't naturally/easily control the size of the unit of work for task executions.
You can execute a new workflow programmatically, for example see [1]. So people have triggered, for example, 50 child workflows from a parent step. As you've identified the difficult part there is the "collect" or "gathering" step, we've had people hack around that by waiting for all the steps from a second workflow (and falling back to the list events method to get status), but this isn't an approach I'd recommend and it's not well documented. And there's no circuit breaker.
> I'm wondering if you could i.e. have a task act as a generator and yield values, or just return a list, and have each individual item get passed off to its own execution of the next task(s) in the DAG.
Yeah, we were having a conversation yesterday about this - there's probably a simple decorator we could add so that if a step returns an array, and a child step is dependent on that parent step, it fans out if a `fanout` key is set. If we can avoid unstructured trace diagrams in favor of a nice DAG-style workflow execution we'd prefer to support that.
The other thing we've started on is propagating a single "flow id" to each child workflow so we can provide the same visualization/tracing that we provide in each workflow execution. This is similar to AWS X-rays.
As I mentioned we're working on the durable workflow model, and we'll find a way to make child workflows durable in the same way activities (and child workflows) are durable on Temporal.
What happens if a worker goes silent for longer than the heartbeat duration, then a new worker is spawned, then the original worker “comes back to life”? For example, because there was a network partition, or because the first worker’s host machine was sleeping, or even just that the first worker process was CPU starved?
The heartbeat duration (5s) is not the same as the inactive duration (60s). If a worker has been down for 60 seconds, we reassign to provide some buffer and handle unstable networks. Once someone asks we'll expose these options and make them configurable.
We currently send cancellation signals for individual tasks to workers, but our cancellation signals aren't replayed if they fail on the network. This is an important edge case for us to figure out.
There's not much we can do if the worker ignores that signal. We should probably add some alerting if we see multiple responses on the same task, because that means the worker is ignoring the cancellation signal. This would also be a problem if workloads start blocking the whole thread.
Cancellation signals are tricky. You of course cannot be sure that the remote end receives it. This turns into the two generals problem.
Yes, you need monitoring for this case. I work on scientific workloads which can completely consume CPU resources. This failure scenario is quite real.
Not all tasks are idempotent, but it sounds like a prudent user should try to design things that way, since your system has “at least once” execution of tasks, as opposed to “at most once.” Despite any marketing claims, “exactly once” is not generally possible.
Good docs on this point are important, as is configurability for cases when “at most once” is preferable.