I'm building a stream processing framework using DuckDB. https://github.com/turb...

arcbyte · 2025-02-24T18:15:30 1740420930

Interesting project. I'm working on a side project that is sort of adjacent but the user cases I'm targeting are fundamentally different from yours.

I professionally have some projects where we are doing custom CDC from kafka -> dbms. In that context, we're just using spring boot apps. How is the troubleshooting/recovery story for sqlflow?

dm03514 · 2025-02-24T20:04:52 1740427492

Very Cool! Which use cases are you targeting?

I'm surprised at the relatively sparse state of streaming. I feel like there are:

- Flink - Spark - ~Benthos - Arroyo - Then mostly custom / bespoke frameworks

Maybe there just isn't that much money in it? But I think there's still lots of opportunity to improve the Dev/UX over the JVM and the enterprise solutions.

-----

I would say recovery is achieved through kafka consumer groups right now, which results in at least once processing semantics.

We also support websocket input for the bluesky firehose but that is completely ephemeral. I have a story in the backlog to write ahead to disk which should allow for tolerating process crash/failure.

The tumbling windows stores state using duckdb. The end user can configure a disk-based duckdb databases which achieves durability as well.

The troubleshooting was actually pretty funky, i didn't realize how tricky it could be with stateful stream processing. I certainly don't have a good story for it, but what I did was added a sql debug HTTP handler. When this is enabled it exposes the duckdb execution context over HTTP. This was how I debugged the tumbling window logic during development. I would start SQLFlow, query to make sure the windows were empty, send some known messages, query the windows to make sure they were aggregated correctly, wait for the windows to close and flush, query the tumbling window state, etc.

For troubleshooting it also supports prometheus metrics oriented towards stream processing, # of messages, processing duration, success/failures, etc.

SQLFlow also ships with a dev framework, this allows you to execute your sqlflow pipeline against a json input file to make sure the sql processing logic is correct. I wanted to give the ability to decouple testing the logic, from actually having to stream data.

I would love to know about more what you're hacking on!

arcbyte · 2025-02-25T03:24:24 1740453864

This may be outside your use case for sql flow, but in my professional work we have a lot of problems with malformed messages or weird data in the target DB. So only 95% of message succeed in effecting their updatupdated. Leads to a lot of troubleshooting what happened, what was the message, what was the data at the time we processed the message, what did we expect to happen, etc.