Hacker News new | past | comments | ask | show | jobs | submit login

I'm building a stream processing framework using DuckDB.

https://github.com/turbolytics/sql-flow

The goal is to create a stream processing framework that supports SQL jobs. Apache Flink supports this but is very heavy-weight overall. I work with cost constrained companies that just can't run Flink but still want access to high performance streaming primitives.

We are taking a slightly different approach from other competitors by building cloud-native tools for engineers. SQLFlow is built on DuckDB, Native Kafka Library, and Arrow. This allows SQLFlow to handle ~70k+ events / second with low memory overhead (~250MiB).

Would love your questions, thoughts and feedback or feature Ideas! Thank you




Interesting project. I'm working on a side project that is sort of adjacent but the user cases I'm targeting are fundamentally different from yours.

I professionally have some projects where we are doing custom CDC from kafka -> dbms. In that context, we're just using spring boot apps. How is the troubleshooting/recovery story for sqlflow?


Very Cool! Which use cases are you targeting?

I'm surprised at the relatively sparse state of streaming. I feel like there are:

- Flink - Spark - ~Benthos - Arroyo - Then mostly custom / bespoke frameworks

Maybe there just isn't that much money in it? But I think there's still lots of opportunity to improve the Dev/UX over the JVM and the enterprise solutions.

-----

I would say recovery is achieved through kafka consumer groups right now, which results in at least once processing semantics.

We also support websocket input for the bluesky firehose but that is completely ephemeral. I have a story in the backlog to write ahead to disk which should allow for tolerating process crash/failure.

The tumbling windows stores state using duckdb. The end user can configure a disk-based duckdb databases which achieves durability as well.

The troubleshooting was actually pretty funky, i didn't realize how tricky it could be with stateful stream processing. I certainly don't have a good story for it, but what I did was added a sql debug HTTP handler. When this is enabled it exposes the duckdb execution context over HTTP. This was how I debugged the tumbling window logic during development. I would start SQLFlow, query to make sure the windows were empty, send some known messages, query the windows to make sure they were aggregated correctly, wait for the windows to close and flush, query the tumbling window state, etc.

For troubleshooting it also supports prometheus metrics oriented towards stream processing, # of messages, processing duration, success/failures, etc.

SQLFlow also ships with a dev framework, this allows you to execute your sqlflow pipeline against a json input file to make sure the sql processing logic is correct. I wanted to give the ability to decouple testing the logic, from actually having to stream data.

I would love to know about more what you're hacking on!


This may be outside your use case for sql flow, but in my professional work we have a lot of problems with malformed messages or weird data in the target DB. So only 95% of message succeed in effecting their updatupdated. Leads to a lot of troubleshooting what happened, what was the message, what was the data at the time we processed the message, what did we expect to happen, etc.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: