This will synchronously block until ‘chat.ask’ returns though. Be prepared to be paying for the memory of your whole app tens/low hundreds of MB of memory being held alive doing nothing (other than handling new chunks) until whatever streaming API this is using under the hood is finished streaming.
Yes, it does. Ruby has a global interpreter lock (GIL) that prevents multiple threads to be executed by the interpreter at the same time, so Puma does have threads, they just can’t run Ruby code at the same time. They can hide IO though.
Concurrency support is missing from the language syntax and this particular library as a concept. This is by design, to not distract from beautiful code. Your request will make zero progress and take up memory while waiting for the LLM answer. Other threads might make progress on other requests, but in real world deployments this will be a handful (<10). This server will get 10s of requests per second when something written in JS or Go will get many 1000s.
It’s amazing how the Ruby community argues against their own docs and doesn’t acknowledge the design choices their language creators have made.