When you call a cuda method, it is launched asynchronously. That is the function...

claytonjy · 2025-04-04T17:37:36 1743788256

I’ve always thought it was weird GPU stuff in python doesn’t use asyncio, and mostly assumed it was because python-on-GPU predates asyncio. But I was hoping a new lib like this might right that wrong, but it doesn’t. Maybe for interop reasons?

Do other languages surface the asynchronous nature of GPUs in language-level async, avoiding silly stuff like synchronize?

ImprobableTruth · 2025-04-04T18:24:02 1743791042

The reason is that the usage is completely different from coroutine based async. With GPUs you want to queue _as many async operations as possible_ and only then synchronize. That is, you would have a program like this (pseudocode):

  b = foo(a)
  c = bar(b)
  d = baz(c)
  synchronize()

With coroutines/async await, something like this

  b = await foo(a)
  c = await bar(b)
  d = await baz(c)

would synchronize after every step, being much more inefficient.

hackernudes · 2025-04-04T20:14:26 1743797666

Pretty sure you want it to do it the first way in all cases (not just with GPUs)!

halter73 · 2025-04-04T21:08:32 1743800912

It really depends on if you're dealing with an async stream or a single async result as the input to the next function. If a is an access token needed to access resource b, you cannot access a and b at the same time. You have to serialize your operations.

alanfranz · 2025-04-05T01:33:38 1743816818

Well you can and should create multiple coroutine/tasks and then gather them. If you replace cuda with network calls, it’s exactly the same problem. Nothing to do with asyncio.

ImprobableTruth · 2025-04-05T01:56:18 1743818178

No, that's a different scenario. In the one I gave there's explicitly a dependency between requests. If you use gather, the network requests would be executed in parallel. If you have dependencies they're sequential by nature because later ones depend on values of former ones.

The 'trick' for CUDA is that you declare all this using buffers as inputs/outputs rather than values and that there's automatic ordering enforcement through CUDA's stream mechanism. Marrying that with the coroutine mechanism just doesn't really make sense.

apbytes · 2025-04-04T17:46:16 1743788776

Might have to look at specific lib implementations, but I'd guess that mostly gpu calls from python are actually happening in c++ land. And internally a lib might be using synchronize calls where needed.

hnuser123456 · 2025-04-04T17:09:41 1743786581

Thank you kindly!