I'm no GPU programmer, but seems easy to use even for someone like me. I pulled together a quick demo of using the GPU vs the CPU, based on what I could find (https://gist.github.com/victorb/452a55dbcf59b3cbf84efd8c3097...) which gave these results (after downloading 2.6GB of dependencies of course):
Creating 100 random matrices of size 5000x5000 on CPU...
Adding matrices using CPU...
CPU matrix addition completed in 0.6541 seconds
CPU result matrix shape: (5000, 5000)
Creating 100 random matrices of size 5000x5000 on GPU...
Adding matrices using GPU...
GPU matrix addition completed in 0.1480 seconds
GPU result matrix shape: (5000, 5000)
Definitely worth digging into more, as the API is really simple to use, at least for basic things like these. CUDA programming seems like a big chore without something higher level like this.
CuPy has been available for years and has always worked great. The article is about the next wave of Python-oriented JIT toolchains, that will allow writing actual GPU kernels in a Pythonic-style instead of calling an existing precompiled GEMM implementation in CuPy (like in that snippet) or even JIT-ing CUDA C++ kernels from a Python source, that has also been available for years: https://docs.cupy.dev/en/stable/user_guide/kernel.html#raw-k...
The mistake you seem to be making is confusing the existing product (which has been available for many years) with the upcoming new features for that product just announced at GTC, which are not addressed at all on the page for the existing product, but are addressed in the article about the GTC announcement.
> The mistake you seem to be making is confusing the existing product
i'm not making any such mistake - i'm just able to actually read and comprehend what i'm reading rather than perform hype:
> Over the last year, NVIDIA made CUDA Core, which Jones said is a “Pythonic reimagining of the CUDA runtime to be naturally and natively Python.”
so the article is about cuda-core, not whatever you think it's about - so i'm responding directly to what the article is about.
> CUDA Core has the execution flow of Python, which is fully in process and leans heavily into JIT compilation.
this is bullshit/hype about Python's new JIT which womp womp womp isn't all that great (yet). this has absolutely nothing to do with any other JIT e.g., the cutile kernel driver JIT (which also has absolutely nothing to do with what you think it does).
> i'm just able to actually read and comprehend what i'm reading rather than perform hype:
The evidence of that is lacking.
> so the article is about cuda-core, not whatever you think it's about
cuda.core (a relatively new, rapidly developing, library whose entire API is experimental) is one of several things (NVMath is another) mentioned in the article, but the newer and as yet unreleased piece mentioned in the article and the GTC announcement, and a key part of the “Native Python” in the headline, is the CuTile model [0]:
“The new programming model, called CuTile interface, is being developed first for Pythonic CUDA with an extension for C++ CUDA coming later.”
> this is bullshit/hype about Python's new JIT
No, as is is fairly explicit in the next line after the one you quote, it is about the Nvidia CUDA Python toolchain using in-process compilation rather than relying on shelling out to out-of-process command-line compilers for CUDA code.
[0] The article only has fairly vague qualitative description of what CuTile is, but (without having to watch the whole talk from GTC), one could look at this tweet for a preview of what the Python code using the model is expected to look like when it is released: https://x.com/blelbach/status/1902113767066103949?t=uihk0M8V...
> No, as is is fairly explicit in the next line after the one you quote, it is about the Nvidia CUDA Python toolchain using in-process compilation rather than relying on shelling out to out-of-process command-line compilers for CUDA code.
my guy what i am able to read, which you are not, is the source and release notes. i do not need to read tweets and press releases because i know what these things actually are. here are the release notes
> Support Python 3.13
> Add bindings for nvJitLink (requires nvJitLink from CUDA 12.3 or above)
> Add optional dependencies on CUDA NVRTC and nvJitLink wheels
do you understand what "bindings" and "optional dependencies on..." means? it means there's nothing happening in this library and these are... just bindings to existing libraries. specifically that means you cannot jit python using this thing (except via the python 3.13 jit interpreter) and can only do what you've always already been able to do with eg cupy (compile and run C/C++ CUDA code).
> my guy what i am able to read, which you are not, is the source and release notes. i do not need to read tweets and press releases because i know what these things actually are. here are the release notes
Those aren't the release notes for the native python thing being announced. CuTile has not been publicly released yet. Based on what the devs are saying on Twitter it probably won't be released before the SciPy 2025 conference in July.
JIT as an adjective means just-in-time, as opposed to AOT, ahead-of-time. What Nvidia discussed at GTC was a software stack that will enable you to generate new CUDA kernels dynamically at runtime using Python API calls. It is a just-in-time (runtime, dynamic) compiler system rather than an ahead-of-time (pre-runtime, static) compiler.
cuTile is basically Nvidia’s Triton (no, not that Triton, OpenAI’s Triton) competitor. It takes your Python code and generates kernels at runtime. CUTLASS has a new Python interface that does the same thing.
Isn't the main announcement of the article CuTile? Which has not been released yet.
Also the cuda-core JIT stuff has nothing to do with Python's new JIT, it's referring to integrating nvJitLink with python, which you can see an example of in cuda_core/examples/jit_lto_fractal.py
In case someone is looking for some performance examples & testimonials, even on RTX 3090 vs a 64-core AMD Epy/Threadripper, even a couple of years ago, CuPy was a blast. I have a couple of recorded sessions with roughly identical slides/numbers:
- San Francisco Python meetup in 2023: https://youtu.be/L9ELuU3GeNc?si=TOp8lARr7rP4cYaw
- Yerevan PyData meetup in 2022: https://youtu.be/OxAKSVuW2Yk?si=5s_G0hm7FvFHXx0u
Of the more remarkable results:
- 1000x sorting speedup switching from NumPy to CuPy.
- 50x performance improvements switching from Pandas to CuDF on the New York Taxi Rides queries.
- 20x GEMM speedup switching from NumPy to CuPy.
CuGraph is also definitely worth checking out. At that time, Intel wasn't in as bad of a position as they are now and was trying to push Modin, but the difference in performance and quality of implementation was mind-boggling.
there is no release of cutile (yet). so the only substantive thing that the article can be describing is cuda-core - which it does describe and is a recent/new addition to the ecosystem.
man i can't fathom glazing a random blog this hard just because it's tangentially related to some other thing (NV GPUs) that clearly people only vaguely understand.
Curious what the timing would be if it included the memory transfer time, e.g.
matricies = [np.random(...) for _ in range]
time_start = time.time()
cp_matricies = [cp.array(m) for m in matrices]
add_(cp_matricies)
sync
time_end = time.time()
I don’t mean to call you or your pseudocode out specifically, but I see this sort of thing all the time, and I just want to put it out there:
PSA: if you ever see code trying to measure timing and it’s not using the CUDA event APIs, it’s fundamentally wrong and is lying to you. The simplest way to be sure you’re not measuring noise is to just ban the usage of any other timing source. Definitely don’t add unnecessary syncs just so that you can add a timing tap.
If I have a mostly CPU code and I want to time the scenario: “I have just a couple subroutines that I am willing to offload to the GPU,” what’s wrong with sprinkling my code with normal old python timing calls?
If I don’t care what part of the CUDA ecosystem is taking time (from my point of view it is a black-box that does GEMMs) so why not measure “time until my normal code is running again?”
You can create metrics for whatever you want! Go ahead!
But cuda is not a black box math accelerator. You can stupidly treat it as such, but that doesn’t make it that. It’s an entire ecosystem with drivers and contexts and lifecycles. If everything you’re doing is synchronous and/or you don’t mind if your metrics include totally unrelated costs, then time.time() is fine, sure. But if that’s the case, you’ve got bigger problems.
Sure, it’s easy to say “there are bigger problems.” There are always bigger problems.
But, there are like 50 years worth of Fortran numerical codes out there, lots of them just use RCIs… if I want to try CUDA in some existing library, I guess I will need the vector back before I can go back into the RCI.
You're arguing with people who have no idea what they're talking about on a forum that is a circular "increase in acceleration" of a personality trait that gets co-opted into arguing incorrectly about everything - a trait that everyone else knows is defective.
I think it does?: (the comment is in the original source)
print("Adding matrices using GPU...")
start_time = time.time()
gpu_result = add_matrices(gpu_matrices)
cp.cuda.get_current_stream().synchronize() # Not 100% sure what this does
elapsed_time = time.time() - start_time
I was going to ask, any CUDA professionals who want to give a crash course on what us python guys will need to know?
When you call a cuda method, it is launched asynchronously. That is the function queues it up for execution on gpu and returns.
So if you need to wait for an op to finish, you need to `synchronize` as shown above.
`get_current_stream` because the queue mentioned above is actually called stream in cuda.
If you want to run many independent ops concurrently, you can use several streams.
Benchmarking is one use case for synchronize. Another would be if you let's say run two independent ops in different streams and need to combine their results.
Btw, if you work with pytorch, when ops are run on gpu, they are launched in background. If you want to bench torch models on gpu, they also provide a sync api.
I’ve always thought it was weird GPU stuff in python doesn’t use asyncio, and mostly assumed it was because python-on-GPU predates asyncio. But I was hoping a new lib like this might right that wrong, but it doesn’t. Maybe for interop reasons?
Do other languages surface the asynchronous nature of GPUs in language-level async, avoiding silly stuff like synchronize?
The reason is that the usage is completely different from coroutine based async. With GPUs you want to queue _as many async operations as possible_ and only then synchronize. That is, you would have a program like this (pseudocode):
b = foo(a)
c = bar(b)
d = baz(c)
synchronize()
With coroutines/async await, something like this
b = await foo(a)
c = await bar(b)
d = await baz(c)
would synchronize after every step, being much more inefficient.
It really depends on if you're dealing with an async stream or a single async result as the input to the next function. If a is an access token needed to access resource b, you cannot access a and b at the same time. You have to serialize your operations.
Well you can and should create multiple coroutine/tasks and then gather them. If you replace cuda with network calls, it’s exactly the same problem. Nothing to do with asyncio.
No, that's a different scenario. In the one I gave there's explicitly a dependency between requests. If you use gather, the network requests would be executed in parallel. If you have dependencies they're sequential by nature because later ones depend on values of former ones.
The 'trick' for CUDA is that you declare all this using buffers as inputs/outputs rather than values and that there's automatic ordering enforcement through CUDA's stream mechanism. Marrying that with the coroutine mechanism just doesn't really make sense.
Might have to look at specific lib implementations, but I'd guess that mostly gpu calls from python are actually happening in c++ land. And internally a lib might be using synchronize calls where needed.