What kind of performance is achieveable with some of the features that Vcc enables (true function calls, function pointers, goto), and what are some of the limitations?
On GPU, function calls are much more expensive than on CPU. Usually it seems to be worth inlining as much as possible. Implementing a stack for recursion on GPUs is also likely to have performance implications. The whole point of using a GPU is to obtain good performance, but I can see the argument for having these features in order to port CPU code to GPU code incrementally.
For function pointers, how does that work? Multiple different implementations of the function are needed to support different devices and the host, which limits what a single pointer can do.
EDIT: To answer my second question, neither function nor data pointers are portable between host and device since Vulkan doesn't support unified addressing.
Strictly speaking, a stack is only needed for reentrant calls. Function pointers can be implemented via defunctionalization in a whole-program context (i.e. there is a single 'eval' function taking a variant record as an argument, and each case in the variant record marks some function implementation to dispatch to and the corresponding arguments).
I am long sought after a CUDA or HIP compiler that target SPIR-V or DXIL. So that we can compile all thoes neural network kernels to almost all compute devices.
The requirements are,
1. extend on C++17 this means template meta programming works
so that cub or cutlass/cute works
2. AOT
3. and no shitshow!
The only thing that comes to close is circle[1].
- OpenCL is a no go, as it is purely C. And it is full of shitshow especially on Android devices. And the vendor drivers are the main source of shit, jit compile adds the other.
- Vulkan+GLSL is a no go. The degree of shitness is on par with OpenCL due to driver and jit compiler.
- slang[2] has the potential, but the meta programming part is not as strong as C++, existing libraries cannot be used.
The above conclusion is drawn from my work on OpenCL EP for onnxruntime. And it purely is a nightmare to work with thoes drivers and jit compilers. Hopefully Vcc can take compute shader more seriously.
At Vulkanised 2023 discussion round Khronos admited that they aren't going to improve GLSL any further, and ironicly rely on Microsoft's HLSL work as the main shader language to go alongside Vulkan.
Maybe there is something else discussed at Vulkanises 2024, but I doubt it.
There was some SYCL work to target Vulkan, but seems to have been a paper attempt and fizzled out.
At the time of the dev of the EP, the tooling is not that good as current. I have imagined a pipeline that HLSL compiles down to DXIL then go through spirv cross and then target wide variety of mobile devices with opencl runtime. But they are more focused on graphics part and is cannot work with the kernel execution model, not to even mention the structural control flow things, it definitely not going to work. The OpenGL does not work with the imagined pipeline, because IIRC it cannot consume SPIRV bytecode. Vulkan is so niche and discarded very early. The final result with opencl runtime and cl language worked but the driver is a mess [palmface]
> At Vulkanised 2023 discussion round Khronos admited that they aren't going to improve GLSL any further, and ironicly rely on Microsoft's HLSL work as the main shader language to go alongside Vulkan.
That sounds intriguing, but I haven't been able of finding any references to it(I guess it was discussed in the panel but the video of it is private) do you have any reference of more information into it?
It's still early days for Vcc, I outline the caveats in the landing page. While I'm confident the control-flow bits and whatnot will work robustly, there's a big open question when it comes to the fate of standard libraries, the likes of libstdc++ were not designed for this use-case.
We'll be working hard on it all the way to Vulkanized, if you have some applications you can get up and running by then, feel free to get in touch.
I think the driver ecosystem for Vulkan is rather high-quality but that's more my (biased!) opinion that something I have hard data on. The Mesa/NIR-based drivers in particular are very nice to work with!
Thoes "existing libraries" does not necessary mean stdc++, but some parallel primitive, and are essential to performance portability. For example, cub for scan and reduction, cutlass for dense linear algebra[1].
> I think the driver ecosystem for Vulkan is rather high-quality
Sorry, I meant OpenGL. At the time of evaluation, the market shared of vulkan on Android deivces is too small and been out of consideration at very early stage. I'd assume the state has changed a lot thereafter.
It is really good to see more projects take a shot on compiling C++ to GPU natively.
[1] cutlass itself is not portable, but the recently added cute is well portable as I evaluated. It provides a unified abstraction for hierarchical layout decomposition along with copy primitive and gemm primitive.
Edit: Nevermind, I think I have misunderstood the purpose of this project. I thought it was a CUDA competitor, but it seems like it is just a shading language compiler for graphics.
See https://github.com/google/clspv for an OpenCL implementation on Vulkan Compute. There are plenty of quirks involved because the two standards use different varieties of SPIR-V ("kernels" vs. "shaders") and provide different guarantees (Vulkan Compute doesn't care much about numerical accuracy). The Mesa folks are also looking into this as part of their RustiCL (a modern OpenCL implementation) and Zink (implementing OpenGL and perhaps OpenCL itself on Vulkan) projects.
It compiles CUDA/HIP C++ to SPIR-V that can run on top of OpenCL or Level Zero. (It does require OpenCL's compute flavored SPIR-V, instead of graphics flavored SPIR-V as seen in OpenGL or Vulkan. I also think it requires some OpenCL extensions that are currently exclusive to Intel NEO, but should on paper be coming to Mesa's rusticl implementation too.
Not exactly, both cl and glsl can be aot, but the runtime will be limited to some newer version and then the market coverage will be niche, those vendor are so lazy on updating the driver and fixing the compiler bugs...
Intel's modern compilers (icx, icpx) are clang-based. There is an open-source version [1], and the closed-source version is built atop of this with extra closed-source special sauce.
LLVM is a big part of it. GCC is more opaque (and monolithic) in comparison. Having an intermediate language and a modular architecture makes it easier to build specialized compilers.
Another selling point is the Clang CFE, which amortizes the complexity of C++ parsing and semantic analysis, much to the chagrin of the Edison Design Group.
Meanwhile EDG are the ones coming up with an usable reflection proposal for C++26, and since Apple and Google stepped out, not many of those profiting from clang are that eager to contribute to upstream fronted.
I was surprised to see they were even still active, much less leading the way on reflection. Hopefully they will manage to stick around.
Anyway, such a pessimistic view of Clang/LLVM is unwarranted IMO. I have yet to see any metrics that imply their abandonment. Also, considering that Google is likely closer to 300 million LOC than 200 million, they really don't have a choice. Likewise for Apple unless they’ve given up on WebKit and LLVM for Swift.
Sometimes one implementation's red box is better than another implementation's green box.
Clang supports the majority of C++20. The biggest blocker on coroutines is a Windows ABI issue and concepts will be ready after a few DRs are polished off. Basically, other than some minor CTAD and non-type template parameter enhancements, the biggest feature lagging behind are modules. IMO, I'm not ready for modules, my dependencies are not ready for modules, neither is my IDE nor my build system. So, for my interests, C++20 support in Clang is more than sufficient.
Apple and Google are perfectly fine with C++17 for their purposes, ...
Well, yes, and they will likely be using C++17 for quite a bit longer. In particular, not only is C++20 a massive update to the language, but the rate of development in the llvm-project has far exceeded the rate at which maintenance capacity can be increased. Not to mention a myriad of other issues, notably the architectural problems in Clang.
Also Microsoft DirectX Shader Compiler[1] for HLSL. In fact, effort is ongoing to upstream HLSL support into Clang.[2][3][4]
To answer your other question—why LLVM instead of GCC:
- First-class support for non-Unix/Linux platforms. LLVM can be built on Windows and Visual Studio without ever needing a single GNU tool besides git[5]. Clang even has a MSVC-compatible interface that allows MSVC developers to switch to Clang without needing to change their command-line invocation[6].
- Written in C++ from the ground up, with a modular, first-class SSA IR-based interface.
- Permissive Apache 2.0 licence. As much as this might exasperate the open-source community, it allows for significantly faster iteration; things tend to be upstreamed when private/corporate developers realise it is hard to maintain separate forks.
All this allows LLVM to have a pretty mature infrastructure; some very large companies have contributed to its development.
Written in C++ from the ground up, with a modular, first-class SSA IR-based interface.
LLVM IR is a great accomplishment, but the design of Clang causes tremendous pain. In particular, the lack of any higher level IR(s) in Clang result in tremendous complexity, poorer diagnostics, and missed performance optimizations. Moreover, while initiatives like ClangIR[1] may exist, the fact is that a transition to MLIR will be long and messy, to say the least. Indeed, such a transition would likely involve the complete duplication of Clang internals while downstream implementations migrate. In comparison an implementation like GCC is free to define and modify new IRs as needed, e.g. GIMPLE has three major forms IIRC.
Long overdue. Please give CUDA some competition. It's always been possible to do general purpose compute in Vulkan, but the API was high-friction due to the shader language, and the glue between shader and CPU. It sounds like Vcc is an attempt to address this.
> the API was high-friction due to the shader language, and the glue between shader and CPU
Direct3D 11 compute shaders share these things with Vulkan, yet D3D11 is relatively easy to use. For example, see that library which implements ML-targeted compute shaders for C# with minimal friction: https://github.com/Const-me/Cgml The backend implemented in C++ is rather simple, just binds resources and dispatches these shaders.
I think the main usability issue with Vulkan is API design. Vulkan was only designed with AAA game engines in mind. The developers of these game engines have borderline unlimited budgets, and their requirements are very different from ordinary folks who want to leverage GPU hardware.
I doubt Vulkan, a low-level graphics API with compute shaders, could ever directly compete with CUDA, a pure play GPGPU API. Although it isn't hard to imagine situations in which Vulkan compute shaders are sufficient to replace and/or avoid CUDA.
I wonder why people haven't coalesced around HLSL; I vastly prefer defining vertices as structs and having cbuffers over declaring every individual uniform or vertex variable with its specific ___location. It makes much more sense to me to define structs with semantics over saying ___location(X) for each var or addressing uniforms by name/string. I often see GLSL to HLSL transpilers, but not vice-versa, which is weird because HLSL is the more ergonomic language to me
I would even advise to author the shaders directly in SPIR-V with a basic SSA checker and text to binary translater. That, very conservatively to avoid any drivers incompatibility (aka nothing fancy).
I wonder if there is an open source HLSL to spir-v compiler written in simple and plain C... but I am ready to work a bit more to avoid to depend on a complex HLSL compiler.
It's worth noting that SPIR-V isn't really compiled for the target ISA, it's not even guaranteed to be in a particular optimized form already.. so when the GPU driver loads it, it may or may not decide to spend a lot of time running optimization passes before translating it to the actual machine code the GPU needs.
In contrast, Metal shaders can be pre-compiled to the actual ARM binary code the GPU runs.
And DirectX shader bytecode, DXIL, is (poorly) defined to be a low-level IR that LLVM spits out right before it would be translated to machine code, rather than a high-level IR like SPIR-V is. i.e. it is guaranteed to be in an optimized form already, so drivers do not expect to run any optimizations at load time.
SPIR-V seems a bit of a mess here, because you don't really know what the target GPU is going to do - and what you can expect in terms of time the GPU spends on optimizing the SPIR-V when loading it varies on mobile and non-mobile GPUs, depending basically on what the GPU manufacturer felt behaved the best based on how random people making games distribute optimized or unoptimized SPIR-V.
Valve even maintains a fancy database of (SPIRV, GPU driver) pairings which map to _actual_ precompiled shaders for all games distributed on their platform, so that they aren't affected by this.
Which is really how the experience with portable APIs from Khronos goes, most people that claim they are portable never really did any serious cross platform, cross device programming with them.
At the end of the day there are so many if this, if that, load this extension, load that extension, do this workaroud, do that workaround, that it could just be another API for all pratical purposes.
> I wonder why people haven't coalesced around HLSL
I mean, people kind of have, because Unity uses it (and Unreal too I think).
> I vastly prefer defining vertices as structs
I agree, and note that WGSL lets you do this.
> It makes much more sense to me to define structs with semantics over saying ___location(X) for each var or addressing uniforms by name/string.
I feel like semantics are a straitjacket, as though the compiler is forcing me to choose from a predefined list of names instead of letting me choose my own. Having to use TEXCOORDn for things that aren't texture coordinates is weird.
> I often see GLSL to HLSL transpilers, but not vice-versa, which is weird because HLSL is the more ergonomic language to me
Microsoft and Khronos support this now [1]. Microsoft's dxc can compile HLSL to SPIR-V, which can then be decompiled to GLSL with SPIRV-Cross.
In general, I prefer GLSL to HLSL because GLSL doesn't use semantics, and because GLSL has looser implicit conversions between scalars and vectors (this continually bites me when writing HLSL). But I highly suspect people just prefer what they started with and perfectly understand people who have the opposite opinion. In any case, WGSL feels like the best of both worlds and a few newer engines like Bevy are adopting it.
They have, HLSL has practically won the shading language wars in the games industry, even on consoles, the Playstation and Switch shading languages are heavily inspired on HLSL.
Also as I noted on another comment, Khronos is basically leaving to Microsoft HLSL as de facto shading language for Vulkan, as they don't have monetary resources, nor anyone else, interested in improving GLSL as a language.
However, as per MSL and HLSL 2021 improvements, alongside SYCL and CUDA, eventually C++ will get the spot, and I doubt this is an area where any of the wannabe C++ replacements can do better.
Note that WGSL is closer to Rust than C++--syntactically speaking anyway--and it's pretty much guaranteed to get traction as it'll be the only way to talk to the GPU using a modern API on the Web.
S̶o̶u̶n̶d̶s̶ ̶c̶o̶o̶l̶,̶ ̶b̶u̶t̶ ̶t̶h̶i̶s̶ ̶r̶e̶q̶u̶i̶r̶e̶s̶ ̶y̶e̶t̶ ̶a̶n̶o̶t̶h̶e̶r̶ ̶l̶a̶n̶g̶u̶a̶g̶e̶ ̶t̶o̶ ̶l̶e̶a̶r̶n̶[0]. As someone who only has limited knowledge in this space, could someone tell me how comparable is the compute functionality of rust-gpu[1], where I can just write rust?
I often see folks complain/express confusion about shading languages.. but I think it's often incorrectly stated as syntax of the language, when in reality the confusion is about underlying GPU concepts.
Like, people (naturally) want to say 'I know how to write X, and therefor if X can run on the GPU then I can just write X for the GPU and be productive!" but in reality programming for a CPU and GPU are just.. very very different.
Heck, even if you stick with just modern compute APIs like Cuda - as close as you can get to treating the GPU as a general compute device - the way you write code for a GPU using that is just going to be very very different than how most write code for the CPU.
The point of Vcc/Shady is to address some of these, in particular recursion is now possible in Vulkan, and control-flow is no longer limited. A lot of those are just historical language limitations and can be eliminated with enough effort.
The elephant in the room is SIMT and the subgroup/workgroup considerations, which don't really require any changes to syntax but indeed needs the programmer to have a good mental model. But I don't think it would be incompatible with raising the bar on expressiveness or host/device code interoperability !
"I often see folks complain/express confusion about shading languages.. "
The confusion already starts with the name. Historically shading languages had the purpose once to calculate shades - but otherwise there are no shades and more correct and helpful would be just "GPU language". As yes, it is very different than writing something for a CPU.
As others have commented, Shady is an IR that happens to have a Clang front-end bolted on, that combination is what I call Vcc. I really don't expect people to start writing shaders in the IR directly, even though has a textual form!
It's very comparable to Rust-GPU, I actually know a really nice person who does key work on that (you know who you are!) and they're actually facing very similar challenges, and we get to exchange ideas regularly. I am confident it's in good hands.
IDK that shady is that far from literally any other C-like shader language. Any skill you have in that area is going to transfer (unless you only know Rust, lol)
No, the entire project is about using existing languages (C/C++) on GPUs, the project you linked is their internal IR (intermediate representation) which is what they compile the C/C++ down to before it is turned into a Vulkan compute shader.