If you want fat vector units you should be on a GPU.

dragontamer · on Oct 19, 2018

Yes, in principle. A dedicated GPU will always be "fatter" in SIMD than a CPU.

But a CPU has latency benefits. Most noticeably, it takes only a few clock cycles for an Intel CPU to transfer data from its RAX register -> AVX vector registers, compute the solution, and transfer it back to the "scalar world".

This level of data-transfer is done within single-digit nanoseconds.

In contrast, any GPU memory transfer takes hundreds-of-nanoseconds to microseconds... an order of 100x to 10,000x slower than transferring between Scalar-code and AVX-code on a CPU.

----------

So GPUs are good for bulk compute (big matrix multiplies, Deep Learning, etc. etc.)... but I'm definitely interested in the low-latency uses of SIMD.

For example: Consider a SIMD-based Bloom Filters to augment a Hash Table (or any data-structure, really). Multi-hash systems like Cuckoo-hashes have also been successfully implemented with CPU-SIMD acceleration.

You can use CPU-accelerated SIMD to accelerate CPU-based algorithms. And CPUs certainly can benefit from fatter, and better designed, SIMD instruction sets.

PeCaN · on Oct 19, 2018

> it takes only a few clock cycles for an Intel CPU to transfer data from its RAX register -> AVX vector registers, compute the solution, and transfer it back to the "scalar world".

unfortunately, it takes millions of cycles for the CPU to switch to the higher-power state necessary for AVX instructions, so your latency really isn't that much better.

the radfft guy did a good writeup https://gist.github.com/rygorous/32bc3ea8301dba09358fd2c64e0...

celrod · on Oct 20, 2018

Yes. My code is mostly AVX, so regardless of down-clocking I see substantial performance boosts.

But I've found getting into GPU computing extremely difficult. Admittedly, part of that may be having bought an AMD Vega graphics card instead of NVidea (not a fan of vendor locking, and AMD is building an open source stack with ROCm/HIP).

My code does lots of different things, many small, but given a billion iterations, it can add up. If you're running a Monte Carlo simulation over millions of different data sets, fitting each iteration with Markov Chain Monte Carlo, with a model that has a few long for loops and requires (automatic) differentiation and some number of special functions... It's not hard to rack up slow execution times out of small pieces.

Part of the problem is I need to actually find a GPU project that's accessible for me, so I can gain some experience, and learn what's actually possible. How I could actually organize computation flow. It's easy with a CPU. Break up Monte Carlo simulations among processors. Depending on the model being fit, as well as the method, different computations can vary dramatically in width. But optimizing is as simple as breaking up all those computations into the appropriately sized AVX units (and libraries + compilers auto-vectorization will often handle most of that!), so wider units directly translate to faster performance.

Part of the problem is I don't really know how to think about a GPU. Can I think of the Vega GPU as a processor with 64 cores, a SIMD width of 64 (64 * 64 = advertised "4096), and more efficient gather loads/scatter store operations?

If there were a CPU like that, compiler + library calls and macros (including libraries you've written or wrapped yourself) go a long way so you can quickly write well optimized code.

I really need to dedicate the time to learn more. My 7900x processor is about 4x faster at gemm than my 1950x, but 10x slower than the Vega graphics card that cost less money. I see the potential.

To start, I need to find a GPU project that lets me really get experimenting and figure out how programs can even look.

Is there an Agner Fog of GPGPU?

marmaduke · on Oct 20, 2018

If the MC simulations vary only in the values of parameters (as opposed to memory access or execution path through an if/else), then they are ideal for GPU. You write a kernel which looks like it’s doing one of those simulations, and call it on arrays on parameters, and you’re in business.

Concretely, I’ve used this for parameter sweeps of large systems of ODEs and SDEs.

jabl · on Oct 20, 2018

Did you see https://news.ycombinator.com/item?id=18249651 about Gpu programming with Julia?

You still need some understanding of gpus, but at least you can write your code in Julia rather than CUDA or OpenCL.

dnautics · on Oct 20, 2018

As a longtime user of Julia I'm convinced that the julia approach to gpus is wrong -- for being insufficiently Julian. I would love for the gpus to be registered as independent GPU nodes accessible by the Distributed module, to which you dispatch async compute tasks.

scns · on Oct 20, 2018

There was an Intro to gpuprogramming in Julia post on hn one or two days ago

dragontamer · on Oct 20, 2018

> Part of the problem is I don't really know how to think about a GPU. Can I think of the Vega GPU as a processor with 64 cores, a SIMD width of 64 (64 * 64 = advertised "4096), and more efficient gather loads/scatter store operations?

From my limited experience, that's the wrong way of looking at things.

SIMD is good at looking at "converged" instruction streams, and bad at "divergent" instruction streams.

"Converged" is when you've got your bitmask (in AMD / OpenCL: the scalar bitmask register) executing all 64-threads as much as possible. This means all 64-threads take "if" statements together, they loop together, etc. etc.

"Diverged" is when you have a tree of if-statements, and different threads take different paths. The SIMD processor has to execute both the "Then" AND the "Else" statements. And if you have a tree of "if / else" statements, the threads have less-and-less in common.

-------

That's it. You try to group code together such that as much code converges as possible, and diverges as little as possible. It helps to know that the work-group size on AMD can be anywhere from 64 to 256 (so keep your "thread groups" together as large as 256 at a time).

The OpenCL compiler will automatically translate most code into conditional moves and do its best to avoid "real if/else branches". But as the programmer, you just have to realize that nested if-statements and nested-loops can cause thread-divergence.

---------

Matrix maths are the easiest for SIMD, because of how regular their structure is. When you get to "real" cases like... well...

> My code does lots of different things, many small, but given a billion iterations, it can add up. If you're running a Monte Carlo simulation over millions of different data sets, fitting each iteration with Markov Chain Monte Carlo, with a model that has a few long for loops and requires (automatic) differentiation and some number of special functions... It's not hard to rack up slow execution times out of small pieces.

Okay, well that's why its hard to program properly on a GPU. Because they're all diverging. Unless you figure out a way to "converge" these if statements, it won't run on a GPU correctly.

Chances are, if you have a million-wide Monte-carlo, a lot of those threads are going to be converging. Can you split up the steps and "regroup" tasks as appropriate?

Lets say your million threads run into a switch statement:

* 106,038 threads will take path A.

* 348,121 threads take path B.

* 764 threads take path C.

Etc. etc.

Can you "regroup" so that all your threads in pathA start to execute together again? To best take advantage of the SIMD-architecture?

Think about it: a gang of 64 may have thread #0 take PathA, thread#1 take PathB, thread #2 take PathA, etc. etc. So it all diverges and you lose all your SIMD.

But if you "rebatch" everything together... eventually you'll get hundreds of threads ready for PathA. At which point, you gang up your group of 64 threads and execute PathA all together again, taking full advantage of SIMD.

SIMD is an architecture that allows "similar" threads to run at the same time, in groups of 64 or more (up to 256 threads at once). And they truly execute simultaneously as long as they are all taking the same if/else branches and loops. The real hard part is designing your data-structures to maximize this "convergence" of threads.

Something like Chess would practically be impossible to SIMD on a GPU. Its just impossible to handle how divergent the typical chess analysis engine gets due to the precision of chess board positions. But something like Monte-Carlo surely will have hundreds, or thousands, of threads taking any "particular execution path". So I'd have hope for a SIMD-based execution on Monte-Carlo simulations.

Filligree · on Oct 19, 2018

Using AVX instructions clocks down your CPU by a lot, though. It isn't usually worth it unless you're running mostly AVX.

esmi · on Oct 20, 2018

> So GPUs are good for bulk compute (big matrix multiplies, Deep Learning, etc. etc.)

Maybe. Sparsity of the matrix in question matters a lot. Matrixes with very little sparsity, like say an image, do well. Others may not. It’s more like GPUs do well on algorithms with predictable branching.