GPU and CPU cores are very different in how they process data / architecture.
You can think of GPU cores as you schedule a series of "vector" instruction over a large chunk of data. They are great at internally paralleling this process because the flow is simple. You assume the there will be little synchronization and little branching. That's why you need such fast memory on GPUs to keep streaming data at instruction speed.
CPU cores are a lot more general prepuce where you're generally building the business logic and the work load is a lot more branch oriented.
So if you have a problem that's trivially parallizable without (much) branching logic then throwing more GPU cores is easy. Problems like this numerical in nature like large matrix calculations (math or 3d meshes), streaming data (hashing).
Hence why it's easier to keep adding GPU cores... it means games can process more geometry data in a fame, but adding more general prepuce CPU cores is harder because the complexity of business logic goes up as you split up the work and have to worry about ordering/synchronization.
There is not much difference in how the GCN and a generic CPU process data, I can not speak for every GPU out there though. The only significant difference is that each GCN core can run up to 40 threads while the CPU at this power level run at most two (Jaguar in the 8th gen consoles is one thread per core). Having so many threads simplifies a great deal of parallel programming because instead of splitting each task into several chunks, where each chunk still has multiple items that need to be looped over you can just spin a thread for each item and get rid of the loops. Thread switching is zero overhead and creation/destruction is a few cycles. These are not Windows/pthread software threads.
Of course, you don't have to run the same code in every thread. If you can figure hundreds of different tasks to do simultaneously in a game - you can put each one of them in a thread too.
This is also why a GPUs, generally, use a very slow memory, contrary to your belief, With so many threads latency does not matter that much since a thread that gets stalled on a memory access is preempted for one that already has data available.
Current gen GPUs use GDDR5 compared to DDR4 in only handful of new Intel chips that started shipping in the last few weeks. The GDDR4 chips runs at 750Mhz and DDR4-2133 as supported by the fasted shipping Intel CPU runs at 266Mhz. That is an effective transfer rate of 48 GB/s vs 17Gb/s for the DDR4.
The current GPUs effectively have the fastest off core memory of current devices. They need those transfer rates to keep all the stream processors running.
"Transfer rate" is not synonymous to speed. Latency is. GDDR5 latency is greater than any DDR3 memory lest DDR4. And HBM that is the new video memory is much slower than GDDR5 (even though it's physically GDDR5 the implementation s we have run it half the clock of normal GDDR5).
Bandwidth is great though, but if bandwidth had been speed you could also say that a container ship is faster than a supersonic jet.
GPUs cards makers the trade of between bandwidth and latency i n favor of latency. When you're doing mostly branch free processing in large chunks that's the trade of to make. All you need is a strait forward pre-fetcher and you don't need to worry about latency.
That's not true for general purpose CPUs that perform lots of branches, that need to predicted (so we can predict what to fetch). The data processed on CPUs tends to be different (structures vs. vectors) and lots of pointer chasing (vtables, linked lists, hash tables, trees). That requires lower latencies since the access pasterns a lot more random.
The stated goal of HBM is taming the power consumption (and thus also heat) of GPU systems while keeping the same (or higher) bandwidth. The name HBM stands for high bandwidth memory.
And while HBM has a lower clock frequency compared to GDDR5 (like 1/4) it has a much wider bus. The bus on HBM is 1024 bits vs 32 bits for GDDR. At one time it can send 32x times the data in the bus. 32x / 4 = 8. The transfer rate of is 8 times bigger. The recent radeon cards with HBM now have memory transfer speeds of 256GB/s vs the 48GB for GGDR5.
Again, HBM trades latency for bandwidth. It negates some the latency drops due to 1/4 of the clock by putting the HBM memory on die vs off die.
I think you're conflating a few different arguments. GPU workloads are not latency sensitive, so in GPU land transfer speed (bandwidth) is speed.
I am not sure you are familiar with modern GPU architecture. Both AMD's and NVidia GPU have no problems with branches. They do not do prediction and prefetch because it's pretty pointless on a single issue architecture. I believe the ISA docs are available to general public - you could easily familiarize yourself with them. I am also quite familiar with latency and bandwidth so the concept of negating one with another sounds very amateurish to me. If you could do that then everyone switched to high bandwidth memory and negated all the latency :) Speed is still speed and bandwidth is still bandwidth.
You can think of GPU cores as you schedule a series of "vector" instruction over a large chunk of data. They are great at internally paralleling this process because the flow is simple. You assume the there will be little synchronization and little branching. That's why you need such fast memory on GPUs to keep streaming data at instruction speed.
CPU cores are a lot more general prepuce where you're generally building the business logic and the work load is a lot more branch oriented.
So if you have a problem that's trivially parallizable without (much) branching logic then throwing more GPU cores is easy. Problems like this numerical in nature like large matrix calculations (math or 3d meshes), streaming data (hashing).
Hence why it's easier to keep adding GPU cores... it means games can process more geometry data in a fame, but adding more general prepuce CPU cores is harder because the complexity of business logic goes up as you split up the work and have to worry about ordering/synchronization.