ARM Launches DynamIQ: big.Little to Eight Cores Per Cluster

DiabloD3 · on March 21, 2017

I wonder when Intel is going to attempt their own version of big.little.

I think there is a market for laptop and thin-desktop x86 CPUs that are asymmetric 2 core/4 thread big + 2 core/2 thread little, and they already have a small version of their modern cores[1], so a 2 full Skylake + 2 little Goldmont cores (or 4+4) could be extremely interesting, especially on a future 115x socket.

  1.
  * Silvermont (Bay Trail/Avoton/Rangeley) == "atomfied" Haswell
  * Airmont (Braswell/Cherry Trail) == "atomfied" Broadwell
  * Goldmont (Apollo Lake/Denverton) == "atomfied" Skylake
  * ???mont (Gemini Lake) == "atomfied" Kaby/Coffee Lake.

Symmetry · on March 21, 2017

The problem with Intel doing this is that, for marketing reasons, they tend to enable different sets of instructions on different cores and an OS would have a really hard time scheduling threads when some the Skylake cores can execute AVX instructions but the Goldmont cores can't. Normally the scheduler wants to assume that it can just move a thread from one core to another but if the thread started off on the Skylake and was taking an AVX-enabled code path that would cause problems.

Not that the idea isn't appealing or even infeasible. It just runs afoul of Intel's marketing.

microcolonel · on March 21, 2017

This is one of those interesting cases where RISC-V could eat their (and maybe ARM's) lunch on implementations like this. The wide vector instructions (i.e. not the packed SIMD, but the -V extension) are width-independent, so you could just make the vector machine narrower on the little core, and switch it to low-frequency in-order.

Symmetry · on March 21, 2017

ARM recently added a similar wide vector instruction called SVE which works similarly. But yes, that way of doing vectors is really cool.

EDIT: Come to think of it, this might be something ARM has been adding deliberately partially to be able to use hetrogenous cores.

pertymcpert · on March 22, 2017

ARM did it first with SVE. The RISC-V vector extension is just a draft while I believe SVE silicon is being developed now for exascale HPC.

microcolonel · on March 23, 2017

Cray did it first, that's why there's no patent dispute right now.

slededit · on March 21, 2017

The scheduler already knows if you've used AVX or SSE instructions. You don't think they waste time backing up all those registers when you've never used them do you?

You can set a flag causing an exception when an FPU instruction is used, and handle it by setting a flag saying you need to backup those registers.

You can also handle Undefined Operation exceptions by moving the thread to a core that can handle the specific instruction.

gpderetta · on March 21, 2017

In practice any non trivial program will end up touching xmm registers as even plain memcpy ends up expanding to vectorized load/stores.

Apparently lazy FPU restore in Linux is deprecated and eager restore is the default [1]

[1] https://tthtlc.wordpress.com/2016/12/17/understanding-fpu-us...

brianwawok · on March 21, 2017

Couldn't you engineer the little core to take the fat instruction it can't process and instead do many small instructions to get the same end result? I am specifically thinking of AVX here, and it seems like it would work.. but perhaps there are other instructions that would not be so easy to unroll.

Symmetry · on March 21, 2017

In terms of AVX breaking up 256-bit vector instructions into two 128-bit vectors is something that's very possible and is, in fact, what AMD is doing in Ryzen. That saves you on execution and datapath silicon but you still need the full sized registers to hold the data and that means that adding AVX to an Atom processor would require a redesign of the back end. And other new instructions might require other backend changes. And to interpret new instructions you certainly have to make silicon changes to the front end which might have follow on effects on the general layout.

But in general you could certainly design an Atom-ish core that has the full range of Intel instructions. Or just add AVX (they've already done this with Phi) and take all the other extra instructions out of the SKylake.

pjmlp · on March 21, 2017

A better source of information than Anandtech, including a reference to the presentation slides.

https://community.arm.com/processors/b/blog/posts/arm-dynami...

Narishma · on March 21, 2017

I'm not sure about 'better'. Those links are full of buzzwords and marketing speak.

pjmlp · on March 21, 2017

Anandtech's article is basically copy-paste from ARM blog, without the links for the relevant information, hence why the ARM site is better.

borandi · on March 22, 2017

No copy-paste there. Ultimately there's not that much information, so the analysis is going to read similar. But claiming it's basically a copy-paste is a bit much.

patrickg_zill · on March 21, 2017

So to tie this in to an application of these chips:

Smartphone that has this - when idle (no phone calls, user is either not using or just looking at static screen such as e-reader or non-interactive web page) only low power chips run, doing the bare minimum housekeeping tasks, updating the screen, etc.

When something happens, the low-power chips wake up the more powerful chips and hand off the task to them.

Is that the basic idea?

Intel's POV seems to be to have 1 powerful CPU with different power states, while ARM is explicitly breaking up the power levels with different CPUs on the same SoC.

Symmetry · on March 21, 2017

I'm very curious how they work the cache hierarchies of these hybrid clusters. You really want to be tuning the latency and throughput of the cache to the consumers attached to it and that wouldn't be straightforward in this case.

robert_foss · on March 21, 2017

Will the proper OS support take 3 years to land like it took for bigLittle?

Probably.

Will individual manufacturers using the IP stumble and fall like Samsung did with their A15 bigLittle hardware implementation?

Presumably.