I wonder when Intel is going to attempt their own version of big.little.
I think there is a market for laptop and thin-desktop x86 CPUs that are asymmetric 2 core/4 thread big + 2 core/2 thread little, and they already have a small version of their modern cores[1], so a 2 full Skylake + 2 little Goldmont cores (or 4+4) could be extremely interesting, especially on a future 115x socket.
The problem with Intel doing this is that, for marketing reasons, they tend to enable different sets of instructions on different cores and an OS would have a really hard time scheduling threads when some the Skylake cores can execute AVX instructions but the Goldmont cores can't. Normally the scheduler wants to assume that it can just move a thread from one core to another but if the thread started off on the Skylake and was taking an AVX-enabled code path that would cause problems.
Not that the idea isn't appealing or even infeasible. It just runs afoul of Intel's marketing.
This is one of those interesting cases where RISC-V could eat their (and maybe ARM's) lunch on implementations like this. The wide vector instructions (i.e. not the packed SIMD, but the -V extension) are width-independent, so you could just make the vector machine narrower on the little core, and switch it to low-frequency in-order.
The scheduler already knows if you've used AVX or SSE instructions. You don't think they waste time backing up all those registers when you've never used them do you?
You can set a flag causing an exception when an FPU instruction is used, and handle it by setting a flag saying you need to backup those registers.
You can also handle Undefined Operation exceptions by moving the thread to a core that can handle the specific instruction.
Couldn't you engineer the little core to take the fat instruction it can't process and instead do many small instructions to get the same end result? I am specifically thinking of AVX here, and it seems like it would work.. but perhaps there are other instructions that would not be so easy to unroll.
In terms of AVX breaking up 256-bit vector instructions into two 128-bit vectors is something that's very possible and is, in fact, what AMD is doing in Ryzen. That saves you on execution and datapath silicon but you still need the full sized registers to hold the data and that means that adding AVX to an Atom processor would require a redesign of the back end. And other new instructions might require other backend changes. And to interpret new instructions you certainly have to make silicon changes to the front end which might have follow on effects on the general layout.
But in general you could certainly design an Atom-ish core that has the full range of Intel instructions. Or just add AVX (they've already done this with Phi) and take all the other extra instructions out of the SKylake.
No copy-paste there. Ultimately there's not that much information, so the analysis is going to read similar. But claiming it's basically a copy-paste is a bit much.
So to tie this in to an application of these chips:
Smartphone that has this - when idle (no phone calls, user is either not using or just looking at static screen such as e-reader or non-interactive web page) only low power chips run, doing the bare minimum housekeeping tasks, updating the screen, etc.
When something happens, the low-power chips wake up the more powerful chips and hand off the task to them.
Is that the basic idea?
Intel's POV seems to be to have 1 powerful CPU with different power states, while ARM is explicitly breaking up the power levels with different CPUs on the same SoC.
I'm very curious how they work the cache hierarchies of these hybrid clusters. You really want to be tuning the latency and throughput of the cache to the consumers attached to it and that wouldn't be straightforward in this case.
I think there is a market for laptop and thin-desktop x86 CPUs that are asymmetric 2 core/4 thread big + 2 core/2 thread little, and they already have a small version of their modern cores[1], so a 2 full Skylake + 2 little Goldmont cores (or 4+4) could be extremely interesting, especially on a future 115x socket.