Hacker News new | past | comments | ask | show | jobs | submit login

L3 caches have grown monstrously.

The new AMD Ryzen 5800x3d has 96MB of L3 cache. This is so monstrous that the 2048x entry TLB with 4kB pages only can access 8MB.

That's right, you run out of TLB-entries before you run out of L3 cache these days. (Or you start using hugepages damn it)

----------

I think Intel's PEXT and PDEP was introduced around 2015-era. But AMD chips now execute PEXT / PDEP quickly, so its now feasible to use it on most people's modern systems (assuming Zen3 or a 2015+ era Intel CPU). Obviously those instructions don't exist in ARM / POWER9 world, but they're really fun to experiment with.

PEXT / PDEP are effectively bitwise-gather and bitwise-scatter instructions, and can be used to perform extremely fast and arbitrary bit-permutations. I played around with them to implement some relational-database operations (join, select, etc. etc.) over bit-relations for the 4-coloring theorem. (Just a toy to amuse myself with. A 16-bit bitset of "0001_1111_0000_0000" means "(Var1 == Color4 and Var2==Color1) or (Var2==Color2)".

There's probably some kind of tight relational algebra / automatic logic proving / binary decision diagram / stuffs that you can do with PEXT/PDEP. It really seems like an unexplored field.

----

EDIT: Oh, another big one. ARMv8 and POWER9 standardized upon the C++11 memory model of acquire-release. This was inevitable because Java and C++ standardized upon the memory model in the 00s / early 10s, so chips inevitably would be tailored for that model.




> That's right, you run out of TLB-entries before you run out of L3 cache these days.

This is more reasonable than it sounds. A TLB miss can in many cases be faster than a L3 cache hit


It's also misleading because it has 8 cores and each of them has 2048 l2 TLB entries. Altogether they can cover 64MiB of memory with small pages.


But 5800x3D has 96MB of L3. So even if all 8 cores are independently working on different memory addresses, you still can't cover all 96MB of L3 with the TLB.

EDIT: Well, unless you use 2MB hugepages of course.


That's another thing which is recent. Before Haswell, x86 cores had almost no huge TLB entries. IvyBridge only had 32 in 2MiB mode, compared to 64 + 512 in 4KiB mode.


Are you sure? TLB misses mean a pagewalk. Sure, the directory tree is probably in L3 cache, but repeatedly pagewalking through L3 to find a memory address is going to be slower than just fetching it from the in core TLB.

I know that modern cores have dedicated page walking units these days, but I admit that I've never tested the speed of them.


It only takes ~200KB to store page tables for 96MB of address space. So the page table entries might mostly stay in the L1 and L2 caches


I think you made an error in your assumptions.

Each 64byte cache line could feasibly come from a different page in the worst case.

I think modern processors actually pull 128 bytes from RAM at the L3 level, if each 128 L3 cache line is from a different page, that's 768k pages in the 96MB L3 cache.

That being said, huge pages won't help much in this degenerate case. So your assumption might be valid for this argument actually.

So maybe it's not that much of an error.


My estimate is for a small number of contiguous regions. It is true that if you adversarially construct a set of cache lines, you might need a far larger amount of memory to store page tables for them. Whether you consider that an "error" or just a simplifying assumption is a matter of opinion I suppose


PDEP/PEXT were part of the Intel Haswell microarchitecture, launched in 2013.

And yes, they can be extremely useful for efficient join operations in some contexts, that would be challenging to implement without those instructions. Also selection for some codes. Not everyone needs them, but when you need them you really need them. And those use cases are frequently worth it. I use them to implement a general algebra, much like you suggest.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: