More

giomasce · 2025-01-11T13:06:15 1736600775

How does that interact with cache? Does accessing the ring buffer using the second set of mapped pages ends up using the same cache line, or is it a fresh request to main memory? If it's the latter, I guess that's has good chances of making your circular buffer slower, depending on how big it is, how does your cache work and how much cache pressure you experience. I don't think I know enough about actual caches to say whether that's probable or not.

junon · 2025-01-11T13:09:44 1736600984

Same cache-line. CPU caches come after virtual memory translations / TLB lookups. Memory caches work on physical addresses, not linear (virtual) addresses.

Memory access -> TLB cache lookup -> PT lookup (if TLB miss) -> L1 cache check (depending on PT flags) -> L2 cache check (depending on PT flags, if L1 misses) -> ... -> main memory fetch, to boil it down simply.

CPUs would be ridiculously slow if that wasn't the case. Also upon thinking about it a bit more, I have no idea how it'd even work if it was the other way around. (EDIT: To be clear, I meant if main memory cache was hit first followed by the MMU - someone correctly mentioned VIVT caches which aren't what I meant :D)

saagarjha · 2025-01-11T13:18:56 1736601536

VIVT caches exist, though.

junon · 2025-01-11T13:27:09 1736602029

That's very true, though AFAIK they aren't used much in modern processors. It's usually PIPT or VIPT (I think I've seen more references to the latter), VIPT being prevalent because the logical address and the cache can be resolved in parallel when designing the circuitry.

But I've not designed CPUs nor do I work for $chip_manu so I'm speculating. Would love more info if anyone has it.

EDIT: Looks like some of the x86 manus have figured out a VIPT that has less downsides and behaves more like PIPT caches. I'd imagine "how" is more of a trade secret though. I wonder what ARM manus do, actually. Going to have to look it up :D

tliltocatl · 2025-01-11T14:29:10 1736605750

Original ARM (as in Acorn Risc Machine) did VIVT. Interestingly, to allow the OS to access the physical memory without aliasing, ARM1 only translated a part of address space (26 bits), the rest of it was always physical.

Nowdays, you don't see it exactly because of problems with aliasing. Hardware people would love to have these back because having to do shared-index is what limits L1 cache today. Hope nobody actually does it because this is a thing that you can't really abstract away and it interacts badly with applications that aren't aware of it.

Somewhat tangential, this is also true for other virtual memory design choices, like page size (apple silicon had problems with software that assumed 4096-byte pages). And I seriously wish for CPU designers not to be all to creative such hard-to-abstract things. Shaving some hundred transistors isn't really worth the eternal suffering upon everyone who have to provide compatibility for this. Nowdays it's generally recognised (RISC-V was quite conscious about it). Pre-AMD64 systems like Itanium and MIPS were total wild west about it.

Another example hard-to-abstract thing that is still ubiquitous is incoherent TLBs. It might have been the right choice back when SMP and multithreading was uncommon (a TLB flush on a single core is cheap), but it's certainly isn't true anymore with IPIs being super expensive. The problem is that it directly affects how we write applications. Memory reclamation is so expensive it's not worth it so nobody bothers. Passing buffers by virtual memory remapping is expensive, so we use memcpy everywhere. Which means it's hard to quantify the real-life benefit of TLB coherence, which makes it even more unlikely we ever get those.

fanf2 · 2025-01-11T16:29:29 1736612969

Original ARM (ARM1 and ARM2) were cacheless; ARM3 was the first with a cache.

The CPU’s 26 bit address space was split into virtually mapped RAM in the bottom half, and the machine’s physical addresses in the top half. The physical address space had the RAM in the lowest addresses, with other stuff such as ROMs, memory-mapped IO, etc. higher up. The virtual memory hardware was pretty limited: it could not map a page more than once. But you could see both the virtually mapped and physically addressed versions of the same page.

RISC OS used this by placing the video memory in the lowest physical addresses in RAM, and also mapping it into the highest virtual addresses, so there were two copies of video memory next to each other in the middle of the 26 bit address space. The video hardware accessed memory using the same address space as the CPU, so it could do fast full-screen scrolling by adjusting the start address, using exactly the same trick as in the article.

junon · 2025-01-11T14:35:12 1736606112

Thanks for the information!

> Passing buffers by virtual memory remapping is expensive, so we use memcpy everywhere.

Curious if you could expand on this a bit; memcpy still requires that two buffers are mapped in anyway. Do you mean that avoiding maps is more important than avoiding copies? Or is there something inherent about multiple linear addresses -> same physical address that is somehow slower on modern processors?

tliltocatl · 2025-01-12T01:51:23 1736646683

Assume an (untrusted) application A wants to send a stream of somewhat long (several tens of KB/multiple pages each) messages to application B. A and B could establish a shared memory region for this, but that would possibly allow A to trigger a TOCTOU vulnerability in B by modifying the buffer after B started reading the message. If page capability reclamation would have been cheap, the OS could unmap the shared buffer from A before notifying B of incoming message. But nowadays unmapping requires synchronizing with all CPUs that might have TLBs with A's mapping, so memcpy is cheaper.

giomasce · 2025-01-11T13:00:32 1736600432

That still requires you to plan how you use the virtual address space, though. You can't just add more memory pages on the back of your vector if you've already started using those virtual memory locations for other stuff. So you have to know in advance how much space your vector might end up using, and carefully avoid placing anything else there. If you're willing to come up with such an estimate, and if your OS is happy to overcommit memory (and I think Linux is, by default at least), then you can just malloc() all that memory in the first place. With overcommitting no physical page will be used to back your virtual memory block until you really use it.

If your system doesn't do overcommitting, then I guess that with some mmap() trickery (using the appropriate flags) you could do more or less the same thing, reserving the virtual address space that you need and then actually backing with memory as you need.

giomasce · 2025-01-03T07:28:52 1735889332

The nice thing of intrusive sensors that sense things that are not sensed by your usual senses, like this, is that it's easy to saturate them while faking innocence. I.e., set up your laptop so that it does many scans and/or associations to the local WiFi, and then some light internet activity (the usual suspects: WhatsApp, Facebook, etc). The detector triggers, the landlord shows up to check what's going on, you show it's just you and your partner. Do that a few times until the landlord is convinced that the sensor is malfunctioning; unless they are IT technicals themselves, which I guess won't happen often, they will have a hard time understanding what's really going on, even more to prove it.

zild3d · 2025-01-03T11:19:03 1735903143

what's the end goal? To convince the owner that the occupancy counter is malfunctioning for a couple days and then after that throw a big party? Maybe you can just ask if it's okay to host a party before renting..

giomasce · 2025-01-03T21:12:58 1735938778

Whose end goal? The tenant's goal might be, as you say, to convince the landlord that the device is unreliable, and cheat on the rent agreement. If you want to cheat, as I guess some people do (otherwise there wouldn't be need for a monitor), asking if it's ok to host a party isn't a good solution.

My personal end goal is speculation. Each time a technology is discussed it's pretty automatic for me to think about what are its weaknesses and strengths, and how its behavior can be subverted or the same thing can be used in unintended ways.

giomasce · 2024-12-06T02:25:36 1733451936

In fine print it is written than "unlimited" really means "only until the airline thinks that you're abusing of it, at their sole discretion".

giomasce · 2024-10-29T12:39:03 1730205543

> One of the most curious humans

No doubt he is very curious, but there might be some selection bias here.

giomasce · 2024-10-05T21:50:50 1728165050

The point is that there are a lot of other things which easily become a problem if you do them by yourself instead of using known good implementations.

giomasce · 2024-10-05T21:46:24 1728164784

> We never take a software methodology, school of programming or some random internet dude's "manifesto" at face value. Rules must be broken, when necessary.

In this specific case it seems particularly necessary. I don't think I will take this manifesto at face value.

giomasce · 2024-10-01T06:21:53 1727763713

In some conditions, yes. You need a cluster of points which have good reflectivity and coherence properties to microwaves over some time (months to years). Manmade steel and concrete structures, like bridges, houses, dams, etc, usually work very well.

You can't measure their position to the millimeter range, but with some interferometry techniques you can measure their movement to the millimiter range, relative to close points. Some variation of https://www.sciencedirect.com/science/article/pii/S092427161... was likely used in that work, I've seen it done for many other structures (and I even tried to setup a pipeline for doing that for commercial customers, but in the end we didn't manage anybody to fund us).

You can probably get better measurements with an onsite survey, but using satellite data has the advantage that with a handful of satellites you can map an entire country once every one or two week, and after throwing some computing power at it you can theoretically monitor all the bridges and houses at once and get early predictors of possible problems.

These case studies give you a hint of what can be done: https://www.sarproz.com/case-studies/ (I'm not and never have been affiliated with that product, just linking some cool pages).

giomasce · 2024-09-20T05:31:14 1726810274

At least it's documented.

flohofwoe · 2024-09-20T07:01:06 1726815666

The DirectX specs are much better than both the OpenGL and Vulkan specs because they also go into implementation details and are written in 'documentation language', not 'spec language':

https://microsoft.github.io/DirectX-Specs/

MindSpunk · 2024-09-20T07:49:20 1726818560

If you search for 'D3D12' spec what you actually find is D3D12 doesn't have a specification at all. D3D12's "spec" is only specified by a document that states the differences from D3D11. There's no complete holistic document that describes D3D12 entirely in terms of D3D12. You have to cross reference back and forth between the two documents and try and make sense of it.

Many of D3D12's newer features (Enhanced Barriers, which are largely a clone of Vulkan's pipeline barriers) are woefully under specified, with no real description of the precise semantics. Just finding if a function is safe to call in multiple threads simultaneously is quite difficult.

giomasce · 2024-09-20T07:25:13 1726817113

I don't think that going into implementation details is what I would expect from an interface specification. The interface exists precisely to isolate the API consumer from the implementation details.

And while they're much better than nothing, those documents are certainly not a specification. They're are individual documents each covering a part of the API, with very spotty coverage (mostly focusing on new features) and unclear relationship to one another.

For example, the precise semantics of ResourceBarrier() are nowhere to be found. You can infer something from the extended barrier documentation, something is written in the function MSDN page (with vague references to concepts like "promoting" and "decaying"), something else is written in other random MSDN pages (which you only discover by browsing around, there are no specific links) but at the end of the day you're left to guess the actual assumptions you can make.

*EDIT* I don't mean to say that Vulkan or SPIR-V specification is perfect either. One still has a lot of doubts while reading them. But at least there is an attempt of writing a document that specifies the entire contract that exists between the API implementer and the API consumer. Missing points are in general considered bugs and sometimes fixed.

flohofwoe · 2024-09-20T08:29:16 1726820956

> I don't think that going into implementation details is what I would expect from an interface specification.

I guess that's why Microsoft calls it an "engineering spec", but I prefer that sort specification over the Vulkan or GL spec TBH.

> The interface exists precisely to isolate the API consumer from the implementation details.

In theory that's a good thing, but at least the GL spec was quite useless because concrete drivers still interpreted the specification differently - or were just plain buggy.

Writing GL code precisely against the spec didn't help with making that GL code run on specific drivers at all, and Khronos only worried about their spec, not about the quality of vendor drivers (while some GPU vendors didn't worry much about the quality of their GL drivers either).

The D3D engineering specs seem to be grounded much more in the real world, and the additional information that goes beyond the interface description is extremely helpful (source access would be better of course).

giomasce · 2024-08-06T05:54:45 1722923685

I wish clangd let me configure the inlay hint elision length...