> caching (and cache-invalidation) is really difficult
That's an urban legend.
Needlessly complicated or over-abstracted general-purpose caching frameworks are difficult, but your dumbest imaginable linear LRU fast lookup is both exceptionally useful and can indeed be done in 10 LoC in a lot of cases.
It's not the algorithm that's difficult, it's the effect of adding caching to a system that wasn't built with caching in place at the start that is difficult.
This isn't an "urban legend" it's first hand experience working with companies trying to add caching. No, those companies aren't even trying to write caching algorithms, they're just bundling in a caching layer and hoping that the system behaves in the same way.
It only takes somewhere which writes data (perhaps in a way that bypasses the caching layer so the cache doesn't know it has changed) and re-reads it back quickly for software which used to work suddenly breaks.
Now you might look at that and go "omg refactor it! That's horrible code, that should never ship" etc, but not everywhere is the s.v. bubble with endless amounts of the best developers to throw at problems. Code which worked and solved a business problem ended up shipping, possibly without testers and probably without code reviews.
So adding a caching layer suddenly "breaks" those reports, now who's going to have to fix it, not the person who wrote those reports even if the very behavior of side-effected data changes and db re-reads is precisely a cause of data layer slowness that led to wanting to implement caching...
I am not sure you and the OP are talking about the same thing.
Bundling a caching layer without even trying to write caching algorithms -- particularly if such a layer is serving multiple purposes -- sounds to me like what the parent is calling a "general-purpose caching framework", probably overly abstracted too as these things are wont to be.
I would be super cautious incorporating this kind of black-box stuff. Even if the docs have 10 LOC examples, there can be all kinds of unexpected quirks that you would need to be aware of before doing anything.
I read the OP as talking about specific, single-purpose caching techniques -- e.g. when you need to repeatedly compute a function of arbitrary parameters, it can help a lot to simply store values for the more common parameter combinations.
It all depends on the purity of the underlying computation. If you're multiplying XXL numbers together, that's different than accessing a database with dynamic or variable data. Math is totally pure. But you'll have to evaluate the constraints on the purity of that database access.
Nobody's saying that one can just throw in a caching layer, touching nothing else and it will just magically work. There's obviously some thought and due consideration required, but it is NOT "really difficult". And in a lot of cases it is in fact as simple as adding a handful lines of code.
PS. It is an urban legend, because "cache invalidation is hard" gets repeated a lot, initially as a joke, but it doesn't preclude people who aren't familiar with the subject from taking it as a fact and then repeating it as such.
Voice recognition from scratch is hard. Some lock-free data structures are hard. Caching is not hard. It's knowing what the heck you are actually doing and doing it well is what's hard. By the same measure, C macros would be hard, because some idiot can do #define true false and everyone else will spend the same 10 months trying to understand why the hell things break now and then. Caching is hard is when someone starts messing with other people' code without fully understanding it. But then anything is "hard" under these circumstances.
The difficulty in implementing voice recognition and lock free structures does not make caching easy. Yes, you can implement caching in 10 lines of code. Anybody can do that - it's the writing the correct 10 lines of code which is very hard.
Sure, if you're implementing Fibonacci, memoization is simple. But if you're trying to memoize the results out of a database, things are going to be a lot more complicated, really fast.
It requires knowledge of what can go wrong, what will go wrong, the use cases associated with a bit of data to be cached, how the application will be deployed, what other caching will be implemented across the system, and a dozen other bits of information unique to each use-case.
Knowing what questions to ask, and of whom to ask them, requires experience.
> It's knowing what the heck you are actually doing and doing it well is what's hard.
That sort of goes without saying. If doing something well is hard, then doing that thing is hard. No one is interesting in writing code that doesn't work -- we're always talking about "doing it well".
We don't all going around claiming that quantum mechanics is easy and then backing it up by demonstrating our ability to pull grossly wrong answers out of our asses.
> It only takes somewhere which writes data (perhaps in a way that bypasses the caching layer so the cache doesn't know it has changed) and re-reads it back quickly for software which used to work suddenly breaks.
So those 10 lines need to be in the wrong place?
Why expose the uncached API?
Web or single system implement caching in a defined is easy if you don't have a defined API you probably don't have a good system. If you don't have a good system, why are you trying to implement caching? The not being good part is probably why its slow.
Precisely -- it's the "general-purpose" part that is extremely complicated, which is what everyone here seems to be talking about when bringing up invalidation, but having simple, specific and localized caching in a performance-critical region is not that hard, and can be very effective.
I have to agree with the parent - cache invalidation is tough. When do you do it? What triggers an invalidation? When is it OK to use known-stale data? How long should a cache be valid for? If multiple copies of a program are brought up, what effect will it have on the caches? How do updates from one instance get populated to others? What are the impacts on up- and down-stream services?
I'm going to borrow from someone much more eloquent than I: How simple it is to declare a static hashtable, and yet how perilous!
That's an urban legend.
Needlessly complicated or over-abstracted general-purpose caching frameworks are difficult, but your dumbest imaginable linear LRU fast lookup is both exceptionally useful and can indeed be done in 10 LoC in a lot of cases.