Hacker News new | past | comments | ask | show | jobs | submit login
Overfitting and the strong version of Goodhart’s law (sohl-dickstein.github.io)
187 points by andromaton on Nov 11, 2022 | hide | past | favorite | 107 comments



I recommend reading "The Collapse of Complex Societies" by Joseph Tainter [0].

Complex societies tend to address problems by adding more and more rules and regulations, simply because they have always done so and it has been successful in the past. More importantly, though, it is typically the only tool they have. Essentially these societies are increasingly overfitting their legislature to narrow special cases until it cannot handle anything unexpected anymore. Such a society is highly fragile. I witness this firsthand every day in my own country. Living here feels like lying in a Procrustean bed.

[0] https://www.amazon.com/Collapse-Complex-Societies-Studies-Ar...


This is not new, merely ignored:

"What's the origin of the phrase 'Hard cases make bad law'?

... 'Hard cases make bad law' isn't so much a universal proverb as a legal adage. It came to light in a comment made by Judge Robert Rolf in the case of Winterbottom v Wright in 1842:

    This is one of those unfortunate cases...in which, it is, no doubt, a hardship upon the plaintiff to be without a remedy but by that consideration we ought not to be influenced. Hard cases, it has frequently been observed, are apt to introduce bad law.
The case required a judgment on whether third parties are able to sue for injury. The unusual nature of the case caused the judge to realise that, in the true sense of the expression, exceptions prove the rule and that, unfair as it might have appeared in some circumstances, the law was better drafted under the influence of the average case rather than the exceptional one.

The point was made explicitly in 1903 by V. S. Lean, in Collectanea:

    Hard cases make bad law. that is, lead to legislation for exceptions."
https://www.phrases.org.uk/meanings/hard-cases-make-bad-law....


Something like this could (should?) perhaps be handled by governmental "remedy" service, a small insurance/tax against everyone that the judges can use to award remedy without forcing the defendant to pay.


We have already moved on, and a case like this would no longer be regarded as unusual, let alone difficult. In the process, something like what you propose arose, in the form of liability insurance. The system works reasonably well, in that there are a lot fewer hard cases without creating a crippling burden in the average case.

If anything, this case, together with the way things have changed since then, demonstrates that there is often a good deal of subjectivity in what constitutes a 'hard' case.


But that only moves the problem somewhere else, because insurance is a moral hazard. Someone who would otherwise be cautious to prevent harm has less incentive to do it because when the harm comes the insurance pays.


Not necessarily. In the case of accidents, liability insurance presents no moral hazard. The ordinary person is not going to leave a broken railing in his house that could cause a guest to fall and injure himself just because he has homeowner's insurance.

Similarly, the ordinary person doesn't seek out automobile accidents on account of having mandatory automobile insurance.

Perhaps most strikingly, the moral hazard theory would suggest that life insurance policy holders are more likely to commit suicide, but in reality they are less likely to![1]

Sure, there are probably vanishingly rare exceptions to the above, but the moral hazard concern is wildly overblown for consumer insurance products purchased by ordinary people.

[1] https://www.munichre.com/us-life/en/perspectives/suicide-mor...


> In the case of accidents, liability insurance presents no moral hazard. The ordinary person is not going to leave a broken railing in his house that could cause a guest to fall and injure himself just because he has homeowner's insurance.

Maybe not if repairing the railing costs $15, but what if the safety repairs would cost $15,000? When not doing it could cause someone who gets hurt to render you bankrupt and homeless, you find the money. When you're insured, you may have other priorities.

> Similarly, the ordinary person doesn't seek out automobile accidents on account of having mandatory automobile insurance.

It's not about seeking them out. You don't want an accident, but you do want to read that text you just got, and you're more likely to wait until you're stationary if an at fault accident could ruin you instead of just raising your insurance premiums.

> Perhaps most strikingly, the moral hazard theory would suggest that life insurance policy holders are more likely to commit suicide, but in reality they are less likely to!

Isn't suicide an exception to nearly all life insurance polices, among other reasons to remove that very incentive?


> You don't want an accident, but you do want to read that text you just got, and you're more likely to wait until you're stationary if an at fault accident could ruin you instead of just raising your insurance premiums.

Literally no one ever has thought “self, I’m going to look at this text while I drive because I’m insured!” In the real world they’re doing it because they’re addicted and not because of some rational calculus. That might make a good scenario for a comedic skit though.

I don’t feel like looking it up on my phone, but I’d bet at worse than even odds that drunk drivers are in fact less likely to be insured, when the moral hazard theory would predict they’re more likely to be.

> Isn't suicide an exception to nearly all life insurance polices, among other reasons to remove that very incentive?

The answer is either not really or even an outright no. Individual policies usually have a 1-2 year no suicides clause and after that they pay. Group policies like employer offered ones usually have no wait period and will just pay out.


> Literally no one ever has thought “self, I’m going to look at this text while I drive because I’m insured!”

It works the other way. If you have no insurance, you think, "self, I'm not going to look at this text while I drive because I'm not insured, and if I hit someone it could cause me to lose my house."

Same reason undocumented immigrants follow the speed limit.

> I don’t feel like looking it up on my phone, but I’d bet at worse than even odds that drunk drivers are in fact less likely to be insured, when the moral hazard theory would predict they’re more likely to be.

There are obvious reasons for this to be the case independently. People with a DUI record are more likely to drive drunk, but people with a DUI record may not be able to get or afford insurance. Drunk driving and not having insurance might both be correlated with poverty. Things like that. Is it your argument that not having insurance causes you to be less likely to drive drunk, all else equal?

> The answer is either not really or even an outright no. Individual policies usually have a 1-2 year no suicides clause and after that they pay. Group policies like employer offered ones usually have no wait period and will just pay out.

But the 1-2 year clause is there specifically because of the moral hazard. Otherwise not only would anyone planning to commit suicide have the incentive to take out life insurance first, anyone who needed a quick big payout for their loved ones would have the incentive to take out a policy and then commit suicide.

And the general trend in the opposite direction is caused by both the removal of that incentive, and the same kind of confounders as in the DUI case. People with stable employer-provided insurance coverage or with the financial stability to afford premiums for >2 years are the sort of people less likely to take their own lives.


Insurance is not free, and it generally gets more expensive the more reckless you are (or seem) to be (which is not to say this is the only factor in whether this is a zero-sum game, but it is arguably the most objectively quantifiable.)


The moral hazard is the difference between the insurance payout and the amount your future rates change as a result of the claim. If this amount is zero, there is no insurance. If it isn't, there is that much less incentive to avoid the harm.


Much less? Can you explain this immediate threshold as soon as the moral hazard is non-zero?


"That much less" meaning equal to the difference between the amount of the harm and what they'd expect to pay in increased premiums.

If spending $X would avoid a small chance of a million dollars in damage, the X you're inclined to spend is a lot larger if the potential loss is the full million dollars than if it would increase your insurance premiums by a net present cost of $1000.


The model you present here is semi-quantitative, in that it has an example value of $1000 for insurance premiums, another of $X for liability, but "small chance" is not introduced as a variable, and neither are "lot larger" and the change in premiums as a function of change in safety spending. I suspect that if this model were completed in accordance with your premise of equality, it would imply there is no rational case for insurance.

This seems moot, however, as this is not shaping up to be a plausible model for how things actually went since Winterbottom's unfortunate accident. Road transport vehicles have become a great deal safer since then, even as the potential for them to do harm has increased enormously. For your argument to be pertinent, it would have to be likely that, in the alternative reality where Rolf's ruling remained the law and liability insurance did not come about, they would be safer than they are now.


> The model you present here is semi-quantitative, in that it has an example value of $1000 for insurance premiums, another of $X for liability, but "small chance" is not introduced as a variable, and neither are "lot larger" and the change in premiums as a function of change in safety spending.

All of the numbers are obviously made up examples because in practice they depend on what the risky behavior is and the value of the dollar etc. But we can make up more of the numbers, if you like example numbers.

A 1% chance of a million dollar liability has an expected value of -$10,000. A 1% chance of a premium increase with a net present cost of $1000 has an expected value of -$10. Therefore, the resources the party would rationally expend to prevent the harm is $10,000 in the first case and $10 in the second case. That difference is a lot.

> I suspect that if this model were completed in accordance with your premise of equality, it would imply there is no rational case for insurance.

Insurance is a net loss to the average insured, even before the moral hazard, because the sum of the premiums is necessarily more than the sum of the claims since premiums also have to cover the insurance company's overhead (or the insurance company becomes insolvent).

Its only purpose is to pool risk. Many people prefer a 100% chance of a $1050 loss to a 1% chance of a $100,000 loss. But pooling risk introduces moral hazard -- that's one of the reasons people like to be insured. "Peace of mind" = don't have to worry because the insurance will cover it.

> For your argument to be pertinent, it would have to be likely that, in the alternative reality where Rolf's ruling remained the law and liability insurance did not come about, they would be safer than they are now.

The market wants cars that are safer for their occupants because insurance can't bring you back from the dead.

The market doesn't care if cars are less safe for pedestrians or other motorists, because that cost is on the other party or the insurance company. And so we see cars getting heavier over time, as expected from that set of incentives.


> Its only purpose is to pool risk.

Precisely - there is a rational case for insurance despite your analysis.

The fact that cars have, on average, become heavier lately is at most a cherry-picked second-order effect that is far from sufficient to refute either what I wrote in my previous post, or in my original post in this thread. Despite your analysis, it is plausible, and IMHO extremely likely, that the same forces that have led to the improved situation since Winterbottom v. Wright will continue to improve the situation for pedestrians and other motorists, as they have in the past.


I'd second the recommendation.

I personally enjoyed reading it, and from my limited exposure to the topic, my impression was that although some anthropologists disagree and theory has advanced since, The Collapse of Complex Societies was a landmark work and its ideas are still taken seriously.


Meanwhile current Indian government repealed 100s of Laws / Acts.

https://www.indiatoday.in/mail-today/story/narendra-modi-law...


Interesting idea! Seeing as my "should read" list is already hundreds of books long, can you give us a spoiler as to what he suggests (if anything) to fix this? Or is it simply inevitable and therefore not actionable?


Here's what I can recall:

The big idea is diminishing marginal returns – increasing societal complexity pays off enormously at first, but eventually the returns level off while the overhead remains high or even continues to increase. Collapse is then seen as a natural (and not necessarily cataclysmic) response to this. It's a process of simplification.

Tainter also points out that collapse may not occur (1) when there are strong, neighboring powers, which simply absorb the foundering society, or (2) when there are strong cultural reasons, such as national identity, for which the members of the society may be willing to put up with the bureaucratic overhead indefinitely (in order to prevent being absorbed); he gives Europe as an example, I believe.


Thanks! I'm also adding this to my reading list now.


This keeps getting rediscovered in new domains.

JIT was the savior of manufacturing, until people learned that a single traffic jam that delayed a single delivery could create costs far in excess of the inventory savings.

Optimizations are critical, everywhere. And measuring optimizations is important because it is is easier and cheaper and earlier than measuring end results. Measuring days of inventory, or dollars in inventory, is a pretty good proxy for supply chain efficiency, until covid hits and suddenly “efficiency” means stability rather than minimum costs.

Over and over. Branch prediction / Spectre. Mortgage-Backed Securities. Optimizations based on second- and third-order effects blow up in the real world where any abstraction is approximate.

So it’s not the efficiency that’s the problem, it’s changing the focus from maximizing the desired output to maximizing efficiency, and measuring success based on the performance of efficiency optimizations.


I feel like i have to repeat this very often: if a single traffic jam or other predictable common-cause variation results in your JIT implementation costing "far in excess of inventory savings" all you're telling me is that you have a really shitty JIT implementation.

JIT means having the buffer on hand to handle at the very least common-cause variation. But it also should come with the flexibility to handle variation of assignable causes at a reasonable cost.

Critically, you can't just press delete on all your inventory and then call it JIT. You have to adapt your processes for it, work with local authorities for improvements to infrastructure, and then always keep a larger buffer than you think you need.

JIT is not about deleting inventory. It's about reducing variation in your processes until most of your inventory truly becomes useless.

When you have a good JIT implementation, you are more flexible to take advantage of changing external conditions, not less.


You've now redefined JIT to mean "buffered with the minimum viable buffer" where minimum viable buffer is "whereever I drop the goalposts when defining common-cause variation.

It's the definitional version of working out a variable in an equation, making a mistake, and ending up looking at 0=0.

Happens all the time.


Yes, but that it literally what JIT means, though. It's a mindset of solving assignable-cause variation at its cause rather than papering over it with more inventory.


Literally, "just" in time would mean "not more (material, goods,...) than needed" in time, wouldn't it? A real world analogy to Kanban.


Kanban is indeed one way to do just in time. It's useful primarily for the more variable flows. For more regular flows, you can often rely on the average flow rate working out and schedule deliveries in advance.


JIT is mostly a name for inter-companies Kanban (the factory version of Kanban, not the software one).

And because of that, yes, the amount of inventory is pretty much arbitrary.


This is a common misconception. People often talk about the kanban style JIT because it looks impressive, in some sense. There's all these things going back and forth.

But as Taiichi Ohno himself would tell you: if it looks impressive, it's wasteful. It's a lot of motion with little movement. Efficiency looks disappointingly obvious and simple.

Kanban is a necessary evil when you have been unsuccessful in driving out variation, it's not a desirable state of things.


Yes? My point was that people get carried away with maximizing optimizations because they start measuring the degree of optimization rather than the ultimate goal.

I’m not saying JIT always goes wrong, any more than neural net training always goes wrong. Just that when it does, it is often because an overzealousness that leads to a disconnect between the local and global goals.


> This keeps getting rediscovered in new domains.

I'm not sure there's a need for a "strong version" of Goodhart's law, or I fail to understand the distinction the author is trying to make.

Goodhart is "so true" precisely because it warns about the fact that the measure is often not the goal itself, just like "the map is not the territory", and is indeed an approximation of the goal.

One finds a similar problem with the word "best": best according to which metric? That's how people have different ideas of what is the best.


See footnote 4. I would quote it except it's pretty long and full of links, and an edit, which HN "formatting" would make confusing.


I highly recommend playing the beer game with different inventory sizes, and looking at the results.

Inventory management is not a simple task, and can not be generalized like this. JIT was adopted because it reduced the number of supply chain disasters, not despite increasing it like you claim. But, of course, that reduction wasn't homogeneous and not every single place saw a gain.


I apologize for the unclarity. I was not trying to assert that JIT is fundamentally wrong and always leads to disaster.

My point was that any optimization process can go wrong when people start focusing on maximizing the optimization rather than the ultimate goal. That is how JIT goes wrong. It is also how overfitting appears in neural net training, how security flaws appear in branch prediction, etc.

I'm an MBA. I love JIT. But it's undeniable that JIT has led to disasters. I'm not blaming the approach, I'm blaming specific implementations.


A major consideration people seem to miss when discussing JIT is that it originated in a country with extremely reliable transportation. I had some chance to observe highly resilient and flexible logistical operations in Japan in the 1990s, and have since wondered just how well JIT in the USA follows that model.


The issue with JIT is not so much about proxies as it's about optimizing for the general case at the cost of worst case performance. Things that suppose an inefficiency 99.99% of the time can be an indispensable redundancy or buffer in the face of events that are difficult or impossible to account for.


Everything is a tradeoff.

Efficiency is usually a tradeoff that limits flexibility. If you don't happen to need flexibility, you can have amazing efficiency: a wheel of a rail car is so much more efficient than a foot. But only when it stays on a rail.

If we look at a somehow more difficult terrain, a foot suddenly happens to be a better deal, because it adapts to a variety of conditions. Of course, at the expense of efficiency (complexity, energy consumption, need for a much more advanced control circuits, etc).

In business you usually have to have some safety margin for the unexpected. If you squeeze it out in the name of efficiency (and profit), all goes well until it does not, and then it fails much harder, maybe catastrophically.

(Nassim Taleb wrote a whole book about that, "Antifragility", where he gives a ton of fun real life examples to make basically the above point.)


Trade offs are also optimizations: you are optimizing for the sums of weighted advantages and disadvantages of something: https://en.wikipedia.org/wiki/Loss_function


I would argue "efficiency" is the wrong word for what we discuss. Efficiency means optimising resource usage while achieving goals. If we need flexibility to achieve our goals consistently (and you usually do) then flexibility is part of efficiency (and effectiveness and efficacy) rather than opposed to it.

Phrased differently: if you define "efficiency" to mean "optimise for a single proxy metric and not what you're actually interested in" then yes, of course efficiency and effectiveness will be opposed other than sometimes by chance. But that's a dumb definition of efficiency!


To me, efficiency means achieving a desired goal while consuming a minimum of resources (time, energy, space, …).

If the goal is defined in a too narrow scope, i.e. your ‚dumb‘ definition of efficiency, the flexibility may be missing. Still, that particular goal may be reached efficiently.

So it’s not an issue with the definition of efficiency, but rather with scoping the problem. As the article states, it may not always be possible to scope the problem in an easily measurable way, hence optimizing for proxy targets.


It seems to me you're making the same terminology mistake, only shifted one layer back! You're expressing yourself as if the "too narrow metrics" were the goal in and of themselves. I would argue the metrics are never the goal -- the fuzzy intention they're proxying for is always the goal.


This is exactly what Taleb wrote about [1].

Systems start out very fragile. Over time they get more and more robust, as edge cases and events result in more rules and rigidity. This works until they become so robust that they can no longer cope with the change when something unexpected -i.e. a black swan event - happens.

The only way to counter this, to make a system that improves in coping with disorder/chaos as it encounters it.

[1] https://en.wikipedia.org/wiki/Antifragility

Update: reworded that last sentence a bit.


Also reminds me of min/maxing in a multi-player game with an economy.

Some times in a balanced game a new min/max strategy is found in a game like this that it quickly becomes the dominant form of wealth generation in the economy. Asset inflation typically explodes in this scenario driving prices for everything way up.

The game itself typically becomes less fun at this point as min/max is now the only valid economic activity, hence intervention by the gods in the form of a patch. But while this fixes the exploit it does not fix the economic ruin. Players that made massive amounts of wealth in the event can have strangle holds on the economy and cause stagflation as material prices only drop slowly. New/poor players are hobbled with poor purchasing power.

It would be an interesting study on how many of the effects in this article could be studied by gamifying these situations.


I'm not sure I agree.

The problem with overfitting and inability to extrapolate out of sample is variance. The problem with Goodhart's law is bias.

I don't think a training sample is to the full population what a proxy metric is to the objective -- not theoretically, and not practically. The training sample faithfully represents the full population (by definition, if it was randomly selected). Any difference in composition is down to sampling error, and this is known from theory.

When overfitting we are still optimising for the objective, only adapting more to individual data points than desirable. Goodhart's law implies optimising for the wrong thing entirely. We have no theoretical tools to deal with this and I suspect we never will, because it's a problem of subjective judgment.


Author of blog post here!

Overfitting can happen in many ways -- your training objective can be different at train and test time, or as you suggest the datapoints you use can be different at train and test time.

For overfitting induced by datapoints: If you include the datapoints in your problem specification, then you can say they induce bias at test time. If you treat the choice of training datapoints as a random variable, separate from the problem specification, then you can say they induce variance at test time. The difference is essentially semantic though. In general, you can freely move contributions to the error between bias and variance terms by changing which aspects of the modeling framework you define as fixed by the problem definition, and which you take to be stochastic.


My first impression is also in agreement with the parent. The blog post appears to use some terms loosely in order to make the connection between overfitting and Goodhart's law stronger. For example - calling training sample "proxy" and stating that it is is a slightly different goal is already leading towards the pre-defined conclusion.

And the reply also leaves me with a similar impression:

> your training objective can be different at train and test time

But this is not overfitting, this is concept drift, a different and well-defined thing in ML.

> the datapoints you use can be different at train and test time

Both train and test data came from the same population. They are just different incomplete random samples.

I guess what I am getting at - overfitting happens because we know we are training a model on an incomplete representation of the whole. But that representation is not a proxy, as suggested in the article - it is not slightly different to the goal. It's an incomplete piece of the goal.


A gentle note that an incomplete piece of a goal (e.g. a loss function computed on a subset of the data) is a proxy for the full goal (e.g. the loss function on the full dataset).

Similarly, concept drift can be a source of overfitting -- the objective you care about is the one after the concept drift occurred, but the objective you trained on is the one from before the concept drift. (Here's a scholar search for papers where the two concepts co-occur: https://scholar.google.com/scholar?hl=en&as_sdt=0%2C5&as_vis... )


I think this is a difficult concept for many without statistical training. The fact that different outcomes can be "the same" from a practical perspective.


Maybe the ultimate proxy measurement is the pursuit of human economic productivity at the expense of longevity of our planet.

Contrived example, human need food to survive. We need food, efficiently, since there are so many of us. So we need efficient farming. Efficient means smallest amount of land to produce most biomass (plant, animal). To that, we need single purpose farmland and artificial feeds, fertilizer and pesticide. To have single purpose farmland, we purge the land from all life form, aka killing all plants and animals. To have artificial stuff, we need energy. To have energy, we expense fossil energy. To have fossil energy, we, the earth, started capturing sun energy millions of years ago.

Meanwhile, some farmer right now is roaming their cow and chickens on a "unindustrialized" land. The cow graze on the pasture, the chicken eat the bugs, their droppings feed the grass, the stomping excites the earthworm, the earthworms aerates the land, the grass captures the sunlight, the sunlight feed everyone, in a perfect balance. 0 GDP generated.

I think it is mentioned in the Omnivore Dilemma [1] that, to produce 1 unit energy in food, we are using 14 unit of fossil fuel energy and 3% of workforce. Before the industrial age (200ish year ago), the ratio is 1 unit of energy to 2 unit of sun energy and 90% of workforce.

Efficiency? Check. Quality of life? Check. All numbers show that we are improving

1. https://www.goodreads.com/book/show/3109.The_Omnivore_s_Dile...


Ironically, we're not very efficient at all with our land usage today. Nearly 60% of global agricultural land is used for beef (either for pasture or for growing feed) and yet it accounts for only 2% of our calories. And about 50% of all food is wasted – simply thrown away.


I doubt it affects your central claim, but do note that a lot of pasture land is unusable for anything else.


Unfortunately, in practice, the opposite is true.

We grow food, such as grains, just to feed it to cattle which has 1% of the calorie efficiency. For example in the EU 63% of arable land is used to grow animal feed rather than food directly for humans.


that's an optimization error of another stripe - thrown away or not, the beef gets sold and FarmCo makes money


If we went with latter we would need more land to feed same number of people. There is more "nature" when we leave the forest as is and not cut it down to, to convert it to a poorly run "human land".

> There were more cereal calories per person in 2020 than in 1992. And this abundance was brought about without massive increases in the area being farmed. While industrial emissions rocketed, emissions due to land-use change fell by a quarter.

We were able to more than triple our output in that period.

https://www.economist.com/special-report/2022/11/01/a-lot-ca...


Not sure I agree. The problem with overfitting is fitting too closely to the data points at hand, but you might still be measuring the right thing, as discussed in other posts here.

The problem with Goodhart's law is, as I've always taken it, closer to the Lucas critique in economics than to the bias-variance trade-off in machine learning. Namely, when it comes to human behavior, structural relations that are very real and present in the training data may break down once you put pressure on them for control purposes.

When you use machine learning to, say, detect skin cancer, you might accidentally learn the markers put into the images to highlight the cancerous region rather than the skin properties - that's overfitting. But the skin cells themselves don't care - they won't alter their behavior whether you detect them correctly (and remove them) or not. If you use a model to find a relation between some input and a human behavior output, humans might very much start to change their behavioral responses once you start to make changes. The entire relation breaks down, even if you've measured it correctly beforehand, because people, unlike particles, have their own interests.


A note that the datapoints you train on are part of the training objective. If you are using different data at test time than you use at training time, then you are measuring the wrong thing during training, the same as if you used a different loss function at training time.

Also -- as you say, feedback loops and non-stationarity make everything more complex, and are ubiquitous in the real world! But in machine learning we also see overfitting phenomena in systems with feedback loops -- e.g. in reinforcement learning or robotics, where the system changes depending on the agent's behavior.

(blog author here)


Cool that you're responding here. Well, regarding robotics, I'm sure there's all sorts of problems when it comes to training models, but I'm not sure that Goodhart's law is one of them, unless you can give a concrete example. It's really geared towards social problems. Sure, some natural systems may also exhibit the kind of adaptive response that leads to the breakdown of structural relattions (eg the cancer cells mentioned before may evolve to avoid detection by the AI), but that happens on completely different timescales.


Except that AI models, especially large deep ones, do NOT overfit like the author thinks. They exhibit what is now called "deep double descent" -- the validation error declines, then increases, and then declines again:

https://openai.com/blog/deep-double-descent/

A question I've pondered for a while is whether complex systems in the real world also exhibit double descent.

For example, transitioning an online application that currently serves thousands of users to one that can serve millions and then billions requires reorganizing all code, processes, and infrastructure, making software development harder at first, but easier down the road. Anyone who has gone through it will tell you that it's like going through "phase transitions" that require "getting over humps."

Similarly, startups that want to transition from small-scale to mid-size and then to large-scale businesses must increase operational complexity, making everything harder at first, but easier down the road. Anyone who has been with a startup that has grown from tiny two-person shop to large corporation will tell you that it's like going through "phase transitions" that require "getting over humps."

Finally, it may be that whole countries and economies that want to improve the lives of their citizens may have to go through an interim period of less efficiency, making everything harder at first, but easier down the road. It may be that human progress involves "phase transitions" that require "getting over humps."


Blog post author here.

A brief note that I do discuss the deep double descent phenomenon in the blog. See the section starting with "One of the best understood causes of extreme overfitting is that the expressivity of the model being trained too closely matches the complexity of the proxy task."

I avoided using the actual term double descent, since I thought it would add unnecessary complexity. Lesson learned for next time -- I should have at least had an endnote using that terminology!


Thank you.

As you probably know, the big deal about double descent is that once sufficiently large AI models cross the so-called "interpolation threshold" in training, and get over the hump, they start generalizing better -- the opposite of overfitting. State-of-the-art performance in fact requires getting over the hump. As far as I can tell, you did not mention any of that explicitly anywhere in your post.

Also, all your plots show only the classical overfitting curve, not the actual curve we now see all the time with larger AI models like Transformers.


It's true that I don't go into detail about double descent, though I do describe how increasing capacity often reduces overfitting.

I believe the figure labeled "Figure 1" illustrates what your are suggesting (despite being labeled Figure 1, it is actually at the bottom of the blog post, so maybe easy to miss).


> It's true that I don't go into detail about double descent, though I do describe how increasing capacity often reduces overfitting.

I agree.

> I believe the figure labeled "Figure 1" illustrates what your are suggesting (despite being labeled Figure 1, it is actually at the bottom of the blog post, so maybe easy to miss).

Easy to miss, yes. I'm not sure it illustrates the phenomenon, though. That plot shows extreme overfitting (i.e., interpolation) by the 10,000 parameter model. No one really understands what actually happens after interpolation. There's in fact some anecdotal evidence that after crossing the interpolation threshold, large AI models trained with SGD gradually begin to ignore outliers and find simpler models (!) that generalize better (!). Counterintuitive, I know. This is an active area of research, with no good explanations yet, AFAIK.


(the plot shows extreme overfitting with a 10 parameter model, and interpolation with a 10,000 parameter model)


Interpolation == extreme overfitting.

Double descent phenomenon is what happens after interpolation.

--

RESPONDING TO YOUR LAST COMMENT (after reaching thread depth limit):

Think of it this way: Why and how does the model's performance continue to improve on previously unseen samples after the model has fully overfit (interpolated between) all training samples? Interpolation is not the end-point in training, but a temporary threshold after which models learn to generalize better, improving on interpolation. How is it that these models improve on interpolation?


I can't reply directly -- is there a maximum thread depth, or a maximum conversation depth?

Anyway -- I wanted to apologize for misreading -- I missed the parenthetical "interpolation" in your comment. I think we are both interpreting the plot the same way.

In terms of your comment about anecdotal evidence -- are you talking about the case where data and model size are increased jointly? If so, I agree, though I don't think that is any longer cleanly to do with double descent/overparameterization.


A lot of the suggested mitigations made a lot of sense to me. For example reducing time to prepare grant applications, or to add 1s jitter to stock trades.

Perhaps there are flaws to these specific ideas, but the thing that struck me most was that we have almost no way of implementing ideas like this in Anglo political countries right now.

Entrenched interests, and the technical detail involved makes it hard to imagine these ever happening. Nobody is ever going to raise a rabble based on making the stock market slower in this nerdy way.

It feels like these are the sort of technocratic solutions which could only come about in societies with high trust in institutions and a sense of ‘the common good’… ie maybe parts of Northern Europe but almost nowhere else.

Rebuilding that sort of trust and shared values is the biggest challenge… just wish I had better ideas for how to achieve it!


> reducing time to prepare grant applications,

Reminds me of a story about, I think, the take over of the Avis car rental company when it was failing. The guy who took charge said that the single thing that he did to turn the company around was to institute a rule saying that any decision, however large or small was not to take more than three weeks. If at the end of that period a decision was not forthcoming then a coin should be tossed to decide between the competing solutions.

Apologies if I have misremembered any, or all, of it but I think the idea works. Even if the wrong decision is made one gets to know early and can change course.


> For example reducing time to prepare grant applications, or to add 1s jitter to stock trades.

The 1s jitter idea was actually moderately successful in the game Grepolis, the successor to Tribalwars, but not perfect. Tribalwars is a browser based MMO RTS game, where a typical game takes multiple months, and a typical attack may take anywhere because 20 minutes and 12 hours (in real time) to reach it's target.

When you sent an attack in Tribalwars it'd have a timer to it, but the battle itself would happen instantaneously. A normal attack just steals whatever resources are available in the village being attacked. In order to conquer someone else's village you must have a successful attack with a "noble" in it, which is a very expensive and slow unit. Attacks can only be a certain size (limited to the population of each village), so in order to take a city you would launch many "clearing" attacks at once at the village you want to conquer, aiming for them to battle slightly earlier than your noble attack, to clear any defences beforehand.

This led to the phenomenon that the best way to defend your village was to leave your village completely defenceless for the initial n-1 attacks clearing, and then slip all your defences back home before the nth (noble) attack arrived.

This started with players optimising their defence to arrive 1 second before the noble attack arrived. However the attacking players adapted, but would perfectly launch attacks based on travel times such that they would all launch in the same second, but maybe 100ms seconds off. This resulted in players using third party scripts with to ensure their attacks, or defences, arrived within milliseconds of desired times.

This all was really far from the point of the game, which was a strategy. To remove this aspect of the game, in it's successor the developers added a random 0-30s variance to either side of the attack travel time. It worked pretty well, but people would still do things like sending attacks, checking if the time was close to optimal, and then cancelling and resending with quicker units if it was not.


+1 to this comment! The barrier between (thinking we) know what changes should happen, and realizing those changes in the real world, is complex and frustrating and political.


I think there are some similarities between Goodhart’s law and overfitting, but I don’t think the lesson or underlying mechanism or model is the same.

Crudely speaking, Goodhart’s law is a reflection of the system reacting to an intervention (ie.: dynamic feedback loops) and that just has nothing to do with overfitting.


Training a neural net is a dynamic feedback loop too. Back-propagation is the feedback phase.


Not in the same sense whatsoever. Training a neural net, backpropagation or not, doesn't affect the data. It's basically just some variation of / remix of a linear regression.


Yes, for that you need RL. An environment beats a fixed, even large, training set.


I think we're probably using different words to make the same distinction, and in any case the underlying mechanism is very different.


In case anyone hasn't played and missed the link: Universal Paperclips[0].

Beware, you may be playing for hours or more if you're a certain user type.

[0] http://www.decisionproblem.com/paperclips


You have to be careful here though, because the "solutions" can be susceptible to the same process, e.g.:

> Use a progressive tax code, so that unusual success is linked to disproportionately greater cost

A progressive tax code is an increase in complexity and is what leads to arbitrage opportunities and multinational corporations in practice paying lower rates than small and medium domestic businesses. Notice that we already have a "progressive tax code" and it hasn't worked.

A better (simpler) solution is to combine a flat tax rate (e.g. VAT) with a UBI, which produces the effective rate curve you want while being harder for megacorps to avoid because they can't change the ___location of their customers.

On the other hand, this one is likely to actually work:

> Develop as many complex, inscrutable, and diverse market trading instruments as possible, vesting on as many timescales as possible. (In nature, more complex ecosystems are more stable. Maybe there is a parallel for markets?)

Because, somewhat counterintuitively, what you want is the combination of regulatory simplicity and regulatory diversity. In other words, every place has simple rules, but every place has different rules, which prevents monoculture.

The last thing you want is complex regulation imposed centrally, as it prevents anything from out-competing it when it goes wrong until it goes very wrong.


Great article. I have two comments:

1. Procrastination seems to be a type of early stopping. I knew I had a good strategy in school!

2. Something that seems to be sorely missing in machine learning (I'm not a ML expert) are error bars. If you take the example of the figure at the end, as you increase the number of parameters in the model, your error bars become larger (at least in the overfitting regime), and they are infinite when you have more parameters than data points. Indeed, chi^2 tests are usually used in physics/astro to test for this. Of course, you need error bars on the data points to do this. So perhaps the difficulty is really in assigning meaningful uncertainties to your pictures/test scores/politicians.


> as you increase the number of parameters in the model, your error bars become larger

In large neural nets the effect is reversed. The larger the model, the better it generalises, even from the same training data.


> The larger the model, the better it generalises, even from the same training data

Do you have some references for this claim? For me, it seems counterintuitive.


It it very counterintuitive. It is also a very common observation that has taken everybody by surprise for almost 2 decades by now. At the beginning, people were very resistant to the idea, even when every experiment confirmed it.

The catch is that you need a huge amount of data to train those.

It also seems to have limits. There has been a few well documented cases where our current huge and very well trained kind of networks got errors there were lower than the rate of mislabeling of the data.


Can’t provide a reference, but I can confirm that this is common knowledge. It’s why e.g. GPT-3 outperforms GPT-2.

Though as stable diffusion shows, network architecture still matters a lot!

Note that the article points out you’ll get more overfitting as your number or parameters approaches that of the training set, which is what I suspect you’ve seen. The trend does reverse later on, but only once the parameter count is orders of magnitude beyond that point, and I don’t know if that ever happens outside of ML. It’s a lot of parameters.


> Goal: An informed, thoughtful, and involved populace

> Proxy: The ease with which people can share and find ideas

Are people today not more informed, thoughtful and involved? Maybe you just don't like the information, the thougths and the involvement. Is someone who shares and finds ideas on the internet not more involved then someone who passively watches TV? I harp on the term "involved" here, because that's the most neutral one. The capital rioters were quite involved.


I just watched Sam Altman's talk at Greylock and was reminded of this article whenever he used the word Humanity. I am looking forward to what the author has to say on future of AI.

https://youtu.be/WHoWGNQRXb0


Nice article! Just a weird question, which theme are you using? I liked it, so I tried to check the source code in your GitHub, because your are using a `.github.io` ___domain I assumed you were using GitHub Pages to host the blog, however I didn't find it in your repo, which is really weird.


Yup, it's GitHub pages. It's just a private repo, so the world doesn't get to see my embarrassing edits and half written drafts. I would be happy to share a snapshot of the source code with you if you have a specific use for it though -- email me.

I'm using the minima Jekyll theme. I also used Markdeep (https://casual-effects.com/markdeep/) rather than vanilla markdown to write the post. (Markdeep is awesome. It took me a full day to figure out how to get it to play nice with Jekyll, but in the end of course it turned out to be simple. You can see how I did it by looking at the blog page source -- note the <pre class="markdeep"> block, and the "mode: 'html'" option.)


Product management 101; never choose a an objective metric without setting guardrail constraints.


Shouldn't it be: never choose an objective metric without choosing several other in conflict


A system can be made efficient in a choice of different parameters; there is no one efficiency. The parameters which are not optimized can suffer. This is almost self-evident. Optimization makes plain what is valued, and sacrifices what is less valued.


Unnecessary disclosure: I posted this article not only because it seems useful to me and others, but also because I'm not sure. Edit: I'm not sure in which ways it wouldn't be.


The strong version of Goodharts law seems occur as a function of time.

In the beginning, everything is fine and dandy, but as people optimise, it begins to turn into extremes.


+1. It doesn't require there to be a time axis -- but in practice, we almost always optimize incrementally, so it takes a while for the strong version of Goodhart's law to kick in.


Is there any application of Goodhart's law to competing in elections?

"Getting a winning number of votes" seems to be one kind of metric.

How might a system adapt?


(blog post author here)

Yes! The post actually talks about that a bit: how overfitting can result from treating "leaders that have the most support in the population" as a proxy for "leaders that act in the best interests of the population"; and some ideas for improving that with noise regularization or capacity restrictions.


I'm going to go on a rant here. Most times when I hear someone invoke Goodhart's law, it's because they are opposed to transparent and metric-driven decision-making. It's easy to find the flaws with this approach, as the article has effectively done. But what they inevitably ignore is the flaws of not using this approach. Without a transparent metrics-driven decision making process, what you end up with one of the following:

- A totally opaque use of metrics to make decisions. You as a customer-support rep are trying to answer as many customer tickets as possible with a reasonably high satisfaction rate, because you think that's what you're supposed to do. Except that management has decided that you will secretly be judged based solely on satisfaction rates, and not the volume of work done. So you keep getting passed over for promotions and raises, losing out to others who are closing far fewer tickets but with a marginally higher satisfaction rate. And you have no idea why

- A totally subjective decision-making process filled with all manner of cognitive biases. You have an accent, so you get passed over for promotions. Your buddy comes to work dressed very sharp, so he gets great performance reviews. The new guy goes bowling with your boss every Sunday, and gets fast tracked for promotions. Nobody knows or says these reasons. Everyone can point after-the-fact to some piece of data that explains why the guy with an accent would not be a good leader, and why the manager's best friend deserves a promotion. And yet, you keep noticing consistently that some people get ahead far more than their talents would suggest

- No decision making process at all. The guy next to you who spends half the work day hungover? Still around 10 years later. The guy who customers hate talking to because he's barely helpful? Clocks in 9-5 everyday with no problems at all. Both of them have abysmal work output or customer satisfaction ratings? Well, you've heard of Goodhart's law right. We can't just go around evaluating people based on these very flawed and could-be-gamed metrics

Yes, it's extremely hard finding a good measure to optimize for. And yes, there are second order effects where people will try to game any metrics you start measuring. But how does this compare against the problems involved in NOT doing this? To use the article's parlance, overfitting is bad, but underfitting is also bad. Instead of trotting out Goodhart's law every time someone suggests transparently using metrics to guide decision-making, perhaps we should be discussing how to find the right balance between underfitting and overfitting. And what combination of metrics will produce the best proxy for the goals we're trying to achieve, while avoiding the problems of overfitting to a single metric.


The article isn't a critique of teams following Goodhart's Law IRL. It's a formalization of concepts between machine learning and Goodhart's Law including a "strong" formalization where performance can diverge.

I found it quite useful to think about these concepts with respect to machine learning. In a sufficiently large system of humans interacting, there's a lot of emergent stochasticity that is tough to model.

Some examples given are of teams working together at a business, but seemed intentionally written to strike familiarity with the reader, not to make suggestions about how people should analyze metrics on a daily basis.


Blog post author here. Just for the record, I am very much in favor of metric-driven decision making. I suggest some ways we can make it more robust, and also that we should be aware that our metrics may not be measuring what we intend.


This was a great article - I'm sending it to a couple of my friends.

Might I ask what software you used for blogging? I can't seem to find the source repo it came from...


See response to alexmolas -- I'm using GitHub Pages + Jekyll + Markdeep. You don't see the source repo because it's private, but I'm happy to share a code snapshot with you if you like -- email me for it.


What about double-descent :D


Yes! In the post I talk about both under- and over-parameterization being mitigations for overfitting, though I don't use the term double descent.


TLDR: When the means become the end, things start going downhill fast.

Is this TLDR too efficient?


> Goal: Distribution of labor and resources based upon the needs of society > Proxy: Capitalism > Strong version of Goodhart's law leads to: Massive wealth disparities (with incomes ranging from hundreds of dollars per year to hundreds of dollars per second), with more than a billion people living in poverty

Nothing as annoying as online communists inserting their horrible political opinions everywhere. Try moving to a communist country. See how capitalism has raied billions out of poverty in only 30 years


These communists - are they in the room with you now?


I have mentioned before - I hate Goodhart's law. It makes no sense. There is no example that is a good measure but a bad target.

The canonical example I have heard is hospital emergency rooms that started to be measured by wait times, so they refused to admit patients until staff was ready to receive them, literally having ambulances circling around the block. This was supposed to be a "good measure turned into bad target" situation, but of course it makes no sense. What really changed was the locus of evaluation from the patient that evaluates his end-to-end emergency care experience on many criteria, including, but not limited to, emergency room wait times, to a bureaucrat who just evaluates the hospital myopically on 1 metric.

It is always the myopia of singling out this measure/target that is bad, throwing off all other tradeoffs and considerations, not the actual desire to improve the measure/target.


So is measuring a programmers produced LOC a good or bad measure?


Bad, obviously. All else being equal, the better the coder, the less code he needs to write to achieve the same effect.


Do tell what is a specific example of a good measure which is not a bad target.




Consider applying for YC's Summer 2025 batch! Applications are open till May 13

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: