This accords with our experience with large Oracle databases as well.
We've had a few panics caused by a DBA having updated table statistics when trying to optimize some query. This would occasionally cause sudden, massive changes in the way that other unrelated queries were performed, which queries would not finish and sometimes bring down the database.
These experiences caused us to have to change our procedures and keep the test database data closely in sync with the production data, so we'd know how newly gathered statistics would affect queries. The database is large enough that having testing be a full and recent copy of production is pretty painful. Oracle has since introduced features in 11g that allow pinning query plans, we've yet to try these though.
In the simplest form, you could say, "I have 2 reasonable plans, let's try A, and if it takes above time X, then start B in parallel and go with whatever finishes first."
You could ramp up the idea to handle changing query plans based on updated statistics by sending some fraction of queries to one plan, and some to another. Then keep stats on how that worked out for you.
Basically never simply flip the switch in production to try a new, unproven query.
Incidentally 12g advertises that they will actually collect statistics during query execution and based on those will validate the query plan. If that works, then this problem should get better. But of course that comes with overhead, and is likely to be a complex piece of code, so you tell me whether you trust them to get it right right away.
A cost-based optimizer also needs a way to handle the combinatorial explosion of possible plans. For MongoDB, maybe they can be exhaustive or use a simple strategy. But for a SQL DBMS, the search space is way too huge, and you need a way to navigate it intelligently.
This "try the plan out" idea has come up many times, and it's a reasonable idea, but it's nothing new and not a complete solution. You need a way to evaluate whether a plan has a reasonable chance of being the best one before the whole plan is even constructed, much less executed.
Combinatorial explosion is also one of the reasons that humans generally aren't great at coming up with non-trivial plans, either.
A good optimizer is a combination of many good ideas; so I don't mean to be discouraging, just that one idea is not the full story.
If you are worried about the risk of a changing plan, you are between a rock and a hard place. If you lock the plan down (or hard-code it in the application), and the data characteristics change, that could be disastrous. If you don't lock it down, that's a leap of faith that the optimizer will adapt gracefully.
I understand the issues you've brought up in your post as well as the traditional methods of pruning the search space of query plans through using left-deep join trees and histograms for calculating selectivity.
My top-level point was that there is no way humans will come up with the optimal query plan by hand, and like you said even if they do - the data will change.
As an aside, the reason Mongo can use such a simple but effective strategy is because it doesn't support joins, which makes calculating the optimal plan much, much easier. RethinkDB does, however, and as such, a CBO is that much more important.
I am not disputing the value of a cost based optimizer.
I'm just pointing out that if you have an existing plan that seems to be currently working the default should be to not just switch it up without testing that your switch didn't hurt anything. Instead you need to use care.
Yup. This kind of complexity they have been forced into because the simple general solution that they think should work and which does the vast majority of the time also routinely causes pants on fire emergencies.
We've had a few panics caused by a DBA having updated table statistics when trying to optimize some query. This would occasionally cause sudden, massive changes in the way that other unrelated queries were performed, which queries would not finish and sometimes bring down the database.
These experiences caused us to have to change our procedures and keep the test database data closely in sync with the production data, so we'd know how newly gathered statistics would affect queries. The database is large enough that having testing be a full and recent copy of production is pretty painful. Oracle has since introduced features in 11g that allow pinning query plans, we've yet to try these though.