In Kevin Murphy's book "Machine Learning: A Probabilistic Perspective" he cites the following paper [1] which shows that the simple Bayesian classifier "performs surprisingly well in many domains containing clear attribute dependencies" and contains numerous experimental trials (in different domains).
I haven't fully read the paper myself (glanced through it) but KM's makes the same point as OPs which leads me to believe it may have the answer to your question.
Naive Bayes assumes variable independence, for one thing. If the features aren't independent from each other, then you'd be better off using probabilistic graphical networks.
> Naive Bayes assumes variable independence, for one thing
That's true, but every model makes assumptions that are wrong. The puzzling thing about Naive Bayes was how well it performs in practice in spite of its assumptions being so wrong. I believe that there have been papers explaining this. I would look at Russell and Norvig's book for a start.
One thing that Naive Bayes sucks at is providing good probabilistic estimates. It is nearly always overconfident in its predictions (eg. P(rain) = 0.99999999) even though its classification accuracy can be pretty good (relative to its simplicity). Logistic regression fares a lot better for probabilities.
> One thing that Naive Bayes sucks at is providing good probabilistic estimates.
This is true based on how it's described in textbooks, but in practice it should always be combined with a calibration algorithm (which add a trivial O(n) cost to the process). The common choices here are Platt [0] or Isotonic Regression [1] (the latter should, itself, be combined with a regularization algorithm because it can easily overfit for outliers).
Given a calibration algorithm Naive Bayes produces probability estimates every bit as reasonable as any other algorithm.
The calibration cost is trivial compared to the coefficient learning cost. Very roughly, calibration is O(records) whereas coefficient learning is O(records * features). So the tiny add-on cost of calibration shouldn't affect anyone's evaluation of the relative merits of algorithms. NB still retains its computational advantage.
One thing that is often discounted in theoretical discussions is that NB takes much less I/O than something like LR, typically in the range of 5-100x (depending on how many iterations you want to do updating your LR coefficients). If you're doing, for example, a MapReduce implementation then NB has huge computational advantages. In LR each coefficient update costs you another map/reduce pass across your entire data set (whereas NB is always done in exactly one iteration).
So if NB + calibration gets you something close to LR for vastly less computation and I/O, why wouldn't you use it?
Having said that, if you're talking about small amounts of data that fit into RAM and you can "just load into R", then sure use LR over NB. For that matter use a Random Forest [0]. The reason NB is still around is because it offers a point in the design space where you spend almost no resources and still get something surprisingly useful (and recalibration narrows the utility gap between NB and better methods even more).
[0] And you should still consider calibrating your random forest's output.
Yeah, Naive Bayes isn't very good for probabilistic estimates, but its often quite good at classification. I've had great luck using it to classify poorly coded medical data. I maintain this ruby gem [0] that implements Naive Bayes as a text classifier, and for a lot of tasks, it's good enough.
Have you tried a Perceptron or MaxEnt model for your medical data? They're also fairly simple (MaxEnt slightly less so), but will give you solid results.
I looked into perceptron, but it seemed overkill and I didn't want to write a classifier from scratch in Ruby or Java. I wasn't previously aware of MaxEnt, which looks cool. Bayes is nice for the data I'm looking at, because my features are pretty independent.
Geoff Hinton has a theory that Naive Bayes is just logistic regression with dropout: https://www.youtube.com/watch?v=DleXA5ADG78 And that if you divide all the parameters by a certain amount, you would get accurate probability estimates.
This is true, but in a lot of instances ignoring variable dependence still produces meaningful results. For example, Naive Bayes on text classification usually performs really well even though words occurrences are dependent on each other. From my experience, if you are attempting to classify on a large number of observations (E.g. a large number of words), the algorithm performs well.
No, not quite. _All_ classification algorithms use "reasoning on the average". (A.k.a. "statistics".)
Naive Bayes is a poor classifier because it ignores conditional dependency: when having feature A raises the odds, having feature also B raises the odds, but having features A and B together lowers the odds.
Known dependencies can be taken into account by treating joint A&B occurrence as another feature. The result can sometimes be significantly improved with this simple hack.
This is an example of why feature engineering is very important, at times more so than the algorithm choice.
Learning to develop features that go together well with an algorithm is essential to practical machine learning.