Hacker News new | past | comments | ask | show | jobs | submit login

On the limitations of machine learning as in the OP, the OP is correct.

So, right, current approaches to "machine learning* as in the OP have some serious "limitations". But this point is a small, tiny special case of something else much larger and more important: Current approaches to "machine learning" as in the OP are essentially some applied math, and applied math is commonly much more powerful than machine learning as in the OP and has much less severe limitations.

Really, "machine learning" as in the OP is not learning in any significantly meaningful sense at all. Really, apparently, the whole field of "machine learning" is heavily just hype from the deceptive label "machine learning". That hype is deceptive, apparently deliberately so, and unprofessional.

Broadly machine learning as in the OP is a case of old empirical curve fitting where there is a long history with a lot of approaches quite different from what is in the OP. Some of the approaches are under some circumstances much more powerful than what is in the OP.

The attention to machine learning is omitting a huge body of highly polished knowledge usually much more powerful. In a cooking analogy, you are being sold a state fair corn dog, which can be good, instead of everything in Escoffier,

Prosper Montagné, Larousse Gastronomique: The Encyclopedia of Food, Wine, and Cookery, ISBN 0-517-503336, Crown Publishers, New York, 1961.

Essentially, for machine learning as in the OP, if (A) have a LOT of training data, (B) a lot of testing data, (C) by gradient descent or whatever build a model of some kind that fits the training data, and (D) the model also predicts well on the testing data, then (E) may have found something of value.

But the test in (D) is about the only assurance of any value. And the value in (D) needs an assumption: Applications of the model will in some suitable sense, rarely made clear, be close to the training data.

Such fitting goes back at least to

Leo Breiman, Jerome H. Friedman, Richard A. Olshen, Charles J. Stone, Classification and Regression Trees, ISBN 0-534-98054-6, Wadsworth & Brooks/Cole, Pacific Grove, California, 1984.

not nearly new. This work is commonly called CART, and there has long been corresponding software.

And CART goes back to versions of regression analysis that go back maybe 100 years.

So, sure, in regression analysis, we are given points on an X-Y coordinate system and want to fit a straight line so that as a function of points on the X axis the line does well approximating the points on the X-Y plot. Being more specific could use some mathematical notation awkward for simple typing and, really, likely not needed here.

Well, to generalize, the X axis can have several dimensions, that is, accommodate several variables. The result is multiple linear regression.

For more, there is a lot with a lot of guarantees. Can find those in short and easy form in

Alexander M. Mood, Franklin A. Graybill, and Duane C. Boas, Introduction to the Theory of Statistics, Third Edition, McGraw-Hill, New York, 1974.

with more detail but still easy form in

N. R. Draper and H. Smith, Applied Regression Analysis, John Wiley and Sons, New York, 1968.

with much more detail and carefully done in

C. Radhakrishna Rao, Linear Statistical Inference and Its Applications: Second Edition, ISBN 0-471-70823-2, John Wiley and Sons, New York, 1967.

Right, this stuff is not nearly new.

So, with some assumptions, get lots of guarantees on the accuracy of the fitted model.

This is all old stuff.

The work in machine learning has added some details to the old issue of over fitting, but, really, the math in old regression takes that into consideration -- a case of over fitting will usually show up in larger estimates for errors.

There is also spline fitting, fitting from Fourier analysis, autoregressive integrated moving average processes,

David R. Brillinger, Time Series Analysis: Data Analysis and Theory, Expanded Edition, ISBN 0-8162-1150-7, Holden-Day, San Francisco, 1981.

and much more.

But, let's see some examples of applied math that totally knocks the socks off model fitting:

(1) Early in civilization, people noticed the stars and the ones that moved in complicated paths, the planets. Well Ptolemy built some empirical models based on epi-cycles that seemed to fit the data well and have good predictive value.

But much better work was from Kepler who discovered that, really, if assume that the sun stays still and the earth moves around the sun, then the paths of planets are just ellipses.

Next Newton invented the second law of motion, the law of gravity, and calculus and used them to explain the ellipses.

So, what Kepler and Newton did was far ahead of what Ptolemy did.

Or, all Ptolemy did was just some empirical fitting, and Kepler and Newton explained what was really going on and, in particular, came up with much better predictive models.

Empirical fitting lost out badly.

Note that once Kepler assumed that the sun stands still and the earth moves around the sun, actually he didn't need much data to determine the ellipses. And Newton needed nearly no data at all except to check is results.

Or, Kepler and Newton had some good ideas, and Ptolemy had only empirical fitting.

(2) The history of physical science is just awash in models derived from scientific principles that are, then, verified by fits to data.

E.g., some first principles derivations shows what the acoustic power spectrum of the 3 K background radiation should be, and the fit to the actual data from WMAP, etc. was astoundingly close.

News Flash: Commonly some real science or even just real engineering principles totally knocks the socks off empirical fitting, for much less data.

(3) E.g., here is a fun example I worked up while in a part time job in grad school: I got some useful predictions for an enormously complicated situation out of a little applied math and nearly no data at all.

I was asked to predict what the survivability of the US SSBN fleet would be under a special scenario of global nuclear war limited to sea.

Well, there was a WWII analysis by B. Koopman that showed that in search, say, of a submarine for a surface ship, an airplane for a submarine, etc. the encounter rates were approximately a Poisson process.

So, for all the forces in that war at sea, for the number of forces surviving, with some simplifying assumptions, we have a continuous time, discrete state space Markov process subordinated to a Poisson process. The details of the Markov process are from a little data about detection radii and the probabilities at a detection, one dies, the other dies, both die, or neither die.

That's all there was to the set up of the problem, the model.

Then to evaluate the model, just use Monte Carlo to run off, say, 500 sample paths, average those, appeal to the strong law of large numbers, and presto, bingo, done. Also can easily put up some confidence intervals.

The customers were happy.

Try to do that analysis with big data and machine learning and will be in deep, bubbling, smelly, reeking, flaming, black and orange, toxic sticky stuff.

So, a little applied math, some first principles of physical science, or some solid engineering data commonly totally knocks the socks off machine learning as in the OP.




There is a whole lot of difference between curve fitting and curve fitting with performance guarantees on future data under a distribution free (limited dependence model).

BTW the 'machine learning' term is Russian coinage and its genesis lies in non-paramteric statistics, the key result that sparked it all off was Vapnik and Chervonenkis's result that is essentially a much generalized and non-asymptotic version of Glivenko Cantelli. The other result was that of Stone that showed universal algorithms that can achieve the Bayes error in the limit not only exist but also constructed such an algorithm. This was the first time it was established that 'learning' is possible.


This is much stricter and well-thought approach than OP makes, there is no need to consider deep learning alone without generalization to all possible math models. For example, OP could mention that simple x^2 function could not be well approximated with a deep network with relu layers with small number of nodes but it could be trivially approximated with a single x^2 layer.

However, the question is, how complex are the "true" models of nature. Gravity law is simple with single equation and one parameter but what if human language law has millions of parameters and not really manageable by human. 500 samples would not be enough then. This is a classical Norvig vs Chomsky argument. Still, for many things the simple laws might exist.


Wat?




Consider applying for YC's Summer 2025 batch! Applications are open till May 13

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: