On the limitations of machine learning as
in the OP, the OP is correct.
So, right, current approaches to "machine
learning* as in the OP have some serious
"limitations". But this point is a small,
tiny special case of something else much
larger and more important: Current
approaches to "machine learning" as in the
OP are essentially some applied math, and
applied math is commonly much more
powerful than machine learning as in the
OP and has much less severe limitations.
Really, "machine learning" as in the OP is
not learning in any significantly
meaningful sense at all. Really,
apparently, the whole field of "machine
learning" is heavily just hype from the
deceptive label "machine learning". That
hype is deceptive, apparently deliberately
so, and unprofessional.
Broadly machine learning as in the OP is a
case of old empirical curve fitting where
there is a long history with a lot of
approaches quite different from what is in
the OP. Some of the approaches are under
some circumstances much more powerful than
what is in the OP.
The attention to machine learning is
omitting a huge body of highly polished
knowledge usually much more powerful. In
a cooking analogy, you are being sold a
state fair corn dog, which can be good,
instead of everything in Escoffier,
Prosper Montagné, Larousse Gastronomique:
The Encyclopedia of Food, Wine, and
Cookery, ISBN 0-517-503336, Crown
Publishers, New York, 1961.
Essentially, for machine learning as in
the OP, if (A) have a LOT of training
data, (B) a lot of testing data, (C) by
gradient descent or whatever build a
model of some kind that fits the
training data, and (D) the model also
predicts well on the testing data, then
(E) may have found something of value.
But the test in (D) is about the only
assurance of any value. And the value in
(D) needs an assumption: Applications of
the model will in some suitable sense,
rarely made clear, be close to the
training data.
Such fitting goes back at least to
Leo Breiman, Jerome H. Friedman, Richard
A. Olshen, Charles J. Stone,
Classification and Regression Trees,
ISBN 0-534-98054-6, Wadsworth &
Brooks/Cole, Pacific Grove, California,
1984.
not nearly new. This work is commonly
called CART, and there has long been
corresponding software.
And CART goes back to versions of
regression analysis that go back maybe 100
years.
So, sure, in regression analysis, we are
given points on an X-Y coordinate system
and want to fit a straight line so that as
a function of points on the X axis the
line does well approximating the points on
the X-Y plot. Being more specific could
use some mathematical notation awkward for
simple typing and, really, likely not
needed here.
Well, to generalize, the X axis can have
several dimensions, that is, accommodate
several variables. The result is
multiple linear regression.
For more, there is a lot with a lot of
guarantees. Can find those in short and
easy form in
Alexander M. Mood, Franklin A. Graybill,
and Duane C. Boas, Introduction to the
Theory of Statistics, Third Edition,
McGraw-Hill, New York, 1974.
with more detail but still easy form in
N. R. Draper and H. Smith, Applied
Regression Analysis, John Wiley and Sons,
New York, 1968.
with much more detail and carefully done
in
C. Radhakrishna Rao, Linear Statistical
Inference and Its Applications: Second
Edition, ISBN 0-471-70823-2, John Wiley
and Sons, New York, 1967.
Right, this stuff is not nearly new.
So, with some assumptions, get lots of
guarantees on the accuracy of the fitted
model.
This is all old stuff.
The work in machine learning has added
some details to the old issue of over
fitting, but, really, the math in old
regression takes that into consideration
-- a case of over fitting will usually
show up in larger estimates for errors.
There is also spline fitting, fitting from
Fourier analysis, autoregressive
integrated moving average processes,
David R. Brillinger, Time Series
Analysis: Data Analysis and Theory,
Expanded Edition, ISBN 0-8162-1150-7,
Holden-Day, San Francisco, 1981.
and much more.
But, let's see some examples of applied
math that totally knocks the socks off
model fitting:
(1) Early in civilization, people noticed
the stars and the ones that moved in
complicated paths, the planets. Well
Ptolemy built some empirical models based
on epi-cycles that seemed to fit the
data well and have good predictive value.
But much better work was from Kepler who
discovered that, really, if assume that
the sun stays still and the earth moves
around the sun, then the paths of planets
are just ellipses.
Next Newton invented the second law of
motion, the law of gravity, and calculus
and used them to explain the ellipses.
So, what Kepler and Newton did was far
ahead of what Ptolemy did.
Or, all Ptolemy did was just some
empirical fitting, and Kepler and Newton
explained what was really going on and, in
particular, came up with much better
predictive models.
Empirical fitting lost out badly.
Note that once Kepler assumed that the sun
stands still and the earth moves around
the sun, actually he didn't need much data
to determine the ellipses. And Newton
needed nearly no data at all except to
check is results.
Or, Kepler and Newton had some good ideas,
and Ptolemy had only empirical fitting.
(2) The history of physical science is
just awash in models derived from
scientific principles that are, then,
verified by fits to data.
E.g., some first principles derivations
shows what the acoustic power spectrum of
the 3 K background radiation should be,
and the fit to the actual data from WMAP,
etc. was astoundingly close.
News Flash: Commonly some real science or
even just real engineering principles
totally knocks the socks off empirical
fitting, for much less data.
(3) E.g., here is a fun example I worked
up while in a part time job in grad
school: I got some useful predictions for
an enormously complicated situation out of
a little applied math and nearly no data
at all.
I was asked to predict what the
survivability of the US SSBN fleet would
be under a special scenario of global
nuclear war limited to sea.
Well, there was a WWII analysis by B.
Koopman that showed that in search, say,
of a submarine for a surface ship, an
airplane for a submarine, etc. the
encounter rates were approximately a
Poisson process.
So, for all the forces in that war at sea,
for the number of forces surviving, with
some simplifying assumptions, we have a
continuous time, discrete state space
Markov process subordinated to a Poisson
process. The details of the Markov
process are from a little data about
detection radii and the probabilities at a
detection, one dies, the other dies, both
die, or neither die.
That's all there was to the set up of the
problem, the model.
Then to evaluate the model, just use Monte
Carlo to run off, say, 500 sample paths,
average those, appeal to the strong law of
large numbers, and presto, bingo, done.
Also can easily put up some confidence
intervals.
The customers were happy.
Try to do that analysis with big data and
machine learning and will be in deep,
bubbling, smelly, reeking, flaming, black
and orange, toxic sticky stuff.
So, a little applied math, some first
principles of physical science, or some
solid engineering data commonly totally
knocks the socks off machine learning as
in the OP.
There is a whole lot of difference between curve fitting and curve fitting with performance guarantees on future data under a distribution free (limited dependence model).
BTW the 'machine learning' term is Russian coinage and its genesis lies in non-paramteric statistics, the key result that sparked it all off was Vapnik and Chervonenkis's result that is essentially a much generalized and non-asymptotic version of Glivenko Cantelli. The other result was that of Stone that showed universal algorithms that can achieve the Bayes error in the limit not only exist but also constructed such an algorithm. This was the first time it was established that 'learning' is possible.
This is much stricter and well-thought approach than OP makes, there is no need to consider deep learning alone without generalization to all possible math models. For example, OP could mention that simple x^2 function could not be well approximated with a deep network with relu layers with small number of nodes but it could be trivially approximated with a single x^2 layer.
However, the question is, how complex are the "true" models of nature. Gravity law is simple with single equation and one parameter but what if human language law has millions of parameters and not really manageable by human. 500 samples would not be enough then. This is a classical Norvig vs Chomsky argument. Still, for many things the simple laws might exist.
So, right, current approaches to "machine learning* as in the OP have some serious "limitations". But this point is a small, tiny special case of something else much larger and more important: Current approaches to "machine learning" as in the OP are essentially some applied math, and applied math is commonly much more powerful than machine learning as in the OP and has much less severe limitations.
Really, "machine learning" as in the OP is not learning in any significantly meaningful sense at all. Really, apparently, the whole field of "machine learning" is heavily just hype from the deceptive label "machine learning". That hype is deceptive, apparently deliberately so, and unprofessional.
Broadly machine learning as in the OP is a case of old empirical curve fitting where there is a long history with a lot of approaches quite different from what is in the OP. Some of the approaches are under some circumstances much more powerful than what is in the OP.
The attention to machine learning is omitting a huge body of highly polished knowledge usually much more powerful. In a cooking analogy, you are being sold a state fair corn dog, which can be good, instead of everything in Escoffier,
Prosper Montagné, Larousse Gastronomique: The Encyclopedia of Food, Wine, and Cookery, ISBN 0-517-503336, Crown Publishers, New York, 1961.
Essentially, for machine learning as in the OP, if (A) have a LOT of training data, (B) a lot of testing data, (C) by gradient descent or whatever build a model of some kind that fits the training data, and (D) the model also predicts well on the testing data, then (E) may have found something of value.
But the test in (D) is about the only assurance of any value. And the value in (D) needs an assumption: Applications of the model will in some suitable sense, rarely made clear, be close to the training data.
Such fitting goes back at least to
Leo Breiman, Jerome H. Friedman, Richard A. Olshen, Charles J. Stone, Classification and Regression Trees, ISBN 0-534-98054-6, Wadsworth & Brooks/Cole, Pacific Grove, California, 1984.
not nearly new. This work is commonly called CART, and there has long been corresponding software.
And CART goes back to versions of regression analysis that go back maybe 100 years.
So, sure, in regression analysis, we are given points on an X-Y coordinate system and want to fit a straight line so that as a function of points on the X axis the line does well approximating the points on the X-Y plot. Being more specific could use some mathematical notation awkward for simple typing and, really, likely not needed here.
Well, to generalize, the X axis can have several dimensions, that is, accommodate several variables. The result is multiple linear regression.
For more, there is a lot with a lot of guarantees. Can find those in short and easy form in
Alexander M. Mood, Franklin A. Graybill, and Duane C. Boas, Introduction to the Theory of Statistics, Third Edition, McGraw-Hill, New York, 1974.
with more detail but still easy form in
N. R. Draper and H. Smith, Applied Regression Analysis, John Wiley and Sons, New York, 1968.
with much more detail and carefully done in
C. Radhakrishna Rao, Linear Statistical Inference and Its Applications: Second Edition, ISBN 0-471-70823-2, John Wiley and Sons, New York, 1967.
Right, this stuff is not nearly new.
So, with some assumptions, get lots of guarantees on the accuracy of the fitted model.
This is all old stuff.
The work in machine learning has added some details to the old issue of over fitting, but, really, the math in old regression takes that into consideration -- a case of over fitting will usually show up in larger estimates for errors.
There is also spline fitting, fitting from Fourier analysis, autoregressive integrated moving average processes,
David R. Brillinger, Time Series Analysis: Data Analysis and Theory, Expanded Edition, ISBN 0-8162-1150-7, Holden-Day, San Francisco, 1981.
and much more.
But, let's see some examples of applied math that totally knocks the socks off model fitting:
(1) Early in civilization, people noticed the stars and the ones that moved in complicated paths, the planets. Well Ptolemy built some empirical models based on epi-cycles that seemed to fit the data well and have good predictive value.
But much better work was from Kepler who discovered that, really, if assume that the sun stays still and the earth moves around the sun, then the paths of planets are just ellipses.
Next Newton invented the second law of motion, the law of gravity, and calculus and used them to explain the ellipses.
So, what Kepler and Newton did was far ahead of what Ptolemy did.
Or, all Ptolemy did was just some empirical fitting, and Kepler and Newton explained what was really going on and, in particular, came up with much better predictive models.
Empirical fitting lost out badly.
Note that once Kepler assumed that the sun stands still and the earth moves around the sun, actually he didn't need much data to determine the ellipses. And Newton needed nearly no data at all except to check is results.
Or, Kepler and Newton had some good ideas, and Ptolemy had only empirical fitting.
(2) The history of physical science is just awash in models derived from scientific principles that are, then, verified by fits to data.
E.g., some first principles derivations shows what the acoustic power spectrum of the 3 K background radiation should be, and the fit to the actual data from WMAP, etc. was astoundingly close.
News Flash: Commonly some real science or even just real engineering principles totally knocks the socks off empirical fitting, for much less data.
(3) E.g., here is a fun example I worked up while in a part time job in grad school: I got some useful predictions for an enormously complicated situation out of a little applied math and nearly no data at all.
I was asked to predict what the survivability of the US SSBN fleet would be under a special scenario of global nuclear war limited to sea.
Well, there was a WWII analysis by B. Koopman that showed that in search, say, of a submarine for a surface ship, an airplane for a submarine, etc. the encounter rates were approximately a Poisson process.
So, for all the forces in that war at sea, for the number of forces surviving, with some simplifying assumptions, we have a continuous time, discrete state space Markov process subordinated to a Poisson process. The details of the Markov process are from a little data about detection radii and the probabilities at a detection, one dies, the other dies, both die, or neither die.
That's all there was to the set up of the problem, the model.
Then to evaluate the model, just use Monte Carlo to run off, say, 500 sample paths, average those, appeal to the strong law of large numbers, and presto, bingo, done. Also can easily put up some confidence intervals.
The customers were happy.
Try to do that analysis with big data and machine learning and will be in deep, bubbling, smelly, reeking, flaming, black and orange, toxic sticky stuff.
So, a little applied math, some first principles of physical science, or some solid engineering data commonly totally knocks the socks off machine learning as in the OP.