Long time user of Python here and recent user of Mathematica.
Some observations I have are that they're both great. Python is a nice open source scripting language, but getting libraries to work can sometimes be a pain. Mathematica is basically install this and everything is included. The Mathematica documentation is amazing and it is really simple how to do most things. The whole iPhone "there is an app for that" is equivalent to "there is a function for that".
Graph Theory works flawlessly in Mathematica. In Python, there is a module to Graphviz. Let me know if Python has something new though. There are a lot of other examples. Mathematica's Import[] function can read over 150 different file types including: CSV,.XLS, genetic encoding files, optimization files....whatever. It is usually far easier and more consistent than finding a corresponding Python library and struggling with the install and minimal documentation. Let me be clear that Python is awesome and rocks and i think Jupyter is moving it in the right direction. I just feel like many dismiss Mathematica as something that does Calculus homework rather than what it is today which is a massive 20 million LOC conglomeration of C & Java & Wolfram language that does everything from Statistics, Machine Learning, Visualization, BlockChain, 3D printing, NodeGraphs, data sets and analysis...etc in a single consistent package. It is expensive and proprietary and certainly has its own faults, but a lot of that cash is funneled back into a great product.
While I totally hear you regarding the pain of python modules (particularly on Windows), the point of python 'distributions' like anaconda and canopy is to bring the kitchen sink along, kind of like mathematica.
The problem with Mathematica from a science point of view is that, being closed source, means you can't independently ensure the calculations are happening correctly. To be replicable, science involving data needs to use open source tools.
The problem with Mathematica from a science point of view is that, being closed source, means you can't independently ensure the calculations are happening correctly
Have there been any high profile failures root-caused to Mathematica (or MATLAB or any similar product) getting its sums wrong? I can find any news stories etc. Plenty of serious calculations were and are done on “closed source” HP and TI calculators too. Every serious scientific instrument with its own data capture uses a binary blob too somewhere, so if that’s a problem for you then you can’t even trust the raw data!
And even if you have all of the code - you still need to worry if the proprietary, closed FPU is working “correctly”.
This sounds very much like a post-hoc justification for “its free as in beer”. Do you think Wikipedia is more trustworthy than real references too? What about blogs?
Science publication is moving (very, very, very slowly) towards a model where instead of a final, polished traditional paper, the raw data along with the software tools and interpretation is published. In principle this should allow readers to completely understand and reproduce the processing of the raw data, rather than reading a few paragraphs summarizing the processing done by the authors. Using a closed source tool for processing the data limits how deeply a reader can delve into the processing that the authors did, because the functions in the proprietary package are black boxes. Jupyter has no black boxes.
In principle this should allow readers to completely understand and reproduce the processing of the raw data, rather than reading a few paragraphs summarizing the processing done by the authors. Using a closed source tool
"Science is facing a "reproducibility crisis" where more than two-thirds of researchers have tried and failed to reproduce another scientist's experiments"
I don't think that can be handwaved away as "OMG closed source software!". Especially since all the scientists in a given field will have access to the same software anyway. Give them open source and the issue will persist, and we both know it because the root cause isn't anything to do with the license of the software
Reducing the barriers for doing reproduction studies will likely increase how often they happen.
Of just consider peer review. How often does a reviewer today actually review the code and data used in a study? As far as I know this is essentially never done. That would involve seeing the code, running it, maybe messing a bit with it.
Open source software, in general, tends to make things more reproducible. Sure, software licencing might not be the singular root cause, but why does that suggest we shouldn't capitalise on the improvements available?
Open source software, in general, tends to make things more reproducible
Citation very much needed for that. Because you can very easily find that 6 months or a year later you update your dependencies and everything is now broken. I recently came back to a Python project after a year, updated my packages then realized: I simply cannot be bothered to unpick the mess that resulted just to add one trivial feature. Whereas the poster child for backwards compatibility is closed-source and proprietary.
Science is not reproducible because there are no incentives for it to be so, despite everyone paying lip service to it. It's extra work and helps those who are competing with you for grants, after all. That's a social problem, not a software one. The software is irrelevant.
Open source software is important for reproducibility for a couple of reasons. Firstly, if you record that you've done your analysis with Python 3.6.3 and Numpy 1.14.2, and it later breaks on some newer version, it's relatively easy to get the same versions you were using. Commercial software vendors are usually not keen on you downloading and running a version of their product which was superseded four years ago.
Secondly, of course, open source means that if you're not sure why two versions/functions/libraries are giving you different answers, you can go and find out. I accept that a lot of people may not have time for that, but I don't think you can fix that problem unless it's possible to dig down and follow the working.
Finally, 'reproducible by anyone with a computer' is a lot better than 'reproducible by people who buy a license for the tool I used to do it'.
f you record that you've done your analysis with Python 3.6.3 and Numpy 1.14.2, and it later breaks on some newer version, it's relatively easy to get the same versions you were using
Better record which compiler you used too, and what flags, and every version of every library and everything else. It’s not as simple as you make out and it’s far from guaranteed that all those packages will still be available or compile on your OS.
Commercial software vendors are usually not keen on you downloading and running a version of their product which was superseded four years ago.
I guess you must not deal with vendors much because generally they are fine with this. It’s part of the support agreement usually, just another service. Getting an “obsolete” version for whatever reason has never been a problem for me.
By “anyone with a computer” you mean “anyone who can exactly reproduce my configuration which I don’t even know myself for certain”
Because the reproducibility crisis has very little to do with the difficulty in re-running the same code. These are almost orthogonal concerns.
The typical problem paper has a small data set, on which the authors tried 20 different things, one of which achieved p<0.05 and got published. The result tells you nothing meaningful about the world (or more often, tells you something about how its authors wish the world worked). But re-running their code on their data set will not reveal the problem.
Now being open source does not prevent such errors. What I think is far more important here is open issue tracking: companies like Wolfram may not be exactly eager to let you know that there are serious bugs in their products (and what they are) unless they must. Being able to fix the bug yourself can of course be a real perk for some user's, but realistically, it's out if question for most.
Here is an article from a couple of years ago about how one group of mathematicians were misled by (and had to spend some time tracking down) a bug in Mathematica determinant evaluation: http://www.ams.org/notices/201410/rnoti-p1249.pdf
I’ll add that many closed sourced developers will show you sections of the source code if you sign an NDA. I’ve done that on several occasions when I needed to see how a model was implemented. As long as you are not a competitor, it’s usually not an issue. I sign NDAs often to see IP.
> The problem with Mathematica from a science point of view is that, being closed source, means you can't independently ensure the calculations are happening correctly. To be replicable, science involving data needs to use open source tools.
Abstract: "The spreadsheet software Microsoft Excel, when used with default settings, is known to convert gene names to dates and floating-point numbers. A programmatic scan of leading genomics journals reveals that approximately one-fifth of papers with supplementary Excel gene lists contain erroneous gene name conversions."
I think your estimate of 99% is a wee bit high. At least in the field I'm familiar with, astronomy, the idea of using Excel for any serious computation or design would be met with laughter.
Well since the author is an economist, questions about "real science" can be interpreted a few ways.
For a fun look into the high standards of the field, a few years ago Piketty made the mistake of sharing his Excel files, in which all sorts of crucial adjustments were hard-coded into tables of data...
Matlab's fine, but Excel has (had?) some serious defects in the basic statistical functions, in particular, it coundn't reliably calculate a standard deviation.
I'm not sure I agree with that. If you publish your methods and make your data available, then a replicator has all they need. They'd have to reimplement the data pipeline anyway, otherwise it's not a replicated result it's just someone running your notebook with potentially the same bugs in it again.
I hear what you're saying and you're right that WinPython & Anaconda certainly help, but the documentation is still a long way off from Mathematica in my opinion.
One thing in Python's favor though might be depth in certain categories. The machine learning stuff in Mathematica is very nice and high level if you want neural networks, but if you need PSO or GA, you'll probably have to write your own or grab someone else's notebook.
It was difficult for me to support closed-source software as I've always supported linux for this reason.
As far as ensuring accuracy of calculations, having a very large and highly technical user base over several decades helps, but I'm not sure how much this is used in theory. If a statistician publishes a paper using R, is anyone really going to check the R module source code? I bet this is a rare occurrence.
Some observations I have are that they're both great. Python is a nice open source scripting language, but getting libraries to work can sometimes be a pain. Mathematica is basically install this and everything is included. The Mathematica documentation is amazing and it is really simple how to do most things. The whole iPhone "there is an app for that" is equivalent to "there is a function for that".
Graph Theory works flawlessly in Mathematica. In Python, there is a module to Graphviz. Let me know if Python has something new though. There are a lot of other examples. Mathematica's Import[] function can read over 150 different file types including: CSV,.XLS, genetic encoding files, optimization files....whatever. It is usually far easier and more consistent than finding a corresponding Python library and struggling with the install and minimal documentation. Let me be clear that Python is awesome and rocks and i think Jupyter is moving it in the right direction. I just feel like many dismiss Mathematica as something that does Calculus homework rather than what it is today which is a massive 20 million LOC conglomeration of C & Java & Wolfram language that does everything from Statistics, Machine Learning, Visualization, BlockChain, 3D printing, NodeGraphs, data sets and analysis...etc in a single consistent package. It is expensive and proprietary and certainly has its own faults, but a lot of that cash is funneled back into a great product.