Long time user of Python here and recent user of Mathematica. Some observations ...

askvictor · on April 15, 2018

While I totally hear you regarding the pain of python modules (particularly on Windows), the point of python 'distributions' like anaconda and canopy is to bring the kitchen sink along, kind of like mathematica.

The problem with Mathematica from a science point of view is that, being closed source, means you can't independently ensure the calculations are happening correctly. To be replicable, science involving data needs to use open source tools.

gaius · on April 15, 2018

The problem with Mathematica from a science point of view is that, being closed source, means you can't independently ensure the calculations are happening correctly

Have there been any high profile failures root-caused to Mathematica (or MATLAB or any similar product) getting its sums wrong? I can find any news stories etc. Plenty of serious calculations were and are done on “closed source” HP and TI calculators too. Every serious scientific instrument with its own data capture uses a binary blob too somewhere, so if that’s a problem for you then you can’t even trust the raw data!

And even if you have all of the code - you still need to worry if the proprietary, closed FPU is working “correctly”.

This sounds very much like a post-hoc justification for “its free as in beer”. Do you think Wikipedia is more trustworthy than real references too? What about blogs?

PopePompous · on April 15, 2018

Science publication is moving (very, very, very slowly) towards a model where instead of a final, polished traditional paper, the raw data along with the software tools and interpretation is published. In principle this should allow readers to completely understand and reproduce the processing of the raw data, rather than reading a few paragraphs summarizing the processing done by the authors. Using a closed source tool for processing the data limits how deeply a reader can delve into the processing that the authors did, because the functions in the proprietary package are black boxes. Jupyter has no black boxes.

gaius · on April 15, 2018

In principle this should allow readers to completely understand and reproduce the processing of the raw data, rather than reading a few paragraphs summarizing the processing done by the authors. Using a closed source tool

But consider http://www.bbc.co.uk/news/science-environment-39054778

"Science is facing a "reproducibility crisis" where more than two-thirds of researchers have tried and failed to reproduce another scientist's experiments"

I don't think that can be handwaved away as "OMG closed source software!". Especially since all the scientists in a given field will have access to the same software anyway. Give them open source and the issue will persist, and we both know it because the root cause isn't anything to do with the license of the software

jononor · on April 15, 2018

Reducing the barriers for doing reproduction studies will likely increase how often they happen.

Of just consider peer review. How often does a reviewer today actually review the code and data used in a study? As far as I know this is essentially never done. That would involve seeing the code, running it, maybe messing a bit with it.

nmca · on April 15, 2018

Open source software, in general, tends to make things more reproducible. Sure, software licencing might not be the singular root cause, but why does that suggest we shouldn't capitalise on the improvements available?

gaius · on April 15, 2018

Open source software, in general, tends to make things more reproducible

Citation very much needed for that. Because you can very easily find that 6 months or a year later you update your dependencies and everything is now broken. I recently came back to a Python project after a year, updated my packages then realized: I simply cannot be bothered to unpick the mess that resulted just to add one trivial feature. Whereas the poster child for backwards compatibility is closed-source and proprietary.

Science is not reproducible because there are no incentives for it to be so, despite everyone paying lip service to it. It's extra work and helps those who are competing with you for grants, after all. That's a social problem, not a software one. The software is irrelevant.

takluyver · on April 15, 2018

Open source software is important for reproducibility for a couple of reasons. Firstly, if you record that you've done your analysis with Python 3.6.3 and Numpy 1.14.2, and it later breaks on some newer version, it's relatively easy to get the same versions you were using. Commercial software vendors are usually not keen on you downloading and running a version of their product which was superseded four years ago.

Secondly, of course, open source means that if you're not sure why two versions/functions/libraries are giving you different answers, you can go and find out. I accept that a lot of people may not have time for that, but I don't think you can fix that problem unless it's possible to dig down and follow the working.

Finally, 'reproducible by anyone with a computer' is a lot better than 'reproducible by people who buy a license for the tool I used to do it'.

gaius · on April 15, 2018

f you record that you've done your analysis with Python 3.6.3 and Numpy 1.14.2, and it later breaks on some newer version, it's relatively easy to get the same versions you were using

Better record which compiler you used too, and what flags, and every version of every library and everything else. It’s not as simple as you make out and it’s far from guaranteed that all those packages will still be available or compile on your OS.

Commercial software vendors are usually not keen on you downloading and running a version of their product which was superseded four years ago.

I guess you must not deal with vendors much because generally they are fine with this. It’s part of the support agreement usually, just another service. Getting an “obsolete” version for whatever reason has never been a problem for me.

By “anyone with a computer” you mean “anyone who can exactly reproduce my configuration which I don’t even know myself for certain”

improbable22 · on April 15, 2018

Because the reproducibility crisis has very little to do with the difficulty in re-running the same code. These are almost orthogonal concerns.

The typical problem paper has a small data set, on which the authors tried 20 different things, one of which achieved p<0.05 and got published. The result tells you nothing meaningful about the world (or more often, tells you something about how its authors wish the world worked). But re-running their code on their data set will not reveal the problem.

BlackFingolfin · on April 15, 2018

I am not sure that there are/were "high profile failures", but serious bugs for sure, not just in Mathmetica but also in the calculators you mention; ser eg here: https://mathoverflow.net/questions/11517/computer-algebra-er...

Now being open source does not prevent such errors. What I think is far more important here is open issue tracking: companies like Wolfram may not be exactly eager to let you know that there are serious bugs in their products (and what they are) unless they must. Being able to fix the bug yourself can of course be a real perk for some user's, but realistically, it's out if question for most.

fdej · on April 15, 2018

Here is an article from a couple of years ago about how one group of mathematicians were misled by (and had to spend some time tracking down) a bug in Mathematica determinant evaluation: http://www.ams.org/notices/201410/rnoti-p1249.pdf

FractalLP · on April 15, 2018

Thanks for posting. I'm sure bugs like this show up in any and all software. I do wish they had a public bug tracker though.

madengr · on April 15, 2018

I’ll add that many closed sourced developers will show you sections of the source code if you sign an NDA. I’ve done that on several occasions when I needed to see how a model was implemented. As long as you are not a competitor, it’s usually not an issue. I sign NDAs often to see IP.

gaius · on April 15, 2018

Additionally most of Mathematica is written in Mathematica and you can just read it. Same with MATLAB.

I get the feeling that many of the Jupyter fans commenting here have only ever used a Jupyter setup and are unaware of the wider industry

starpilot · on April 15, 2018

> The problem with Mathematica from a science point of view is that, being closed source, means you can't independently ensure the calculations are happening correctly. To be replicable, science involving data needs to use open source tools.

Excel and MATLAB can't be used for real science?

inigoalonso · on April 15, 2018

At least you need to be very careful with them: https://genomebiology.biomedcentral.com/articles/10.1186/s13...

Abstract: "The spreadsheet software Microsoft Excel, when used with default settings, is known to convert gene names to dates and floating-point numbers. A programmatic scan of leading genomics journals reveals that approximately one-fifth of papers with supplementary Excel gene lists contain erroneous gene name conversions."

starpilot · on April 15, 2018

Yeah, but that doesn't change the fact that 99% of the science/engineering/business world is reliant on Excel. Airplanes are designed in Excel.

PopePompous · on April 15, 2018

I think your estimate of 99% is a wee bit high. At least in the field I'm familiar with, astronomy, the idea of using Excel for any serious computation or design would be met with laughter.

iguy · on April 15, 2018

Well since the author is an economist, questions about "real science" can be interpreted a few ways.

For a fun look into the high standards of the field, a few years ago Piketty made the mistake of sharing his Excel files, in which all sorts of crucial adjustments were hard-coded into tables of data...

https://marginalrevolution.com/marginalrevolution/2014/05/pi...

jjgreen · on April 15, 2018

Matlab's fine, but Excel has (had?) some serious defects in the basic statistical functions, in particular, it coundn't reliably calculate a standard deviation.

chillydawg · on April 15, 2018

I'm not sure I agree with that. If you publish your methods and make your data available, then a replicator has all they need. They'd have to reimplement the data pipeline anyway, otherwise it's not a replicated result it's just someone running your notebook with potentially the same bugs in it again.

FractalLP · on April 15, 2018

I hear what you're saying and you're right that WinPython & Anaconda certainly help, but the documentation is still a long way off from Mathematica in my opinion.

One thing in Python's favor though might be depth in certain categories. The machine learning stuff in Mathematica is very nice and high level if you want neural networks, but if you need PSO or GA, you'll probably have to write your own or grab someone else's notebook.

It was difficult for me to support closed-source software as I've always supported linux for this reason.

As far as ensuring accuracy of calculations, having a very large and highly technical user base over several decades helps, but I'm not sure how much this is used in theory. If a statistician publishes a paper using R, is anyone really going to check the R module source code? I bet this is a rare occurrence.

dr_zoidberg · on April 15, 2018

The best package for graph analysis in Python is NetworkX: http://networkx.github.io/