For all the talk of "best practices" and "training" the depressing truth is that guaranteeing correct software is incredibly difficult and expensive. Professional software engineering practices aren't nearly sufficient to guarantee correctness with heavy math. The closest thing we have is NASA where the entire development process is designed and constantly refined in response to individual issues to create the checks and balances with the lofty goal of approaching bug impossibility at an organizational level. Unfortunately this type of evolutionary process is only viable for multi-year projects with 9-figure budgets. It's not going to work for the vast majority of research scientists with limited organizational support.
On the positive side, such difficulty is also in the nature of science itself. Scientists already understand that rigorous peer review is the only way to come to reliable scientific conclusions over time. The only thing they need help with understanding is that the software used to come to these conclusions is as suspect as—if not more so than—the scientific data collection and reasoning itself, and therefore all software must be peer-reviewed as well. This needs to be ingrained culturally into the scientific establishment. In doing so, the scientists can begin to attack the problem from the correct perspective, rather than industry software experts coming in and feeding them a bunch of cargo cult "unit tests" and "best practices" that are no substitute for the deep reasoning in the specific ___domain in question.
I've spent a bit of time on the inside at NASA, specifically working on earth observing systems. There is a huge difference between the code quality of things that go into control systems for spacecraft (even then, meters vs. feet, really?) and the sort of analysis/theoretical code the article talks about. Spacecraft code gets real programmers and disciplined practices, while scientific code is generally spaghetti IDL/Matlab/Fortran.
There is a huge problem with even getting existing code to run on different machines. My team's work was primarily dealing with taking lots of project code (always emailed around, with versions in the file name) and rewriting it to produce data products that other people could even just view. Generally we'd just pull things like color coding out of the existing code and then write our processors from some combination of specifications and experimentation.
I'd agree that "unit tests" and trendy best practices are probably not the full answer, but the article is correct in emphasizing documentation, modularity, and source control. Source control alone would protect against bugs produced by simply running the wrong version of code.
Definitely. Obviously the software industry has a lot of know-how that would be invaluable to the science community. The critical point I was trying to make is that scientists need to understand the fundamental difficulty of software correctness before they can be expected to apply best practices effectively.
the depressing truth is that guaranteeing correct software
is incredibly difficult and expensive
There is a world of difference between the correctness of industrial programs that follow 'cargo cult best practices' and the correctness of scientific programs. This is achieved without incurring incredible expenses. That we can't go all the way by (practical) definition doesn't mean we shouldn't try to get further.
One of the main problems is convincing, especially young, scientists that their code sucks. Young programmers, you can coach. You review their code, teach them what works and what doesn't and they get better. Scientists that happen to write progams, they don't learn to become better programmers: they've got other things to worry about. There's nobody to help them and since they're usually highly intelligent and overestimate their capabilities in things they don't want to spend time on (which is a way of justifying for yourself not to spend time on it), they need all the more guidance to become good.
I'm a PhD student in Electrical Engineering. I'm currently working on a Monte Carlo-type simulation for looking at the underwater light field for underwater optical communication (no sharks!). I'm doing the development in MATLAB and I recently put all my code up on Github (https://github.com/gallamine/Photonator) to help avoid some of these problems (lack of transparency). Even if nobody ever looks/uses the code, I know every time I do a commit there's a change someone MIGHT and I think it helps me write better code.
The problem with doing science via models/simulation is that there just isn't a good way of knowing when it's "right" (well, at least in a lot of cases), so testing and verification are imperative. I can't tell you how many times I've laid awake at night wondering if my code has a bug in it that I can't find and will taint my research results.
I suspect another big problem is that one student writes the code, graduates, then leaves it to future students, or worse, their professor, to figure out what they wrote. Passing on the knowledge takes a heck of a lot of time, especially when you're pressed to graduate and get a paycheck).
There's got to be a market in this somewhere. Even if it was just a volunteer service of "real" programmers who would help scientists out. I spent weeks trying to get my code running on AWS, which probably would have taken a few hours from someone who knew what they were doing. I also suspect that someone with practice could make my simulations run at twice the speed, which really adds up when you're doing hundreds of them and they take hours each.
I'm a M.S. student in mechanical engineering facing a similar situation, except I haven't put any code on Github (my advisor wants to keep it proprietary, but I probably would not bother putting it up even if he were ok with it).
I've written around 15000 lines of MATLAB for my research and only a handful of people will ever need to see it. Some is well-structured and nicely commented, but other parts are incomprehensible and were written under severe time constraints. My advisor is not much of a programmer and will not be able to figure it out, and I feel bad for leaving a pile of crappy code to the person who inevitably follows in my footsteps, but I ultimately have a choice between writing fully commented, well-tested, and well-structured code and graduating a semester late (at the cost of several thousand dollars to myself), or writing code that's "just good enough" to get results on time. This is a solo project (there is no money for a CS student to intern) and I'm not getting paid to write code unlike a professional programmer, so every second I spend improving my code beyond the bare minimum costs me time and money.
Even if I were able to tidy up and publish all of my code, most mechanical engineers would not be able to understand it because most can't write code. Those who can mostly use FORTRAN, although C is becoming more common. Nonetheless, even those who could understand my code would have little incentive to read through 15000+ lines of code.
Unfortunately, as far as research code is concerned, a lot of trust is still required on the part of the reader of the publication. I agree that the transfer of knowledge should be handled differently, but until there is a strong incentive for researchers to write good code it will continue to be bad. Especially when many research projects only require the code to demonstrate something, after which it can be put in the closet.
This concerns me. Is this kind of thinking pervasive in public academic institutions? Avoiding the copyright ownership issues that tend to accompany such discussions, would it not be better to be more open about the code in an attempt to gain peer review?
I understand your personal motivations about not publishing, but the statement about your advisor is what I'm worried about.
Yes it is. Often it's not for nefarious reasons - it happens a lot where I work because we use data from collaborators that is unpublished, and it's considered unethical to jump over them by releasing code or results based on it.
Of course, the problem is that it can sometimes take years to get large datasets published and this means that the code gathers dust and gets forgotten in the meantime. By contrast, the papers and results aren't, because those are the things by which academic careers are measured.
I would personally support a wholesale change in culture in this area. Code and data/results/conclusions are not as inseparable as most scientists would like to believe, and often should be published as a unit. There has been push in this direction for a while in the engineering sciences, but other informatics disciplines like biology lag badly in this respect.
The ethical considerations with regard to "jumping" collaborators indeed make sense.
As to the last point, perhaps it's time the scientific community took software into consideration along with the data and it's resulting papers. At the least, acknowledge the problem. At best, decide where (alongside the data? with the paper in progress?) the software should be stored.
I wonder whether our priorities for research are misguided. Isn't research about extending the knowledge of humanity? Writing and passing on readable code would probably advance us further in total, than everyone starting basically from scratch.
(I'm not faulting you, you just react to the incentives.)
Well, I went to a lecture by one of the most prominent scientists here in Brazil, where he explicitly said that the answer to your question is NO. Research as it stands today exists to feed the system. According to him, you:
* Publish, so you can get grants
* Use that grant so you can publish more
* Get more grants;
* Get tenure somewhere in the middle.
I have to confess I was very disgusted by him saying that in front of such a large audience of scientists and graduate students.
I agree it's a problem, but I think you have to fix the incentives to make meaningful change. When people are thrown into a cut-throat competitive environment, with tenure clocks, multiple junior professors per tenure slot, requirement to bring in grants to fund your research or you get shut down, etc., it doesn't encourage people to be altruistic and sharing.
I think the problem is fundamentally one of economics. Research is good, but you have to decide how much money to allocate to it. In order to decide, you need a metric for performance. Really, only scientists are qualified to judge whether the results of other scientists are worth anything, so currently the only metric we really have is publishing in peer-reviewed journals. Ultimately, therefore, that's where the incentives end up.
When a more appropriate way of quantifying research output and its benefits is found, hopefully a beneficial change in culture will trickle down into the academic trenches.
How about trying to fix the current system by making somebody else using your software count as a "super citation"? (It could even arguably count as much as co-authorship.)
I think this is an excellent idea. If published software could be tagged via a unique identifier (like the DOI of a paper), then it could be cited by that tag just like a paper. Well written software might even get cited more than the paper it was published in.
It's not his fault that the system is set up in a such a way that some random bureaucrat that's not close to the project can make a department unemployed with a wave of his hand. Aiming for the next grant is how you survive - it's not trivial for academics (or anyone) to move cities every year or so to follow where grants might land.
Until there is that job security, knowing that as long as you keep working you're not going to be randomly turfed out, this phenomenon will be a fundamental part of the academic career.
Well, yes, it is about extending the knowledge of humanity- but it doesn't happen in a vacuum, and is subject to a lot of the same constraints as any other human activity. And, as you say, there are incentives at work- if the john_b's advisor had written "release usable MATLAB toolkit for $doing_whatever_john_b's_thesis_does" into his grant as a deliverable, you can bet that both john_b and the advisor would have made sure that it was in a releasable state, and also that the advisor would have had funds available to pay john_b to clean it up and get it ready to go- they would have been specifically allocated for that purpose in the grant's budget.
> graduating a semester late (at the cost of several thousand dollars to myself)
Really? Your funding isn't guaranteed?
When I did my Master's I was funded as an R.A. without my advisor/lab having to tap her particular grants. Grants and fellowships were usually seen as something "extra" for master's and Ph.D pre-quals students, not their main source of funding. I find it surprising that your school or department seems to (or is forced to) think differently.
PhD student funding is guaranteed in my department, but not funding for M.S. students. Initially I did have a RA, supplied by my advisor's start up funding (he was fairly new at the time). But when his grant applications were rejected and his start up funding ran out, I had to do a TA instead. TA-ships aren't guaranteed though, and at my school if you stay too long on a TA they take you off of it to make sure that other students have a chance at funding too. My advisor did ultimately get some grant money in, but by that point there were other students who needed it more than me.
It might have something to do with recent state budget cuts (it's a state-funded university). My department has also grown dramatically over the past few years, both in terms of faculty and students, so the graduate student funding will probably lag behind for a few years more.
There is a market, and it's called libraries. Eventually you will use a language where software carpentry and code reuse is a core feature, and tested, modular libraries for not only core algorithms, but also deployment and dev-ops stuff (like managing a compute cluster on the cloud) will have standard approaches.
This is starting to shape up on the Python side of things, but it has stagnated a little bit. People who can and do write the foundational code are oftentimes too focused on making the code work, and not at all focused on improving the quality of the ecosystem that their code is part of. Open Source is a great mechanism for many things, but polishing up the last 20% is not one of them.
"Eventually you will use a language where software carpentry and code reuse is a core feature."
Well, to some extent products like MATLAB solve this problem. For better or worse, I trust Matlab's ability to generate a (pseudo) random number, parallel process my functions, invert matrices, etc., etc.
On a broader level, thanks to the specialization of academia, chances are that the code I want to write isn't duplicated by others. Even if it is, I still have to trust them to have written it well - which is the whole problem here.
I believe I came across it at some point or another, but I haven't looked closely at it. For better or worse, it doesn't seem to be employed by others doing light field simulations underwater ... not sure why. I'd have to poke into it further. I was, however, seriously thinking about writing a plugin for Blender that would utilize my specific scattering phase functions in their volumetric ray tracing renderer ... no time though.
I want to write a "software style guide" for journalists and their editors.
Software and Code are both mass nouns in technical language.
"Code" can be in programs (aka, things that run), libraries (things that other programmers can use to make programs), or in samples to show people how to do things in their programs or libraries. Some people call short programs scripts.
When you feel you should pluralize "software", you're doing something wrong. You might want to use the word programs, you might want to use the word products, you might want to just use it like a mass noun "It turns out, thieves broke into the facility and stole some of the water", etc when talking about a theft of software "It turns out, thieves broken into the facility and stole some of the software".
"he attempted to correct a code analysing weather-station data from Mexico."
This annoys me, and it is everywhere. It indicates the writer has no idea what they're writing about and presumes that it's not a process but a matter of getting the right answer. "Hold on a sec, let me get out my Little Orphan Annie's Secret Decoder Ring."
The biggest howler I saw was "The SSI unites trained software developers with scientists to help them add new lines to existing codes, allowing them to tackle extra tasks without the programs turning into monsters."
Ah, well - we /do/ come from different worlds. My initial reaction to the word "codes" is to look for the "plz send me the" somewhere before it.
That said, at least (some kinds of) EEs seem to have it better - the basic Spice simulator was released under a permissive license a really long time ago, and there are people like Fabio Somenzi who make available things like CUDD (it's also used commercially.) Mind you, these have a significant overlap with CS, where the culture is different. I would be very happy to see a good open-source EM field solver, for example.
What an excellent idea: Generate new jargon that's incompatible with the jargon used from the field you suck at.
Perhaps concerned scientists and editors should reject the bifurcation here and take on the lingo of the field that creates the tool they have to use and need to learn better as a first step in learning to program in a more responsible manner?
That attitude seems a bit provincial: the usage may be uncommon in industry software development, but it's not rare in some areas of computer science. For example,
"Code" also has connotations (self-contained, numerical, etc.) that make it distinct from "program" or even "library". A routine in ATLAS is a code, but Microsoft Word is not.
I think you have the chronology backwards. The use of "code" as a mass noun dates to the 1960s at the earliest (actually, I can't find a good example before the 1970s in brief searching), while the use of "code" as a singular noun to mean "implemented algorithm", and "codes" as the plural, dates back at least to the 1950s.
My Chronology may still be backwards, but they still should swap over to the language of the mature field of software development's language to better allow themselves to integrate in good practices.
The bifurcation is still harmful to them even if the software usage originated later than the science term.
Context matters. e.g. misusing physics terminology in a political metaphor over drinks is annoying but inconsequential. Misusing physics terminology in papers where the bulk of the work was physics, or in an article for ACM about why computer scientists are bad at physics moves from eye-roll to WTF territory.
Yes, I hear this all the time. I know that the fancy course 6 kids here on HN would poo-poo it, but it's a very common usage in scientific computing. I can imagine the origins and can hypothesize about why it persists (the festering petri dishes of programming culture that is "grad school"), but don't have a definitive answer.
John Tukey is widely credited with coining the term "software" in print in 1958, but I'll wager that "codes" actually predates that.
Scientists say "code" because "machine code" is a bit of a mouthful. They say "machine" because their computers used to be people, and an interrupt meant running excitedly into Sir George Everest's tent. If entrepreneurs had invented the machines, we'd call them "electronic clerks".
When the manual is titled "Theoria combinationis observationum erroribus minimis obnoxiae", you know you're dealing with legacy code. In that case it's a pretty cool legacy, though.
From my experience, the most used languages in scientific programming are Fortran 90, C/C++, and Matlab with a considerable number of legacy codes written in FORTRAN 77.
Also, FORTRAN 77 is most definitely not an interpreted language.
However, science should modernize to modern computing's terms, so as to allow for easier cross training in modern techniques to make their uses of programming go smoother.
Is this not simply a British English thing? I assumed it was, like "maths", since Nature is a British publication. HN users from the UK, can you confirm? gte is speaking about constructions in the article like:
"As a general rule, researchers do not test or document their programs rigorously, and they rarely release their codes, making it almost impossible to reproduce and verify published results generated by scientific software, say computer scientists."
"As recognition of these issues has grown, software experts and scientists have started exploring ways to improve the codes used in science."
Definitely not. Queen's English here, and "Maths" is a simple concatenation of Mathematics, meaning "Codes" would make no lexical sense. Therefore, we say "Code", just like you.
My girlfriend is a PhD student in a pharmacology lab. I'm a software engineer working for an industry leader.
Once, she and the lab tech were having issues with their analysis program for a set of data. It was producing errors randomly for certain inputs, and the data "looked wrong" when it didn't throw an error. I came with her to the lab on a Saturday and looked through the spaghetti code for about 20 minutes. Once I understood what they were trying to do, I noticed that they had forgotten to transpose a matrix at one spot. A simple call to a transposition function fixed everything.
If this had been an issue that wasn't throwing errors, I don't know whether they would have even found the bug. I've been trying to teach my gf a basic understanding of software development from the ground up, and she's getting a lot better. But this does appear to be a systemic problem within the scientific community. As the article notes, more and more complicated programs are needed to perform more detailed analysis than ever before. This problem isn't going to go away, so it's important that scientists realize the shortcoming and take steps to curb it.
i'm in a similar position to you (although i started out as an academic i've worked in the software industry for ages and so end up helping my astronomer partner).
anyway, i disagree slightly with your analysis. in my experience academics know that they suck at the "engineering" part and, to make up for it, are very diligent in making sure that the results "feel right". so i don't think what you described was luck - that's how they work.
in comparison, what drives me crazy, is that if they learnt to use a few basic tools (scm, libraries, an ide, simple test framework) they could save so much time and frustration.
[related anecdote: last year i rewrote some c code written by a grad student that was taking about 24 hours to run. my python translation finished in 15 minutes and gave the same answer each time it was run (something of a novelty, apparently)].
Not sure how your anecdote relates to the conclusion. Forgetting, or even knowing why, to transpose a matrix is not an example of a problem that can be solved by "a basic understanding of software development". Hell, I'm sure there are many decent hackers that don't know what a matrix is, let alone spot such errors within a long sequence of computations.
Bad code compiles. Good code works right. Great code is so obviously right you don't have to wonder.
*Those are the same formula, though the second one is missing some critical parentheses. I use the example because I have done exactly this and been bitten by exactly this, and now am fanatical about keeping my mathematical formulas clean and obvious.
I suppose the tie-in is simply that we hackers think differently from scientists. It's much easier for us to visualize a complex tree of logic than for people who are not accustomed to it. The second purpose of the anecdote was to illustrate that there is a complete lack of software testing knowledge within the scientific community, or even the recognition of the need for it. All of the "testing" they do is on production data. There is no unit testing anywhere.
The problem I see with your girlfriend's program is more of a "verification" issue.
In the simulation sub-field I am there is this "research development process" which includes "verification" and "validation" after the model is performed.
Part of the verification is done by "third party code reviews" in which a party unrleated to the program/project reviews the model description (word document) and does a line-by-line analysis of the code to see that the program matches the code.
I did that during my PhD (a Professor at INSEAD paid me to do a code review of a model).
In the case of your girlfriend's lab, they catched the error via "face validation" (the results looked wrong).
Yes this is a huge problem. I am a software engineer working at a research institute for bioinformatic. The biggest problem I encounter in my struggle for clean maintainable code, is that management down prioritize this task quite heavily.
The researchers produce code of questionable quality that needs to go into the main branch asap. Those few of the researchers that know how to code (we do a lot of image analysis), don't know anything about keeping it maintainable. There is almost a hostile stance against doing things right, when it comes to best practices.
The "Works on my computer" seal of approval have taken a whole new meaning for me. Things go from prototype to production by a single correct run on a single data set. Sometimes its so bad I don't know if I should laugh or cry.
Since we don't have a single test or, ever take the time to do a proper build system, my job description becomes mostly droning through repetitive tasks and bug hunting. It sucks the life right out of any self respecting developer.
There, I needed that. Feel free to flame my little rant down into the abyss. :)
As a general rule, researchers do not test or document their programs rigorously, and they rarely release their codes, making it almost impossible to reproduce and verify published results generated by scientific software, say computer scientists.
Just stop doing that!
Seriously, testing is not wasted effort and for any project that's large enough it's not slowing you down. For a very small and simple project testing might slow you down, for bigger things - testing makes you faster! And the same goes for documentation. And full source code should be part of every paper.
Many programmers in industry are also trained to annotate their code clearly, so that others can understand its function and easily build on it.
No, you document code primarily so YOU can understand it yourself. Debugging is twice as hard as coding, so if you're just smart enough to code it, you have no hope of debugging it.
The point is that since software development is not their main goal or background, their practices tend to be ad-hoc. We know the value of testing and documentation, but they do not. People don't know to stop doing something until they know it's a bad practice. And they're not going to know it's a bad practice until they discover that fact on their own (which can be a slow process) or someone teaches them (faster, but potential cultural problems).
That they should is basically a given in the article. The question is how to make it happen.
The mindset of a scientist is that the code is a one-time thing to achieve a separate goal - data for a paper. The code isn't supposed to last, it's simply a stepping stone. For a lot of folks, whose research areas tend to move around, there isn't always the expectation that you'll get to a 2nd or 3rd paper on the same data.
Now, all of this is different if you research actually is building the model. But, I'm speaking for experience on the rest. I've built plenty of software tools that I need "right now" to get a set of data.
It may not be intended to last, but it's still supposed to be correct. And, of course, there's probably gobs of software out there was not intended to last, yet did.
The thing about scientific code is that it's often a potential dead end. The maintenance phase of the software life cycle is not as assured as it is in industry.
Writing good engineering software is not the scientist's goal so much as demonstrating that someone else with a greater tolerance for tedium (also someone better-paid) could write good engineering software.
Exactly. I'd go further: in industry, the software is typically the end product, and the quality of the software is inherently relevant. In science, the output (the prediction of the simulation, the result of the analysis, etc.) is typically the end product, and the quality of the software is relevant only insofar as it affects the quality of the output.
In practice, of course, the quality of the software often does affect the quality of the output---but time spent on software quality creates less immediate value than it does in industry.
I have all of my code on github under a CRAPL license [1]. It assumes a certain amount of good-faith from others, but I feel that if you're worrying about getting scooped, your problem isn't ambitious enough. Luckily, my adviser agrees, and is very in favor of open releases of data [2].
That will never happen, because the universities are bigger than the journals and will push back. The universities want to own the code if there's money to be made. Stanford made a small fortune from Google, for example. If journals required code review, other journals would pop up that wouldn't require it.
An often neglected force in this argument is that many practitioners of "scientific coding" take rapid iteration to its illogical and deleterious conclusion.
I'm often lightly chastised for my tendencies to write maintainable, documented, reusable code. People laugh guiltily when I ask them to try checking out an svn repository, let alone cloning a git repo. It's certain that in my field (ECE and CS) some people are very adamant about clean coding conventions, and we're definitely able to make an impact bringing people to use more high level languages and better documentation practices.
But that doesn't mean an hour goes by without seeing results reverse due to a bug buried deep into 10k lines of undocumented C or Perl or MATLAB full of single letter variables and negligible modularity.
Next they'll discover than when those scientists leave academia and become quants, they don't magically become any better at coding (but at least they now have access to professionals, if they recognize the need).
> This paper describes some results of what, to the authors' knowledge, is the largest N-version programming experiment ever performed. The object of this ongoing four-year study is to attempt to determine just how consistent the results of scientific computation really are, and, from this, to estimate accuracy. The experiment is being carried out in a branch of the earth sciences known as seismic data processing, where 15 or so independently developed large commercial packages that implement mathematical algorithms from the same or similar published specifications in the same programming language (Fortran) have been developed over the last 20 years. The results of processing the same input dataset, using the same user-specified parameters, for nine of these packages is reported in this paper. Finally, feedback of obvious flaws was attempted to reduce the overall disagreement. The results are deeply disturbing. Whereas scientists like to think that their code is accurate to the precision of the arithmetic used, in this study, numerical disagreement grows at around the rate of 1% in average absolute difference per 4000 fines of implemented code, and, even worse, the nature of the disagreement is nonrandom. Furthermore, the seismic data processing industry has better than average quality standards for its software development with both identifiable quality assurance functions and substantial test datasets.
Something I heard from one of my professors once: "A programmer alone has a good chance of getting a good job. A scientist alone has a good chance of getting a good job. A scientist that can program, or a programmer that can do science, is the most valuable person in the building."
I'm finishing up a degree in Computational Science where we are essentially trained in computational and mathematical techniques used in the physical sciences, and all I can think about is becoming an artist.
I don't think it's the science that adds value, I think it's the programming. The thing is, programming allows you to automate, simulate, measure and visualize complex processes. Science is all about complex processes, so if you have more powerful tools available to understand them, you will be much more valuable. Add to that, many of the physical sciences are hitting limits of physical experimentation and require simulations for further understanding.
I don't think the power of programming has truly shown itself, it should revolutionize every industry. It brings with it a different attitude towards solving problems and opens up new realms of possibilities. Social sciences are finally starting to look like real science thanks to big data and we have new knowledge industries. I'm personally most interested in how much art and education will change thanks to new powers of interactivity.
If you are interested in programming and arts, why not have a look at computer games? (Or even modern board games, where the algorithms are run by the players themselves.)
I grew up interested in computer games, and I was lucky enough to TA a game design class at my university. I love the ability to create interactions, but my interest has recently turned to designing interactions into things that aren't normally interactive. I had a lot of fun making an interactive data visualization at a recent hackathon, and I'm hoping to do more things like that.
I personally think games and science need to get much closer together, interactive learning is so powerful, and video games can make anything fun. It's definitely something to explore, but there is still a huge divide between science and entertainment and the understanding of the people in each field.
Georgia Tech has a fun Mobile Robotics Lab, and there are several other places that you could study further. (You'll gain training in the actuators, sensors, etc you'd need for your artistic work).
"People who can code in the world of technology companies are a dime a dozen and get no respect. People who can code in biology, medicine, government, sociology, physics, history, and mathematics are respected and can do amazing things to advance those disciplines."
This seems at odds with the statement from the article
"There needs to be a real shift in mindset away from worrying about how to get published in Nature and towards thinking about how to reward work that will be useful to the wider community."
EDIT: What the heck is wrong with this? We have two opinions on the perceived value of programmers in scientific enterprises, one from someone who works in the field (David Gavaghan) and another from Zed Shaw. I'm highlighting that Zed's perception is not universally agreed upon.
This sentence strikes me as astonishingly hypocritical. Nature can immediately do 2 things to make the science they publish vastly more useful to the wider community:
1) Remove the paywall
2) Require publishing the code for computational papers (and the data for experimental papers)
Nature Group only cares about maintaining its status as a high impact factor journal, and scientists sheepishly submit to them. They actually love it that scientists worry about getting published in Nature.
Dont feel too bad. HN has become full of people who believe that the correct response to something they disagree with is to mod down, instead of to reply with a rebuttal. It sucks, but others will mod you up.
Nobody is paid for doing research. The salary of a professor is about 1/3-1/5 of what the same person may get in industry. People who do research don't do it for the money.
Yeah, but you know who gets even more screwed, salary-wise, than the professor? I'll tell you who: their lab techs and staff scientists ("research associates"), and that most definitely includes the "programmer who knows some science".
Furthermore, at least at my institution, being the "programmer who knows some science" means that your position is entirely funded with "soft money", which means that your level of job security can be pretty low.
I agree with your point, but I'd add the modifier that if you're a top-notch developer who forms an interest in a specific sub-discipline in science (my field is genomics, but there are many others where this would be equally applicable), there is certainly huge potential to carve out a niche and make a big name for yourself in your chosen field (as long as you take the science side as seriously as the coding side).
Then you certainly could make big money, as Shaw implies and as you'd surmise from reading the linked article.
Everyone thinks they're underpaid, but the truth is, academic skills do not transfer well to industry. I refer you to the discussion in Ghostbusters:
"Personally, I liked the university. They gave us money and facilities, we didn't have to produce anything! You've never been out of college! You don't know what it's like out there! I've worked in the private sector. They expect results."
That's fair, but you know what those professors get? Tenure. ie not getting fired unless they're caught in bed with a live boy or a dead girl. And they typically get paid with hard money as stevenbedrick notes.
This is also true in business. You should always remember that you're not just a developer or a programmer but that you're solving business problems. That's the only way to truly make yourself invaluable.
This is so true. I work in a research lab and I'm trying to interest myself more to the science and it's really helping my coding work. It's easier to improve software when you know what your users need.
That has not been my experience. I spent years working on DNA aligners then in a wetlab building software for confocal laser microscopy. In both locations, the best paid and most highly valued people were the scientists. If you, say, were a good developer with masters in stats and a strong understanding (somewhere between an undergrad and an MS) of the relevant science... you were paid 1/2 as much as you would be paid if you did computational advertising.
And yet there aren't many good developers doing science. Weird, huh?
Agreed that low salaries are a disincentive for good engineers to work in science. It's worth noting that even the scientists are paid far less than people with commensurate training and experience in industry.
As for why scientists are more highly valued: they bring in the grants that keep the wheels turning (in academic circles; industry & national labs obviously differ).
That's possible. I have a friend who works in the university aerospace lab that can read and write C and Matlab. His major contribution is being able to both understand the engineering and write code to manipulate it, and it's basically guaranteed himself a position there. He may not be valued quite as much as the PI or one of the other professors, but he's pretty solid.
I'd also cite myself, but I don't count since I'm in a robotics lab.
Sequence analysis companies and labs which don't value software engineering get what they pay for: serious or crippling inefficiencies and inability to do analysis on their data or maintain continuity. Unfortunately, many of them don't even realize what they need or how bad their inefficiencies are.
(Disclaimer: my background is in materials physics, and it may be different in other fields. But I doubt it.)
Unfortunately there is very little direct incentive for research scientists to write or publish clean, readable code:
- There are no direct rewards, in the tenure process or otherwise, for publishing code and having it used by other scientists. Occasionally code which is widely used will add a little to the prestige of an already-eminent scientist, but even then it rarely matters much.
- Time spent on anything other than direct research or publication is seen as wasted time, and actively selected against. Especially for young scientists trying to make tenure, also the group most likely to write good code. Many departments actually discourage time spent on teaching, and they're paid to do that. Why would they maintain a codebase?
- Most scientific code is written in response to specific problems, usually a body of data or a particular system to be simulated. Because of this, code is often written to the specific problem with little regard for generality, and only rarely re-used. (This leads to lots of wheel re-invention, but it's still done this way.) If you aren't going to re-use your code, why would others?
- If by some miracle a researcher produces code which is high-quality and general enough to be used by others, the competitive atmosphere may cause them to want to keep it to themselves. Not as bad a problem in some fields, but I hear biology can be especially bad here.
- Most importantly, the software is not the goal. The goal is a better understanding of some natural phenomenon, and a publication. (Or in reverse order...) Why spend more time than absolutely necessary on a single part of the process, especially one that's not in your expertise? And why spend 3x-5x the cost of a research student or postdoc to hire a software developer at competitive rates?
I went to grad school in materials science at an R1 institution which was always ranked at 2 or 3 in my field. I wrote a lot of code, mostly image-processing routines for analyzing microscope images. Despite it being essential to understanding my data, the software component of my work was always regarded by my advisor and peers as the least important, most annoying part of the process. Time spent on writing code was seen as wasted, or at best a necessary evil. And it would never be published, so why spend even more time to "make it pretty"?
I'm honestly not sure what could be done to improve this. Journals could require that code be submitted with the paper, but I really doubt they'd be motivated to directly enforce any standards, and I have no faith in scientists being embarrassed by bad code. Anything not in the paper itself is usually of secondary importance. (Seriously, if you can, check out how bad the "Supplementary Information" on some papers is.) But even making bad code available could help... I guess. And institutions could try to more directly reward time put into publishing good code, but without the journals on board it may be seen as just another form of "outreach"--i.e., time you should have been in lab.
I did publish some code, and exactly two people have contacted me about it. That does make me happy. But many, many more people have contacted me to ask about how I solved some problem in lab, or what I'm working on now that they could connect with. (And are always disappointed when I tell them I left the field, and now work in high-performance computing.) Based on the feedback of my peers... well, on what do you think I should've spent my time?
I'm working in biology now, and a good example of researchers who produce quality, documented code that other people find useful is the Knight group at UC Boulder. They write python with good docs and support, publish the algorithms they come up with in bioinformatics journals, and people cite them all the time.
Might be worth thinking about why there are incentives there and not elsewhere.
That is an excellent example and a good point. But, for what it's worth, the Knight lab doesn't really do any biology. Most of their biology is done by collaboration with other labs, and the people in the lab are almost entirely programmers or database people. There's nothing wrong with that, but it's more an example of programmers getting into biology than the other way around.
Another place where good software work is done is Broad Institute. The reasons are as follows: (a) Broad can hire the best people in bioinformatics, (b) they are a relatively large organization where focusing on the process pays off. Software ultimately is process (i.e. how you do stuff) and small labs often cannot afford to focus on the process and instead try to reach the goal (i.e. publishable results) more or less directly, regardless of the inefficiencies they may encounter.
In the past, the model of having many small labs in universities was a great idea. Today things are looking a bit different in some fields because larger labs can afford to do more automation (by hiring programmers instead graduate students).
My PhD was in computer science and my experience was quite similar.
I wrote probably around 3000 lines of code on 4 separate projects (mostly MATLAB, C and Java). This code was never shared with anyone, my advisors were not interested in the code, all they cared about were the results. To be honest it wasn't very good code, I would have a hard time understanding it now (although I could probably figure it out eventually).
And after I graduated I took the code with me and I am the only person who ever verified the working of the code.
This bothers me on some level, since no one can really verify and inspect the results of my publications (unless they tracked me down to ask me for the code some of which has been lost) - but it is pretty much the norm in my field.
There was an interesting discussion about this on the Theoretical Computer Science Stackoverflow a while back:
Bottomline: Yes, we should probably do it (especially in areas where the research is simulation and the code encapsulates all the results) but we probably won't unless we're pushed.
I have a PhD in Comp. Sci. too, and continue working in academia.
Regarding your code, you could have just uploaded it to SourceForge or any other OpenSource repository. I know a guy (Steve Phelps) who did exactly that ( http://sourceforge.net/projects/jasa/ ) with his PhD code.
On a related note, the institute where I am working now has this "great" simulation program (housemade in C++) for which a lot of publicatios have been written. However, the code is closed source and thus cannot be third-party verified.
This is wrong, and actually, a colleague of mine who just started doing her PhD found an error in the simulation program, bad enough that it makes me question the previous research.
In my opinion it must be a requirement that all software related to a publication must be made open-source before (or at the same time) the paper is published.
In the traditional research method, computer programs are part of the methods of the reserach. It is amazing that nowadays researchers can publish research without clearly showing the process they used to arrive to those.
Don't forget that a particular code might have 2 or 3 paper's worth in it, so releasing the code after 1 paper could mean getting "scooped" on another paper.
I'm left a little cynical after a Master's in computational science, and I still can't believe that open code is not part of the repeatability doctrine. I suppose my goals are not aligned with most grad students since I have no interest in an academic career (at least not after many years in industry) but I got much more satisfaction from feedback on my blog posts than publication.
Hell, each blog post is its own little publication, and it may not be peer reviewed before its published, but the amount of links to them and google searches prove that I have more than a few peers who appreciate my contributions.
It might also spawn a new collaboration. There are dishonest people in science, but anyone scooping your work has to weigh the risks of getting called out for it, which is more likely if your software is good and widely used.
I don't believe in the private model, so I release code when it's ready, regardless of where it fits in the publication cycle. It's pretty neat from a reproducibility perspective to submit a paper based on code that is runnable as a tutorial example shipped with a library that the reviewers stand a good chance of already having installed.
From my experience, even in a field so close to computer science as robotics your analyzes is correct.
In my opinion, publishing all code that was used for the paper should be mandatory. Everything else is an obvious violation of the confirmation requirement in the scientific method.
But just like with Open Access I have little hope, that this will be adapted soon on a wide scale. If you are a student, I believe, all you can do is get permission to publish your code and do so. Maybe this will hurt tenure but it increases your karma!
Another reason - a bug (and there's always bugs) would probably invalidate the paper, possibly causing a recall. Recalled papers are not seen in a positive light by the science establishment.
Careers could be destroyed, if people where held to account.
People in software see this as ludicrous. Of course there's bugs, just update the conclusions, and move on! But that's not how a lot of scientists think.
Just publishing the code is not enough, in any case. In order for the research to be verifiable, everything, from raw data to the final paper (and notes on how you went about the process) should be properly documented and available. Something along the lines of this: http://rr.epfl.ch/
Of course, the problem with this is that it's a large amount of work and in most cases probably doesn't have a good ROI.
I have recently started to try this approach of better documenting everything, mostly because I have found it hard to go back to work I did 6 or 7 years ago and understand it (e.g. a bunch of one-off, poorly documented data processing scripts that could, if properly done, save me some time today). I haven't yet published anything like this yet, but it looks promising.
I'm still kind of amazed that when scientific results are based on original software, the sourcecode isn't required by the journals for peer review. How are people supposed to check the results?
I think it is unreasonable to expect that a person will be a good programmer just because (a) they are a scientist and (b) their current project can be assisted by computers.
Is it not sensible, perhaps, to have a dedicated group of programmers (with various specialities) available as a central resource to assist the scientists with their modelling? (I am imagining a central pool whose budget would be spread over several areas.)
I personally love working on toy projects related to science. Maybe we hackers with time for that kind of thing should volunteer in some way to assist with the technical aspects of research that is directed by a scientist? I'm not sure I'd even care about getting a credit on a research paper so long as I could post pretty pictures and graphs on my blog...
Greg Wilson once commented that the subversive way to get scientists to use source control was not to pitch it as a code history tool, but rather as a nifty way to sync up code between their work machines, home machines, etc. He said he had a lot more traction with that than trying to lecture them about having code history.
I'm guessing Dropbox has introduced many a scientist to the wonders of code history/portability. I was pretty reluctant to move to Git when Dropbox worked fine.
That's probably good for arguing about good backup practices too. In the department I worked in it was common for grad students to have months of code written on their laptop that was not duplicated anywhere else. The problem is that these students were working independently and didn't really have a need to transfer the code anywhere else.
One of the main sources in the article is a study from the 2009 Workshop on Software Engineering for Computational Science and Engineering. One of the workshop's organizer's has a report of the overall conference which is interesting: http://cs.ua.edu/~carver/Papers/Journal/2009/2009_CiSE.pdf
Rather than building these data analysis/visualization programs from scratch each time, my thought is that scientists should instead be writing them as modules for a data workflow application like RapidMiner.
If you haven't heard of RapidMiner, you basically edit a flowchart where each step takes inputs and outputs, eg take some data and make a histogram, or perform a clustering analysis.
There are a lot of suggestions that the code and data be required to publish.
Sorry guys, but that hasn't worked so far: the economics journal _Journal of Money, Credit and Banking _, which required researchers provide the data & software which could replicate their statistical analyses, discovered that <10% of the submitted materials were adequate for repeating the paper (see "Lessons from the JMCB Archive", Volume 38, Number 4, June 2006).
I did some work on data visualization for the astrophysics department when I was in college. I started to work with the simulation code, but found that the math was sprinkled everywhere, which made it really difficult for me to make structural changes without risking the integrity of the program.
One of the most elusive skills for self-taught programmers is how to structure code properly. A good architecture would allow ___domain experts and non-expert programmers to coexist, but that would require throwing away a lot of existing spaghetti code written by ___domain experts, which is not going to be a popular decision.
I'm a programmer who is studying physics in college, and a couple years back I had a similar experience with simulation code as you did. I didn't have any issues with the math—the program I was working on didn't have anything more conceptually advanced than multivariable calculus—but I did struggle significantly to understand the physics behind the simulation.
It didn't help that most programs use, for example, the variable 'rho' for density instead of just writing out 'density'.
On the other hand, reading game physics libraries (written by programmers, not physicists) can be just as bad. There are physics hacks all over ("it's not stable, so let's throw in an arbitrary constant") and there's code repetition where the programmer doesn't understand that two concepts are closely related.
This is why I think, rather than every data analysis/visualization program being written from scratch with it's own custom UI and I/O, formatting etc, these programs should be written as modules to a data workflow program like RapidMiner that handles all that for you.
I'm lucky to be in a lab with enough funding for this. This is rarely the case. In medical science it's easy to get large grants and pay programmers in something like biology the programmer is usually that guy that learned perl since he's already getting paid to do science.
Grad students probably wrote the majority of the tools used in my lab and it shows because when they leave the knowledge of the different issues and bugs in the software that they had goes with them. Years later they resurface and no one has any idea of the thought process of the original author.
It's quite annoying. We have such a project right now that is a basic piece of software we use for all our research. There is no current funding for someone to sit there and clean up the code. Most funding agencies want new work not maintenance work to be done with their money. There just isn't any incentives anywhere for this.
This happens already with mechanical and electrical engineers in my field (space physics). There are challenges, though: the engineers are paid less than they would earn in industry, and long-term employment is contingent on scientists winning new grants which call for their skills. So, the low value for expertise (in academia) is largely driven by the grant funding system.
I could see a role for staff computer scientists in areas of research where computation plays a particularly large role, but for typical grad-student data munging, the cost/benefit ratio is likely far too high.
Even when money allows for this, it's not always possible to segregate the work. I.e. a scientist may fully understand the mathematics of his work, but not understand that translating the math directly into code will cause serious performance issues. A programmer should understand the performance issues, but without a deep and intuitive understanding of the mathematics, including what assumptions are implicit therein, will often have difficulty making the translation.
In many cases, projects with sufficient technical depth require detailed ___domain knowledge that can be acquired quicker and cheaper by hiring scientists or engineers with that knowledge and then teaching them to code (or code better), rather than segregating the work and dealing with the problems that result from the communication gap.
It's usually outside of the scope of what you can pay for using a grant, and grants are currently the "standard" way to fund science. In this particular case also the permanently part won't get along with with grants being given for specific goals and also for fixed periods of time.
there's not nearly enough open source academic projects, nor is there any sort of pervasive culture that encourages one.. besides the litany of examples that could be put together to show that open source + academia does exist and does work, I've read way too many computational physics or computational chemistry or computational anything academic papers that simply do not publish source code, and imo there's no good excuse for it, other than the usual, funding, or copyright / university IP
There is an important factor discouraging publishing source code - fear that there indeed are bugs and they will be exposed. This is blatantly "security through obscurity", but I fear it's a common attitude. If there are bugs and code is secret, even if someone else later points out that the results contradict their own findings, it's (presumably) not difficult to sweep the thing under the rug and cool it down. On the other hand, if the paper is published, code is public and someone spot serious bugs, it's instantly a big shame... (code review as a part of peer review would help, but it's very unrealistic - already, reviewing is very time consuming).
In addition, there are really no structural/institutional incentives to produce and share good quality scientific code. Maintaining good code costs much effort and, currently, gives few short-term benefits. It's often easier to produce crappy code, get the results, publish and move ahead.
Typically a review paper describing the software is what is actually cited, but yes!
For instance, in my department there is a guy who maintains an astrophysical software package called Cloudy. The faq[1] describes how to cite it. (Unlike a lot of the software mentioned here, that project actually is open source, uses version control, and was migrated from the original Fortran to C++.)
Where do most programmers get this exposure to best practices like version control, unit testing, etc? I took a few early-mid level CS classes, and there was a relatively cursory emphasis on readable code, there was barely any on any of the sorts of things that lead to well-maintained projects. If these are the sorts of things that one learns at your first internship, then it's no wonder that academics in other disciplines don't have any exposure to it.
The vast majority of students are never exposed to these concepts and those who are usually teach themselves. We've been teaching a class called "programming for biologists" that teaches practical skills that are needed in scientific computing - simple database use, version control, etc. - and we've seen huge demand from students and faculty members in many departments.
This is a difficult situation. Is it easier to train the ___domain experts to be competent programmers or train the competent programmers to be ___domain experts? In a research environment, I worry there's little time or interest in developing specs that can change in an instant or can't be written until the physics is understood.
We find it quite difficult trying to get programming out of people who don't know why Carbon has 4 bonds while Nitrogen has 3, for example.
My feeling is that a one-semester required course for students in "software carpentry" [1] (as developed by Greg Wilson and discussed in the article) would cure many of the most serious ills in scientific software development. Students can't know they should be using version control, debuggers, and testing if they don't even know such things exist.
I think there are multiple reasons for this problem, and only one of them is a lack of training in software management. Another problem is that science is an inherently exploratory procedure. You design an experiment, gather some data, and then go about analyzing it. You have an idea of what you'll find, but depending on what you get, you might need to then reformat/restructure the data, transform it, cut it up, etc.
The problem is that this represents one of the worst problem cases in software design: evolving requirements. By itself this is bad enough. Recently I have been analysing data from a recent study. You start off with a data structure that you think represents things, but then you notice for example you need to synchronize several recordings; now you have to track time. You realize some recordings need to be split down the middle to aid in synchronization; now you need to add a 'part' field. You derive some value from several data points that takes a long time to compute, so you need to create a file to hold it. This needs to be kept in synch with the original data. Eventually you realize that text files aren't going to cut it; you start moving things to a database. Now you need to reconfigure your visualization program to read from the database. Then you realize that you want to add another similar derivative value, but this time it's a 3x3 matrix for each data point; time to extend the database again. etc.. etc.. Eventually you decide it would be best to really rewrite the codebase because it's becoming impossible to work with. Unfortunately the paper is due soon and you just need to generate a few more graphs..
And I didn't even mention the growing directory of scripts that aren't properly organized into modules, that end up with copy-pasted code because it's not very clear how to cleanly put this into a function, or which module it should belong to.
Now, this is bad enough when you have a CS degree and have designed several software frameworks in your life. Combine this with someone who knows nothing about software architecture and you have a really big problem on your hands. My point is this: it happens to the best of us, no matter how hard you try to organizing things, when you don't have the requirements available ahead of time.
The best approach I've found is to force myself to simply write functions as small as possible, that do one simple thing at a time. I try to break up functions as much as possible for reuse, and avoid copy-pasting code at all costs. Admittedly it's not always easy, sometimes a function that generates a particular graph just needs a certain number of lines of logic, and it's very difficult to modularize. Then you find that you want a similar graph but with a slightly different transformation on the Y axis... etc.. etc..
GarlicSim's goal is to do all the technical, tedious work involved in writing a simulation while letting the scientist write only the code that's relevant to his field of study.
Could be that they are not using the correct language, If they have some ___domain specific language on top of common lisp for example, they will have much better code with less work, i think
I wish this kind of mentoring program would be implemented here (I work in a big research center). Often us programmers end up having to integrate code written by scientists in our apps, and it's a pain. Even a quick glance over the code is often enough to see some problems.
I know of a company, made up of scientists from academia, that develops software by writing the code (or "codes" as they call it) in Microsoft Word documents and e-mailing them to eachother.
Oh my god yes. As someone who cut their teeth developing software for pharmaceutical research, I can testify that a much of it is absolute crap, by a variety of metrics.
It's funny how this exact same mistake is made in the linked paper. For some reason, people outside of IT can't get it into their minds that "code" is an uncountable noun in this context.
On the positive side, such difficulty is also in the nature of science itself. Scientists already understand that rigorous peer review is the only way to come to reliable scientific conclusions over time. The only thing they need help with understanding is that the software used to come to these conclusions is as suspect as—if not more so than—the scientific data collection and reasoning itself, and therefore all software must be peer-reviewed as well. This needs to be ingrained culturally into the scientific establishment. In doing so, the scientists can begin to attack the problem from the correct perspective, rather than industry software experts coming in and feeding them a bunch of cargo cult "unit tests" and "best practices" that are no substitute for the deep reasoning in the specific ___domain in question.