Hacker News new | past | comments | ask | show | jobs | submit login
Is “data scientist” the new “programmer”? (blogs.harvard.edu)
290 points by jdhzzz on Sept 18, 2018 | hide | past | favorite | 239 comments



Something about this article strikes me as a thinly-veiled complaint about poorly designed object-oriented systems. Take, for example this comment by the author:

>Even if the money were half of what today’s coder gets paid it might still be a better job because one is spared the tedium of looking at millions of lines of Java that do almost nothing!

What all those millions of lines of code are is abstractions, decoupling, and modularization of logic/responsibilty. This is hard-won knowledge from the field of software engineering. Granted, a lot of it is probably very poorly designed or organized. But the problem is the design and not the philosophy.

Because scientists all use the same basic rules of math, but each business will each have it's own special rules (i.e. not all payroll software implements the same policy/axioms), this makes it really easy for the hard logic of scientific work to be in a general-purpose library. "Normal" developers need to customize their own rules, or in other words, develop their own services unlike the data scientists.

Now if every data-scientist had to roll his/her own version of numpy, pandas, sci-kit learn, tensorflow, etc. the author would probably be decrying the deluge of procedural spaghetti produced by data scientists. The data scientists' notebooks look simple because of all that indirection is hidden away in the many libraries.


All of today's software is built on millions of lines of code. This isn't much of a problem when it's hidden away behind a good abstraction. Being written in another language forces the API to be documented well enough so you don't need to go deeper: "native code", "kernel code", "part of the browser".

Crappy million-line Java apps are generally crappy not due to raw line count but rather due to leaky abstractions and badly designed APIs, so you do have to navigate through a lot of that code.


I sometimes wonder if programmers instinctively overcomplicate things in interest of collective job security. Some of the stuff I've seen in (particularly awful) Java code bases is perplexing to the point where it seems intentional.


It's more likely the natural entropy of code - it's easy to add stuff to a system in a way that makes it more messy; and if the system already is a big mess, then it's much harder to do non-messy additions and the bigger mess it is, the harder it is to start cleaning it up.


I refer to this as the "spaghetti law of attraction". The burden of refactoring things gets higher and higher and no one wants to touch it. So they just add another try-catch block and do some side effect and get the PR merged.


It is also a question of management, it is easier to motivate adding new features instead of keeping thing nice and tidy. Especially as finding good abstractions are very hard and time-consuming.


True that. Adding complexity is easy, keeping things at the right level of simplicity is hard work and requires skill. Not just for coding.


This is the real answer !


Writing clear and as-simple-as-possible code is one of the most difficult parts of programming, one of the hardest skills to learn. In fact, many programmers don't ever learn it. Some do not even realize it's a problem.

I could talk about people unprepared for the jobs they do, and companies and managers setting unrealistic deadlines, having unrealistic expectations, etc. But I think most often than not, the issue is that to really build good code, you need to have a lot of experience and plan and design really well from the start. And simply, most companies and products start as they can. If you start with a deficient codebase, trying to fix that later under heavy budget and time constraints is pretty much impossible, and it's so accepted that some people believe it's the normal way to work.

And even with proper design, clear perspective, and good programmers, you still have to be lucky that no one higher up the ladder imposes decisions that f*ck up all the good work done. There are a few factors more.

What I mean is that you might find all kinds of people, but in general, for those of us who care about what we are doing it's the exact opposite. We consciously try really hard to keep code as simple as possible.


programmers instinctively overcomplicate things in interest of collective job security

It’s more likely boredom. Cranking out yet another CRUD app is more fun if you do it in a weird way. And a new “skill” for your resume.


Don't forget that one spring scheduled job that goes rogue and wrecks your sanity constantly.

/mylife


>But the problem is the design and not the philosophy.

If the philosophy prescribes dozens of tools for managing complexity and no tools at all for reducing it than it is the problem.

"Abstractions, decoupling, and modularization of logic/responsibilty" are not some kind of universal good. They are only useful within specific contexts. A lot of software engineers do not understand this and routinely engage in premature abstraction. As a result they produce systems that are 10 times more complicated than they need to be for absolutely no reason.

Java definitely encourages this kind of mentality, because the language itself and its standard library lack in some fundamental areas. Introduction of lambdas and streams helped, a lot, but the overall mentality is still well-entrenched.


I've been hearing the same complaint between Cobol and Java for years : it was simpler before, more efficient, etc.

Of course it was, but you were tied to one system (no application server), security was login/pw, database had no constraint, typing systems were ultra limited, everybody has its own way of writing batches (no Spring), business code was mixed with tons of technical code (no JPA).

Now, sure, if you glue some R, some SQL, etc. you can extract insights worht millions of dollars. But all of that exist just because we have digitalized all of the processes, data collection, etc. And the rise of data scientists will continue only if there are more stuff put in the databases, thanks to you plain, regular, normal programmers...


> But all of that exist just because we have digitalized all of the processes, data collection, etc. And the rise of data scientists will continue only if there are more stuff put in the databases, thanks to you plain, regular, normal programmers...

IMHO there's far to little attention paid to how data might be valuable in an economic sense when storage strategies are being designed by database designers. I recently gave a talk at a developer conference and was really surprised at the level of pushback to adding more data elements or higher precision data "just in case it might be useful".

The preconception that you have to be maximally efficient with storage has led to huge quantities of valuable data being lost.


The net value of nearly all data is negative. It’s a liability not an asset. Unless you are actually selling data to customers or solving a specific problem that requires collection of specific data, it’s just wasting the company money. You would never hear a manufacturer say “ let’s stock pile aluminum in case we need it”. But that line of thinking comes up all the time with respect to data.


>> The preconception that you have to be maximally efficient with storage has led to huge quantities of valuable data being lost.

I 100% agree. Many programmers are trained as if memory, CPU are finite resources. Although it's true, in many cases, that reality may be safely ignored, opening tons of opportunities (to store data, to develop faster because you don't optimize, etc.).

At a time where my phone has gigabytes of memory, I'm always surprised that some people ask me to put a limit on a text field. I understand the technicalities behind the question, but from a conceptual point of view, that's often pointless.


Personally, I am increasingly convinced that a lot of this hate comes from programmers with weak abstract thinking who simply cant do it. Instead of admitting that there is learning cure involved, they will claim the system is bad and everyone else is bad. Compounding factor is difficulty dealing with system that was written by different people who holded different opinions.

Yes, there are badly designed large system. No large system is perfect. However, there are also reasonably designed large systems, including in Java and Java is used in such system for a reason. It is more challenging to write large system. Yes, it is harder when parts of system are written in style that was considered best practice few years ago but was abandoned since then.

If you are spending a lot of time looking at millions of lines of Java that do almost nothing, them you likely dont really know what it what and need to read up more. At least that is my experience.


Badly designed systems aside, for most systems that currently exist or have existed in the past, there is little or no documentation that is worthy of existing. Most comments in code and the associated documentation in manuals fails to provide the reasoning as to why the code exists, why it is written that way, what underlying assumptions have been made, etc, etc, etc.

I am going through a process at the moment of documenting all of my local codebase. It will, in turn, be turned into a literate programming base. The problem I am finding is understanding all of the assumptions that underlay the original code. Why was it written this way or that, what is it trying to do, is the code actually doing what it is supposed to do?

There are, at present, some questions that I am having difficulty answering and this is my code. How much more difficult is it for someone to come in and look at a historical piece of code and follow what the original authors and designers were trying to achieve and what were the changes that have been made over time trying to achieve.

Documentation at the level we need to be able to adequately maintain any code base is just not done - it is very hard to do and to do so in a way that will help future people manage and maintain that codebase. On of my projects involves restructuring the code base. However, I need to understand the history of that codebase and that means talking with those people who are still living who knew the original authors and give an oral insight as to why things have been done the way they are. This oral history has to be written down and the codebase documented with it. When that information is in place, why the code is written the way and how we can now rewrite that code to be more effective is now achievable.

If we then put on top of missing history and documentation all of the bad designs, well, we are then facing even bigger problems. Then we put on top of that all the egos and politics involved, we get an even bigger mess.

So just reading up more doesn't actually help, because that which is needed was never written in the first place.


I meant documentation on whatever tools and libraries project uses. Specifically, a lot of lines that do nothing are 95% result of frameworks integration.

You should not need to look at those particular lines that much.

I agree that the other issues you described here make work on large long running projects harder. It is challenge and somethings fight for every inch. Which is why I am increasingly sick of people who can't do it kicking those who can and kicking tools that make it possible (I have better chance to figure out the system with oral history only you just described then in javascript or python. Not easy, but tools make it less hard.)


The problem isn't the abstractions, it's the sheer size of the codebase. The author - and I think most people - prefer a codebase they can grok. Nobody can grok millions of LOC, at best they can have a high level overview of what does what.

At that point you need the abstractions and practices that make code boring.


It would be far from the first HN article (or comment) that failed to make a distinction between "poorly designed object-oriented systems" and the very idea of OOP, design patterns, etc


nit: is it "about poorly-designed, object-oriented systems"?


no.


It's "Something about poorly designed, object-oriented systems."

1. compound adjectives are hyphenated, except when the first world ends in 'ly.' 2. two adjectives, should have comma, similar to 'big, green systems.'


The take down on abstraction and software engineers (by using Java as an example) is similar to saying "back in the day to find a prime number we would simply use a sieve, but today it is a tedium, what with all the pi's and e's and thetas that get in the way, and what are geometry and polynomials doing here, and what in the god's name is this i, I just want to count the prime numbers which are nice round whole numbers".

That's what happens when a topic grows from being a curiosity where dilettantes dabble into a proper field that is applied to solve problems. Granted, some of the developments can indeed be tedious and self indulgent, but otherwise this is the natural progression. Its sad and frustrating when people who ought to know better make such statements. Is it done to provoke a critical analysis, positive trolling if you will?

About the role of data scientist, I find it both amusing and disappointing that just about anyone with a three week MOOC gets to work in this field, who otherwise had never dealt with statistics before. I mean, statistics is a three year long grueling applied maths degree, and condensing it to three weeks is silly. It is actually in this way that it is similar to progamming job of the 90s (I don't know how it was in the 70's, I wasn't born yet). Just about anyone who could learn Java or VisualBasic, or the self taught cowboys who used C, ended up programming professionally. Actually it was not that bad, for coding is not as hard as its made to be, but that until they got sucker punched by the n squared complexity, to say the least, on a big data. Coding couldn't help them and they realized programming was more than learning to code and using some API's and system calls. (I was one of them in a way, when I started to code in C++ to model and simulate my mechanical engineering project, and it lead me to the path of enlightenment.) So, today's data scientists who are not bonafide statistics graduates or statisticians have it coming as well, whatever the analogue is, unless they are merely "data monkeys" in which case all is well and as expected.


I'm a scientist (wet lab) by training, a programmer (back end) by profession, and a data scientist by hobby (I have a machine learning project that I'm working on), and most of "data science" is not really stats... There will be a bit of stats at the end product but really the bulk of the necessary work is data curation. Annoying stuff like making sure my data fit into the right buckets.

I did have to debug a memory leak that only showed up when I deployed my data pipeline on 22 cores.


N == 22 specifically? Or N >= 22? Interesting threshold value


My box has 24 cores. By default I deploy on 22. Actually it fails at 10 cores, but it gets 3/4 of the way through the dataset. At 22 it dies about 1/4 of the way, at 5 cores it makes it all the way through.

Error is in a string tokenizer, which I wrote as a recursive call. Usually it's fine but I made a code change which absolutely killed it. Also I'm writing in Julia, which does not TCO, the back end stuff I do is in elixir, which does.


Why is 22 particularly interesting?


It's not. The fact that it doesn't readily have any real-world significance is what makes it an 'interesting' (read: odd, curious) threshold value, which is why I asked OP whether it would only fail at that core number (N == 22) or whether it effected all processor counts higher than the value. I can see that my use of interesting was colloquial and not literal. My bad for any confusion this may have caused ;)


"data monkey" is even more of a fitting name than the corresponding "code monkey". Much of so called data science leans more on the side of data engineer where one fits existing solutions to your specific data. The split of data scientist and data engineer is the most unfortunate. It's like splitting programming into program design and development (opposite of devops) in a specific language. That's done too but usually the spec is functional (behavior not FP sense) and not algorithmic.

If this pattern works for anyone, great keep running find it he limitations. I just believe that there will be a better structuring and selective application movement.

A true data scientist would be doing research into new solutions or high level improvements. This can't happen at typical sized companies unless it's the core product and not a feature of one.

The big data, data scientist/engineer bandwagon is a little like blockchain. Everyone wants to leverage it, there are places where it is suitable but not everywhere where it's applied.


The risk I see is over using data science in circumstances where it is just a product feature. Risk is then to over emphasize the data science part and forgetting the relevant context. Like getting lost in the data itself.

A tendency I saw is that math graduates have a tendency to put everything in probability functions. That reality is composed of people that cannot be predicted is sometimes beyond their horizon. As a result everybody believes the solution is mathatically correct and thus suitable to reality while it is quite the opposite.

EDIT: Typos, again...


A good programmer is someone who can communicate with the problem ___domain experts and provide an a solution to fit their problem. Someone who understands the limitations of the computing environment and can engineer a solution that is adequate for the problem space problem.

Many who consider themselves programmers produce solution space solutions that just don't get to the core of the problem space problems. This is a function of the simple fact that many programmers never have the opportunity to see what the problem space experts are doing or actually need. This is a real shortcoming in the education of programmers.

We don't have to be subject matter experts in all fields, we just need to become competent in being able to understand the kinds of problems that are being faced by the various subject matter experts that we build systems for.

On the other side of that coin are those who are subject matter experts who think it is easy enough to become competent programmers. What they miss is the essential problem that programming is, itself, a field that requires a subject matter expert. I have come across too many systems that have been developed by the subject matter experts that were just wrong. Wrong in design, wrong in understanding the limitations of the tools being used, wrong in oh so many ways.

To build properly functional and functioning systems requires the cooperation, input and continual communication between those who are subject matter experts facing problem space problems and those who are subject matter experts in computing systems. This is a rare event and so we see the problems in every field with the computing systems that currently exist.


> Many who consider themselves programmers produce solution space solutions that just don't get to the core of the problem space problems. This is a function of the simple fact that many programmers never have the opportunity to see what the problem space experts are doing or actually need. This is a real shortcoming in the education of programmers.

I can't argue with that. I would add that another factor is how the work is structured and presented to the programmers. In many places programmers largely disconnected from any users. The programmers are usually given a set of requirements by a third party who themselves derived it from someone other than a user/consumer. Thus, programmers may end producing lovely programs that don't actually address the needs of users/consumers.


Good points, if a bit verbose. :)

Tangent: regarding "problem space vs solution space" issues, I find that many projects suffer needlessly from too much focus on one of these over the other. Learning to balance them isn't easy, but is critically important.


It was one of my former managers/mentors that introduced me to the concepts of problem space vs solution space. As the decades have passed since then, what I have seen is that most computing solutions that have been offered for the problems people have experienced do not really consider what the problem is that is being faced.

It takes a lot of effort to actually elucidate what the actual problem is that needs solving. Which is why I have made the comment earlier that programmers need to get out and see what the end user (client/customer/whatever you might want to call them) is actually doing and experiencing. When all you have is some design documents, functional specifications and technical specifications, the actual working environment for the solution is then missing.

We need to get out and face the complaints, observations, ire and suggestions of those who use the software we write.

Edit: as for verbosity, my mother has made the statement for many decades that, of her children, I was the one who could talk the legs off a cast iron stove. As my sons and daughters, grandsons and granddaughters, nephews and nieces have all had to learn, to shut me up, they have to talk.


wrt verbosity: Haha, I'm the same way, as in: "Sorry this [email|message|comment|...] is so long, I didn't have time to write a short one."

wrt problem space, yes! In contrast to all the focus on product development and engineering methodologies, somehow customer development generally suffers from a lack of rigor and attention. Ditto marketing -- in the sense of identifying or growing a market for the goods or services on offer.


I have never seen someone with “a 3 week mooc” getting a data science job. In fact, those jobs are being gatekeeped to a ridiculous degree, suddenly asking for PhDs for jobs that are barely more than a regular BI job.


A skilled software engineer who is good at math could probably take some MOOCs, build a portfolio, and do well at data science interviews that emphasize coding. Many interviews often just ask ISL-level questions (Introduction to Statistical Learning), which is studyable over a few months. On the other hand, it would be significantly more difficult for a new-to-coding statistician to become an excellent coder in a short time ... although I've seen some people do it.


> who otherwise had never dealt with statistics before.

And why is statistics required? Let's face it most companies who need "Data Scientists" are looking for regular BI guys with fancy terms. Most of the problems are solvable using out of box functionalities in python/keras etc. Sure there are places and problems which require hard mathematics and stats but those are few and far between.


This is true in my experience. Data Scientist seem to run the gamut from "knows SQL" to "has a Ph.D in Behavioral Psych and spent four years getting scientific results published in peer-reviewed journals".

The company that I work for has changed role titles from Data Analyst to Data Scientist specifically because people who know SQL, but don't program, won't apply to/accept jobs without that title.


Because data science and machine learning are applied statistics, and if you don't understand how it works under the hood (and not necesarily a very deep understanding, sometimes just a broad understanding is enough) you will have trouble adjusting things, debugging edge cases, or simply not know why something works and something else doesn't.

(edited for slightly better clarity)


Most companies that think they need a team of data scientists just need a SELECT with a WHERE clause and maybe GROUP BY.


Did you ever the pleasure to work with a Six Sigma Black Belt who had the most of one week statistics training? I am one, but honestly this is just enough to do some back-of napkin number crunching for purely operational purposes. That you need a math PhD is maybe exaggerating as well on the end of the spectrum.

That being said, the abilit to talk to ___domain experts and accept their experience is one of the most important skills for a true data scientist. Without proper context all the data in the world gets you nowhere.


> statistics is a three year long grueling applied maths degree, and condensing it to three weeks is silly.

I agree. I was in a very prestige organization and they didn't know what a statistician really do and just hired CS machine learning PHDs. Even those people don't even know what statistician does. One person gave me ISLR when I ask for advice to get hire at this prestige place (I did an equivalent of this over several graduate courses in statistic program).

Another person proudly told me that in his project, he was using GLM stating he knows GLM. I asked what was the link function and that person stated he didn't know and it's somewhere in the code...

I've since then double down on statistic and will be going into Biostatistic field instead of data science. It feels like there are a lot of impostors especially in start ups and government organizations in data science. I have no clue why but there is just this culture in tech industry that have made me left it for better field. I've intern in the biostat field it is much better. CNN and other have listed with high quality of life.


What I took from the post was not that a "data scientist" qualifies as a "programmer" in the modern sense, but in the sense of the kinds of things programmers did in the 1970s. And maybe he's expressing some nostalgia for those times.

I learned programming around 1982. I didn't pursue a programming career, but went to college and majored in math and physics. Today I often use programming in the way that a data scientist might, solving problems using high level tools. The data that I deal with are physical measurements. I'm not employed as a programmer.

I also work with a lot of programmers, so I get a glimpse of what they're doing, maintaining a million-line code base. And I have to admit that being thrust into that environment would have me waxing nostalgic about the good old days too. I'm happy doing what I'm doing, and happy that someone knows how to turn my stuff into production code if it ever gets to that point.

What I'm really doing is applying my ___domain knowledge in a ___domain that happens to depend heavily on computation. To answer Greenspun's question, what I'm doing is certainly more interesting -- to me. I have colleagues for whom wrestling with the monster code base, and the kinds of engineering it requires, are their source of fascination.


Yes, "data scientist" is the new "70's programmer"... write some code in one file that runs within a hosting system (mainframe, spark).

Regarding the complexity and tedium of many production code bases I think they got there because many developers don't have the ability (experience) or opportunity (iterations) to do things simply.


True, but the reverse is also applicable: academia most often than not does not get the chance to do anything really complex in a tight schedule; this explains what the author says:

"Consider that the “data scientist” uses compact languages such as SQL and R. An entire interesting application may fit in one file. "

I have seen horrendous, multi-page SQL queries in very large systems.


SQL is still one of the best languages for readability in my opinion.


> Consider that the “data scientist” uses compact languages such as SQL and R. An entire interesting application may fit in one file. There is an input, some processing, and an output answer.

The argument this post is making is reductive. Yes, sometimes data science is simple. Sometimes it isn't, and that's when you really need someone with the appropriate skillset.


Data scientist, programmer and software engineer are different things. They are not disjoint by any means, but this guy is conflating them in a way that's totally wrong.

Software engineers have to engineer things. They deal with production applications, distributed systems, concurrency, build systems, microservices... coding is sometimes only a small part of the job.

Data scientists nowadays do programming in interest of research, modeling and data visualization. But they are not only programmers - they are usually supposed to have an applied statistics or research background. Some also do software engineering, especially at companies serving data science/ML in their products.

A programmer is actually someone like a data analyst or business systems developer. They don't have to build systems themselves, they just write loosely structured code against existing systems. Like writing SQL queries for dashboards, or drop-in code for things like Salesforce. This is probably the closest thing to what he's describing as the "70s archetype". Minus the deep optimization stuff.


I agree with you. I've seen brilliant Data scientists struggling to understand how git branching works. But, as you say, their principal focus is applied statistics, not programming.

My role as a software engineer is to create a good enough architecture so their can use properly the information contained in their 60 GB CSVs.

As a side note, I also noticed that clients have no issue paying a lot for _Data Science_, but for the "software guys" ? That's a whole other story, despite being of equal importance to the project.


I think you're taking this analogy too literally.


Deming, from Out of the Crisis (1986):

  People with master's degrees in statistical theory accept
  jobs in industry and government to work with computers. It is
  a vicious cycle. Statisticians do not know what statistical
  work is, and are satisfied to work with computers. People
  that hire statisticians likewise have no knowledge about
  statistical work, and somehow suppose that computers are the
  answer. Statisticians and management thus misguide each other
  and keep the vicious cycle rolling. (p. 133)
This is what today's data scientists are. Last century's statisticians, similarly hired for misguided reasons (we need them because our competitors have them!).


I'm not sure I follow. Specifically, what is meant by, "Statisticians do not know what statistical work is, and are satisfied to work with computers."?


What he meant was that they (being fresh out of school) don't know what statistical work for business is. They can crunch numbers, but they don't actually know what questions they need to be answering for their employers. And their employers don't really know what's needed either, just that they need a statistician. So they end up doing number crunching on the computers, which is all fine and good, but of relatively low value. Additionally, and many who have programmed can attest to this, computers offer an emotional satisfaction when you work with them: Oh, cool, I finished this neat Travis CI integration so now my workflow is more automated. But I spent two weeks doing that and what value have I added to the business? (Not that automation is bad, but people get distracted by the side problems and not the core problem.)

Of course things have changed, you don't have to invent your own statistical packages anymore. But see some comments elsewhere in this post: People are saying that the data scientists are the ones that know process automation better than anyone else in their offices. The ones who best understand docker and continuous integration. This leads to a question: Why are they so good at that? Is it because it's solving real problems for them and letting them be more effective? Or are they like every "data scientist" in the last couple offices I worked in: They have no real work to do because no one knows what's expected of them, so they solve interesting (to them) problems rather than business problems.

I'm not trying to knock the whole field, but it's a trend we've seen play out before. Smart companies and smart people figure out that they need X. Or they discover a technique or process that works well (see devops). They do it, they create a position called Xian or X Scientist or X Analyst. Now everyone wants to be like the successful guys and start imitating, without comprehending the value or purpose of the work or process. Lots of people take on the new title, schools offer courses in the techniques they use, but with a poor emphasis (due to time or their own lack of understanding) on the business case for it.

The current trend with data scientist is no different. There's positive value when it's understood, and negative value when it's not (best worst case: just an extra body being fed but doing no harm to the business other than the cost of their salary).


No, at least in my understanding data scientists specialize in the analysis of data rather than the development of software. You'd hire a data scientist to look for interesting patterns in data, or create machine learning models, and other data analysis tasks. These tasks may involve writing code, but it's usually specific to data analysis, often in R or Matlab or similar. A lot like how many people in the natural sciences pick up coding to enhance their capability, but the software writing is a means to an end.

I wouldn't hire a data scientist to build a web app (well, I would if he or she had the necessary knowledge and skills - the job title wouldn't be "data scientist" though). "Software developer" is much closer to "programmer".


I think the point of the article was that it used to be more common back in the 60’s and 70’s for programmers to work on data problems. From basic stuff like census tabulation or designing file systems, to creating trigonometry or t-statistic tables, to AI.

There was less specialisation, less of a divorce between programmers and users.

There also seemed to be a conflation of computing and AI back then. Lisp was considered AI. And the early computing pioneers and theorists were strongly interested in AI, logic, and mathematics.


This post states that a data scientist uses compact languages such as SQL and R.

Genuine question - do people really believe that being able to write and understand complex SQL makes you a data scientist?

I ask because, I've been writing some of the nastiest, most difficult looking SQL around for probably at least 15 years. And yet, I would NOT call myself a data scientist because I know and can work with data and use SQL. It might make me a data engineer.

What would make me a scientist is the process, method and rigor I apply to data-driven research and in practice. It's not about what tool I use or how complicated that tool is.

I often get a whiff of imposter syndrome over this because, if being "great at SQL and R" is enough to get the big bucks as a data scientist, then I'm clearly doing it wrong. But, then again, maybe I'm being too literal thinking that a scientist means something different.


I've been working as a data scientist for several years and have written some pretty gnarly looking SQL myself. I have a background in math and hard science so I have some understanding of the scientific method as well. While I respect our DBAs I wouldn't call any of them qualified to be data scientists.

While I have been able to hold my own in this job I went back to school to pursue a graduate degree (partly) because being in the field has shown me how much more there is to know. While it's easy enough to train a simple model in R there are so many ways to fool yourself and produce an invalid analysis and so many variations on otherwise-simple problems.

It seems this field has a lot of variation. A glorified report writer might get the DS title but they're not going to get the really cool jobs.

If you're interested in data science try out a kaggle competition and try to place high. The variety of methods and tricks people try to improve their entries can be illuminating, I think.


I'll preface this with I've not had a look at any Kaggle competition, but I always assumed Kaggle competitions was on par with programming competitions in terms of how the skills transfer professionally. A great programmer is not necessarily great at programming competitions after all.

Am I way off here?


No, there's way more to data science than competitions. But for someone who is already a data engineer more or less, I think it could be a good window into the complexity of modeling.


Nope. Kaggle just covers the modelling part, which is normally much easier than figuring out how to solve business problems using data.


Firstly, it states that "a data scientist uses compact languages such as SQL and R". It doesn't state "everyone who uses SQL is a data scientist".

That said, the term data scientist itself is a bit frustrating. It gets thrown around a lot as if it is a well-defined role, and it is anything but. In my experience, the role of a "data scientist" is about as well defined as the role of an "engineer": it has connotations about the type of work and maybe a few shared skills, but the specifics of what an "engineer" does and their skillset varies widely depending on if they are a software engineer, an electrical engineer, or a civil engineer.

So while I think that most data scientists know SQL or use SQL frequently, I don't think that all data scientists use it, nor do I think that everyone who uses SQL works in a role that would probably be considered that of a data scientist.


That covers the "data" aspect, for my work however, the "scientist" aspect is just as important. While I'm expected to use SQL and R to generate reports, I need the thought process of an epidemiologist to construct my analytic samples. I also require the scientific knowledge and background to interface with MDs and clinical PhDs, who need me to bridge the gap between data and science.


> do people really believe that being able to write and understand complex SQL makes you a data scientist?

Many data scientists use R and SQL, that does not mean that many of those who use R and or SQL are data scientists.

Many lawyers use word. Yes I’m not a lawyer just because I use word.


Your second sentence does not follow from your first. Just because Y's do X, doesn't mean everyone who does X is a Y.


youre being too hard on yourself and you should go apply for the big bucks. most scientists barely deserve the title


Sure, we’ve stolen the term “engineer” for long enough, let’s bother the scientists now.


Why is software engineering not a valid engineering? I worked on both software and hardware engineering and general principles seem to be the same. You deal with complexity and simplify it by making abstractions. You make calculations to make sure your project is feasible. It's not like EE and Aerospace engineering are literally the same field but there are some principles shared in those fields, and with software engineering. Am I missing something?


When software engineers stop disclaiming all liability for their products failing, we can talk about them like we talk about engineers (sidenote: some are already take responsibility)


It's foolish for anyone to take responsibility for any software written under the current prevailing industry practices.

It would be funny if real engineers were able to get away with making crumbling messes that can't hold their own weight because their middle managers don't believe in concepts like stress and strain like ours don't believe in refactoring or abstractions.


Depends on the country, as some do require a certain level for people to call themselves software engineers to start with.

As for the liability, I fully agree.


Yes, while in many countries Software Engineering is actually a degree that needs accreditation and in some cases even a professional exam if you need to sign off projects as the responsible person, on US apparently you can call yourself software engineer after a three week bootcamp.


My degree was called computer science, and run by the school of maths rather than the school of engineering, but I’m eligible to apply to be accredited as a chartered engineer (the regulated professional title for engineers in this country), though people who graduated after 2012 aren’t, due to course changes. The whole thing is a bit of a mess.

Just to add to the confusion, my degree was a BA(Mod) rather than a BSc or BEng, for obscure historical reasons.


I've interviewed a few "data scientists". Some of them were pretty arrogant. Their idea of a "close to the metal" language was Numerical Python. I don't think these guys are going to be writing the next generation of OS anytime soon.


Never. I'm a data scientist myself and know many other so-called data scientists. But coming from an engineering background, I pretty much agree with you. & interestingly, all of the ppl I know who claim they want to write an OS one day are all data scientist. Seriously! They don't even know what's unit testing!!!


Why would data scientists aspire to write an OS? Sounds puzzling.


I was talking from the viewpoint of the OP who was asking if data scientists were replacing programmers.


The OP is explicitly about the kind of programming work where "An entire interesting application may fit in one file.", which doesn't tend to apply to OS's nowadays.


Why would we expect a statistician/data analyst to write an OS? I'd expect them to write reports, white papers, articles, and functional ___domain-specific packages for R or Python. Especially in my field of healthcare, where over 50% of the analysts use SAS, I doubt we'll see any groundbreaking innovation, at best it's incremental changes via papers or sharing code.


Why would you be expecting a data scientist to be writing an OS?


This is such a bizarre post. The reason why people use a language like R is because it is easy to learn and use (and install, via RStudio) for data analysis without having to be a well-trained programmer. I can’t recall ever hearing from anyone who has relied on R, doing so because it was computationally efficient. The point of the language is convenience — particularly with how easy it is to create attractive graphics using ggplot2’a defaults.

It’s a testament to the R library’s developers (particularly Hadley Wickhan) for making APIs that do so well in streamlining data work. But I’m willing to bet a majority of R users, particularly in academia, could not load a simple delimited data file without a high-level call such as read.csv.

(By “simple”, I mean a delimited text file that could be parsed with regex or even split. I don’t expect the average person to be able to write a parser that dealt with CSV’s actual complexity)


The fact that R has such buy-in despite being a rather awful programming language (a friend of mine worked on the next Lisp-like version of R under Ross Ihaka, and the next version is based on the fact that current R is a bit awful) is precisely because it offers such convenience to non-programmers.

In my sister company, they have data scientists, and data engineers. The data scientists write their algorithms in the language they're most comfortable with (typically JS), and the data engineers rewrite to perform efficiently in the application that's applying them.

Data scientist and programmer are two very different specialisations.


Base R is annoying, but IMO the tidyverse alone makes the language worthwhile.


It’s a great set of DSLs that show off some of the pretty decent meta programming facilities of R. I think those DSLs could be created elsewhere, but I also don’t think it's an accident that it happened in R.


IMO it's the exact opposite. "Il mondo e' bello perche' e' vario".


> ... despite being a rather awful programming language (...) it offers such convenience to non-programmers

I've heard people say similar things about MATLAB - that it's a poorly designed language, but many that people (mostly non-CS folk) use it out of convenience.

Can someone with experience using R explain what makes it so appealing to non-programmers? It seems like these two factors, "poorly designed" and "easy to use", should be at odds with each other.


Eh, it’s not as bad as people like to whinge that it is. There are indeed warts, but they’re pretty overblown. If you are comfortable with functional idioms R mostly does what you want without a great deal of fuss. If you’re predisposed to procedural idioms, then you’re going to be fighting the language.

I started learning R about the time I started reading How to Design Computer Programs, and I found it pretty easy to transfer that model of thinking to R. And I find Clojure, Racket, and Scheme to also be somewhat comfortable after a short reacclimation period.

Some of the convenience bits have to do with most functions working on vectors without needing to explicitly iterate most of the time. Also libraries. If you want to estimate a linear regression, or make some exploratory plots, or try some rando statistical method that your graduate advisor suggests, you don’t have to worry about whether it’s already been implemented for you in R.

You can do a lot of heavy lifting by cribbing off of example code because most code is short. You just get heaps of leverage by using R.

Look, I like to do things the hard way a lot. My whole life is pretty much a string of highest friction path choices. For data science R is easy because all the work has been done for you. It's the difference between writing GUI apps against Cocoa APIs vs, I dunno, XLib or Motif.


Problem is that we are coming from completely different perspectives. When you say "programmer", you are likely referring to someone from a CS background, likely with software engineering experience, who has spent their lives working in C++, Java, Python, etc.

By that definition, I would be a non-programmer, as I come from a statistics background, and though I have lots of experience in C++ and Python, most of my experience and work is in R. But that is by choice.

If I'm trying to create an application or build a website, I wouldn't use R. But when it comes to ingesting data, transforming and cleaning data, and modeling data, R is second to none. Yes, its syntax looks ugly and bizarre if you are used to object-oriented programming, software development, etc. In the context of working with data, I have never found anything in R to be even remotely confusing or strange.

On the contrary, the next best option to R would almost certainly be Python, and the gulf between the two is massive in my opinion. Python is a great general purpose programming language, but its data analysis capabilities, using packages like Pandas and sci-kit learn, feel like poorly designed, bolted on, and unwieldy. R is better for virtually every aspect of data analysis than Python.

So it isn't that R is poorly designed. Conversely, its very well designed, for its purpose as a data analysis-focused programming language. It only seems to be poorly designed to "programmers" because programmers work on problems that R isn't meant for. But that is like complaining that a screwdriver looks poorly designed for hammering nails.


What makes R appealing is basically all statistical methods are available in it -- and it is often the initial implementation language of new methods in statistics. Often an R program involves very little programming as such other than to read in data, run some existing statistical methods on it and print or plot the results. I'm not a particular fan of the language itself (I kind of wish XLispStat hadn't died), but every time I feel like checking out Python or Julia I find things I need that haven't been implemented yet in those languages that are in R.


(Havn't used much R, Have used MATLAB)

Its all about availability of libraries. I did control systems in undergrad, and despite being a shitty language, being able to describe and manipulate dynamic systems (ODEs) was very useful. Doing numerical integration by hand for the nonlinear systems was horrible, though better than simulink (which is about as much fun as using LabVIEW or sculling H2SO4).


Can you elaborate more on the "algorithms in JS" bit? What libraries/tools are they using? Why JS versus Python or JVM languages?


Seconded. "Data scientists" and "being most comfortable writing in JS" just sounds strange to me.


I don't want to pile on but that sentence strongly reminded me of the oldish saw about "a data scientist is a programmer who lives in SF". I've never heard of anyone using JS for data science. What might it have that's anywhere comparable to the tidyverse or numpy/scipy/sklearn?


They don't use JS to investigate the data, they use SQL (Hive, Impala etc.) as their lingua franca for exploring datasets. But they then write their algorithms based on that analysis in JS.

Again, I don't know why, it struck me as odd also, but hey, whatever they're doing, it works for them.


I have no idea why they use JS - it's a Berlin start-up so maybe trendiness is involved? All I know is that their data engineers have to make those algorithms work in production.


> The data scientists write their algorithms in the language they're most comfortable with (typically JS)

This seems weird to me, I associate JS with web dev (and even more with the front end side of it)


R is cumbersome, but then you look at Matlab and Stata and try to explain that to social science graduate students that just managed to grok LaTex, and R starts looking like a streamlined vision of the future.


Totally. I’ve taught intro R to mostly social scientists, and intro Python to mostly physical scientists. Every workshop, the R group is coding circles around the Python group with the same number of hours of instruction. And the intro R curriculum that we use was largely a ported version of the Python curriculum.

Having seen brand new scientific programmers tackle two “beginner friendly” languages, man the differences after a day or two are stark.


>The data scientists write their algorithms in the language they're most comfortable with (typically JS)

I'm sorry, what? You know data scientists who use JavaScript to implement their algorithms?


a friend of mine worked on the next Lisp-like version of R under Ross Ihaka

Did anything come of this? I seem to remember seeing a paper or article where he proposed doing this, but I’ve never seen an implementation.

Edit: paper I was thinking of is https://www.stat.auckland.ac.nz/%7Eihaka/downloads/Compstat-...



A trend I've been noticing (especially as ML/AI tooling becomes more accessible) is that people believe the quality of data science code and workflows is proportionate to its complexity/LOC (since complex problems require complex code, right?).

It's a toxic perspective that ignores recent and pragmatic innovations in the field.


I often see careers built more by complex non maintainable models that show fancy math than more simple ones.


Lol, reminds me of the nightmarishly complicated first NLP model I wrote. I would classify this under "resume driven development" which you see a lot on the software side too with fancy new frameworks.

Funnily enough, although the first place that let me work with Hadoop and Spark didn't need to be using Hadoop and Spark, I probably wouldn't have worked there if they didn't let me learn them, so maybe this isn't as wasteful as it seems at first glance


I think the post is referring to some idea of "glamour" or the lucrative nature of a rapidly emerging field.

Meanwhile, both demand that the employee spend all day telling a computer what to do.


In my company I am a software engineer and my colleague is a data scientist, our current project that we work together on does a lot of NLU and NLP type work (think bots) and our skillsets often don't overlap and are both equally valuable to the projects success. That is, I tend to write the infrastructure and platform code that ties everything together and deal with all the software engineering type work, while my data scientist feeds in trained models and the likes. Both are necessary to handle contractual requests/responses as per our scope design.


My experience is very similar to this as a "software engineer" in a company who has 50% 50% split software engineers and data scientists.


There was a sign on the door to the Vax Lab at the University of Maryland that said "Department of Research Simulation".


Hmmm! Funny, but I'll bet you could learn a lot from research simulations. It is, after all, a field that does seem to benefit from the big-picture review...

(I was just talking to a client about this today, at a micro-level: What is your personal research model, for important career-related, yet non-work-related projects? Does it exist, or do you just perform research as the intuition prompts? What kind of structure can be leveraged to achieve a quality outcome? So maybe that's why this seemed interesting to me.)


Haha, it seems that quite a few scientists in academia are simulating the act of performing research ;)


In my case, staying up all night playing SimCity in the computer lab was actually good preparation for a career in simulation game programming.

https://medium.com/@donhopkins/designing-user-interfaces-to-...


A data scientist is just a statistician who works in the Bay Area.


As someone who has been looking for Data Scientist jobs in the past few months, I can reliably say that the term can mean everything from software engineer for big data systems, SQL guy or a person that builds complex machine learning models.

It is just as vague as the job profile of a "programmer". In that sense, the title is right. But, in the context of the article's content, I disagree.

The job done by a data scientist in demanding roles, requires a strong grasp on undergrad level statistics. But because of the recent trends towards ML, the person also needs to have a strong grasp of linear algebra, vectorization and software engineering / undergrad algorithms.

While it is unlikely that one data scientist may need to summon the whole skill set, an interviewee will never know which subset of these skills you will be asked to demonstrate to get hired.

Modern software jobs have figured out distinct subset of skills needed to differentiate between different software roles for experienced employees. Junior level employees are barely even expected to know anything other than algorithms, data structures and high level system design (at least during interviews)

Another funny observation (anecdotal) is there seem to be more openings for "senior data scientist" (who is expected to know everything), than "junior data scientists" whom the company is willing to mentor.

As of now, I find myself scrambling to decide which skills I need to prioritize, often feeling like I am being pulled in opposite directions. Almost of which require formal instruction (the maths), and can't be picked like software skills through youtube and online projects. This isn't a knock against software, just different type of subject matter.

Companies interviewing for these roles may ask everything from leetcode algorithms questions to statistics to questions about modern ML algorithms and ___domain specific models (in NLP, Vision, finance, recommenders)

I personally find a "junior" Data Scientist's role (in expectations) to be harder than that of a junior SDE. There is a reason many these jobs will put phD into preferred qualifications. It is ironic that there has been such a massive surge of people without the necessary background, who do a couple of MOOCs and crown themselves data scientists. Being good at any software & math heavy ___domain is hard. Data Science is no exception.


Forgot who said it but it was great: a "Data scientist" is a programmer better at stats than any 'normal' programmer and better at programming than any 'normal' statistician." :P


This is the best short form definition of a data scientist I've heard yet.


There is a good comment on the original article by a user named LauraConrad. I'm excerpting it so HN readers will see it:

> I was a “Programmer” in the ’70’s, and I keep thinking how much of what my early programs did would be done by a spreadsheet now (or any time since the late ’80’s).


I thought data scientist was closer to doing the work of a statistician than a progrrammer. Visualizing data, and analyzing data. Programming becomes part of it by necessity.

Data science is also a much sexier term than statistics, just like "Machine learning" and "Artificiall intelligence" is a lot sexier than, say, "Regression".

As someone funnier than me put it: "A data scientist is just a statistician with a mac".


I prefer : What's a data scientist -> a scientist"

It doesn't capture the whole, but it's a powerful way of thinking about what the profession should really be trying to be.


The basic premise of the article is that

systems programming=irrelevant bloat and abstraction

while

data reduction=definite purpose and utility

People writing python notebooks to do data analysis are probably fairly comparable to the scientific computing programmers of the past, but I feel like this picture tends to dismiss the computer science side of systems programming:things like GUIs, network code, processes and virtual memory, all the architectural aspects of computing.

One might prefer APL or Forth for writing one-page programs, and it's probably true that systems now are bloated relative to what they could be. Still, there is much of interest going on in a typical operating system, compiler or video game, while a typical data analysis notebook is IMO fairly dull and even basic, from a software angle.


Yeah, the author's take is myopic. What they call bloat, people from the 70s would call wondrous: ubiquitous networking with and without wires, beautiful graphical interfaces, encryption everywhere (and expanding), far more open systems than proprietary re-engineered ones, the list goes on and on.


The author is Philip Greenspun, who in the 1980s worked with the people that created all of the things you listed: https://en.wikipedia.org/wiki/Philip_Greenspun

There is nothing myopic about his perspective.


It's fair to mention that he is well known, though in fact I'm one of the old guard that remembers when he had a higher profile.

But, as with a new Paul Graham essay, surely we can critique the blog post on its merits instead of falling back on an assessment based on some kind of appeal to authority/"expertise by association". Philip Greenspan doesn't need to be treated with kid gloves as if he was the pope.

John Ousterhout made comments that touch on some similar (though not identical) distinctions in programming practices. That was years ago, and he was then a much more credible figure in software than Greenspun. All the same, his essay was heavily criticised. That's what serious intellectual discussion should involve.

https://en.wikipedia.org/wiki/Ousterhout%27s_dichotomy

http://www.tcl.tk/doc/scripting.html


If it means we now get a term that, at least for a couple of years, filters out all the garbage roles recruiters throw at me then I'm on board with adopting this terminology.


I wonder why "data engineer" isn't one of the suggested terms. Scientists do not really program science, nor do programmers research programs, as their respective fields of expertise.


It is. I was a data engineer at spacex this past summer.


If you don't mind, what did you do?


My current job title is "Data Engineer", before this role I was a "Process Engineer". In my opinion those two jobs are actually pretty similar.

When I was working in process engineering I was trying to optimize the outputs from our industrial process on a day to day basis in this role broadly speaking I try to optimize the data extracted from the same industrial process.

Mostly I'm concerned with how can we extract data out of our plant, how do we represent and present that data (particularly to operators and technicians) and how can we better recognize and respond to underlying trends in the data.

Before I assumed the role (in 2011) my predecessor, who had a background as a statistician, was called a 'Process Statistician' so I assume my Manager changed the job title to reflect my background as an Engineer (Materials Engineering in my case).


What I would consider the difference between an engineer and not engineer (I am not an engineer) is delivering a qos or sla driven by measurement of tolerances and either empirical or imputed information from existing qos or tolerance information... Not necessarily the optimization part. Everyone does optimization to some degree (possibly negative) but not everyone is an engineer.


I think of engineering as the practical counterpart to science. Science is finding patterns/uncovering truths/building and testing models; engineering is the deployment of technology to fulfill an objective.


Engineering is more than just application. In some places you need to be certified as an engineer, for better or worse, which says that there is an understanding of how to calculate and communicate product tolerances and service level guarantees that go beyond just applying science. Basically be the bare minimum of applying science I would call hacking - which is a great thing, but there is value in the distinction between hacking a solution and engineering


Yes this is my view as well Engineering is applying theoretical knowledge to achieve practical solutions.

In my world, which is industrial manufacturing there are scientific theories - fluid dynamics, thermodynamics, kinetics etc. which govern the fundamentals and limits of the process.

As an engineer your take this knowledge along with your own intuitive experience and work to ensure the reactor is operating at peak efficiency.


To be nitpicky, in the US, engineer means you graduated from an ABET accredited program in something like: Chemical engineering, mechanical engineering, civil engineering, electrical engineering, industrial engineering, computer engineering....etc.

That is not to say programming isn't a difficult job that requires a lot of analytical and creative thinking similar to an engineer. The difference is in getting a degree in something that has 4-5 years of calculus based math, physics...etc classes. There is also a rigorous 8 hour test to get a license after 4 years on the job.

I guess the broad term of building something and doing analysis fits here, but I don't see any Data Engineers in practice. What I see are Data Scientists and Data Analysts. Of course I'm arguing over semantics here, but it is important to get the distinction correct.


A Computer Engineering degree does not require ABET accreditation - it didn't when I went to UCSD... the difference between Computer Science and Computer Engineering was just 4 classes. Exactly the same calc/physics classes between the two, which were engineering level.


There are plenty of engineering programs without ABET accreditation. If you get a degree from a non accredited program, you'll likely have a different title than engineers who graduated from an ABET school and will make less. This is because there has to be some standardization in what the people who build bridges learn in school. All companies are different, but my company won't even look at an engineer without ABET accreditation. Even a Ph.D. from China, which is just really dumb. It is reality though.


Interesting. Here's a list of mid career pay by major. Fields that tend to be accredited and value accreditation are certainly present in high places on the list, but variants of mathematics are in comparable places. My guess is that "economics and mathematics" or "computer science and mathematics" are probably examples of fields that require a lot of math but aren't part of an ABET accredited degree. Doesn't look like there's much of a pattern here.

https://www.payscale.com/college-salary-report/majors-that-p...

Now, if you plan to work in structural or civil engineering or another field where accreditation is important, than yes, I would tend to agree that having an ABET accredited degree is important (as well as taking the PE exam for that field).

Fields where ABET accreditation isn't terribly important (computational finance, google, facebook, and so forth) often pay more than the fields where it is (mechanical, civil). My guess is that passing the PE for computer engineering wouldn't make a big difference in your compensation at a top tech company, I certainly don't know anyone at those companies who has bothered - though I certainly am willing to evidence to the contrary!


> To be nitpicky, in the US, engineer means you graduated from an ABET accredited program in something like: Chemical engineering, mechanical engineering, civil engineering, electrical engineering, industrial engineering, computer engineering....etc.

Do you happen to have a reference for this? At first glance, it seems to be incorrect rather than nitpicky.

Anecdotally, I know plenty of people who do not have ABET accredited degrees and have "engineer" in their title in the US.


Someone can certainly call themselves an "engineer", but it doesn't have the same consequences as someone who has received an ABET accredited engineering degree... for example, if you've graduated from an ABET accredited engineering degree, you are able to become a licensed and registered professional engineer - https://en.wikipedia.org/wiki/Regulation_and_licensure_in_en....


https://motherboard.vice.com/en_us/article/vvapy4/man-fined-...

(There are lots of other articles about that case, that one sums it up mostly in the url)


The OP says "the US". That article is one state in the US. Most states don't have restrictions on the use of Engineer in job titles. Canada does though.


Unless you have a professional engineers license, you can't testify in court as an engineer. To get a professional engineer's license, you must graduate from an ABET accredited program.


Since we're being nitpicky, neither of those claims are true.

>To get a professional engineer's license, you must graduate from an ABET accredited program.

In some states you can take the FE and PE exam with a related non-ABET accredited degree and work experience based, and in some states you can take the exams with no degree based on experience only. For example, NY let's you substitute work experience for a degree. There's a table where you can see how much experience you need based on your degree (or lack of one).

>Unless you have a professional engineers license, you can't testify in court as an engineer

Courts determine what credentials qualify someone to testify as an expert witness, not state licensing boards.

In some states, testifying in court as an engineer does qualify as practicing engineering without a license, and the state licensing board could fine you after the fact. However, in other states testifying in court doesn’t necessarily qualify as practicing engineering.


Some even have engineer in their titles with no degree at all.


I don't think your nitpick is correct. To be a licensed engineer within some fields, you do need ABET but there are many people who are "engineers" (their job title) but aren't required to be licensed.


Indeed, in the US, most engineers work under an "industrial exemption," for instance if they are employed within a company that makes a product, and not providing engineering services directly to the public. Most of the engineers at my workplace do not have licenses. On the other hand, our products get checked out by a certification lab, and the people who sign off on the test reports are in fact licensed. The work they do is phenomenally dull and bureaucratic.

Of course there are also fields where everybody pursues a license such as civil.

If I were to call myself an engineer, it would be in a field that doesn't have a discipline-specific license in my state. I don't know if it means that I don't need a license, or what. So far it has never been an issue.


Isn't it the new term for report writer?


Intern Analyst Automation Engineer.


This is total BS.

> Does the interesting 1970s “programmer” job still exist?

Sure! Go right there: https://www.linkedin.com/jobs/cobol-jobs/

And enjoy the not-bloated-at-all systems you'll find there!


How did applications get so bloated and therefore boring to look at?

That's an easy one. Too many people who lacked sufficient experience to make their own informed judgements yet trusted consultants peddling soundbites instead of skilled and experienced developers who knew better.

If you have ever read a book or watched a talk by someone who advocated very short functions and minimal nesting, and you subsequently adjusted your personal programming style or corporate coding standards as a result, please do yourself a favour and go back and look at whether they offered any evidence -- anything at all -- or even just a reasonable argument that stands up to scrutiny -- to support their position.

The relatively plentiful resources in a lot of modern systems do remove one barrier that forced developers to do better, but I don't really believe that's a big factor. It's more that when you have an industry so focussed on young people, a lot of what happens is the one-eyed leading the blind, because too many people who have been around long enough to see the big picture get shipped off to management or other positions before they can pass on what they've learned widely enough to advance common practice.


Data Scientist has two terms in it : Data + Science. More often than not, people ignore the "Science" part of that equation.


Someone said any field with "Science" in the name isn't really a science. Computer science, data science, political science, social science, etc. Physics, chemistry, biology don't have science in their name.


>> Someone said any field with "Science" in the name isn't really a science. Computer science, data science, political science, social science, etc

The etc. would also include cognitive science/neuroscience, medical science, earth science, material science, agricultural science, veterinary science, geoscience, food science, etc.

And of course as we all know climate science is fake./s

Generally when I hear a field with the word "Science" in the name I think of it as a more interdisciplinary field. Take Earth Science it draws on different areas of physics (ie wave physics), biology (ie ecology) and chemistry (ie kinetics). Earth science is still very much science it is just doesn't fit perfectly into the more foundational fields.


> Someone said any field with "Science" in the name isn't really a science.

That's the most unscientific thing I've heard in a while ;)


Then there should be a field "Science studies" that combines both unscientific worlds.


Physics, chemistry, and biology are all part of the Natural Sciences.


Datistry?


That's an expression, not an equation. And the space between data and scientist quite obviously indicates that the two combine multiplicatively, not linearly as you have mistakenly written.


The output of a data scientist's work includes plenty of things that aren't code. Yes, the code that I write tends to be very short, but if it represented everything I had to do to get there it would be quite a bit longer.


I don't think so.

Per my observation, the most 'interesting' part of a data scientist's job is story telling, that is using data analysis to draft a theory to push forward for product direction. Some of the ML engineers works under Data Scientist umbrella, but since the DL thing happens, they are now putting under even fancier titles like AI Engineers or such.

So data scientists are really product manager/owner with analysis skills. Is this job interesting? For sure, when it follows this definition. Interesting? That only depends on the problem ___domain, not the title, IMHO.


True - Superficially SQL may appear to be simple, old fashioned and a bit verbose, but once you are expert with it (takes at least 5 years) is amazingly powerful. Operates at a much higher abstraction level than Java, Python etc, so is I would guess 25 times more expressive. Postgresql pure SQL CTE’s give you variables and recursion PLpgsql gives you dynamic sql for macro/meta programming. If you use immutable tables can be purely functional. SSDs and now even faster Optane memory have resolved the IO problem which handicapped RDBMs until recently.


SQL is not more expressive than Turing Complete languages, no.


They’re all Turing complete inc SQL with case and recursion. I meant density, 1 line of a code, a SQL window function with a filter clause would probably take a page of Java to achieve the same result.


Nope, Java has map and filter just fine. Eg

``` Words.stream() .map(word -> String.toUpperCase(word)) .filter(word -> word.startsWith('A')) ```


SQL window functions aren't rocket science(not that I've used them much, cause ORMs and popular stripped down DBs like MySQL tend to not support them very well), but they do a lot more than you think if you're comparing them to trivial map/filter operations.


Map and filter together in Java reflects why you get from SELECT and FROM without aggregate, much less analytical (window) functions. Aggregate and analytical functions correspond to reduction operations, which Java supports but doesn't come with canned equivalents to common analytical methods, just aggregates, AFAIK.


Data scientist is a misnomer except when there is a relevant Ph.D. and that was never the bar for a programmer.


I would say "data engineer" is the new programmer, in that programming is evolving away from procedural monolithic threaded code with locks everywhere, to distributed message processing pipelines whose capacity can be flexibly adjusted, etc. "Data scientist" is an actual role at some companies but most data scientists are actually struggling with the contradiction between what they learned in school and the harsh reality of what their job demands of them.

Edit: misspelling.


So data engineering would be a subset of software engineering?


The author simply has a grudge towards over-engineered code such as [FizzBuzz Enterprise Edition](https://github.com/EnterpriseQualityCoding/FizzBuzzEnterpris...). Not all modern coding is like that, and you certainly cannot build complex software (at least maintainable complex software) without abstraction layers.


One thing I don't like about my source code is it "doing something interesting".

I like my code to be boring. I like my frameworks to be boring. I like my APIs to be boring. So I can focus on important things in life (or even the important things at work), and be done with it.


Hacker News, finding new ways to make your job feel trivial since 2007.


Similar story: once, when Googlers were talking about full stack engineers, Rob Pike piped up to say they weren't really full stack engineer; he said he was a full stack engineer back when he worked on the Voyager project because he had to know everything from silicon quantum mechanics up to interplanetary astrophysics.

Sure, there are parallels between modern programmers and the scientist-engineer-calculators of the past.


I would bet a lot of money on the fact that that isn't the case. (In fact, I am betting a lot of money, in the shape of my choice of career.)


Well I am "lucky" enough that I can work on a legacy application written in OpenACS. It wasn't written in the 70's but it's definitely old and outdated. So this kind of narrative that everything was better back then, simply does not convince me at all. I might be wrong of course, but the author tells anecdotes which is not a real argument, so there is that.


No. Data scientists exist at the mercy of programmers: without the tooling and the pipes, data science would not be a going concern.


Hah! At my company a decent proportion of engineers spend their lives scrambling to productionalize and operate the Lovecraftian concoctions of R and Python that our data scientists cook up on their laptops.


Where I work, the data scientists are more educated and experienced on containerization, CI tooling, unit testing, profiling tools, web service prototyping including API validation tools, caching layers, queues, GPU systems programming, etc. etc.

We are constantly thwarted by infrastructure teams that use superficial policy basically so they can whine and complain that they don’t want to have to provide support for the extremely heavily researched and tested implementation we create.

They don’t care that different technologies, database systems, whatever are chosen to solve customer use cases and that growing our business means supporting “Lovecraftian concoctions of R and Python” — they just don’t want to do their jobs (which indeed requires providing infrastructure support for crazy screwball data services that repeatedly break all the assumptions)...


If the data scientists know more about CI, unit testing, profiling, and caching than the engineers then they are better engineers and I'd wonder a bit about their math/stats chops and whether their role was just re-branded "data scientist" to keep up with trends.


It’s quite common to start out with a PhD / masters in math / stats, with deep specialization in fields like NLP, computer vision, MCMC sampling, and then to become an experienced expert in GPU computing, containerization, web service layers, etc., while working on implementations of ML models.

This was true for me anyway. The main thing I do is deep learning for computer vision and image search, but I think it’s fair to say I have significant experience with Docker, GPU architectures, various CI tooling, linux system programming, deep internals of CPython, internals of MySQL and Postgres, lots of frustrating performance tradeoffs with py4j in the pyspark world, as well as all the usual crap with pandas, sklearn, data visualization tools, and a lot more.

I’d say almost all data scientists I’ve worked with are just like me, just with maybe different specialization areas, except possibly for very young data scientists right out of undergrad.


If you don't mind me asking, where do you work at? (or even a family of companies that have this type of roles)

What you describe sounds like where I want to be, and it would help to where I could go looking. (Deep Learning / Vision specialization with a math focused CS background but want to learn more core software skills going ahead)


I work in a fairly mature startup that’s been around for over 10 years. It began primarily with an app, but shifted focus to other business areas. The image processing products are mostly related to offering information retrieval and search services for app users that have curated personal image collections.

I would say your description is accurate but only accidentally. The reason we have to learn more core engineering skills is that infrastructure will not take responsibility of bringing our solutions to production, and seek to limit the tools in our toolbox with policy.

It’s not fun when your everyday life is constant impedance mismatch against tools that infra will allow you to use. You can constantly see better / faster / safer / cheaper ways to solve problems, which have no downsides at all relative to the bad, slow, insecure, expensive ways infra currently makes everyone solve problems, but you just feel constantly sad that you are superficially prevented from the autonomy to select the efficient solution accordung to your creativity and skill.


I feel like you are roughly describing research programmers versus system administrators or operators in academic computing environments.

I think a big difference between research programmers and production/ops people is that as researchers we often chase a transient goal. Build some complex and horrible integration to compute a result or put something in a paper. We used to call these Rube Goldberg machines rather than Lovecraftian horrors, but we mean the same thing. Something that belongs on a movie set, with some Jacobs ladders arcing in the background. In some circles, it is called the heroic demo.

In the past, we might substitute other fads for your CI tooling or API validation tools. I remember when some research programmers were all-in on enterprise junk like J2EE/managed code, SOAP/WSDL, and other stovepipe tooling. There is a lot of cargo culting of such tools. When you have furnished your lab with rapid prototyping tools and focused on crazy integration stunts, you are almost always deluding yourself to think these tools are also giving you "production" system qualities.

Building something at the hairy edge of possibility is inherently about experimentation and risk-taking. Building reliable, production operations is inherently about conservative design and risk-mitigation. There seems to be a new cargo cult of devops which believes you somehow mash these together and the conflict disappears. You don't have to have to map the negotiation onto two teams with opposing objectives, but the negotiation has to live somewhere.

Magically erasing the negotiation just means that you have chosen to default on the optimization task and jettison concern for at least one of functionality, cost, or risk. Startups commonly do this because the VC funding has mitigated the risk elsewhere: you can fail because they've also funded your competitor who may succeed...


I’m not referring to transient research prototypes, but to robust long-lived systems needed for experimentation and reproducible results tracking, and services that are directly customer facing.

We are often required to create new services and functionality because it is how our company can grow, and we have to have ease of access to experimental working space, with freedom to do things like custom compilations of ML frameworks, using programming languages that haven’t been widely used in the company yet to gain access to an important library, define complex assumption-breaking deployment constraints relative to GPU runtimes or containerized notebook servers, etc.

I think people who see how these things grow out of prototypes and wrongly conclude it was designed with transient concerns and thus isn’t robust in some way, they are rushing to judgment and discounting the fact that that ML expert who also wrote the web service layer and who also wrote the Jenkinsfile and who also wrote the container definition abd who also knows how to tune indices in the database, etc., really made their choices for serious, pragmatic engineering reasons that solve the business problem efficiently, and that they already anticipated and accounted for the shallow tradeoffs and caveats that IT will use as potshots to try to circumvent the responsibility to help maintain it.


We may be talking past each other. I am on the research side of academic computing/informatics and have faced these struggles my whole career, encountering some very reluctant IT divisions.

We have had to bite the bullet and use colo facilities to self-host internet-facing deployments that the overhead-funded IT groups would not touch with a ten foot pole. From these experiences, I also acquired a more nuanced perspective on the IT division perspective and constraints, and how they derive from overall organizational policy and economics. We also had funny situations where we tried to help other PIs benefit from our new-found independence, and immediately regretted it. They did not understand what self-hosting means. I think anybody trying to toss integrations over the fence to an ops team needs to have an extended tour of duty trying to operate their own solutions in production WITHOUT assistance before they form bold opinions about operations constraints.

When there are strong time-to-market constraints (which includes publishing papers in academics), you are forced to find solution points that are different than if you are planning to run something for long periods at low overhead and low accumulative risk. These solution points also have to take into account the staffing and resources for that ongoing production.

Those things like bleeding edge libraries and assumption-breaking deployment constraints are the headache for ongoing operations and maintenance. It's not enough to have an existence proof that some complex integration can be built and passes its tests. You need a plan for how all the components will be maintained, patched, and upgraded. You need contingency planning when some of those bleeding edge components are going to become deprecated. You need to consider what staff capabilities are assigned to do that maintenance work or what will happen when the institutional knowledge used to form the original integration is not on-call to reintegrate it in the face of unexpected events.


> “I think anybody trying to toss integrations over the fence to an ops team needs to have an extended tour of duty trying to operate their own solutions in production WITHOUT assistance before they form bold opinions about operations constraints.”

I think this is one of the worst possible attitudes to have. It’s a petty way to feel, desiring some type of “I’ve seen some shit” tough guy credential more than supporting the stuff needed to actually solve business problems.

If you hire people whose value add to your company is inventing completely new things, including deployment, ops, scaling, etc., that goes along with that, then it is the job of infrastructure on the other side of that fence to happily and eagerly accept whatever is tossed over the fence, to understand why developer teams made the choices they made, and to take an attitude of supporting as much as possible.

> “You need a plan for how all the components will be maintained, patched, and upgraded. You need contingency planning when some of those bleeding edge components are going to become deprecated. You need to consider what staff capabilities are assigned to do that maintenance work or what will happen when the institutional knowledge used to form the original integration is not on-call to reintegrate it in the face of unexpected events.”

Yes, of course. But all this is already what dev teams are doing. Ops / infra is not taking a hare-brained plan and adding these robustness aspects into it. Not at all. Instead they take plans from application teams and try to use policy to minimize their own maintenance burden, even when that optimization is antithetical to what the company requires at a more fundamental level.

A lot of companies languish and die because of sociological dysfunction in the policy interface between dev teams and infrastructure. The more that infrastructure has political control of that interface, the closer to death is that company.

It’s like a body that is disallowed from generating white blood cells in response to a new immune challenge. Even if the bleeding edge integrations are really hard, the alternative world where you slow them down with policy is death and attrition.


As a data scientists I view other data scientists who need engineers to productionize their code with contempt. Perhaps (probably?) their data pipelines and systems are sufficiently more complicated than mine, but I'd feel embarrassed if I couldn't write production Python code


They are different skills. Not saying that it's hard to learn both, but there are standardized career paths that will lead you to be good at the modeling / techniques side of "data science" without learning much about software engineering. For example, studying math in undergrad. And there are certainly lots of people capable of productionizing messy R scripts without fully understanding the statistical ideas behind them. So I think, as a team leader, you are restricting yourself to some degree if you only hire people who can do both.


Just what the data science needed more of: gatekeeping.

Data science is an extremely broad, vague buzzword encompassing a variety of jobs and skillsets, most of which have existed for decades under different names. You do work that involves putting models into production in Python, congrats. The insistence that all data scientists must also do so is silly, especially considering that there are surely many skills used by many data scientists that you are incapable of.


We don’t allow (new) Python code in production, which has something to do with it. Feature engineering is also a very different game with streaming systems and online datastores than with CSVs.


Ia! Ia! Prototype into production! Research Quality code! RPy2 Fhtagn!


Even the Stanford NLP Java implementations don’t always match the code they’re meant to be a translation of. I think R is a worse offender than Python.


In Python 0.1 + 0.2 is not equal to 0.3 because the result is 0.30000000000000004.

In R 0.1 + 0.2 is equal to 0.3.


Not only is this technically wrong, but Python has a decimal module for performing these kinds of calculations. It uses floats/doubles for the native float type which produces exactly the type of results you see above.


What is the result of 3/2? In R it is 0.5. In Python it is 1.


You mean, in Python2, which uses integer division by default, it's 1 (as expected), and in Python3, which uses float division by default, it's 1.5.

If R is giving you 0.5, you should find another language (I assume you meant 1.5?)


In Python 3 it's 1.5.


Not on my R:

    > (0.1 + 0.2) == 0.3
    [1] FALSE


The result for 0.1 + 0.2 is 0.3. https://imgur.com/xWpx1Cg

Why do you compare point to point? I have never once in my Statistics education compared point to point. You always need to see the probability of the result if it is within 2 points.

But if you want to compare use all.equal(0.1 + 0.2, 0.3)


    > all.equal(0.1 + 0.2 + 0.000000003, 0.3)
    [1] TRUE


Try this:

> print(.1+.2, digits=18) [1] 0.300000000000000044


In R 0.1 + 0.2 returns 0.3. In Python 0.1 + 0.2 returns 0.30000000000000004

Why do you compare point to point? I have never once in my Statistics education compared point to point. You always need to see the probability of the result if it is within 2 points.

But if you want to compare use all.equal(0.1 + 0.2, 0.3)


R:

    > 0.1+0.2 == 0.3
    [1] FALSE

?


Why do you compare point to point, it is meaningless? The result for 0.1 + 0.2 shows 0.3. It doesn't show you wrong result. If you print it prints 0.3, nothing else.

Why do you compare point to point? I have never once in my Statistics education compared point to point. You always need to see the probability of the result if it is within 2 points.

But if you want to compare use all.equal(0.1 + 0.2, 0.3)


I thought that's what you mean by "equal". It didn't appear to me that you were merely talking about the default formatting of numbers printed on the REPL, which is utterly inconsequential to developing and porting applications.


R is not used for developing applications. https://www.r-project.org/about.html


This entire subthread is under a comment talking about "productionalizing" code developed in R. If you want to make an argument that R should not be used to make products, but only as an interactive notebook and to make nice plots, maybe make that instead of just mentioning small UI details.


At my company, I don't have a staff of engineers to productionalize things so I have to do it myself.


As a data analyst who is pushing towards data science, do you have any resources or advice on how to steer away from these "Lovecraftian concoctions" you're speaking of?



It's scary how much this comment applies to my current job. Literally just spent today discussing with entire engineering org how to steer away from this behavior.


why?


hedge fund?


Well, say that to an actual computer scientist. They will argue without data science/informatics there wouldn't be any computers in the first place. Steven Levy's Hackers is a showcase that it's true. Coders/Programmers/Hackers emerged from that open computer science lab culture. And so on.


And programmers exist at the mercy of farmers. What's your point?


These kinds of "arguments" are so tiring. Who cares who's better than who? Focus on solving problems, not stroking egos.


Why is Java, a byte-compiled statically-typed language, "bloated", while R it SQL, interpreted scripting languages, not? I find the point hard to follow. Execution of an R script will go through many more "layers", downloaded from many different sources, and implemented in different languages.


I don't know, the number of hoops you need to jump through to use a trendy data science tool like Hadoop, Spark etc. is way bigger than that of a simple Java program. From my (limited) experience I'd say they the data science (or big data) way is the bloated and convoluted one.


> How did applications get so bloated and therefore boring to look at?

I love reading code at this level. 1. You get to see into the mind of the programmer and learn new techniques. 2. Boring?! 3. A good text editor makes it attractive to look at for hours :)


In the absence of marketeers (think bootcamps) and recruiters the alternative or correct title would have been: 'Is “statistician” the new “programmer”?'


A lot of data scientists are doing very light statistics, though. Many of them don't have formal degrees in stats at all. There's a huge range of ability represented by that job title, ans "statistician" doesn't capture the lower end.


Programmer who has refreshed his high school math skills.


You can shape software out of chaos. You can shape software out of order. Both are just sides of a multifaceted field called Software Engineering.


I thought “engineer” is the new “programmer.”


It's the old new term ;)


Only in countries that don't have certification of job titles.


data scientists write code for themselves while software developers write code for other people


only if "software engineer" is the new "system administrator"


This doesn't make any sense. My job title is "software engineer" I never do any system administration. I produce code in python, javascript, C and SQL; never do any sort of administration. Sure, I occasionally deal with linux since our servers are linux and so some knowledge of it is useful, and I use unix tools pretty extensively (in OSX) since I prefer to write code this way. All the "software engineer"s I know have similar experience to mine with varying languages, so please suggest some evidence.


I agree that it doesn't make any sense, however I recently interviewed at two different companies for a software engineering role and both had requirements/expectations for sys admin experience. I will refrain from going off on a rant and just remain hopeful this is yet another short-lived trend.


No.


Is marketer the new journalist?


Is online journalist the new marketer?


I would say without any hyperbole, absolutely.


Wait. Then what are "influencers"? The feedback loop seems to be eating itself.


The human serpent of advertising.



coder -> programmer -> developer -> data scientist


Everyone is a programmer now. Data scientists, accountants, marketers, doctors, lawyers, project managers. Knowing how to write programs is just know how to write.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: