The main point that one should know SQL as much as possible before using ORMs, I agree with fully. The point that applications should be written by quick-prototyping with an ORM, then replacing the ORM entirely with raw SQL, I could not disagree with more. Since he is using my own ORM (SQLAlchemy) as his example, I'd like to point out (as many of you know I always do) that SQLAlchemy's entire approach is one of exposing the relational model at all times. There is virtually no need to drop out of using the Query and/or select() constructs into raw SQL strings, as these constructs can represent SQL fully and more or less directly, not really any different in concept than when the relational database itself parses an incoming SQL string into a tokenized parse tree internally. The results returned are tuples. If your tuples happen to line up with the attributes in your object model, then you can tell it to get objects back. There is no sharp red line between "I'm using an ORM!" vs. "I'm using SQL!". With a mature tool like SQLAlchemy you're using both, and the tool is there to automate your work with SQL, not to replace it.
So of course, learn SQL as fully as possible. But I recommend using an ORM that allows you to make full use of your SQL knowledge at all times.
This is very important. Any time you build a bike with training wheels, you need to build in a way to take the training wheels off. Otherwise, at some point the solution is to not make faster training wheels, but discard the bike entirely.
Right, and (most) C compilers let you fall back on assembly if you absolutely must. So my point still holds, even if you do something silly like employ hyperbole on it.
Indeed. I do worry about the level of SQL knowledge that developers have in general these days. Without understanding it - and the relational model (which I'm a big fan of) - it's impossible to optimise the way you use the ORM.
OT but thanks for SQLAlchemy. It's the model ORM in my opinion. It's the only one I've ever used that feels like it respects SQL. When I started working with Python fulltime I spent days reading through the SQLAlchemy source - I learnt a lot from that.
I have no interest in the 'O' or 'M' in an ORM, but I am very interested in a language better than SQL for interacting with SQL-based pseudo-relational datastores. I've found SQLAlchemy to be a pretty good tool in this respect.
Other people on your team, might not be on the same SQL level as you. E.g. you can create very fast update/bulk/etc. operations directly in your used SQL engine, but then provide some ORM for people not familiar so that they can get access (speed might not be critical to them, just access to the db).
An example - in the game industry we often have technical artists - pretty cool bunch of folks with awesome talents - art (maya, motionbuilder, etc.) and coding (python, mel, C/C++) - these folks are never afraid to step into the dark woods, and would take SQL and learn it as if it was nothing dangerous, but they surely won't mind a good ORM. From what I've learned, they see most of these things as tools.
Any reason it would be 'wrong'? There are few things in software engineering that are universally 'right' or 'wrong', it all depends on context.
There are plenty of people who already know SQL and are comfortable using it, but still prefer to use an ORM though. It's not like ORMs are only used or liked by people who don't know SQL.
Still, 'it depends'. Yeah, there are all sorts of reasons, depending on context (and plenty of people to argue all those reasons).
Personally, I agree with OP that nobody should ever use an ORM as an excuse to not know SQL. You need to know SQL anyway.
So, once we agree on that, your question "Is there any reason to use an ORM if I know SQL" (that's basically what you mean when you phrase it in the inverse "Is there any reason it would be suboptimal to NOT use an ORM", right?) -- basically just boils down to "Is there ever any reason to use an ORM?".
A topic which is basically the equivalent editor/OS war of db-based web development on HN or reddit. Meaning it's an argument that can and does go on forever, and you can find in-depth treatments of in many other threads.
If you already have a layer of abstraction to your SQL statements, then it can be arguable. But if your are littering SQL statements everywhere, making changes to the database can be more tedious.
Also if you want to support more than one type database, I would use an ORM.
it's really not as much about the querying (though there is a lot of time-saving automation to be had there) as it is about integrating the data in your object model with the tuples being shuttled to/from the database. Like, at what point do you get sick of writing redundant "INSERT INTO <table>" over and over again? Are there really people who still don't see the time-wasting, code-cluttering repetition in that?
I'm still amazed that there are developers who would rather spend more time maintaining/debugging/optimizing ORM code in a mature app that just learning a simple INSERT statement.
who would rather spend more time maintaining/debugging/optimizing ORM code
But that's kind of the point- with a decent ORM library there isn't really much work to do.
model = MyModel.load(id)
Add a new field to your model? It'll handle it. It might even modify the table for you. With raw SQL you'd have to go and edit your UPDATE and INSERT statements each time. Seems like a lot more maintenance/potential debugging to me.
SQL is rarely about a lot more than CRUD and tables. I would think that over 95% of the SQL code ever written has been CRUD operations.
ORMs exist to simplify CRUD operations. If you are trying to do something that is a not a CRUD operation then you'd be mad to try and use an ORM to do it.
Because in my experience, the "maintaining/debugging" step is almost non-existent (isn't that the point of the ORM?), and the "optimizing" step hasn't been an issue for the traffic our sites get.
> the "maintaining/debugging" step is almost non-existent (isn't that the point of the ORM?)
Nobody can agree on the point of ORMs, which is a large part of the reason why there's so much debate over whether to use them and how. I must say, if their point is to eliminate maintenance and debugging, then they are failing miserably at it.
anyway it is usually never the case. Why would you build a complex app only to tear it appart with going back to raw SQL ?
i'm not talking about SQLAlchemy but often the main problem with ORM is performance. Especially when a orm has its own query language on top of the SQL language ( Like Hibernate or Doctrine ).
A good ORM imho is what you described , something very light but still usefull enough so one doesnt have to do data transformation from rows to objects. How does SQLAchemy fetch related objects ? is it is lazy ? or does it fetch everything when doing a join query for instance.
Anyway thank you for your great work with SQLAlchemy.
For a while I used to ask interview candidates to explain the difference between WHERE and HAVING, to see if they'd ever done anything beyond the basics. I'm still not sure if that's too hard, but people who could answer it did tend to do much better in the rest of the interview as well.
It's sad but this difference, although completely fundamental to SQL, is seen as "advanced" by most devs that I know. They would have no clue as to how to answer...
Consider these two queries (untested so they might have typos):
select order_number,sum(line_value) as order_value from order_line
group by order_number
having order_value > 100;
And:
select order_number,order_value from
( select order_number,sum(line_value) as order_value from order_line
group by order_number )
where order_value > 100;
I'd expect them to have the same explain plan and runtime characteristics. Isn't HAVING just syntactic sugar for the fact queries can be arbitrarily nested? Syntactic sugar isn't fundamental IMHO.
I don't know all of the syntax from all languages, but I do know that in Sybase and Microsoft SQL Server, the first would generate a syntax error. The HAVING and WHERE clauses cannot reference column aliases. You could rewrite it to be:
...
having sum(line_value) > 100;
Would the performance be the same? That's up to the individual DB. To say, "Yes, they would" implies that all DBMS vendors implement query plans the same way. I think that, fundamentally, they should perform the same way but again: it's not up to you or I; it's up to the people who wrote the query execution engine.
I believe you are correct, except possibly the other way around -- nested queries are just syntactic sugar for joins, group by, and having. :)
I'm actually not sure what I said is true (you can possibly do more with nested queries than you could do with just joins and group by/having?), but I DO think group by/having came before nested queries in rdbms implementations.
I can help out with this! (Finally a hacker news discussion in my line of expertise!)
While you could think of nested subqueries as syntactic sugar, you might also want to think of them as a view you specify on the fly, or a "derived table" of information.
Each RDBMS optimizes a bit differently, but depending on your system subqueries may have query plan implications as well. Sometimes they'll make your query faster, other times slower. It all depends on the RDBMS, table indexes, and the operations you are doing inside the nested query.
Personally, I'd recommend using HAVING instead of using WHERE with a nested SUM. The query optimizer may create the same execution plan in the end, but the HAVING is a bit more explicit in what you are doing.
For those familiar with SQL, HAVING indicates you are filtering your query on an aggregate value, where as a WHERE indicates you are filtering records out of consideration before they are aggregated (as someone else as pointed out in another comment).
Conceptually WHERE is filtering the table rows that get considered and HAVING is filtering the result rows - functions and all - that get returned. That's how my mental model of it all works.
I deal with pretty complex queries daily, and I completely forgot what HAVING did exactly. I guess it's because I long ago realized that if I had to use HAVING in a query, I'm doing something wrong. That's almost always a Reporting/BusIntel tool's job, not mine.
That's a great point. Part of learning SQL is learning what parts of it _not_ to use. Things like HAVING and ORDER BY are useful for ad hoc queries at a SQL command-line, but are either not useful or a waste when there's another level in the system like a BI tool.
I like asking for a 3VL truth table for the basic logical operators. This is a good way to get people talking about something that is conceptually fairly straightforward but that they might not have ever really puzzled out before.
If you really want to master non-trivial SQL learn to think about filtering, joining, and combining data via set theory. Once you can logically break down the set you're trying to get to the rest then just becomes learning/looking-up syntax.
The other big piece of advice is the tried and true incremental approach. The more complicated something is, the more likely I am to use the SQL client the way one uses a REPL and incrementally write the query:
1. Write basic SELECT
2. Add another clause (WHERE filter, GROUP BY, etc...)
3. Execute (syntax/sanity test)
4. Finish or Goto step 2
Just like everything else in programming it's amazing how much simpler things are when you just piece them together one step at a time.
I believe it's somewhat impossible to overestimate how REPL-like data munging with SQL actually is. Particularly as the complexity of your data exploration increases.
He left out an important reason to know and use SQL directly: you should always be aware how you are using your indices. This goes beyond making an index for a column in the table: you also need to know whether the query will map well to a btree index. Mobile, with its low resources, makes index optimization even more important; apple's Core Data can easily kill your app.
A class of query I love that scares off a lot of developers is a correlated sub-query, where the subquery references a value from the outer query. For example, finding all employees with at least one assignment:
SELECT *
FROM employees e
WHERE EXISTS (SELECT 1
FROM assignments a
WHERE a.employee_id = e.id)
For a while in Oracle this was a lot faster than IN/NOT IN. I'm not sure if that's still the case, or if it's true for other systems. I believe I read that in Postgres the query planner does the same thing whether you use EXISTS/NOT EXISTS or IN/NOT IN.
EDIT: This kind of query is great with Rails scopes, because you can write something like this:
class Employee
scope :with_assignments, where(<<-EOQ)
EXISTS (SELECT 1
FROM assignments a
WHERE a.employee_id = employees.id)
EOQ
end
and that is easily composeable with other scopes/conditions/etc since it doesn't force you to use any joins. Yay for mixing SQL with your ORM!
I believe I read that in Postgres the query planner does the same thing whether you use EXISTS/NOT EXISTS or IN/NOT IN.
I don't think this is true, unless it's a very recent change. Here's a post from 2009 comparing NOT IN/NOT EXISTS/LEFT JOIN WHERE IS NULL for Postgres:
I recently discovered that MySQL before 5.6 (5.5 being the latest available on RDS, of course) does not honor indexes if they should be invoked for a subquery, e.g.
select whatever from wherever where user_id in (select id from users where somethingorother like '%lol%');
Got an index on user_id? Too bad. Ignored.
If you precompute the values, though?
select whatever from wherever where user_id in (1, 2, 3);
Let's say your inner query returns n rows and wherever contains m rows. And your index is a B-tree, so you can hit it once in O(log m).
If you use a nested loop join, where you hit the index once for each inner query result, that's O(n*log m). If on the other hand you do a hash join, skipping the index but doing a full table scan of wherever, the complexity is O(m+n). So which is the faster choice depends on how many rows the inner query returns.
If you want to plan up front, before you know what m is, how would you decide which join to use?
You can use CROSS APPLY on tsql, and it allows you to use an order by in the inner query, for example, if you wanted to know the id of the employee's latest assignment.
SELECT e.*
, assignment_id = b.id
FROM employees e
CROSS APPLY (SELECT top 1 a.id
FROM assignments a
WHERE a.employee_id = e.id
ORDER BY a.assignment_date DESC
) b
In tsql, you are not allowed to have order by in a subquery (if memory serves me right)
Edit: Oh crap, I forgot this isn't SO, and the formatting went out the window.
Sure. Note also that CROSS APPLY allows you to retrieve multiple columns, unlike an in-SELECT subquery (unless you want to have that query written multiple times)
Depending on the query and the optimizer, the resulting plan for IN/NOT IN might be the same as EXISTS/NOT EXISTS. E.g. a left semi join. I know this is true for MSSQL Server, I can't speak for Oracle.
Well, adding joins to the outermost query can limit your options re how you use the scope. For instance, `joins` sets a read-only flag on the AR result, so you can't say Employee.with_assignments.destroy_all, for example, to, um, fire all your busy people. :-)
OP's query is not the same as an inner join; it will only return each employee only once while your query will return each employee for every assignment it is associated with.
I haven't used Oracle since 8, and the only reference I can find now from those days are a couple AskTom articles [0] [1]. They are quite worth reading, although what they say re performance is not quite what I (mis?)remember.
Everyone uses an ORM. You use a well known, documented, and supported ORM, or you're writing your own wether you realize it or not.
Don't believe me?
1. Do you have objects?
2. Do you have relational data?
There's the O and the R. How do you get them together? That's where the M comes in. You use a library that knows how to do the M, or you do your own M with a bunch of getters and setters, for loops and case statements.
Eventually, any little change to the database becomes a regression nightmare.
Once you find yourself saying "I know, I'll build a code generator to create these DAOs", that's when you should finally realize you should have used a real ORM. Sadly, many people still won't get it at this point and will go ahead with the code generator.
I don't know why you're being downvoted. Even when you're using an oo language, there's not always a reason to force your query results into an "Object." Frequently the only object you need is a 2 dimensional data structure which could be a list of dictionaries, or a DataTable or something like that. You don't always need a special named structure for the results of every query. ORM's always seem to be designed for people who have a one to one mapping between data structures in their program and tables in their DB. I find that this very rarely makes sense, particularly when you're talking about analytical applications.
Great, you have a list of dictionaries of dictionaries. Now the user has updated some of the items inside one of the nested dictionaries.
The user clicks save. Now what? You've got to get that change to the data in the dictionary inserted into the correct place in the relational database.
You're going to write code to do by hand what an ORM wants to do for you.
Do you use data structures? Do you use variables? Do you use anything that in any way stores the values from your database so your application can work with them?
Then you are doing XRM, where X = Objects, structs, vars...
More wild assumptions. The M in ORM stands for mapping. I do not map anything, the data comes back in tuples, and is used exactly as-is. Just because it gets stored in a variable doesn't mean there is any form of mapping going on.
I don't understand how it is possible to develop anything database-related without knowing SQL. Are these people not querying the database directly to figure out what it contains (during development, debugging, testing etc etc)?
I've never seen a GUI frontend that managed to remove the need to write custom queries while inspecting the data, but if you don't know SQL I could see someone not even realizing all the things that can be trivially done with a relational database, and just settling for manually cross-referencing tables.
Well there's the whole world of NoSQL but without understanding the relational model there's no guarantee that your choice of NoSQL is appropriate. From what I understand, many of the prominent advocates of NoSQL even recommend learning relational first.
I once interviewed a guy with a Masters in computing (of some sort, I forget) who didn't know SQL. He'd been developing for years, but lived entirely in .NET land and just used ORMs.
I could actually see how this happens these days, especially when working on a big enough team with dedicated Database engineers. In the old days, I'd definitely be writing my own stored procs and embedded SQL in code, etc. Then I worked for several years on teams with dedicated data gurus. When we'd hit some performance thing that would need tweaked, they'd write a stored procedure for it, and we'd just call it from the middleware. I've seen this type of team composition on both .NET and Java platforms, with and without the use of ORMS.
It all depends on the coverage of your server-side web engineers...some will go deeper into the JavaScript/UI, some go deeper into the data-model.
I have a PhD in computing (of some sort, I forget) and I know about 3 SQL commands. I've been developing for decades, and I never need to interact with databases. Like everyone in my field of research, I keep my data in text files.
> I have a PhD in computing (of some sort, I forget) and I know about 3 SQL commands. I've been developing for decades, and I never need to interact with databases.
If you were a developer that did interact with databases (but not a DBA or specifically a "database developer"), you could probably get by with the four that correspond to CRUD operations directly (SELECT, INSERT, UPDATE, DELETE), so knowing "about 3" isn't all that bad.
In practice, SQL access control is pretty simple-minded compared to relational algebra, and a server may not notice that your DELETE and INSERT are equivalent to an UPDATE which only modifies columns you're authorized to.
I was referring to what you could practically survive with as an application developer who didn't control the databases involved; for the reasons prodigal_erik states in their reply, DELETE + INSERT, while you can do equivalent transformations to UPDATE with them, aren't going to be a general-purpose replacement for developers of real-world applications against real-world databases that they do not control.
Sure- if using databases isn't an everyday task for you it's understandable. But this guy used them every day, without knowing what was actually going on when he called .Save()
Most of my experience is with the Django ORM. The point of the django ORM is not to replace SQL knowledge, even though that is feasible. The main point is reusability.
I had to work with SQL through PHP for a while and I found myself "composing" SQL queries in a myriad of ways. I tried to not repeat myself, but it felt like the Django ORM would have gone a lot further in cleaning up the query-building.
In conjunction with Django forms and Django Admin, maybe even the template language, the ORM makes query construction reusable.
One of the kickers is the ability to unify object construction from table columns. It's easy to convert a string or number to some Python field. It's more elaborate with Decimal, Json, or whatever you want to cook up.
This is only a valid question if you have a single engineer on the project? Likely, there's at least one or two people on the team that are gurus at optimizing the access to your datastore.
What are some good resources for methodically learning SQL? Like many of the other devs I know, I learned a hodge podge of SQL while working on other projects, but I've never had any formal or comprehensive training on it. Ideally I'm looking for a book or two, and I don't mind if they start simple as long as they're comprehensive and recent enough to be relevant to modern RDBMSs.
Then learn the pain as everyone uses an expensive view in all sorts of queries because it's easier, and they dont understand the performance characteristics of joins well enough to understand why they shouldn't do that.
An ORM isn't necessarily complex. It just needs to map a tuple to an object.
Where things get complex is when your framework starts introducing other concepts such as query generators, unit-of-work, caching, lazy-loading, etc.
One of the most critical features is query generation, which I think is the point of this article. Simple queries are pretty easy to abstract, such as loading rows by primary key or querying based off a simple index. Other queries, especially aggregate queries, get tricky fast. I argue that often it is much harder and more work to construct an appropriate query via your frameworks query generator.
Fortunately many good frameworks allow you to essentially write the exact SQL to be executed and the rest of the framework (mapping, caching, unit-of-work) "just works" with the results.
No. First, learn the internals of your system. Learn SQL. Learn NoSQL. Understand why your software is slow. Then, once you've optimized the worst case away with decent SQL and/or a good object-oriented architecture. Then and only then should you start with caching. Too many developers these days are just brainless monkeys who dump everything in memcache.
Throwing caching at poorly written database code is exactly like when parents tell their children to tidy their room and then they shove everything under the bed. You can only do that for so long, at some point you need to clean that room up.
The article has the good premise, of course you should learn SQL, but it goes over the top and makes some false presumptions for the sake of the argument:
'Think about it, though: it’s absurd that you would even need to learn any SQL at all! The very nature of an ORM is to bypass SQL.'
I think the nature of ORM is to map object oriented code to relation data. So not to bypass, but to pass between the two conveniently. Conveniences that ORMs do automatically that otherwise you do manually are type checking, sql sanitization, and merging logic (methods) with the data in the same class. Knowing SQL does not make the above tasks any easier, so does not, in any way, prompt dropping ORMs.
Also, there is the wrong use of 'begs the question' in the same paragraph :)
But the point that one needs to know SQL even if one is using an ORM is just obvious to me. It boggles and scares me that anyone thinks they can be a competent web developer without knowing SQL.
I was going to say "...if they use an rdbms, maybe they just use some NoSQL and can get away without it." But you know what, nope, not even that caveat -- if you don't know SQL and rdbms, you aren't going to be competent to know if some nosql is right the choice, or which one, either.
I agree. I am not a professional programmer. I'm more a technologist with a passing interest in programming. My desktop is Linux and I can basically hack in Bash, Perl and R to achieve certain simple outcomes. One of the things I use R for is to query a MySQL database for statistical analysis. To do this, I write long hand SQL queries. I didn't know there was another way. What is this ORM of which people speak?
The nice thing about SQL is that it doesn't take very long to learn how to use it and then it's incredibly useful any time you have a database to interact with. I think all novice programmers should take a crack at it.
More generally, I think that learning about the relational model (which, in practice, probably means learning to use SQL, even though SQL is in many ways far from ideal) of data is fundamentally mind-opening in how you understand and deal with data and makes you a better programmer in general, even when you aren't interacting with an external relational database.
I see no reason why one programmer can't learn just about everything related to their application. I expect the programmers who work for me to be experts in SQL, CSS, and everything in between.
I truly wasn't being pedantic. Pretty much every project I've worked on has required me to learn some new layer of technology that I never even knew existed. Sometimes it's just too much for one person and you have to rely on the strengths your team as a whole. The idea that a single developer can rapidly learn all of the layers/technologies affecting his/her project is a fallacy at best.
[Edit: However, in the spirit of the OP it's perfectly reasonable to expect your devs to know or be able to learn SQL.]
I dont understand how its possible for someone to be the "best python programmer you know" yet that person doesn't know enough SQL to display data from a table...?
> I dont understand how its possible for someone to be the "best python programmer you know" yet that person doesn't know enough SQL to display data from a table...?
Because knowing SQL doesn't make you a better Python programmer. It might make you a better application developer, but SQL knowledge is not a subset of Python knowledge.
So of course, learn SQL as fully as possible. But I recommend using an ORM that allows you to make full use of your SQL knowledge at all times.