Oh, man, as a seasoned DBA/Data Analyst, don't do this unless you have a really good reason to. This is premature optimization of the kind you want to avoid.
Yes, it's really neat to update everything in a single statement, and in some situations it can perform significantly better, but CASE expressions in an UPDATE statement quickly get complicated to map out in your head. It gets extremely difficult extremely quickly to tell where you have an error, and it's very, very easy to make a very costly mistake.
If you really need an atomic change you can just do this:
BEGIN;
UPDATE reward_members SET member_status = 'gold_group' WHERE member_status = 'gold';
UPDATE reward_members SET member_status = 'bronze_group' WHERE member_status = 'bronze';
UPDATE reward_members SET member_status = 'platinum_group' WHERE member_status = 'platinum';
UPDATE reward_members SET member_status = 'silver_group' WHERE member_status = 'silver';
COMMIT;
In general, however, make your queries difficult for the server and easy for you, because you make a ton more mistakes than the server ever will. Let the query planner and optimizer do the work. If performance becomes a problem on this query, you can fix it later when you can focus on just that one issue and understand the specific problem much better.
> You can imagine how many round trips this would take to the server if multiple individual UPDATE statements had been run.
If you're not returning data and you reuse the connection like you're supposed to, "round trips" cost is essentially nothing. What's expensive here is that the database server has to scan the index or row data on member_status. However, if the table is not billions of rows, it can probably fit that index (or even the row data for small tables) in memory and will cache hit on everything.
However, the list of single UPDATE statement can perform much better than a monolithic statement. If you're only updating a portion of the table, or if the number of rows that you'll actually be updating is comparatively small, then the list of single UPDATE statements can perform much better. It all depends on exactly what you're doing with the table.
This comment is really interesting to me as I'm working on getting essentially this functionality added to Django[1].
The case statement is slightly different, more like this (usually with a couple of hundred ID's):
UPDATE TABLE place
SET rating = (
CASE WHEN id = 1 THEN 25
CASE WHEN id = 2 THEN 94
CASE WHEN id = 3 THEN 16
)
WHERE id IN (1, 2, 3)
Members of the Django community, myself included, have seem huge performance benefits using this technique. I had an update loop go from taking 90 seconds to 3.
As a seasoned DBA you see any huge downsides with this technique? I agree that if you're updating the whole table using a CASE statement is not a great idea but if you want to update 1000 rows with differing values and you have to issue 1000 individual queries then this can take some time (including whatever overhead Django has).
I'd probably use a temp table for that, rather than a case.
begin;
select id, new_rating into temp_ratings values (1,25),(2,94),(3,16);
update table place set rating=new_rating
from temp_ratings
where place.id=temp_ratings.id;
drop table temp_ratings;
commit;
Scales well, and almost all db engines know how to do that join. You can index the id, or use a temp table, or whatever depending on the shape and quantity of your data.
1) This is not really a temporary table in the traditional
sense; you're creating a real table and then dropping it.
This has problems with concurrency (you can't have two
"real" tables with the same name), among other issues.
2) Temporary tables in Postgres suck because they bloat
the catalogs. Creating temporary tables regularly is
better avoided.
3) There's no real need to create a table of any kind. A
regular VALUES would do. (Though you might have to encode
some type information into the query.)
> I had an update loop go from taking 90 seconds to 3.
How is the performance with prepared statements? It seems like repeatedly executing the same statement with different values is what prepared statements are made for.
The ORM likely does use prepared statements, it's that most ORMs by default makes 100 round-trip IO operations (one for each update). The code would need to be rewritten without the ORM to send all the updates as one database call. The query isn't expensive to the database, but to the application itself just the IO of the database call adds up quickly.
Unfortunately Django does not use prepared statements. There are some big challenges to work around before support can be added. In any case I'd be very, very surprised if a significant portion of the 87 seconds using a single statement saved was saved also saved by using a prepared statement.
I don't think using a prepared statement 100 times will have anywhere close to the same performance improvement as sending one concatenated bulk update or one update set case.
No reason why prepared statements and single-call batched-statements can't both be used together. Of course, if one method (batching statements) works well enough, there's no need for another (prepared statements) too.
In every ORM I've ever used, everything is a prepared statement by default, so I've only been comparing 100 separate database call updates vs 100 updates concatenated as a single database call and 100 updates as a single update case when, where all three were already implied prepared statements.
Prepared statements will still be noticeably slower in most deployments. Almost all drivers are synchronous, and each individual query is a round trip to the database. This tends to dwarf parse and planning times unless you have a sub-millisecond ping time to the database.
It can be a lot faster, but it can also be slower. With databases, the answer for almost everything is "it depends." Here, it really depends on how big the table is, how many rows you're updating, what the indexes are, how much RAM is available, etc.
My point is that, in a vacuum, you can't really say anything about what the performance difference will be. Yes, you can certainly find instances where the performance is better, but it may not be much.
My point is that the complexity of writing the statement is non-trivial, and unless you have a good reason to you shouldn't use CASE. It shouldn't be your first choice, is all I'm saying. For example, the example you provided here is not even valid SQL. It should be:
UPDATE TABLE place
SET rating = (
CASE WHEN id = 1 THEN 25
WHEN id = 2 THEN 94
WHEN id = 3 THEN 16
END
)
WHERE id IN (1, 2, 3)
Keep in mind, I also said: "If performance becomes a problem on this query, you can fix it later when you can focus on just that one issue and understand the specific problem much better."
There are lots of reasons writing a query one way might be faster than another. Running in parallel might be one reason. If you had say 20 separate cases, and you compared that to 20 separate queries one after another, they'd be waiting to run till the previous one ran, and so there's no overlap when run separately. I bet 20 parallel separate queries would have been much faster. Depending on your index, how much of the whole table did the query scan to create the update set - if it had to scan most of it for each query, that could explain why a single query was better.
I agree with others that usually it's best to write queries that are simple to understand and let the optimizer pick a good query. Under higher load and with higher data, the indexes matter more of course (and amount of ram).
I can't say the above case is more likely to be better or not, it depends on the database schema.
So I was actually wondering if I was correct after seeing some comments here. This is what my quick benchmarking says:
I made a postgres table with two columns, id and num. I filled it with 1000000 rows in sequential order (just a generate_series()), so the first row was (1,1), the second was (2,2) etc.
I ran two queries to update the first 1000 'num' columns with random values. The first was as described above:
UPDATE test SET num = (CASE WHEN id = 1 THEN 123 CASE WHEN id = 2 THEN 356 ...) WHERE id IN ...
This took about 1.5 seconds (on my completely un-optimized postgres instance)
The second uses 1,000 individual queries in a transaction (BEGIN, COMMIT):
BEGIN
UPDATE test set num = 123 WHERE id = 1
....
COMMIT
This is still running, so it's taken over 3 minutes so far.
I'm guessing it's because even though the statements are in a single transaction the database still issues 1,000 individual table scans for a single record, and does 1,000 individual updates. Doing 1,000 updates using a single CASE statement seems to take the same time as updating a single one with the traditional way.
Edit: Query returned successfully in 21 min 39 secs.
Did you have an index on the table? If you didn't, you had to scan the whole table on each "separate query", whereas in the single query you scanned it once.
But like someone said above, it depends on the best way, it will be different. Also when dividing into 1000 separate queries, they have to wait for the previous one to run, and then they'll need to be optimized. A more interesting comparison might be (1) create index on the id column, (2) run the 1000 updates in parallel, maybe 30 at a time?
There was an index on the PK column and the value column as well.
Edit: Oh no, there was not. My bad :/ I added an index and the query time dropped to 5 seconds or so. A lot, lot faster, but 4x as slow as the CASE version.
Parallel queries would probably perform better, but I'm looking at a more general case. Django supports a lot of backends and not all of them support parallel queries, and indeed there is no way to make parallel queries like this in Django.
If we are talking about Postgres specifics, there is some better syntax for this that lets you join on a list of Values you can construct, which I intend to add support for later.
But if using 1 case statement is generally faster than issuing N individual synchronous queries then it's a big win for applications that have to do this kind of thing:
for instance in queryset.all():
instance.field = some_function()
instance.save()
While this isn't a super common pattern it does crop up from time to time and it's annoying to work around (if you even can).
Yup! The simple UPDATEs could improve the chances of the query planner working optimally--think of cases where a single, complex, setwise update might avoid using some really helpful partial indexes, etc.
Head's up. Depending on the circumstances, multiple updates are vulnerable to a race condition known as "read skew". If a user's status changes while the multiple update transaction is executing, their status can wind up in a bad state. Here's a demonstration:
> CREATE TABLE reward_members (id bigint, member_status text);
> INSERT INTO reward_members (values (0, 'silver'));
> BEGIN;
> UPDATE reward_members SET member_status = 'gold_group' WHERE member_status = 'gold';
> UPDATE reward_members SET member_status = 'bronze_group' WHERE member_status = 'bronze';
-- Concurrently: > UPDATE reward_members SET member_status = 'gold' WHERE id = 0 AND member_status = 'silver';
> UPDATE reward_members SET member_status = 'platinum_group' WHERE member_status = 'platinum';
> UPDATE reward_members SET member_status = 'silver_group' WHERE member_status = 'silver';
> COMMIT;
> SELECT * FROM reward_members;
id | member_status
----+---------------
0 | gold
If the transactions executed atomically (in the instantaneous sense, not the ACID sense), the member_status of the user should be either silver_group or gold_group depending on the order the transactions executed in. Due to the race condition, the user winds up with the member_status of gold instead.
If you want to avoid this you either have to use the case statement version or use a stricter isolation level. Either repeatable read or serializable would prevent this. I've written about read skew, other related race conditions, and how to prevent them in a series of blog posts you can find here[0][1][2].
On a related note, efficiently batching disjoint updates is an interesting problem. One of my coworkers wrote about how he solved this exact problem in Redshift[3]. He found the most efficient way was to create a staging table with what old values corresponded to what new values. Then you can join the staging table with the table you are updating.
I was looking through the thread for this solution. It's great because it's much harder to mess up and is less verbose than a case, and it's built on common SQL features making it portable and fast and easy to grok (as long as both the engine and the dba knows how joins work).
As a data analyst this technique is also great for scrubbing dirty data at query time when you don't have the ability to just fix it.
Thank you very much for taking the time to explain that and including all the links.
I've had issues in the past with atomic transactions in Django still appearing to have race conditions and I think this explains why. Really appreciate it.
I suggest read committed, and increase it to serializable only if necessary. serializable transaction isolation means your code is responsible for catching the serialization exceptions that will occur and handling them (retrying the transaction, generally). Devs just don't do that, and don't even know how to trigger the exceptions in most case. So you end up with apps that work fine during CI and staging, but start failing under load. And it might not be possible to retrofit the correct exception handling without a massive re-architecture.
That's reasonable. I usually avoid the higher isolation levels because they can cause queries to spontaneously fail. You need to be able to put in the extra work for handling the cases where they fail.
If data correctness is an absolute must, you should go with serializable.
This isn't an issue with multiple statements in a transaction. It's just an issue with using the incorrect transaction isolation level. That will cause problems with single statement transactions if you're using READ COMMITTED transaction isolation level. One of your examples already shows that you don't need multiple statements:
S1> BEGIN;
S1> UPDATE ints SET n = n+1;
S2> BEGIN;
S2> DELETE FROM ints WHERE n = 2;
-- S2 blocks since the DELETE is trying to modify a row
-- currently being updated.
S1> COMMIT;
-- S2 unblocks.
S2> COMMIT;
S3> SELECT * FROM ints;
n
---
2
3
This is working as designed and intended. This is how the READ COMMITTED transaction isolation level works. If you want higher isolation levels like REPEATABLE READ or SERIALIZABLE, then use those. If you need snapshot isolation, then use that. READ COMMITTED is the default isolation level because it's the most usable isolation level for the vast majority of situations, but having an RDBMS that handles concurrency for you doesn't mean that you don't have to understand how that concurrency model works.
The tradeoff for these higher isolation levels is that instead of getting results that you might not necessarily expect, you'll instead get deadlocks and forced transaction rollbacks, and your application will need to account for that instead and may need to resubmit queries in certain situations. This is the fundamental tradeoff between performance and isolation.
Except mutex locking the rows is also vulnerable to read skew! For example, even if you lock the rows with SELECT ... FOR UPDATE:
> CREATE TABLE reward_members (id bigint, member_status text);
> INSERT INTO reward_members (values (0, 'silver'));
> BEGIN;
> SELECT * FROM reward_members WHERE member_status = 'gold' FOR UPDATE;
> SELECT * FROM reward_members WHERE member_status = 'bronze' FOR UPDATE;
> SELECT * FROM reward_members WHERE member_status = 'platinum' FOR UPDATE;
-- In concurrent transaction: > UPDATE reward_members SET member_status = 'platinum' WHERE id = 0 AND member_status = 'silver';
> SELECT * FROM reward_members WHERE member_status = 'silver' FOR UPDATE;
> UPDATE reward_members SET member_status = 'gold_group' WHERE member_status = 'gold';
> UPDATE reward_members SET member_status = 'bronze_group' WHERE member_status = 'bronze';
-- In concurrent transaction: > UPDATE reward_members SET member_status = 'gold' WHERE id = 0 AND member_status = 'platinum';
> UPDATE reward_members SET member_status = 'platinum_group' WHERE member_status = 'platinum';
> UPDATE reward_members SET member_status = 'silver_group' WHERE member_status = 'silver';
> COMMIT;
> SELECT * FROM reward_members;
id | member_status
----+---------------
0 | gold
What you would need instead are "predicate locks". Unfortunately Postgres does not expose predicate locks to the user. Predicate locks lock all rows that match some predicate and prevent any row where the predicate is false from being modified so the predicate is now true. If Postgres did have predicate locks, you could grab a predicate lock on "membership_status = 'gold'", before running the gold UPDATE. That would prevent any row from concurrently having it's membership_status to gold while the transaction is executing.
Stricter isolation levels are an option. I would only use them if data correctness is an absolute must. I try to avoid them because they abort any queries that would otherwise have a race condition. You then have to work around queries spontaneously failing.
These are great. I feel like no matter how many PG tips & tricks I learn, there's always something new out there.
I recently started learning PostgreSQL after ~2 yrs of relying on BaaS etc. (mostly of the NoSQL variety).
I have to say: I regret not learning it earlier. Stuff like constraints, triggers, views, etc., are game-changers, but even just being able to write queries instead of navigating a proprietary GUI was worth the few weeks it took to learn.
It was also a boon for data analysis work. Whereas before I'd take a few hours cleaning some data with JavaScript or Python, now I just write a SQL query or two. It's faster, it feels more powerful, and it's _substantially_ less fragile than one-off data cleaning scripts.
Anyway: for those that started out like I did, afraid of SQL, dive on in.
Yup-3 parts elbow-grease, 2 parts tears & sweat. I had a lot of data to work through, so I used that as an excuse. Not having to worry about performance, schemes, servers, users, etc., made learning a lot easier, I think.
I started by reading the tutorial in the PG docs[1]. It was _very_ well-written & accessible--def. recommend.
Later, I picked-up the book "Mastering PostgreSQL in Application Development"[2]. It was pretty solid, and seemed one of the only books aimed at developers (as opposed to DBAs).
The PG docs are surprisingly good. Good enough to learn from. I remember learning my first SQL back in like 1999 or 2000 by reading the PG docs in the back of my parents minivan. Good times!
I don't mean to knock the pg docs but I honestly come away reading them with the feeling "what did I just read?". They seem dense and dry. So personally I appreciate the articles like this to help me really learn. I will skip over official documentation and go to this sort of content.
It's a personal preference I guess. I'd rather hit up stack overflow and find a common English answer to a similar problem that usually includes an example. Django suffers from the same thing... excellent docs, that just seem unapproachable day to day.
Basically when I Google something, I'm skipping the first 2-3 official links that in fact contains the info I need, but instead clicking the one that summarized it more plainly
If it doesn't actually resonate with you, and you don't come away with the understanding you need, it's most certainly not a 'feature' but a 'bug'. I think some projects see dense/thicks docs that keep a high barrier to entry as a feature itself as well.
If they're not approachable, they're not 'excellent'.
I've been a django developer for a decade and keep hearing from others how great their docs are. I barely ever use them. My biggest complaint is they give examples without any context. I can't look up a feature without knowing 12 other things before
I also recently began learning SQL properly, and I also regret not having done so earlier.
The PG docs/tutorial is really great to get started.
After I worked through those, I feel it was essential to work my way through pgexercises.com. It's really easy to think you get it, but actually having to puzzle through stuff is obviously quite different.
When I reached some of the more complicated subquery stuff, I decided to get 'Learn SQL' by Alan Beaulieu (recommended by pgexercises), and I just worked my way from the start all over again. Book's excellent too.
I do find that I hit a 'barrier' when all the basic parts are introduced and it becomes all about translating a goal to 'thinking in SQL', especially when there seem to be multiple valid approaches. But the combination of reading a ton, then doing the exercises, then re-reading and trying to play around with the Learning SQL db (Sequel Pro is great for Mac, as is Postico (Postgres)) has been very effective so far.
Basically I just used the pg gem for Ruby to execute a bunch of raw SQL commands and wrote tests to check to see if the output was correct. It taught me a bunch in just a few days!
ORM is much easier when you know what is going on and can dove into a query plan. I have noticed most devs that struggle with ORM and or performance tend to just not understand basic RDBMS/SQL stuff. Indexes, joins, how to tell ORM to prefetch, reviewing query plans when the performance matters.
PostgreSQL like many SQL engines has real cost in doing separate queries, and the query analyser does a lot of work to "understand" what you mean (and may need a lot of guidance in the form of indexes), but in kdb+, a lot of these queries are simple and obvious to a q programmer, and doing it across multiple statements creates opportunities to understand what you're looking at -- and with little-to-no cost.
1.
fake:2017.02.01+til 28
t:select sum amount by legit:payment_date.dd from payment
t:update fake:legit from a where legit in fake.dd
select from t where amount > 2000
2.
string[2]," times "," 2 equals",string[2*2]
3.
t:desc select sum amount from payment by customer_id where payment_date.mm = 4, payment_date.dd within 10 13
/ or obviously clearer if you know the year:
/ t:select ... where payment_date within 2017.04.10 2017.04.13
d:exec distinct customer_id from t where amount > 30
select first_name, last_name, email from customer where customer_id in d
only want the top 5?
d:5#d;
4.
Tables are a data type (unlike SQL) so this is much simpler:
An excellent opportunity to see value sharing syntax between function application and indexing:
m:`gold`bronze`platinum`silver!`gold_group`bronze_group`platinum_group`silver_group
m:m!`$string[m:`gold`bronze`platinum`silver],\:"_group"; /shorter
update member_status:m member_status from `reward_members where member_status in key m
7.
awards_to_honor:select expense_amt, member_status from reward_members where member_status=`gold_group
save `:awards_to_honor.csv
> a lot of these queries are simple and obvious to a Q programmer
I'm not sure if it was you, but I most definitely saw a couple of posts like this on HN. Each time I had just one wish: for the poster to provide an explanation, preferably from the basics, of what is going on.
As a programmer, due to my hobby, I learned a lot more programming languages than average and am pretty confident in my ability to learn new ones, yet K and Q elude me completely. Every time I get interested in it, I struggle with its complete lack of entry-level documentation. I learned J, and while it's similarly impenetrable at first, there are tons of good documentation on J wiki. With Q I couldn't find anywhere near that amount of explanations.
So, I'm not against pasting Q snippets in comments, but I think they would bring much, much more value if accompanied with a detailed explanation of both syntax, semantics, and library functions. As it stands, even things I thought I know, like that : means assignment, only add to the confusion, because looking at your code I see that apparently it sometimes means something else entirely...
The thing I love is that the SQL and regular programming syntax can be mixed. For instance, if I wish to apply a lambda function for each group I can do, eg:
select result: {lambda code here...} column by category from table
Since a table is implicitly a list of dictionaries under the hood, you can iterate over a table where each column will be a dictionary. Tables and dictionaries are first class objects in q, so it is a lot easier to reason about certain manipulations as they are essentially the same as if you had used regular code. One issue I've always had with SQL itself is that I learned KSQL first, and the way many things are done in regular sql just seem strange.
How does one get this level of SQL proficiency? I am not sure how to go beyond my current level of understanding, and these queries don't make much sense to me. I don't often do more than some selects and maybe an inner join occasionally, so far.
(self plug) I made https://pgexercises.com/ a while back to help people learn SQL in a learn-by-doing exercise format. You can run the queries in the browser, so it's easy to just to an exercise when you have a spare few minutes.
The queries in the article seem trivial to me, very basic introductory level stuff.
I got to my current level of understanding by (a) writing 1000 queries as part of my college degree; we got to define our own schema, but then were given a list of query features to use, and several months to write them for our schema. This was perhaps 20% of the score for our databases class. I know of no better way than exercises of gradually increasing complexity. Even the school of experience is less complete and may have inconsistent difficulty jumps that leave you stuck.
And (b) exhaustive examination of query plans, while having mental models of what btrees and hash maps are. To understand performance and adjust your query to encourage a different plan, you need to understand what's possible and develop an intuition about what the database will choose to do.
I write fairly large SQL programs composed from successive queries in order to process large inputs (millions of rows) efficiently. It's cheaper to load the data into the database and compose operations over it than to loop in memory and iteratively perform queries.
Would you say (b) is worth properly understanding even when performance is not likely to be a concern? As in, is it like knowing a lower level abstraction that improves the higher-level understanding (in a way that, for example, knowing javascript makes using jQuery a lot easier)?
There's two classes of knowledge you acquire here: things you know you don't know, and things you don't know you don't know.
You're focused on what you know you don't know. You don't have a performance problem, so is the information worth acquiring if you don't have a problem? Maybe, maybe not.
What you might not know is ways to use the data in your database to make your application simpler or dramatically faster, not because it's a problem now, but because you don't know that there's a way it could be much faster or better. It could be reporting or some other kind of statistical output, something around import or export, or some features that could be possible that you simply didn't think were possible.
IMO relational models are usually superior to object oriented models. Much pain comes from trying to fit a relational model into an OO skin. Once tightly wrapped up in OO, a lot of the power of the relational model becomes hidden. OO turns into a straitjacket for your data.
I'd say that's more the (a) of the person I replied to. I was specifically asking about (b), "exhaustive examination of query plans, while having mental models of what btrees and hash maps are."
I do agree that understanding the relational model is hugely valuable, compared to just retrieving and converting rows to 'objects' and going from there (with, at most, some basic WHERE clause).
From my experience so far, simply moving beyond the most basic queries is already so much of a win that in most situations that are not 'web scale', it is more than enough. Especially considering the number of pretty successful companies I've seen that don't even go beyond the ORM basics.
Basically, (b) strikes me as an argument similar to, say, dropping down to assembly when in tons of cases just sticking with dog-slow Ruby is a common and legit approach to many use cases where performance isn't a serious bottleneck.
I'm not GP, but I would say it is essential to understand query plan debugging if you're doing, or plan on doing, database development at any scale--scale of data throughput, or scale of the size in code/developers of the application. It's essential in three ways: in principle, practically, and career-wise.
It's essential in principle because RDBMSs are not magic, but are asked, as an application grows, to do increasingly magical things. Understanding how, and more importantly why the DB chooses to execute your query in a given way is absolutely key to understanding how to design data flows and distribution across the non-DB parts of your application. The people who designed your RDBMS are smarter than you; the query planner is (albeit limited) insight into how they applied their intelligence rather than just how you benefit from it. Understanding that databases are collections of variously-traversed data structures underneath, chosen carefully to balance multiple use cases, can lead you to better understanding of other code you write.
It's essential practically because without consistent discipline and coordination (things that diminish, typically, as companies and codebases grow), queries and schemas tend to grow over time. Even outside of the complex reporting use cases, it's not uncommon to have really big/weird queries running on a very high utilization production database. Those queries interact with other stuff that's running in sneaky ways, causing everything from slowdowns to absolute deadlocks, and understanding exactly what is breaking them requires a thorough understanding of how the query plan works. "Add indexes and don't use the DB as a message queue" is excellent advice, and will get you very far, but past a certain size/cruft level, you need to go deeper.
It's essential career-wise because every company with more than a handful of engineers has a database guy or gal, and that person is typically considered essential and worthy of lucrative reward--even in environments where other backend engineers are considered replaceable cost centers. This is usually not the DBA, but an application developer who really deeply understands how the queries from your code hit the database, how it behaves if more than one of them run at the same time, and how to fix it if it breaks. A DBA debugging an unfamiliar/new query has to start from closer to first principles: how is it broken, how to tweak it to fix it, how to isolate it, etc. The database guy starts from the other end: is the code that generated it sane, given the data access patterns it's requesting? The querying code is a solution to whatever problem its writer was trying to solve, but is it the right solution? The query planner and database statistics/activity views are this person's bread and butter.
It is kind of what you might expect for many of us. Tough problems pushed me to learn more. I got started before the time of ORMs being popular. It forced me to learn a lot of fundamentals. Find data sets and interesting things to do with them. Teach yourself about the database internals. RDBMS is rows and columns. Queries are planned. Planner tries to use most optimized query. Example, you join on a column that isn't indexed. To find every matching row DB has to scan entire table. Hello O(n^2). The details a query planner gives can be overwhelming. Learn every bit of it. Find ways to query data efficiently you would just deal with it app code. Learn subqueries. Learn aggregation. Learn aggregation with subqueries. But here is the real, fundamental truth about RDBMS and SQL. It is essentially a functional data environment, like a spreadsheet. Once you adapt to the functional mindset of SQL it can feel effortless to do things you might not have even thought possible with SQL.
> Example, you join on a column that isn't indexed. To find every matching row DB has to scan entire table. Hello O(n^2).
It's only O(n^2) if your DBMS uses nested loops to do this join. If it uses a merge join it's O(n log n) (plus you get an order by for 'free'), and if it uses a hash join it's O(n).
MySQL, last I checked, only supported various forms of nested loops, but first-class databases like PostgreSQL support the others.
Good point, and a good reason to learn how your DBMS of choice does common operations. Most query planners will explicitly tell you what yours is doing. I remember even back a long time ago MSSQL 7 had a nice GUI for its "query explain" and pgAdmin also gives a really nice graphical view of query plans, which really helps visualize when it is looping, doing hash lookups, etc. I suppose that would be part of the recommendation for learning: use a graphical query plan view if you can find a client that supports it :)
Build a 100+ table database. Try redoing the havok some ORM did on a solo project.
Also, books are far better than any online tutorial. I'd suggest doing the database before reading the books, as that will show you all the mistakes fresh in your memory.
The best way is through experience. Building an ERP system for a business is a great way to run into problems that need these kinds of solutions. You’ll be writing triggers, indexes, stored procedures, views, etc.
Ha! I had the pre-ORM ERP experience myself, definitely a way to learn how to go beyond the basics. Probably not available to the newest programmers though.
For Postgres and other feature rich RDBMS's one important thing is to take some time to learn what's possible. generate_series() is at the heart of a lot of good reporting tricks, as shown in the article.
minor style thought: I think it'd be easier to scroll thru an article like this (ie, a list of things) if the item headers were more distinct from the body. e.g., a heavier weight, or maybe with more vertical margin.
You should view the article on mobile then, where all the SQL is indented... and in a font 2x the size of the rest of the article, and you can only see 25% of a line.
You dont need include aggregate functions in your select clause to use them in having and order by.
I suppose I may be glossing over another point, but for the subquery example I would've just done:
SELECT first_name, last_name, email
FROM customer
WHERE customer_id IN
(SELECT DISTINCT customer_id
FROM payment
WHERE extract(month from payment_date) = 4
AND extract(day from payment_date) BETWEEN 10 AND 13
GROUP BY customer_id
HAVING SUM(amount) > 30
ORDER BY SUM(amount) DESC
LIMIT 5);
Other people have given actual answers, so I'll go ahead and give the non-answer of... use psql instead!
Like Dimitri Fontaine says early in the Mastering PostgreSQL in Application Development book mentioned elsewhere in this thread, psql is really powerful. From the \d meta commands to inspect anything and everything in the database, to the variable substitution, nothing beats editing sql in your preferred editor and hitting \i my_query.sql in psql to run it. You can also nest additional \i my_other_query.sql commands inside the files you run with \i, which I've found really useful for repeatedly dropping / installing test tables / views, or setting up test data, before running the query I'm actually interested in. Take the time to customize it by adding a few lines to your .psqlrc file, like \x auto, and \timing and it becomes very usable very quickly.
I used psql for a number of years, then joined a new company where DataGrip was the norm. I gave DataGrip an honest try to see what I might be missing from a GUI SQL client. After a year of using it exclusively, I slowly found myself migrating back to psql and can honestly say I don't miss anything.
But it's also possible I just haven't tried the right GUI client yet. Anyway, I hope you find the PostgreSQL client that works for you!
Postico is fantastic. I’ve been part of a few discussions on github with new features they are adding in and the core team is super helpful and open to making the product better. I use it daily.
Postico is nice but gives me anxiety. I don't like the metaphor of browser-style back/forward yet hoping the software remembers the state of a window when you go back to it.
My money is on DBeaver (free). The killer feature is that you can have all the queries you're doing laid out on the same place and execute the one you want as pleased, and also selecting rows of results and generating an `INSERT` statement off them.
We moved from pgadmin to DBeaver at work and have been pretty satisfied. It's based off Eclipse, so benefits from the windowing/panes, and being able to use any eclipse plugins. My db usage world changed the day I got vim support and auto complete in my query editor.
I definitely meant that as "being based off Eclipse it comes with these advantages, but caveat emptor otherwise". Honestly I use Eclipse for all my Java dev at work, and compared with the OSX implementation of PGAdmin III, it's well worth the shortcomings.
Oh, man, as a seasoned DBA/Data Analyst, don't do this unless you have a really good reason to. This is premature optimization of the kind you want to avoid.
Yes, it's really neat to update everything in a single statement, and in some situations it can perform significantly better, but CASE expressions in an UPDATE statement quickly get complicated to map out in your head. It gets extremely difficult extremely quickly to tell where you have an error, and it's very, very easy to make a very costly mistake.
If you really need an atomic change you can just do this:
In general, however, make your queries difficult for the server and easy for you, because you make a ton more mistakes than the server ever will. Let the query planner and optimizer do the work. If performance becomes a problem on this query, you can fix it later when you can focus on just that one issue and understand the specific problem much better.> You can imagine how many round trips this would take to the server if multiple individual UPDATE statements had been run.
If you're not returning data and you reuse the connection like you're supposed to, "round trips" cost is essentially nothing. What's expensive here is that the database server has to scan the index or row data on member_status. However, if the table is not billions of rows, it can probably fit that index (or even the row data for small tables) in memory and will cache hit on everything.
However, the list of single UPDATE statement can perform much better than a monolithic statement. If you're only updating a portion of the table, or if the number of rows that you'll actually be updating is comparatively small, then the list of single UPDATE statements can perform much better. It all depends on exactly what you're doing with the table.