what's the diff when running 100 updates in a single transaction? that's typical...

orf · on April 15, 2018

So I was actually wondering if I was correct after seeing some comments here. This is what my quick benchmarking says:

I made a postgres table with two columns, id and num. I filled it with 1000000 rows in sequential order (just a generate_series()), so the first row was (1,1), the second was (2,2) etc.

I ran two queries to update the first 1000 'num' columns with random values. The first was as described above:

    UPDATE test SET num = (CASE WHEN id = 1 THEN 123 CASE WHEN id = 2 THEN 356 ...) WHERE id IN ...

This took about 1.5 seconds (on my completely un-optimized postgres instance)

The second uses 1,000 individual queries in a transaction (BEGIN, COMMIT):

    BEGIN
    UPDATE test set num = 123 WHERE id = 1
    ....
    COMMIT

This is still running, so it's taken over 3 minutes so far.

I'm guessing it's because even though the statements are in a single transaction the database still issues 1,000 individual table scans for a single record, and does 1,000 individual updates. Doing 1,000 updates using a single CASE statement seems to take the same time as updating a single one with the traditional way.

Edit: Query returned successfully in 21 min 39 secs.

Latteland · on April 15, 2018

Did you have an index on the table? If you didn't, you had to scan the whole table on each "separate query", whereas in the single query you scanned it once.

But like someone said above, it depends on the best way, it will be different. Also when dividing into 1000 separate queries, they have to wait for the previous one to run, and then they'll need to be optimized. A more interesting comparison might be (1) create index on the id column, (2) run the 1000 updates in parallel, maybe 30 at a time?

orf · on April 15, 2018

There was an index on the PK column and the value column as well.

Edit: Oh no, there was not. My bad :/ I added an index and the query time dropped to 5 seconds or so. A lot, lot faster, but 4x as slow as the CASE version.

Parallel queries would probably perform better, but I'm looking at a more general case. Django supports a lot of backends and not all of them support parallel queries, and indeed there is no way to make parallel queries like this in Django.

If we are talking about Postgres specifics, there is some better syntax for this that lets you join on a list of Values you can construct, which I intend to add support for later.

But if using 1 case statement is generally faster than issuing N individual synchronous queries then it's a big win for applications that have to do this kind of thing:

    for instance in queryset.all():
      instance.field = some_function()
      instance.save()

While this isn't a super common pattern it does crop up from time to time and it's annoying to work around (if you even can).

mgkimsal · on April 15, 2018

thanks for the details.

4-5 seconds sort of makes sense, depending on hardware. 3 minutes made no sense.