First advice is horrible. One shouldn't do in application things that are meant ...

vidarh · on April 3, 2013

It is not horrible advice once your system hits hard limits of the database system. Depending on your database system you can hit those fairly quickly.

It is often far cheaper to scale an application over many boxes by extracting data from your canonical database into a set of in-memory read-only search structures, for example, and delta-index and merge changes regularly.

It is similarly often far cheaper to sort and group large dataset outside the database because sorting and grouping are simple to parallelise over multiple machines working on in memory subsets and doing cheap merges at the end.

If your system can run at reasonable speed in your RDBMS, sure, do that rather than reinvent the wheel.

But when you find yourself maintaining complex trees of replicas, it is often worth testing if you can do better with specialised middleware that can selectively throw out guarantees your RDBMS can't because it would violate guarantees it is meant to provide and that can otherwise make use of specialized characteristics of your data.

E.g. you don't see people running large search engines out of RDBMS's. For a simple reason: while many RDBMS's provide full text search, you can do it far faster when you realize that your full text index is "always" going to be catching up, and so once you exceed the threshold where a single RDBMS doesn't serve your needs anymore (and often before that) you can save massive amount of resources by building small, frequent deltas of changes, distributing them to however many app servers you need, and gradually merging the deltas into larger chunks to keep the number manageable.

There are a lot of scenarios like that where moving the logic out of your data store makes sense.

mfenniak · on April 3, 2013

I think the advice in point #1 only makes sense with a caveat: When performance is an issue, if you can calculate or process it at the application layer _without adding load to the database layer_, then take it out of the database layer.

For example, order-by w/ limit/offset? If you do that at the application layer, you've increased the I/O usage on the database server, and clearly this advice doesn't make sense.

On the other hand, group-by (assuming no HAVING)? If you do that at the application layer, you've reduced the load on the database server, increased the load on the network, and you've probably made a justifiable performance vs. scalability trade-off. If you measure it and back it up with data.

I think the advice here in the article is too broad, but I can see a kernel of wisdom in it that is non-intuitive.

markokocic · on April 3, 2013

Well, without a caveat the advice is plain wrong in 99% cases, since that caveat changes the advice completely.

primitur · on April 3, 2013

I agree with you - I think this bit of advice addresses the effect, not the cause, of an ignored problem: design, design, design.

Design your database for your application - if you have some major hassle with your database after App 1.0 is developed, its because you missed this very important step.

Don't treat the .db like its an architectural sandbox, adding/tweaking/removing things 'to make it better' after the fact. Careful .db architecture means, once your app reaches a functional state, your .db should be already pretty much immutable.

Maybe the solution for those who can't escape this flaw is to simply build the App first, then the .db, so that there is little chance for the .db to be screwed up in the first place, who knows ..

np422 · on April 3, 2013

If your performance bottleneck is at the database layer it's a difficult problem to solve, scaling relational databases requires all kinds of witchcraft and magic such as read-only replicas, sharding, result-caching ...

But it is very simple to add another application server and update your load-balancers configuration.

Some large (internet) applications have moved operations such as joins to the application and only use the database for simple storage.

I think it's a reasonable advice.

hackerboos · on April 3, 2013

I agree. Why would you sort in memory? RDBMS have been heavily optimised to do these things.

vidarh · on April 3, 2013

You would sort in memory when your dataset and/or number of clients is large enough that you would need to distribute your dataset over many servers and the overhead of replication with your preferred RDBMS is too high. E.g. large sorts can be reasonably well parallelized (split the dataset into chunks, sort each chunk, and zipper merge the sorted subsets on a machine collecting the result), and similarly caching the dataset on however many app servers you need in order to do sorts often makes it worthwhile even for much smaller datasets.

If you can run everything on one machine, or your RDBMS handles sharing and replication efficiently and easily, then by all means try that first. But you can often beat the pants of RDBMS' for specialized scenarios.

ccheever · on April 3, 2013

i think this is actually reasonable advice because in most systems, the application layer is pretty stateless and can easy be scaled out to more machines whereas the db is usually a central bottleneck.

stiff · on April 3, 2013

I think too that this was his point, it only might make sense in very specific scenarios though, for day-to-day web-dev this is terrible advice as others have said.

rombdn · on April 3, 2013

I agree - on the majority of entreprise system (ERP) I have worked this was the case. I can confirm that GROUP BY are huge performance killers and sometimes the simplest solution is to avoid them by slightly changing the business logic if possible.