Follow this advice with caution. Dropping foreign keys is effectively giving up part of the C in ACID. It should be done with very, very open eyes to the downsides. I'm not sure the author is selling the "when" side of this very well. Migrations are "hard" because the database is forcing you to handle correctness criteria that are easy to ignore. "Lock contention" is the database covering your sloppy ill-thought out application code from doing the Wrong Thing. Yes, of course, you take a minor performance hit in exchange for consistency, but how many of us actually work on applications where that trade off is the wrong one? Developers are frustratingly superstitious about scale and their need to do it.
The data will very likely outlive the application that created it. It's almost guaranteed to outlive your tenure on the team. Viciously guarding its correctness solves all the problems that not guarding it causes.
Whenever anyone talks about “the application,” I immediately ask “what about all the other applications?” I promise you that you will find customer service and accounting and biz dev have also built stuff that uses the database to get their jobs done (probably not with the same ORM or even the same language) unless you have taken draconian measures to prevent them.
Interesting that this whole thread makes no distinction between read and write access, as those are dramatically different use cases. Read access is by far the more necessary, and is usually solved relative easily by saving snapshots to a data warehouse. This is no panacea as data can still easily be misinterpreted or replicated and used out of context in violation of expected production data lifecycle by the team that owns the source-of-truth, but in many cases it's trivial and minimizes coordination overhead.
Write access on the other hand is a different story. IMO restricting write access of a database to a single application is table stakes for scaling complex applications. Sure there are other approaches such as writing all your logic in the database via constraints and stored procedures and making the DBA a god-like figure, but these approaches have fallen out of favor as they've proven less scalable compared to wrapping a DB with a service that has exclusive write access. The latter arrangement allows many constraints to be enforced in a horizontally scalable and more legible layer, while still leveraging the DB to prevent races with a better menu of tradeoffs.
Of course this requires thoughtful service and interface design by competent technical ___domain experts, which is easier said than done, but the alternative allows the overall system cohesion to degrade to where no one understands the system well enough to make any changes without risking major incidents. At that point, the agency of system builders and maintainers is replaced by care and feeding of the unknowable system to not upset the status quo, accompanied with increasingly byzantine hacks and workarounds to enable any business changes.
The no foreign keys advice tastes of 2005/2010 when the lack of support in some frameworks (Rails included) prompted developers to dismiss them as unnecessary. This piece of advice at least quantifies the terms of the tradeoff.
Regarding the correct "saving snapshots to a data warehouse", if there are one million web apps in the world, how many of them have the scale to noticeably benefit from either doing without foreign keys or from a data warehouse? I've been using many of them like everybody else but they are totally irrelevant to the long tail of apps and developers.
By the way, a good DBA can make miracles for the performances of most databases in that long tail. A few days of work are worth the cost especially if the team pays attention and learn the lesson. The best remark I got about a DB of mine was a "not bad for being only a developer". The DBA version was better.
If I recall correctly, "no foreign keys" hit its stride back in early PHP days when MySQL didn't support them properly. Rather than cop to the fact that MySQL just implemented them badly, MySQL AB went on a dev PR run telling folks that foreign keys weren't actually useful and just slowed a system down.
Once MySQL implemented them less horribly, the PR push finally started to die down. I will never forgive them for that, and decades later where Oracle controls MySQL (and arguably is doing a better job), I still hold a grudge against MySQL that I have to actively suppress when the contract demands it.
Nobody with a Computer Science or especially a Software Engineering degree should have felt for it but I know a number of good developers with a more varied background. Some of them inevitably became team leaders etc. By varied backgrounds I mean Philosophy, Agricultural Sciences, Graphic Design. Some of them know very well how a database work, some admit to never have learned SQL, go figure all the theory and the rationale behind some technologies. Give them some blog posts that make the job done without foreign keys and they won't even know what a foreign key is until it's 2015 and people started to move to PostgreSQL.
My degree is in literature. (I'm that dev who also writes docs, makes presentations, and serves as tech lead.) I still know that foreign key constraints should be the default. I even often have to remind those with CS and CE degrees that DBs aren't just dumb bit buckets and that database schemas are critical business logic.
Bad programmers worry about the code. Good programmers worry about data structures and their relationships. – Linus Torvalds
> Regarding the correct "saving snapshots to a data warehouse", if there are one million web apps in the world, how many of them have the scale to noticeably benefit from either doing without foreign keys or from a data warehouse?
Yes I agree. This advice was assuming you have scale that necessitates multiple services, at which point you'll want to be able to query against data from multiple sources. I'm with you that the vast majority of teams (finger in the air: less than 20 full-time engineers working on a typical web app) are probably better off with a monolith and single database. At this point a read replica is a low effort way to safely provide access to a wide range of stakeholders.
Indeed - I immediately thought of hypothetical system (similar to many systems I’ve worked on) where SELECTS are 100x more common than INSERTS which are 100x more common than DELETES.
The 5x speed up on delete performance is a useless optimization.
+1. When I started in the industry, it was common for more experienced developers to drill the “data outlives the application that generated it” principle into you. Somewhere in the transition to NoSQL and back we lost this.
The _data_ yes, the _database_ no. The data will get migrated to different solutions at different times. I've lost track of how often I've been porting MySQL to SQL Server, SQL Server to Postgres, Postgres to Mongo, Mongo to cloud etc.
This really depends on the size of your data set. If you have a large database (TBs+) it's likely to be very long-lived due the effort/hardware resources a migration requires - especially if you want to improve the schema when migrating.
If you have an established business or startup, data will outlive the application. However, you need a product that lives long enough for either data or application to matter.
In the startup world, that means making decisions that help you ship now, at the expense of debt/costs down the road.
IME, Foreign Keys do not slow down development velocity, after you factor out the upfront time investment it takes to understand them, normal forms, and other RDBMS concepts. Other posters in this thread have argued that they take CPU cycles during program execution - which is factual correct.
It wasn't the transition to NoSQL. It was the transition to web APIs. Now the application that can access the database has a universal shim in place that deals with the messy business of taking requests and applying them to the data store.
Now there's typically three applications that need DB access:
1. The application it was built for.
2. The read-only reporting and visualization tools.
My assumption (with modern applications!) is that nothing but the role directly owning the data will access the data. The development and DBA teams will likely have a role they can assume after performing a carefully-audited breakglass procedure to use in an emergency (rare) or to fulfill audit tasks. At least in my org this is a well-known problem with legacy applications sharing databases. Limit access to the database to a single role, used by a single application, and you absolve so many issues.
When your data has outlived several generations of front-end technologies (mine started with Powerbuilder), you will find that ownership of the data does not control the evolution of how it will be used with what replaces past clients.
One particularly interesting Oracle problem is:
ORA-00060: deadlock detected while waiting for resource
Tom Kyte's book, Expert One-on-One Oracle, describes the primary culprit:
"Oracle considers deadlocks to be so rare, so unusual, that it creates a trace file on the server each and every time one does occur... The number one cause of deadlocks in the Oracle database, in my experience, is un-indexed foreign
keys."
For another perspective, add to this a default setting in every SQLite database:
$ sqlite3 verynew.db
SQLite version 3.34.1 2021-01-20 14:10:07
Enter ".help" for usage hints.
sqlite> .dump
PRAGMA foreign_keys=OFF;
BEGIN TRANSACTION;
COMMIT;
Foreign keys can cause interesting problems, and SQLite specifically prefers to avoid them.
Yes, if you have downstream services that need to make use of your data, you would provide an API. That way you can manage the number of connections, how the queries are formed, any sort of encoding/decoding, authorization, caching, etc. Definitely not limited to read-only.
I would argue that if you have completely different contexts / business lines / concerns (customer service, accounting, biz dev) all directly accessing the same database you have far, far larger architectural concerns that FK will ever hope to address.
I strongly believe the opposite. Your business IS your database. How do you handle access to the data without access to the data?
It is inevitable that 2 "organizations" will at some point need access to the same data. You could just have each of them decide how they interact with it. It doesn't matter how they do it because the database itself makes sure invariants are kept true (such as FKs).
Have a single application clearly owning the data. All access to the data is done through that application. Why would you _want_ multiple applications deciding how to handle data in a common store? What about access authorization to sensitive data?
Because that single application already exists, in my case its postgres.
Pgweb is one of my interfaces, admin dashboard for free, and I am sure the changes I make there are as valid as changes through any other interface.
I can implement a website as a SSR app that talks directly to the db. Maybe tomorrow I decide I need to work on a web scraper that will use python, instead of adding more API endpoints to allow the scraper to talk to the database, I just talk to the database...
The database is that "single appplication", no need to write custom endpoints for every operation when SQL is good enough.
It's all a spectrum of course, today you can even go as far as to use something like pg_graphql and you don't even need to write a REST api yourself.
Edit: I forgot to answer the last section, but postgres can totally handle autorization with RLS for instance, allowing users to only see their own data, or maybe data marked as public, etc.
> Because that single application already exists, in my case its postgres.
This is exactly what postgres was designed for! Tens of thousands of hours of work over decades to solve the problem of relational database management. That's why we call it an RDBMS!
Which isn't to say you shouldn't make your own API ever. There are a lot of situations where you don't want things to connect directly to postgres.
But you shouldn't be afraid of having multiple systems connect to postgres. It has incredibly mature and robust features to accommodate that use case. It's the expected use case.
If you operate at the scale where you consider the performance implications of foreign keys, you probably are not ok with anyone accessing the database or running just any query against it.
It is not realistic that you can trust everyone who needs to access the data with access to the database as they might easily cause problems with poorly written queries.
Additionally you may want invariants maintained, that the database cannot maintain but an application in front of it can.
Also for historical data or analytical queries postgres is not ideal either, so you probably want to move the data into some OLAP database or datalake.
> I can implement a website as a SSR app that talks directly to the db. Maybe tomorrow I decide I need to work on a web scraper that will use python, instead of adding more API endpoints to allow the scraper to talk to the database, I just talk to the database...
If the website and the scraper are just parts of the same application, it makes sense to do this but if they are genuinely different applications. I would use different databases here.
> If you operate at the scale where you consider the performance implications of foreign keys, you probably are not ok with anyone accessing the database or running just any query against it.
I used to work on an application where ALL database accesses were via stored procedures. Genuinly the best dev experience, to me, so far.
> I used to work on an application where ALL database accesses were via stored procedures. Genuinly the best dev experience, to me, so far.
I found such an environment to be simply terrible.
In general, I think stored procs certainly have their place, but if ALL access is through stored procs, you better have a schema that’s basically set in stone otherwise dev will turn into a nightmare.
Large ERP systems do that sort of thing as a matter of course and have for decades now. It does require careful planning and design. I mean AR / AP / scheduling / manufacturing / inventory and so on.
The main downside of splitting everything into isolated databases is that it makes it approximately impossible to generate reports that require joining across databases. Not without writing new and relatively complex application code to do what used to require a simple SQL query to accomplish anyway.
Of course if you have the sort of business with scalability problems that require abandoning or restructuring your database on a regular basis, then placing that kind of data in a shared database is probably not such a great idea.
It should also be said that common web APIs as a programming technique are much harder to use and implement reliably due to the data marshalling and extra error handling code required than just about any system of queries or stored procedures against a conventional database. The need to page is perverse, for example.
That does not mean that sort of tight coupling is appropriate in many cases, but it is (typically) much easier to implement. Web APIs could use standard support for two phase commit and internally paged queries that preserve some semblance of consistency. The problem is that stateless architecture makes that sort of thing virtually impossible. Who knows which rows will disappear when you query for page two because the positions of all of your records have just shifted? Or which parts of a distributed transaction will still be there if anything goes wrong?
Different business units often want aggregate, filter, join and transform data in different ways - sometimes in ways that are hard to anticipate.
In general, the closer to the persistence layer you can perform those transformations, the better they will scale. If you pull the transform into the app layer, you need to move and serialize more data. If you pull the transformation into a constellation of apps, you need to move and serialize a constellation of data.
> probably not with the same ORM or even the same language
I mean, they shouldn't? Like you've just identified a bug: another application can access your database. If another department needs your data, they should request an endpoint that you control. You should be using an "application database"[1] not an "integration database"[2].
They are you. You is they. It’s one company with one goal — keep companying.
The idea that ever “department” should access every other department’s data through some bespoke interface that the latter department maintains might work at some corporate behemoth, but at almost all other scales is absurd.
This would be a total waste of effort when you need to be building a product and iterating. I hate articles like this because they do a poor job contextualizing the tradeoffs and when it might be appropriate to do the weird exceptional thing.
IMHO if you have a performance critical case when foreign keys are in the way, load THAT data into an in memory DB on a recurring basis and server time sensitive requests from there.
Foreign keys are slow on delete, not on read. If you have a popular table, say, users, and all other tables refer to it, then deleting a user locks the database for time proportional to the number of foreign keys - good old linear scaling.
It doesn't lock the database, it only locks the rows of the tables that have foreign keys to the popular tables and are referencing the user you are deleting.
Rightly so because when deleting the user the database needs to do work to keep the referential integrity. Either it nulls the user_id, delete the rows, or it throws an error.
Hi! The author here. I appreciate your reply here and in few other threads for this post too. I think they are very well thought out. I also like how it brewed more branches of other good discussions.
I totally agree that this decision isn't to be taken lightly. That said, I figured I'd also take this opportunity to clarify a few things:
- My post aims to prompt developers to critically assess whether the constraints they employ are truly essential or just conventionally used without question. They should weigh these constraints against the considerations mentioned in the post.
- While I recognize that lock contention is the database's way of managing application code, I propose that if foreign keys aren't necessary from the outset, then lock contention may only serve to impede performance. If you can rewrite your application code, say with `FOR UPDATE SKIP LOCKED`, great!
- I am also not recommending to drop all FK constraints, but to challenge and see if each one of them is necessary, and if it is, also good.
I am fine with the concept of foreign keys, but if you have a perfectly designed database, it should be really hard to insert a row in any table by definition.
What would solve this is defaulting to raise foreign key constraints errors at the end of transaction rather than when rows are inserted. I know postgres has an option for it, it should be the default
If you do not particularly care about performance or have a great deal of headroom then database enforcement of referential integrity is great. Alternatively you could just write test cases to check for it and not pay the severe performance penalty.
The other major downside of database enforcement of referential integrity is the common need to drop and re-create foreign keys during database schema upgrades and data conversions.
You’re still going to pay the cost of maintaining referential integrity — you’re just doing it on the app side. You can do it faster by being not-correct — eg you don’t need a lock if you ignore race conditions — but it’s not like the database is arbitrarily slow at doing one of its basic fundamental jobs.
Of course, you can just skip the validation altogether and cross your fingers and hope you’re correct, but it’s the same reasoning as removing array bounds checking from your app code; you’ve eked out some more performance and it’s great until it’s catastrophically not so great.
Your reasoning should really be inverted. Be correct first, and maintain excessive validation as you can, and rip it out where performance matters. With OLTP workloads, your data’s correctness is generally much more valuable than the additional hardware you might have to throw at it.
I’m also not sure why dropping/creating foreign keys is a big deal for migrations, other than time spent
It is quite common for modern databases to have multiversion concurrency so that writers do not block readers. If you do not your transactions should either be awfully short, you should be prepared to wait, or you should implement dirty reads (which are quite common in any case).
You either end up reinventing foreign keys, your support volume will scale faster than your data, or user experience will suffer.
There may be situations where foreign keys become too much overhead, but it's worth fighting to keep them as long as possible. Data integrity only becomes more important at scale. Every orphaned record is a support ticket, lost sale, etc.
Orphaned detail records are usually inconsequential, like uncollected garbage. References to anything with an optional relationship should use outer joins as a matter of course. If you delete something that really needs to be there you have a problem, which is one of the reasons not to delete rows like that, ever, but rather to mark them as inactive or deleted instead.
Typically you look for orphan rows - the sort of thing ON DELETE CASCADE was invented to prevent. Another thing to check for are records that need to exist but should have references cleared when something else is deleted, e.g. ON DELETE SET NULL. And the third thing is ON DELETE RESTRICT.
You can check for the first two of those things after the fact, and they are relatively benign. In many cases it will make no difference to application queries, especially with the judicious use of outer joins, which should be used for all optional relationships anyway.
If you need ON DELETE RESTRICT application code should probably check anyway, because otherwise you have unexpected delete failures with no application level visibility as to what went wrong. That can be tested for, and pretty much has to be before code that deletes rows subject to delete restrictions is released into production.
As far as race conditions go, they should be eliminated through the use of database transactions. Another alternative is never to delete rows that are referred to elsewhere and just set a deleted flag or something. That is mildly annoying to check for however. Clearing an active flag is simpler because you usually want rows like that to stay around anyway, just not be used in new transactions.
This effectively means you are building an embedded database in your application and using the networked database for storage. There are a few reasons to do this and a million reasons not to.
I once worked somewhere that used rails in lieu of foreign keys. The result was a brittle nightly delete_orphaned records script as well as obscure user visible bugs. My team started adding foreign keys to our records and unsurprisingly caught bugs in our application code that otherwise would have been missed. Personally, I think the default should always be to usr foreign keys and only if you have a genuine scalability problem might you consider dropping them or put then in a different database with lighter data integrity guarantees.
This anecdote is not very surprising to me. Foreign keys are on a scale of DB constraints. I opt to add a maximum of constraints where I can. Adding a date that should be in the future - I'm adding a constraint! Did you mess up your time zone conversions & are trying to add a date in the past - the constraint catches it. At some point the constraints are strong enough where with high confidence the following statement can be said: "if the data exists in the database at all, it is correct & complete." This is the "database as fortress" principle, that only correct data should ever be allowed into the database to begin with.
I agree and makes sense. Starting out with them and then challenging the setup is a good practice. I like the idea of moving to a lighter data integrity setup too (ofc when possible)
I worked on a couple projects in the early aughts like this (php / mysql). Even did it myself for a project and fortunately learned my lesson early enough to fix that very project and rely in the database to handle data integrity.
Always important to remember that while learning from others is important; They're just as human as you are.
Foreign keys are often the best (and only) documentation of the data model, and the only one guaranteed not to decay, so this is an important consideration when rethinking them.
That said, I quite like this post and the way the author is thinking critically about software decisions most of us take for granted. Even if you decide that foreign keys are always best, this type of thinking is a great way to improve your understanding.
I agree - and I also appreciate the author's objectivity and focus on making the correct technical decision for the problem. Dogma is the bane of good architecture.
I completely agree with the idea that foreign keys aren't always recommended. For example, one of the reasons that MongoDB was so great at edge writes was because of the lack of referential integrity and the "document" as a row concept - denormalize the data into a document and store everything you need there.
That works great for high volume edge writes that you want to operationalize later, asynchronously. Sure, the data can be messy and may have errors - but many use cases tolerate this well.
The nice thing about modern postgres is that you get the best of both worlds, RDBMS stuff when you need it, and NoSQL stuff if you don't.
The author's point, which I think is an excellent one, is that just because it's in a database, it doesn't necessarily make sense to go full 4th normal form on everything.
I was recently working on a system that split data across multiple database instances (unnecessarily) and that means referential integrity is lacking and sometimes a huge problem.
To expound, I was tracking down something call a "slotid" and the absence of foreign keys does not mean anything. That data could very well live in some other table on the other DB instance. It turned out though that "slot-id" instead referred to a start time of day as counted by 'minutes-from-midnight'. Thus "slot-id=375" was just "6:15am"... Yup.. 3 hours of my life I will not get back to realize that data did not refer to anything at all. If everything that could have had foreign keys did, then it would have been a super quick investigation to realize the data was not a reference to anything at all.
Unless your project is small and for learning, or won't be around for very long, Yes. You absolutely do or will end up in a technical debt world of hurt. I have worked on systems without them that are 15+ years old and the vast, vast majority of fixes and refactoring were self inflicted wounds like this. We added FK's and indexes where necessary and wound up faster and safer.
That sounds awful. My first instinct would be to see where an FK constraint would fail, and then ask the business what it should be. There is no guarantee that every business case from way back needed an FK, or that anyone in the business knows which FK value should be there. It sounds absolutely miserable.
In my experince, adding proper constraints will not only ensure code can never write invalid data to disk, it will also help you realize when business rules are lacking or failed to consider edge cases.
The Salesforce CRM application somewhat "famously" does not use native DB foreign keys to model most relationships, and has a custom relational integrity and indexing layer.
But interestingly, this also lets our internal data modelers choose to make a relationship e.g. leave dangling records or have other update/delete patterns that are appropriate for the data and data volume. Obviously there are cleanups and tradeoffs required here. It's certainly not a model I'd suggest people start with, but there really are times where database native FKs aren't the answer.
Foreign keys allow for pushing a really vital piece of business logic down to the database itself, referential integrity.
This can be done in application logic, but that risks bugs allowing broken references into your data. Its more foolproof when these checks are enforced directly at the db, making sure data is valid before it's stored or updated.
No to mention that most SQL databases have complex consistency rules. Just checking that the target row exists (or not) before adding (or deleting) a record may not result in a correct database state in the face of concurrent transactions.
You better very carefully read your database's concurrency guarantees and ensure that all transactions are running with the right consistency level otherwise you will have a bad day.
On top of that the JOIN method of following foreign keys often leads to missing rows if expected keys don't exist which can be very hard to debug.
There was an interesting topic on r/experienceddevs [1], where the dev team was arguing with some DBAs on adding another column. The DBA insisted on using FKs over enum / application level logic, etc. The comments there present some excellent arguments on top of everything you said.
True, a foreign key by itself is just a business requirement. A foreign key constraint is what the database does to enforce that. There's rarely any confusion about the two concepts, though.
I do consulting for a lot of different companies. The older ones who did not religiously use PK/FK constraints have databases that are nightmares to maintain. The data will start to rot surprisingly quickly, devs will react by writing weird code to compensate, and your life will not be fun.
I worked in different banks for the better part of a decade, lots of internal applications, each with their own DB.
They all had PK/FK/unique/check constraints, and they were still a clusterfuck to understand and maintain. Lots of outright stupid and dangerous code to work around the issues.
Still, if there were no constraints, it would have been even worse, as some 'intelligent' people temporarily disabled constraints to update data, resulting in data integrity hell.
> Personally, it took me quite a few years to make up my mind about whether foreign keys are good or evil, and for the past 3 years I'm in the unchanging strong opinion that foreign keys should not be used. Main reasons are:
> * FKs are in your way to shard your database. Your app is accustomed to rely on FK to maintain integrity, instead of doing it on its own. It may even rely on FK to cascade deletes (shudder).
When eventually you want to shard or extract data out, you need to change & test the app to an unknown extent.
> * FKs are a performance impact. The fact they require indexes is likely fine, since those indexes are needed anyhow. But the lookup made for each insert/delete is an overhead.
> * FKs don't work well with online schema migrations.
> FKs are a performance impact. The fact they require indexes is likely fine, since those indexes are needed anyhow. But the lookup made for each insert/delete is an overhead.
This is not a valid argument at all and I'm concerned anyone would think it is.
If you have a foreign key, it means you have a dependency that needs to be updated or deleted. If that's the case, you will have an overhead anyway, the only question being whether it's at the DB level or at the application level.
I don't think there are many cases where there's any advantage to self-manage them at the application level.
> FKs don't work well with online schema migrations
This seems to be related only to the specific project that the issue is about if you read about the detailed explanation below.
> If that's the case, you will have an overhead anyway, the only question being whether it's at the DB level or at the application level.
Inserts and updates do not require referential integrity checking if you know that the reference in question is valid in advance. Common cases are references to rows you create in the same transaction or rows you know will not be deleted.
If you actually want to delete something that may be referred to elsewhere then checking is appropriate of course, and in many applications such checking is necessary in advance so you have some idea whether something can be deleted (and if not why not). That type of check may not be race free of course, hence "some idea".
> the only question being whether it's at the DB level or at the application level.
It is not a binary situation like that. With the rise of 'n-tier' systems that are ever so popular today, there are often multiple DB levels. The question is not so much if it should go into the end user application – pretty much everyone will say definitely not there – but at which DB level it should it go in. That is less clear, and where you will get mixed responses.
Note that this was written in 2016 in the context of a mysql-centric project. You will not find an "unchanging strong opinion that foreign keys should not be used" outside that context.
I haven't kept up with mysql enough to know if there are still good reasons to avoid foreign keys. I just stick with postgresql.
You don’t need foreign keys, no. But you do need referential integrity and foreign keys implemented and enforced by the database is usually the easiest.
If you’re disciplined it’s not difficult to implement yourself in the application, but the challenge is if folks can access the database directly. If so, good luck, since it’s inevitable that they end up making modifications and break integrity.
If one insists on not having foreign keys (and there are decent reasons to not have them), I would really really suggest not allowing direct access to the database and enforce whatever application be the main or only client to the database.
Foreign keys are wholly optional if you’re the kind of person who thinks it’d be cool to add “manually repair millions of rows” and/or “lose half a day of data; cry” to the todo list for a 3AM emergency deployment rollback.
The main advantage is it's WAY faster if you are writing a lot.
The big disadvantage of foreign keys is they do verify integrity. That means making lookups on every write/update which can be very costly especially as the model becomes more complex.
If you have a read heavy application with low levels of writes then by all means put in foreign keys to you hearts content. But if you are at a point where you have millions or billions of records in multiple tables, they simply aren't feasible.
But you pay the cost in checking it in the application, as GP said. If so, it simply is moving the cost from db to application layer. Is there a reason the checks can be implemented more efficiently in the application than the DB can?
Referential integrity problems usually happen due to missing deletes, improper deletes, or references that should be cleared.
The overhead of checking for the existence of referred to records in ordinary inserts and updates in application code is unnecessary in most cases, and that is where the problem is. Either you have to check to have any idea what is going on, because your key values are being supplied from an outside source or you should be able write your application so that it does not insert random references into your database.
If you actually need to delete a row that might be referred to, the best thing to do is not to do that, because you will need application level checks to make the reason why you cannot delete something visible in any case. 'Delete failed because the record is referred to somewhere' is usually an inadequate explanation. The application should probably check so that delete isn't even presented as an option in cases like that.
> If you actually need to delete a row that might be referred to, the best thing to do is not to do that, because you will need application level checks to make the reason why you cannot delete something visible in any case. 'Delete failed because the record is referred to somewhere' is usually an inadequate explanation. The application should probably check so that delete isn't even presented as an option in cases like that.
I feel like this belongs to the same strategy as duplicating form-validation on frontend/backend. The frontend validations can't be trusted (they can be skipped over with e.g. curl POST), so backend validation must be done. But you choose duplicate it to the frontend for user-convenience / better reporting / faster feedback loop. The backend remains the source of truth on validations.
The same between database and application; the database is much more likely to be correct when enforcing basic data constraints and referential integrity. The application can do it, its just a lot more awkward because they're also juggling other things and have a higher-level view of the data (and the only real way to check you didn't screw up is to make your testcase do exactly the same thing... but be correct about it -- no one else is going to tell you your dataset got fucked. Also true in an RDBMS, but it's trivial to verify by eye, and there's only one place to check per relationship). Thus in my world-view, the database must validate, and the application can choose to duplicate validation for user-convenience / better reporting. The database remains the source of truth on validations. As an optimization, you remove the database validations, but at your own risk.
And then in a multi-app, single db world, then you really can't trust the application (validations can be skipped), so even that optimization is likely illegal. Or you do many-apps *-> single-api -> db, and maintain the optimization at the cost of pretty much completely dropping the flexibility of having an RDBMS in the first place
At my first job they used mongodb for no reason other than it was the fad at the moment, with the big data and so on.
There were often crashes in the application because our records followed several different schemas, because of bugs in the application that were later fixed, behaviour that got changed, ORM got replaced with hand written code that was much faster…
Basically, what happens with over confident developers and CTO that think they are very good developers and aren't.
Your argument, if I understand it correctly, is that the constraints can be implemented in the application layer more efficiently - either in dev time or CPU time - than it can be in the DB layer. Why is that so, and under what circumstances?
This is your typical false economy trading a very small performance gain for strongly degraded data integrity.
If the indexes are too much to ask you're basically saying that the referred to foreign records are never looked up (they have no key index!) and/or that the foreign records are never gathered up for referring table. If that's the case the problem isn't having superfluous indexes it's that you have superfluous data in your database.
> a very small performance gain for strongly degraded data integrity.
It's not very small. Doing a lookup on every write can be hugely detrimental to performance especially with large numbers of writes.
> If the indexes are too much to ask you're basically saying that the referred to foreign records are never looked up (they have no key index!)
That does not follow. The lookup for foreign records is not free so avoiding doing it when you don't need to will gain faster performance vs doing it all the time. Indexes make lookups faster, they don't make them free.
Further, you have to consider the impact of locks on such a system. Writes to a table that references another will lock the contents of the second table while the write is in flight. So if I wanted to update the foreign record in any way, that task now gets blocked until data integrity check finishes.
In MSSQL, that lock is held until the end of the transaction.
Batch processing is also WAY faster (like 1-2 orders of magnitude), but for whatever reason people don't talk about it much (cynically, I guess because it's too straightforward and old-hat). As far as I can tell, Rails doesn't have multithreading/async, so the OP is presumably not batching requests. That seems like a much better place to start than giving up referential integrity.
We are batch processing. I don't know why you assume we aren't. FK constraints are simply too much burden for us. I'm not the OP, however, my company has experienced the performance headaches of FKs.
I didn't assume anything about you; I said OP uses a framework that (as far as I know) makes it difficult to batch requests, so they're probably not batching requests. And they mention doing a `before_create` hook, which again suggests they're operating on individual models. If you're already batching and still running into a wall, then yeah you might need to remove constraint validation or shard. But batching alone can probably take most people far past the scale they'll ever see.
> What would be the advantage of implementing it in the application?
It is always implemented in the application. The question is about whether you do it in your own application or use someone else's application.
The disadvantage of doing it in your application is that you have to do the work that someone else has probably already done, and are likely not as great of a programmer as that other person, thus more likely to screw it up.
It's basically like saying you're ok with dangling pointers or orphaned records, or that you're awesome enough not to need these "slow", annoying safety features.
That's all fine, but we know how this ends up in reality.
Checking that child records that can exist only if there is a parent record are deleted when the parent is deleted and that the foreign key you're using in a record actually exists in the pointed-to table, for the price of a single index you very likely need anyway, is hardly a problem.
You don't need to cascade deletes or anything, that's a red herring, as is migration issues where you can turn off referential integrity while the process is performed.
Using an intermediate client would go against many of the claimed performance benefits of not using the FK. One would also likely lose the ACI from ACID. Replicating those features in the intermediate client sure is risky; it would require significant expertise. I'd imagine the implementation would leverage transactions, which would be 'round-trip' to the database server and thus the lock contention would still exist and also be substantially worse (due to the round-trip latency).
That is great and I noticed it’s something I have always assumed should be the case - no indirect access to the database outside your application / ORM. Because then you open a can of worms.
Doesn't help much though. It only takes not wrapping related inserts into transaction in one place in your code. i know zero frameworks/ORMs that can help you with that
I usually wrap everything (including reads) in a transaction at middleware level. Not perfect, has its downsides but for many projects it's perfectly ok.
Of course it's not a real substitute for foreign keys, but definitely better than nothing.
Ahh the old days of LAMP when the M was mysql and foreign keys were just a dream...
When you reduce complexity and take off the safeguards things get faster! Cock that foot gun and hope that it doesn't go off!
Can you do what the author suggests. You sure can and we did it for a long time with MYSQL. Should you? It depends on your team, how in tune they are with working with databases, sql etc...
There was a funny NoSql video that said something to the effect of, "if you are willing to sacrifice everything for speed, just pipe all your data to /dev/null."
If you are willing to forego 3rd normal form and just duplicate data like mad, and have a million column like a giant excel sheet, then you really don’t need foreign keys.
MyISAM I think was the table type that did NOT support FK constraints and was blazing fast. InnoDB had FK constraints.
You would fully normalize, use proper joins etc. But without FK constraints you could end up with orphan data... Not the worst thing in the world, depending on how the joins in your system were structured.
Care and diligence were the order of the day. You had to know your schema and make sure you were doing the RIGHT thing at all points in the stack.
Large scale MySQL databases I've worked on typically do not use foreign keys. The payoff is higher write performance. They're implicit based on table/column naming and relationships defined in code, e.g. Rails associations or process/operation classes.
With a certain level of team maturity and thoughtful reviews, this has rarely been an issue. Sometimes there are orphaned rows (from incomplete/buggy writes) which also have clean-ups that tend to have jobs for pruning data that has gone 'out of the retention window'.
Depends on your definition of large-scale, but I managed a 120K QPS MySQL 8.x cluster that heavily used FKs. After heavy optimization and parameter tuning, I was also able to halve the instance size while improving query performance.
I’m a big fan of FKs. They stop you from doing stupid shit, and if the time ever truly comes that you have to drop them, you can do so.
I have seen many people running MySQL make that claim about lots of things... But what I have never seen is a MySQL database for business data without major issues.
I work on a legacy MySQL database that has 100+ moderatedly-related tables with zero foreign key constraints.
Someone wanted to change the name of a particular object, so I did a little investigation and found that approximately two dozen tables were storing this name-as-key, with at least half a dozen different column names. So essentially we have to scan every row in the entire database to robustly maintain referential integrity, and I vetoed the largely cosmetic change. Granted, this isn't 100% a FK constraint issue, but I imagine if the original DBA knew what FKs were, I wouldn't be in this mess.
On the other hand, I have implemented prototypes that clearly had way too many FK constraints, which leads to a bloated schema of unnecessary tables with 1-2 columns each, and makes object lifecycles a spaghetti nightmare.
Having worked in a db with lots of missing FKs, it becomes very difficult to figure out relationships. This makes it difficult to find out what tables need to be updated when there's a lot of them. In this particular company, the only thing holding it together was institutional knowledge.
I worked for a company where we would not use features that depended on foreign keys (cascading deletes, etc.), but we would define the foreign keys 100% of the time. We ran all of our staging environments with the keys, but when it came to prod our DBA would remove them, among other optimizations. I've not done this at other places that didn't have a dedicated DBA or two, though.
Implementations with skin in the game are still innovating and improving the 'tried and true' relational database model, IMO. PostgreSQL, MS SQL, MySQL, etc. consistently churn out new features that actually make sense, like FK constraint support for sharded/partitioned tables relevant to this discussion.
If you work on a large scale system that doesn't include foreign key constraints, try running some queries that should never return results because that would expose data integrity problems.
If your system is hooked up to a data warehouse you can run queries there, too.
I bet you can find all sorts of weird edge-case records doing this.
On that note, has anybody figured out a nice way to combine foreign keys and soft deletion (that is, a deleted_at column)?
Soft deletion is occasionally useful, but losing foreign keys for it is a big pain.
EDIT: got a bunch of responses, thanks! To be clear, the issue I have in mind is e.g. you want to have a foreign key that makes sure the “singular” side of a one-to-many relationship isn’t soft-deleted (on delete restrict) as long as it has anything in the “many” side. Both sides using soft-deletion.
By soft deletion, do you mean you would want to ever re-instate the record and make it valid again, PK/FK constraints and all? Or tracking deleted records? For the latter I use audit tables + triggers to track the changing values.
At a basic level you could duplicate the entire record into the audit table on every action, e.g. the audit table would look like `audit_id | record_id | user/process_id | action (insert, update, delete) | timestamp | ...<record rows>`.
You can optimize it to not duplicating column values unless necessary. On inserts you only need the metadata of the action. On updates, the old value of columns with changes goes into the audit table. On deletes the whole record goes into the audit table.
Just brainstorming - perhaps one way to do that would be a multi-column FK towards the row ID combined with an is_deleted column, which would mean that the FK constraint also enforces updating the related rows if something is deleted.
The `deleted_at` column approach to soft deletion has enough downsides that I would choose to just move "deleted" records into a different table.
Having a separate table for deleted records means that
- FK references to the main table will just work
- the main table can be kept clean.
- it avoids the class of problems where a user or the application treats soft-deleted records as real records, because they weren't aware of the `deleted_at` column.
I see at least two possible approaches (each with their tradeoffs, neither great):
1. For each relevant foreign key, have an additional column without a constraint where the foreign key value can be copied to before deletion. (This requires that the original foreign key be nullable.)
2. Make soft deletion universal so that foreign keys can remain and any associated rows will still exist.
I assumed cube2222 meant where soft-delete wasn't universal: so you might really delete the parent row but want to soft-delete the child rows. Though I would suggest this is a bad idea: if you want to soft-delete child rows you should enable it on parent rows too.
E.g. you want to have a foreign key that makes sure the “singular” side of a one-to-many relationship isn’t soft-deleted as long as it has anything in the “many” side. Both sides using soft-deletion.
A better post than I expected. The only thing I'd add is that foreign keys can actually improve read times because the optimiser knows it can safely skip certain joins e.g. if you have inner equi-joins between tables a, b and c
a join b join c
If there is an FK from a to b, and likewise from b to c, and you don't use anything in b, then the optimiser can rewrite this to
Yes, I remember using foreign key relationships to optimize symmetric hash join back when I worked on an OLAP db. The idea was that for a 1:1 relationship you can immediately discard both sides of a joined tuple, while for a 1:many relationship you can immediately discard the "many" side of a joined tuple.
I wonder whether there’s a measurable performance difference if instead of calling delete two times he just used the second delete with “on delete cascade”?
Whenever I read posts like this I know, despite all the caveats at the start of the article, a large number of people will take away "we shouldn't have any foreign keys because they impact performance".
I suggest that anyone thinking that has a look at a database where there are no foreign keys.
It was a nightmare. The article seems to basically be saying "but it's hard!". Well, I'd rather put down the extra effort so that I don't have corrupt data.
You know what's really hard? Fixing corrupt data.
The data is the most important thing in a database.
Despite the title, this article is about dropping foreign key constraints, not about dropping foreign keys altogether (which is possible, see below).
You can drop foreign key constraints and enforce referential integrity in your code, but from a logical standpoint your foo.barId column is still a foreign key.
Actually getting rid of foreign keys on the other hand can be done in a data store that allows nested structures, like e.g. a document based store. The price to pay is data duplication and denormalisation (and the related complexity you will have to deal with in all write operations), the pros are ease and speed of retrieval.
Novice Question: How can you have a relational database at all without foreign keys?
In other words, if your tables aren't referencing other tables, how do you perform JOINS for your queries? From my limited experience, they seem pretty essential for most databases. Or is the article just talking about foreign key CONSTRAINTS and cascading UPDATES and DELETES, rather than literally talking about the values stored in foreign key columns?
If anyone can explain, I would greatly appreciate it.
Foreign keys are really just constraints - the row referred to by the FK must exist, and the database will ensure that. That is the basic idea. Cascading, etc are just useful features on top.
While FKs and JOINs often go hand in hand, JOINs actually do not require foreign keys or any other constraint. You can join any table and column that you want regardless of FK or index or whatever (although you have to be careful about performance).
I see foreign keys the same way I see types. You may not need them, but they catch a lot of mistakes and give you more confidence in the integrity of your system.
And then, don't just drop the constraints because you read an article about how FK constraints don't work with a distributed relational model, take a careful look at the docs of whatever DB engine you're using. PostgreSQL 12 for example supports foreign keys with either side being a partitioned table.
Is 3rd normal not something used in design? Obviously, I have dropped off more than 20 years but distinctly remember that design saving our product evolution over 10 years or more - really a working early 90's designed system in 00's
Referential integrity saves you so many headaches. I am dealing with a 500+ table database with no foreign keys at all. This has allowed so many problems to creep in. Writing queries is pure suffering.
The performance gain by dropping foreign keys doesn’t hold water. You still have to do the referential checks in application or in the ORM code. Unless you meant dropping referential checks.
# Find correct ID.
obj.some_id = run_query("select some_id from tbl where x=?", param)
Lots of variation on that, but the user_id and some_id here are pretty much guaranteed to be accurate when implemented correctly. The biggest potential issue might be race conditions with deletes on the parent ID, but just having soft deletes sufficiently alleviates that (potential) small issue.
deals = sql("select * from current_deals ...");
// Show the user the deals.
// User takes his sweet time to ponder over the deals.
// Meanwhile, the deals are gone.
order = new_order()
order.deal_id = deals[5].id
order.insert()
// User calls in to complain never receiving the deal.
// Ok, need to verify the deal still exists.
sql("select deal from current_deals where id = ? for update", deals[5].id)
... insert ...
sql("commit")
That's second select..for update is referential checking in code.
The only time I could see it really impacting performance is during bulk operations. But that's where making FK constraints deferred and wrapping the bulk actions in a transaction makes more sense.
We had some missing FK constraints, along with missing cleanup in code for a given table. Some customers had many millions of orphaned rows, while only thousands of live rows.
My rule of thumb has been: enable them strictly in DEV and INT environments, disable in PROD. They can catch schema discrepancies, but can impede ingestion rates.
Also some referential errors are sort of ok in PROD, as long as it's only about not dropping user data; which can be dealt with later on (INT gets reset with PROD user data from a backup each week, it also helps in the restore plan, fk are enabled, errors are caught, then data gets pruned heavily)
If referential integrity is a business-level bug, then of course we should enable them.
To me that seems like an invitation for catastrophic bugs or problems to creep through and surprise you in prod, since the lower envs are running different a different configuration.
I suppose it's entirely dependent on the type of data you're working with though.
Whatever you do, always have your dev/test environment identical to production.
Have a load balancer or replication in prod? You must have the same when you test.
Otherwise, you will have a bad time m'kay.
And don't get me started about removing fk constraint, anybody doing that on purpose is either very, very smart, either very, very ignorant/inexperienced.
>My rule of thumb has been: enable them strictly in DEV and INT environments, disable in PROD
I mean I think the constraints should be on in all environments, but disabling them in prod but not dev seems utterly backwards? Protect your test data from getting corrupted but not your actual customer data?
That sounds like a terrible assumption to stake your business on though, especially for what are really marginal gains for the 99% of companies who don't need the tiny performance boost from not having foreign keys.
How often is data actually DELETEd from production databases? Unless following through with regulatory removal such as GDPR, it’s far better to use an isdeleted flag IMO (so you can un-delete if the action was a mistake).
There's also an impact on inserts; it's not just deletes.
Essentially a foreign key constraint is an on {insert,update,delete} trigger which checks whether the target check exists, so that's a select on the target table. I'm not sure if that's still the case, but I believe that for a long time foreign keys were just implemented as triggers in PostgreSQL.
For a lot of things that has a minimal performance impact and it's not a big deal. For some other things it can really add up.
I assume you mean "performance overhead"? For insert-heavy workloads the difference very much is not "very minimal". In a quick test it's ~100ms vs. ~20ms with two foreign keys which check against a "countries" table (250 rows) – one column for nationality and one for residence.
±a lot because this is a very quick test I ran just twice, but it fits what I measured before when I very much ran in to this performance penalty a few years back.
And look, 20ms vs. 100ms is very much fine for a lot of use cases. But it's also not for a lot of others. And it's certainly not very minimal, and with many inserts it does amplify a lot.
That's true, but managing the "soft" referential integrity becomes a big authorization/security headache. How do you prevent/allow "deleted" objects from being accessed indirectly?
I like Django's ORM approach which allows you to easily set baseline filters for your Model Managers, so you can implicitly exclude "deleted" objects from most queries.
Good point. I’m one of those “never-ORM” people, so writing & maintaining the SQL is prone to error when the DB gets complex, and it does introduce mental overhead during dev.
The data will very likely outlive the application that created it. It's almost guaranteed to outlive your tenure on the team. Viciously guarding its correctness solves all the problems that not guarding it causes.