RDBMS makes sense for most applications. Most applications store data that can be fit to the relational model. Most applications aren't big data or data mining OLAP.
Most RDBMSs can do key-value stores very well now. Most applications also care more about consistency over availability, which is what RDBMSs do (CAP theorem). Many NoSQL data stores choose availability and partitioning and sacrifice consistency (i.e., "eventual consistency"). There's a lot of applications that you can't sacrifice consistency for. Electronic health records, financial records, student records, employee records, etc. You care that the data are accurate and up to date, and you want the system to error if it can't provide that. Wrong answers and "close enough" answers aren't good enough.
Now, if you're running Reddit or Wikipedia or Facebook or HN... do you really care if a user doesn't get the absolute latest version of a document or comment? No, not really. If the content is hours old it's a problem, but it's not a big deal if it's a few minutes out of date. You care more that your users get a version of the document more than you care that they get the latest version of the document.
Most RDBMSs can do key-value stores very well now.
Yep, all of MongoDB is just one bullet point on Postgres's list of features. Anyone spending on money on it ought to be hauled before the shareholders and given a talking to on fiduciary responsibility...
This subthread is about the future, not the past. 8 years ago, PG didn't have json support so your point is moot.
I'm very curious about what people think about the future of Mongo, independently and particularly in comparison to Postgres. However every time that comes up, people keep bringing up that Mongo was a buggy piece of crap in some irrelevant past. So what?
> This subthread is about the future, not the past. 8 years ago, PG didn't have json support so your point is moot.
8 years ago PG did have replication though, so not sure why it not having feature X 8 years ago makes my point moot.
People keep bringing up that it was a buggy piece of crap because its the icing on the cake and pretty much something you never want your database to be, past or present. Not that software configured by default to eat your data and not persist it can be called a database mind you.
I don't see this as a problem. It takes years for any software project to mature, a DBMS even more so. I'm sure that I can go back to the 1980s and find gamebreaking bugs in the original POSTGRES. It has been years for MongoDB to approach maturity.
Of course I would prefer Postgres when I can use it, and I can generally use it basically all the time, but NoSQL still has its use cases.
Synchronous replication was added in 9.1 and much improved in 9.6? pglogical[0] works pretty well for me under 10 but I have no production experience with bdr[1].
IMO, sure but it's far from seamless. (I also looked at pg's quorum commits, but the same applies.)
In general Postgres was not designed at its core for a distributed world. Even now, replication feels like an afterthought in the grand scheme of things, and sharding nonexistent without extensions.
Wikipedia predates NoSQL. It runs on PHP + MySQL because that's what was most popular back in 2001, and they have no intrest in completely rewriting their entire stack just to use Cassandra or MongoDB. That doesn't mean a NoSQL data store wouldn't work extremely well for the type of application that Wikipedia is.
> Now, if you're running Reddit or Wikipedia or Facebook or HN... do you really care if a user doesn't get the absolute latest version of a document or comment?
I mean... do you? I often come back a few minutes after posting to add something I forgot or rephrase something for clarity. I hate when I am tweaking a Reddit comment a couple times during a period of high server load and I get served an old version of the comment and end up losing something I added in a previous edit.
With something like Wikipedia it would be quite frustrating to lose revisions.
Obviously it is what it is, I can't change their codebase, and I'm sure it's necessary as currently engineered, but is there really no other way to cluster their data except "one big table"? Maybe like shard subreddits to specific servers ala Hyperdex?
But yeah, most places that Mongo is applied aren't exactly Facebook or Reddit either, in terms of total data throughput.
Oh, it will certainly come up, but it's not going to break Reddit if you get an old version of a comment as long as it's eventually consistent. Nobody is going to die, and nobody is going to lose any money.
Data stores like Cassandra and MongoDB don't lose revisions. That's not the kind of consistency we're talking about. CAP consistency is just getting the most recent version. You won't lose data -- data loss is a bug, not expected behavior, just like any other data store -- you just won't always get the most recent version of it. And, keep in mind, when we talk about eventual consistency here we generally mean "consistent on all nodes within a few minutes, but we're not blocking reads to write this data." It's not going to take hours.
That said, if you find you get an old version of your own comment, I'd be more willing to believe it's the fact that your request failed with a 503 error or otherwise timed out as much as it was a data store problem. Next time it happens, wait 5 minutes and try again.
> is there really no other way to cluster their data except "one big table"? Maybe like shard subreddits to specific servers ala Hyperdex?
The whole point of MongoDB or Cassandra is that you can get shards without all the headache that RDBMSs usually put you through. You configure your sharding function and let the system do the rest. You don't have to connect to the right shard or anything of the sort, which some RDBMSs do (or did, it's been awhile since I've looked) require with sharding.
Reddit has their code and architecture posted, though it's out-of-date now, it makes it clear that it's basically just two big tables:
Why? An RDBMS has never been the best option for any application I have created and I have created standard business applications as well as consumer applications.
You don't seem to know this, but no traditional RDBMSs actually provide CAP consistency, for that they would have to use at least two-phase commit or something, but they don't. So, they all are noCAP databases. Electronic health or financial records are way safer in a proper eventually consistent database, like orders of magnitude safer, but everyone just takes the risk with some insurance at best to cover the losses.
EDIT: If you downvote, please explain why. You can't disagree with the truth.
> If the transaction committed was a Transact-SQL distributed transaction, COMMIT TRANSACTION triggers MS DTC to use a two-phase commit protocol to commit all of the servers involved in the transaction. If a local transaction spans two or more databases on the same instance of the Database Engine, the instance uses an internal two-phase commit to commit all of the databases involved in the transaction.
I'm only versed in SQL-Server but I'm pretty sure other RDBMS vendors provide similar functionality.
In certain environments, it is better to fail then to have data that isn't immediately consistent. Finance and healthcare are two such systems. Availability is not always paramount.
They only guarantee consistency as long as you don't use them over a network, i.e. communications with the database are always reliable. But once you do use them over a network - CAP theorem comes in and forces you to either use something like two-phase commit or no promises of consistency. Which is the opposite of what his post implied, like there is CAP consistency with those databases. But there never was!
Although I kind of got used to RDBMS crowd not understanding consistency, it's just another technology cult.
Which ones? Traditional mainstream RDBMSs, like Mysql and Postgres don't use two-phase commit protocol. Obviously new distributed ones do it properly, but we are not talking about them.
> You don't seem to know this, but no traditional RDBMSs actually provide CAP consistency, for that they would have to use at least two-phase commit or something, but they don't.
At the single server level (which is how I think others here are interpreting your comment)? No, they all do, with the exception of some configurations of MySQL (especially older editions, which is why it's often maligned by DBAs). That's what transaction logs do. They're literally a write ahead log (WAL). You commit a transaction, and the DB first obtains an exclusive lock on the affected rows (or page, or table). Any other transaction attempting to read or update those rows will be blocked (with exceptions). It then writes the change to the transaction log and flushes the change to disk. Then it writes the changes to the database file and flushes the change to disk. Then it returns the results of the query to the user. Many RDBMSs let you control how tightly the locks are and the degree that the data are isolated during a transaction.
At the distributed network server level? Then I guess I kind of agree with you, sure. RDBMSs let you "get around" the problems of distributed scaling by not letting you do it easily. SQL servers often only have master/slave or publisher/subscriber setups or otherwise partition the data between instances with sharding. There's no need for raft or paxos type algorithms because they don't attempt to implement a true multi-master environment. There's either a fixed overall master, or each server is the deterministic master of it's own little world, so you avoid consistency problems with distributed data. However, in doing so you sacrifice availability, since if a shard goes down so does all that data, or if the master is busy then you can't always submit queries to the slaves. Replication is used for redundancy, not scaling or load balancing. The solution RDBMSs had was sharding + master/slave replication for redundancy, which can get messy fast and has issues like hot spots or limited queries or variant performance. It's just a lot harder to do than it feels like it should be, and with storage as cheap as it is it feels like a waste of effort.
That said, some RDBMSs do allow you to use multimaster, bidirectional, or peer-to-peer replication, but most of those configurations basically warn you that you're sacrificing consistency by doing it and all of them that I've seen are a huge pain in the ass that makes shard + replicate look like child's play. They also have schema requirements that make life difficult, and they're somewhat notorious for being difficult both to administer and develop for. You have to design the whole thing from the ground up to work with this type of replication, it still feels like a house of cards, and it's this exact level of pain in the ass that encouraged the partitioning and availability focused NoSQL data stores.
However... most applications don't need that kind of scaling. They don't need a database in every time zone for single millisecond response times globally. They don't have the users to demand it, or don't have the quantity of data to require it, or have other requirements that make a traditional RDBMS desirable where you can't accept a system that allows for out-of-date data (which is when PACELC theorem kicks in because NoSQL typically doesn't have locking like an RDBMS does to mitigate this particular problem).
Actually, the trend with NewSQL goes towards providing CAP consistency, even with multi-master replication.
Google’s Cloud SQL is a good example of that, by using TrueTime as transaction id, and an MVCC implementation, they are able to provide consistency, while also being good enough on the other metrics.
Some NewSQL implementations copy that concept, but unless you run GPS clocks yourself, you’ll get slightly worse results.
Most RDBMSs can do key-value stores very well now. Most applications also care more about consistency over availability, which is what RDBMSs do (CAP theorem). Many NoSQL data stores choose availability and partitioning and sacrifice consistency (i.e., "eventual consistency"). There's a lot of applications that you can't sacrifice consistency for. Electronic health records, financial records, student records, employee records, etc. You care that the data are accurate and up to date, and you want the system to error if it can't provide that. Wrong answers and "close enough" answers aren't good enough.
Now, if you're running Reddit or Wikipedia or Facebook or HN... do you really care if a user doesn't get the absolute latest version of a document or comment? No, not really. If the content is hours old it's a problem, but it's not a big deal if it's a few minutes out of date. You care more that your users get a version of the document more than you care that they get the latest version of the document.