SQL itself is finding itself in vogue again. Not full ANSI SQL but derivatives/half implemented equivalents e.g. CQL, HiveQL, Spark SQL, Presto. The type that this article refers to (AMPLab are the original creators of Spark).
But SQL databases themselves are definitely less popular. Mainly as (a) data grows beyond their capability, (b) data becomes more streaming/real time and (c) microservices pushes data into smaller disparate databases. The one big vertically scaled database at the hub of everything is definitely disappearing.
> The one big vertically scaled database at the hub of everything is definitely disappearing.
Maybe in Startupville, CA. But I think you're forgetting that it's a big world out there and there are lots of systems that are built on vertically scaled relational engines that practically print money. The vast majority of companies out there do not have Twitter-scale engineering problems to solve.
I work exclusively in Fortune 500 size enterprises.
And the big EDW that you use to find powering everything has been broken over the years into unintegrated silos e.g. ERP, Web, Salesforce, Payroll etc. The big trend now is to reintegrate all this data and do analytics on it. To do this requires you to do (a) major ETL work between completely different schemas then (b) your data science/analytics work. In semi real time.
This article is referring to this type of workload since this is Spark's bread/butter. You land the data in HDFS, use Spark SQL to run ETL/Analytics jobs and then output the results in a single enterprise view for reporting, marketing etc. And yes this is identical to what Twitter's analytics team would be doing.
With cloud tools from Azure, IBM, Amazon this sort of analytics is going to be becoming much more common place. All using SQL the language but not SQL the database.
The enterprise I work for won't touch cloud with a 10 foot pole, and I know this because we literally got told to quit asking about it. :)
So yes, even we are building out a pretty beefy internal Hadoop cluster, so I would never say that it will be all-relational-all-the-time. But my point was more that there will be copious amounts of SSAS cubes and Oracle warehouses for the foreseeable future. They work great for their use cases and they have well known problems with well known solutions. Doing what Twitter's team is doing when you aren't Twitter might not be the best idea for everyone, after all.
In our case, we use Teradata for our work and it's quite capable of handling very large workloads, and thus we currently have no plans to spin it down in favor of the new hotness. (Even though the new Hadoop cluster positively dwarfs our TD appliance.) I'd say we have a mixture of both on the horizon, if only because our DBAs are less than cooperative about Java UDFs, so Hadoop is the easiest way for us to do complex processing against our fairly large data set.
For EDW, yes. You might see smaller federated data marts or even separately managed relational dbs all over the place. But for OLTP systems for the vast majority of enterprises out there, the vertical single instance big hunk database is still big dawg.
> separately managed relational dbs all over the place
On the BI side, this is overwhelmingly the outcome for large companies. The business units get silo'ed, they build their fiefdoms, a consolidation project gets kicked off and fails, rinse, repeat. Even if the consolidation succeeds, it takes extremely strong leadership to keep it from devolving right back to silos. The tech is not the cause of this problem, so I don't foresee it being the solution to it either.
How real-time are the analytics with these implementations? When I think ETL I think daily chronjobs. Have there been advances in this space which would let me instantaneously see a lead created in Salesforce in these new reports?
We have a POC running where we stream web hit data onto HDFS in near real time (several seconds of latency perhaps). There's no reason to think you couldn't do it with other streams of data as well.
edit: Not sure about Salesforce specifically, sorry if this is too far off topic.
The company I work for is building a near real time (web real time not real realtime) setup (a second or delayed) using AWS SQS and redshift with a custom message consumer. If you keep the message consumer as stateless as possible it's super scalable and reliable.
Those derivatives are, apart from not being really SQL (but a SQLish-like in the best case), only target an incredibly small subset of what is SQL.
When they say "we implement a subset of SQL" they really mean "we have a language similar to what used to be a small subset of SQL 92". And let me reinforce that 92, from the times of Windows 3.1. It cannot be compared to today's Modern SQL (http://use-the-index-luke.com/blog/2015-02/modern-sql) found on modern relational databases, like PostgreSQL.
I wouldn't say that.. as much as they disgust me, Oracle isn't going anywhere.. and neither are MS-SQL, mySQL/MariaDB, PostgreSQL, SQLite, Firebird, DB2, and a bunch of others.
SQL allows for very generalized tools to be used to query data, aside from whatever application front ends get developed. This is simply harder with other database engines. Not to mention that it's a simple good enough tool in many use cases. Not everyone has twitter/facebook/google sized problems. Most applications can do just fine scaling up. And often too much importance is placed in hitting five nines of availability.
I really like NoSQL databases, flexible schemaless systems and horizontal scaling. That said, you can go a long way with SQL, it's just you can't solve some problems on multiple systems, and you can't solve some problems on a single system... depends on your needs.
I definitely am not suggesting they are disappearing. Only becoming less popular.
And I wouldn't group SQLite in that list. I am assuming it has already has come from nowhere to be the most popular in use today (courtesy of iOS/Android ?) and I believe is part of an emerging trend of microdatabases.
What is happening is that SQL is becoming more popular but over the top of NoSQL systems e.g. HDFS, Cassandra.
Which is pretty nice... beyond the language of SQL, relational databases that support full ACID compliance are important for a lot of scenarios. Usually you have to sacrifice a portion of that in order to scale horizontally, or when a single system can no longer keep up.
There's been enormous progress in database reasoning the past decade or so, as more companies have needed to reach the scale of Facebook, Twitter, Google, Amazon and others... Not every instance needs that level of scale, but many businesses are also trying to reach very high levels of availability, which is an issue with similar solutions.
SQL itself is finding itself in vogue again. Not full ANSI SQL but derivatives/half implemented equivalents e.g. CQL, HiveQL, Spark SQL, Presto. The type that this article refers to (AMPLab are the original creators of Spark).
But SQL databases themselves are definitely less popular. Mainly as (a) data grows beyond their capability, (b) data becomes more streaming/real time and (c) microservices pushes data into smaller disparate databases. The one big vertically scaled database at the hub of everything is definitely disappearing.