I edit a database newsletter – https://dbweekly.com/ – so tend to always have my eyes out for new releases, what's coming along, and what not. And I thought I'd share a few more things that have jumped out at me recently in case anyone's in the mood for spelunking.
1. QuestDB – https://questdb.io/ – is a performance-focused, open-source time-series database that uses SQL. It makes heavy use of SIMD and vectorization for the performance end of things.
2. GridDB - https://griddb.net/en/ - is an in-memory NoSQL time-series database (there's a theme lately with these!) out of Toshiba that was boasting doing 5m writes per second and 60m reads per second on a 20 node cluster recently.
3. MeiliSearch - https://github.com/meilisearch/MeiliSearch – not exactly a database but basically an Elastic-esque search server written in Rust. Seems to have really taken off.
4. Dolt – https://github.com/liquidata-inc/dolt – bills itself as a 'Git for data'. It's relational, speaks SQL, but has version control on everything.
TerminusDB, KVRocks, and ImmuDB also get honorable mentions.
Luke from TerminusDB here. Thanks for the honorable mention! Definitely exciting times in database land and great to be in the same list as all these projects.
We are focused on the revision control aspects of Terminus - trying to make the lives of data intensive (ML etc.) teams a little easier. We use a delta encoding approach to updates like source control systems such as git and provide the whole suite of revision control features: branch, merge, squash, rollback, blame, and time-travel. Idea is to provide continuous integration for the data layer.
Has anybody tried using Dolt? It looks quite young.
I think this idea is really valuable, but I usually see it implemented as a time-series extension on top of Postgres or MySQL, like SQLAlchemy-Continuum or TimescaleDB. i.e. you caan get most of the useful git-like time-travel semantics (modulo schema migrations) out of timeseries data with a separate transaction history table.
I'm curious what Dolt's performance profile looks like (i.e. how reads and writes scale vs "branch"-count, row-count, and how they handle indices across schema migrations), since the aforementioned solutions on Postgres are building on an already-very-performant core.
edit: TerminusDB also looks very cool, although it's approaching this problem from a graph-database angle rather than a relational one. Their architecture (prolog server on top of a rust persistence core) also seems super fascinating, I'd love to read more on how they chose it.
You are indeed correct, we are quite young. We are also focused at this stage on the data distribution use case (ie. use Dolt instead of putting a CSV or JSON file on GitHub). So, we haven't spent much time on query performance or scale. The current model is that the database and all of its history have to fit on the user's hard drive. The format is easily distributed (ie. put content addressed chunks on different servers, optimize for fewer network hops) but that's not how it works today.
That being said, we think architecturally we can eventually get pretty close to parity with other RDBMS on the read path. We will be slower on the write path given the need to build the Merkle DAG on writes.
Long ways to go though, we only launched open source last August. On a suite of MySQL ~6M correctness benchmarks, we currently execute 3-4X more slowly. These aren't large sets either so we suspect we'll run into some non-linearities in performance. This is just normal SQL. We haven't really tested the limits of how many branches we can handle or how long branch creation or merge takes at scale. Not because we don't want to but because it's not the use case we're focused on.
If you are willing to take the journey with us, we'll work with you to get Dolt where it needs to be (if it can, reads should be fine, writes will be more expensive). But we're just at the beginning so there will be bumps along the road. If that's not for you, we completely understand.
1) `SELECT * FROM state_populations AS OF 'master'` to get the repo as it existed at a particular timestamp / ref name
2) `SELECT * FROM dolt_diff_state_populations WHERE ...` to examine to/from column value pairs between any two commits you choose.
I work as a SWE at a large AI consultancy. We've been experimenting with "git for data" products for a while, and we've been trying to get rid of them (notably Pachyderm) for -at least- 2 years.
What they all share is a) awful performances, and b) bad design.
Git semantics, ("branche", "merge", "commit") are not well suited for data, because merging dataframes and creating "branches" often leads to misunderstandings and delays. Time travel is very nice to have, but it's often the case where you would like to consume your input datasets at different point in time in the same repository (unless you do one dataset per repository, but then, what's the point ?).
Performances are bad, because all updates needs to go through some kind of coordination mechanism (etcd, zookeeper, or raft directly). In a single instance scenario, you often end-up flooding it or needing additional memory to cope with the load. However, you could deliver high throughput and high availability by using proper sharding and distributing updates to specific masters (like you would do in any actor-based architecture).
As a replacement, we're now using a custom event-sourcing framework on top of AWS S3/Azure blob. It's faster, more reliable, and most importantly, better designed.
We're building a very literal "git for data" product, Dolt. (doltdb.com). I'm very curious about this criticism:
> Git semantics, ("branche", "merge", "commit") are not well suited for data, because merging dataframes and creating "branches" often leads to misunderstandings and delays.
Can you give a concrete example of what you mean? I'm wondering if this is a failing of the tool you're using or the model itself?
> Time travel is very nice to have, but it's often the case where you would like to consume your input datasets at different point in time in the same repository
Dolt supports doing exactly this. See my reply to a child comment below.
> Awful performance
It's not obvious to me why this needs to be true in the general case, unless it's caused by b) bad design. Are you mostly talking about write performance, or are reads also a problem?
I don't want you to take it personally, but, from my point of view, Dolt falls in the same bucket as DVC, Pachyderm and some others. The shortcomings are not from the tools themselves, it's a fundamental issue with the design.
First, branching and merging. In git, branching allows you to make uncoordinated parallel progress for the price of a reconciliation step. In a datastore, you want the exact opposite: A single, consistent, available source of truth. Having different branches of the same dataset bring more confusion while solving zero problem.
Then, commits. In git, a commit represent a snapshot of the entire state of your repository. This is particularly attractive because it guarantees that your code will build no matter what kind of update will follow (without incidence: editing a readme ; severely destructive: removing an src folder). In a datastore, this is nice but unnecessary. As I mentioned it in this thread, datasets move at different speeds, and attaching an new hash to something that didn't change doesn't add value. However, I have to recognize, I failed to mention earlier that datasets are often unrelated and not relational. This would be to reconsider if it were the case, of course. Most of the time, a dataset is represented as a single dataframe (or a single collection of dataframes).
There some points where git semantics make sense: immutability of commits, linearizability within branches. Both are extremely important if you want to enable reproducibility of your pipeline. These are traits coming from Event Sourcing.
Reproducibility is also claimed by DVC and Pachyderm, but their issue here is more a problem of trying to do too much things at once but not managing to do it right. Running code within Pachyderm pipelines was a recipe for disaster and the first thing we got rid of.
As for performances, the write side is where it matters, because it needs to be coordinated. Reads almost never are an issue with good caching. In any case, it should be robust enough to fill the gap between csv files sent to s3 and a full kafka cluster, eg: not complaining for a few TB. To my knowledge, the only multi-leader datastore suitable for storing sharded datasets as a continuous log is Kafka.
That's very interesting. This is why we think (Luke from TerminusDB again) designing your own full featured graph DB is the best appraoch to the problem - you can work from the ground up and ensure performance is satisfactory for the operations you want to deliver. I don't agree that Git semantics are not well suited to data, but you do have to be very careful about how you advance.
I was reading through these source code branching patterns and you can easily imagine (if design and performance is right) how they would apply to data - but as the author says 'branching is easy, merging is harder.'
Was discussing your point with a colleague and the comment about time travel is very interesting. Could you give a little more detail the the point? With us, everyone can be talking about a different time point in a branch whenever they like, and it has nothing to do with head. Two clients can look at different time points in our database and neither have to know about each other at all.
Also, for TerminusDB we don't use a multimaster coordination mechanism - we actually use the same sort of git approach.
First, I need to say that I never tried TerminusDB, so I can't claim having a strong opinion on your approach :)
Back to the time-travel. One of the most evident architecture when dealing with AI/ML/Optimisation is to design your application as a mesh of small, deterministic, steps (or scripts) reading input data and outputting results. As you would expect, output of one step is reusable by another one. Example:
Script A is reading Sales data from source S, Weather data from source W; writing its result to A. Script B is reading data from source A, and Calendar from C; writing its result to B. In this example, we want to be able to do two things: 1) run a previous version of A with S and W from 2 weeks ago and assert the result it produced now is exactly identical to the one it produced at the time 2) run a _newer_ version of A with S and W from 2 weeks ago and compare its result from the one it previously produced. Of course, in the real-world, S, W, C, progress at different speed : new sales could be inserted by the minute, but the weather data would likely change by the day. So, you need a system that would allow you to read S@2fabbg and W@4c4490 while being in the same "repository". That's why git semantics are not a good fit: you need to have only one "branch" to ensure consistency and limit misunderstandings, but you want to "commit" datasets in the same repository at different pace. For that purpose, event sourcing is much better :) (BTW, git at its core, is basically event-sourcing)
Kafka's architecture is actually the best solution.
Very interesting - thanks for the additional detail. Will have to think about how we might best represent in Terminus. We did a bunch of work for retailers in exactly the situation you describe.
In short, something very similar to https://www.snowflake.com/. An event-based system sitting on top of s3, indexed with a postgresql, influxDB, or anything else.
> you caan get most of the useful git-like time-travel semantics (modulo schema migrations) out of timeseries data with a separate transaction history table
Well, not really. Dolt isn't just time travel. If all you want is time travel (or append-only data), you can do that with Postgres or MySQL pretty easily and get great performance. What Dolt brings to the table is actual Git semantics, a real commit graph you can branch, merge, fork, clone. You can inspect the historical values of each cell and examine the author and commit message for each change. Two people can clone a database, both insert / update / delete rows, and then merge each other's changes. If they touched the same rows, they'll have a merge conflict they have to resolve. If you find somebody's data they've published in the Dolt format has an error, you can fork it, correct the data, and submit them a PR to merge back, just like you do with code on Github. It's a database built for collaboration from the ground up. Instead of authorizing each person you want to make edits specific grants on particular tables / subsets of the data, you just let them clone it, then submit a PR that you can examine and approve. Or if they disagree with your policies, they can fork the data and still merge in updates from your master source as necessary. Git demonstrated what a powerful model the commit graph is for code. Dolt brings it to data, with SQL built in.
To answer your question about indexes across schema migrations, indexes are versioned with the schema of the table. This means they can be efficiently synced during a pull operation, but it means that previous versions of the data don't have indexes on them. We're considering adding the ability to add an index to the data from the beginning of the commit history as a rebase operation, but haven't implemented that yet.
Heyo, thank you for responding! I totally agree that the commit, PR, merge semantics are powerful and valuable, and you can’t get that easily with existing databases.
I guess in my imaginary perfect world, you don’t need to use commit, PR, merge semantics to make normal edits. You can use existing bitemporal/transaction history ideas in online updates, appends, and deletes, and then you have a higher-level abstraction for offline(-only?) branching and merging.
I guess what I’m saying is that I don’t totally buy the idea that you need git “all the way down”, especially if it gets in the way of performance. But maybe I’m just used to “the old way”, and I’ll cringe reading this in 20 years. :)
Major kudos for this project, it’s super impressive. I’m excited to see how it grows!
I have a hunch that some form of multi-temporalism, or perhaps more fancily some kind multi-lattice-ism, as the secret model underlying all of these.
But putting that probably-wrong-because-I'm-a-dilettante wankery to one side: a serious advantage of bitemporalism is that most layfolk understand how clocks and calendars work. This is less true of git commits.
I think I agree with you. Git uses logical ordering rather than temporal ordering, and their concept of a primary key is rather nebulous.
We've been able to train up "regular folk" on bitemporalism at our company. The distinction between valid and transaction history takes a little while, but it sticks. Git has two or three such conceptual sticking points (staging vs committing, branching & merging, remote vs local).
FWIW we are absolutely working on modes of using the product that "just work" like a normal database without any explicit commits / merges. In that case commits would happen automatically (maybe one for each write transaction?) and you'd get all the history "for free." We aren't sure how useful that is, though. Definitely looking for customers interested in this use case to guide us on their needs.
I work at an insurance company, where it's extremely important that we have a way of persisting (1) what happened when [transaction history], and (2) how our opinions of what happened when changed over time [valid history]. Basically, you can think of (2) as a materialized view of (1). But it all lives in one database.
There is much business value to be extracted out of this idea, because most companies (1) do not have the ability to cheaply (i.e. ad-hoc) introspect this way (2) would gain a lot of value from this ability. My hunch is that most companies out there are interested in answering that set of questions (referred to as bitemporal data, in the literature).
This is the space that TimescaleDB seems to be competing in, although we don't use them (we use our own extension of Postgres, that's only a few hundred lines of PL-pgSQL, plus some helper libraries in Go/Python).
In my perspective, I think git-style semantics would be very powerful as a layer above an existing bitemporal system. We've implemented similar systems, where records get a semver-style version in their bitemporal key, so we can see how different producer logic versions minted different records over time. It's be really cool to have branching versions, and to be able to rebase them -- this is something our system doesn't currently support.
Anyways, hope our data point is useful for you all. Happy to share more if you have questions about anything.
Thanks for this - really interesting (Luke from Terminus here). We have engaged plenty with some reinsurance folk around some query use cases (recursive stuff to better understand risk), but that was before we did a lot of the git-like stuff. Can see how this might be powerful on a historical view in insurance.
Have you heard of patch-theory-based VCS systems (Pijul is unstable, I heard Darcs is slower)? Instead of a chain of independent states (commits), each state is a set of patches (some of which depend on others). Pijul stores an internal structure that's more complex than "a series of chars/lines".
This eliminates complex rebasing topologies, and Git not recognizing that old and rebased commits are the same. But I'm not sure if it works well in practice. And it doesn't extend to SQL data sets yet.
I don't know if you're interested in this or not, but I just wanted to mention it.
Depends what you mean by "work" and "large," I guess. We have repos in the 250 GB range that definitely push the limits of functionality, but everything still works (pull, push, merge, clone, etc., and queries return the right results, if slowly).
Thanks! Luke from TerminusDB here. We think a graph is the best way to approach a graphy problem!
We like Prolog for Querying, constraint checking and user interaction. Rust is great for low-level data manipulation. Prolog is a superpower - very simple but very powerful, but quite far removed from the hardware and uses abstractions not under our control so not good at nitty-gritty bit manipulation. We like Rust as a low-level memory-safe language (and it has a great community).
There have been a lot of interesting projects released as alternatives to Elasticsearch and Solr like Sonic, Tantivy, Toshi, Xapiand, Typesense, Blast, etc.
Yes, the latest version of Terminus uses JSON-LD, which is a serialisation of RDF. This enables us to store queries in a database and provides a schema documentation for our datalog query language.
Not really a database - but Parallax[0] is a data warehouse geared at private/sensitive data written in Rust - it allows you to perform data anonymisation at query-time.
Furthermore it exposes a privacy-as-code framework allowing you to version control and be very granular about how different entities interact with your datasets.
I'm on the core team - if you have any questions I would be more than happy to answer.
Since you mentioned Dolt, I will add Noms [1], which is the original db that Dolt is based on. The folks that created Noms are working now on Replicache [2]. I started using Noms in my project and the offline first tech is really wonderful; it allows me to do some cool stuff like syncing clients only when they come online.
Yep, you got it. We should link these from the front page really :-) The main problem is they're just an "added extra" and we can't always guarantee the formatting is 100% due to our focus on getting the emails just right. It tends to be OK 99% of the time though.
I'm surprised that it has taken so long for time series databases to take off. I remember in 2015 I wrote an application to wedge time-series features into MySQL because the only option I could find for a native db was paid.
In general, a time-series database makes architectural decisions and focuses on capabilities for time-series data that lead to orders of magnitude higher performance and a better developer experience.
For example, TimescaleDB:
- Auto-partitions data into 100s-10,000s of chunks, resulting in faster inserts and queries, yet provides the Hypertable abstraction that allows you to treat the data as if it lives in a single table (eg full SQL, secondary indexes, etc)
- Provides Continuous Aggregates, which automatically calculate the results of a query in the background and materialize the results [1]
- Supports native compression, using a variety of best-in-class algorithms based on datatype (eg delta-of-delta encoding, Simple-8b RLE, XOR-based compression (aka "Gorilla" compression), dictionary compression [2]
- Adds other capabilities like interpolation, LOCF, data retention policies, etc, that are necessary for time-series data
- Scales up on a single node, or out across multiple nodes
There's more, but I'll stop because I don't want this to sound like marketing.
You are welcome to implement this yourself (and our docs explain how if you'd like), but most find it easier to use the database we already provide for free[3]. :-)
I tend to think of them as databases where rows are an aggregate count of events over a period of time.
An overly simplistic example being an API that captures upvotes. Instead of storing the individual upvotes, you incoming requests in a queue and only write out a count of upvotes over, say 1 minute. That way, if you want to get a count over a larger period of time, you're setting a ceiling on the number of records involved in your operation. If you have 1 minute resolution and you're looking over a year of data, you're adding at most, 6024366 records for each post. This is really handy for analytics data where you don't necessarily care about the individual records.
The project I was specifically referencing was one that captured data streams from sensors like accelerometers and thermometers, which is inherently time-series data, because it's literally just (timestamp, sensor-reading). But to make use of that, you need a library of tools that understand that the underlying data is a time-series to do things like smoothing data, or highlight key events. For example, a torque spike had a "signature" which involved looking at the difference over time periods at a particular resolution. But a temperature spike would look different. Etc.
I see, thanks for the explanation. Is aggregation in general a required feature of a time-series database? Or specifically the ability to calculate aggregates continuously over changing (incoming?) data?
Do time series databases store the raw underlying data for the aggregates or just enough to calculate the next window?
Time-series databases themselves are usually key/value or relational.
The "time-series" features come from the focus around time as a primary component of the data (or some kind of increasing numeric value). Features like automatic partitioning and sharding, better compression, aggregating into time buckets, extracting and managing date values, smoothing missing values in queries, etc.
Redshift does a lot of that. Is there something unique to a "time series" database that Redshift does not do? Is it the inclusion of some ETL capabilities? I'm not sure what makes a database truly time-series. It sounds like there is some compromise made that requires a time dimension. I'm not sure what that compromise would be beyond the general case of a columnar database.
Nothing unique. Any database can store "time-series" data. If you look at InfluxDB (key/value) or TimescaleDB (relational), they offer things like high precision time values, higher compression algorithms that work better with timestamps, utilities to make it easier to query for time (buckets, timezones, gap-filling), automatic rollups and aggregations, data deletions based on time, etc.
You can do all that yourself on Redshift but they just offer more built-in functionality for it.
Exactly what mani says. For many people, Redshift, or just plain Postgres, is good enough. But if you find yourself wresting with poor performance, slow queries, high costs, or just an annoying developer experience, I'd encourage you to look at time-series databases (eg TimescaleDB, where I work).
You could have made a timeseries db. There's nothing magic about it. It's just a collection of engineering tradeoffs where you throw away relational features and indexes on various fields, instead focusing on performance for the huge number of events coming into an append only system (you can update but it's super rare).
To be clear, TimescaleDB does NOT throw away and indexing capabilities or ACID features. Its only limitation is that UNIQUE constraints need to include the time column. Also, its query optimizations are far from trivial to implement.
If you build it on a modern relational data warehouse then you don't have to throw away any of that. That's more of how specific databases (like InfluxDB) choose to handle it but not necessarily a component of time-series itself.
Yes absolutely. However if you are aggressive in getting performance then you're likely to do things like batching the data and storing compressed blobs in the db and the rest of the row is essentially just metadata to the blob. Once you do this, relations on the data are basically thrown away because you can no longer do any joins on the blob data.
You can choose to not do this but then get ready to eat mediocre performance.
Sure but that's because a "time-series" database is trying to get functionality that isn't already part of the underlying data store, for example if you're using RocksDB underneath and want to avoid slow scans.
You don't need to do that with a relational columnstore like Redshift, BigQuery, MemSQL, Vertica, Kdb+, or others. They're designed for massive aggregations with rows stored in highly compressed column segments that support updates and joins. And they're much faster than any custom compression scheme on top of another slower data store.
There's also the in-between options like OpenTSDB and Pinalytics that use Hbase/Bigtable. That's a sorted key/value datastore but it still applies compression on the row segments so you can leverage that in scans without a custom compression scheme on top.
>And they'll be much faster than any custom compression scheme on top of another data store.
As usual, it depends on the situation and what's being measured. For example, if you are able to batch a lot, then write performance of a blob will be line rate. If you need an ACK on each event then that's not possible and indeed a column store will be better.
> OpenTSDB
AFAICT, OpenTSDB compacts columns which is similar to what I've described.
This is proprietary so there's no visibility into what they do here.
FWIW, KairosDB is a timeseries db build on Cassandra and uses a row with 1,814,400,000 to side step the data management overhead via bucketing. Not a binary blob but certainly not the straight forward data layout that someone asking about 'why can't I just store data in a normal database' might expect.
All relational columnstores use batch inserts and updates. In fact that's the recommended way to write columnstore data and SQL already supports various batch insert methods. The modern databases also use a row-store representation as a buffer for any single-row mutations before merging the columnstore segments. [1]
OpenTSDB compacts columns inside a column-family. This is because Hbase/cassandra/dynamo key/value stores support arbitrary columns for each row and store a verbose identifier for the column/cell/version matrix. It's a custom data format to save disk space but the underlying compaction still refers to HBase LSM-tree compaction. [2] The rows are still stored individually and compression is enabled on HDFS.
But yes, it's all highly dependent on the situation and it's the same fundamental mechanics of batching and compressing data. My point is that it's better to just use a database that is built with this architecture and can provide greater performance, usability and full querying flexibility instead of bolting it on top of a less functional data store.
Almost all of our load operations are a merge. Our ETL layer takes care of that. The vacuum then deals with re-sorting rows and freeing blocks.
I have never worked with something that bills itself as a "time-series" database. Does that involve pushing elements of the ETL layer down into the database? Or is it just optimizing for access by a time dimension?
Since you mentioned QuestDB ... I'm a former InfluxDB person, now at QuestDB. I'll say this about them: Their performance is off the charts good. Influx performance has been, frankly, headed in the other direction. I was looking at the NYC Taxi data the other day with QuestDB and I ran a sum over 1.6B rows of taxi data in under 150ms. 'select sum(passenger_count) from trips;'
Just came to say that I recently discovered your newsletter and am very glad for it. The database space has been seeing so much movement lately that your newsletter is the missing piece for staying on top of it.
What do you think about Aito and the predictive databases?
We believe that the predictive databases can be a thing both in the database and the ML space, which is quite exciting.
I also feel that the 'smart databases' like Aito or BayesDB form an interesting database category, which will emerge as an everyday softwarw component once the technology matures. According to our internal benchmarks, it is already mature accuracy & performance wise for many applications.
MeiliSearch is exactly not Elastic-like search server. It should work out of the box for its intended use case (fast typo-tolerant search), and it doesn’t support basically everything else. No complex queries, no aggregations, no big datasets, no clusters, etc.
I’ve always felt it strange that in almost every job I’ve had databases have been one of the most important pieces of the architecture but the least debated. I’ve spent hours debating languages and frameworks, but databases always come down to whatever we have a license for/what others at the company are using. Engineering teams will always say they make sure to use the right tool for the job, but no one ever talks about if it’s right to keep using the same database for a new product.
My experience is places with DBAs tend to keep very tight control over database design than those who do without. I've worked at several places where developers did not touch the database schema. They couldn't even propose changes, instead, you'd give the DBA team the requirements and they would give you a solution.
Materialize is neat, but there are other database systems that refresh at least some materialized views on the fly, while being smart about not rebuilding the entire view every time. See for example, Oracle, where FAST REFRESH ON COMMIT does most of what Materialize is advertised as doing, at least for views which that feature can support (restriction list here: https://stackoverflow.com/questions/49578932/materialized-vi... ). Mind you, this comes with Oracle's extremely hefty price tag, so I'm not sure I'd recommend it to anyone who isn't already stuck with Oracle, but it is technical precedent.
It would be interesting to compare notes, and see what Materialize does better.
The main differences you should expect to see are generality and performance.
Generality, in that there are fewer limitations on what you can express. Oracle (and most RDBMSs) build their Incremental View Maintenance on their existing execution infrastructure, and are limited by the queries whose update rules they can fit in that infrastructure. We don't have that limitation, and are able to build dataflows that update arbitrary SQL92 at the moment. Outer joins with correlated subqueries in the join constraint; fine.
Performance, in that we have the ability to specialize computation for incremental maintenance in a way that RDBMSs are less well equipped to do. For example, if you want to maintain a MIN or MAX query, it seems Oracle will do this quickly only for insert-only workloads; on retractions it re-evaluates the whole group. Materialize maintains a per-group aggregation tree, the sort of which previously led to a 10,000x throughput increase for TPCH Query 15 [0]. Generally, we'll build and maintain a few more indexes for you (automatically) burning a bit more memory but ensuring low latencies.
As far as I know, Timescale's materialized views are for join-free aggregates. Systems like Druid were join-free and are starting to introduce limited forms. KSQLdb has the same look and feel, but a. is only eventually consistent and b. round-trips everything through Kafka. Again, all AFAIK and could certainly change moment by moment.
Obviously we aren't allowed to benchmark against Oracle, but you can evaluate our stuff and let everyone know. So that's one difference.
Materialize sounds like it has reinvented Michael Stonebraker's StreamSQL [1] and SAP's Continuous Computation Language (CCL) [2] which was created as part of a StreamSQL competitor named Coral8 and lives on in an enterprise product now named SAP HANA Smart Data Streaming. This space has gone by many names, Streaming SQL, Complex Event Processing (CEP), and Event Streaming.
I think the Continuous Computation Language (CCL) name captures the essence of these systems: data flows through the computation/query.
These systems have always had promise but none have found anything but niche adoption. The two most popular use cases seem to be ETL-like dataflows and OLAP style Window queries incrementally updated with streaming data (e.g. computations over stock tick data joined with multiple data sources).
The projects you've mentioned are attempts to address stream processing needs with a SQL-like language. That is fundamentally different from providing incremental view maintenance of actual SQL using streaming techniques (what Materialize does).
If you want to maintain the results of a SQL query with a correlated subquery, StreamSQL in Aurora did not do that (full query decorrelation is relatively recent, afaiu). I have no idea what TIBCO's current implementation does.
If you want to maintain the results of a SQL query containing a WITH RECURSIVE fragment, you can do this in differential dataflow today (and in time, Materialize). I'm pretty sure you have no chance of doing this in StreamSQL or CCL or CQL or BEAM or ...
The important difference is that lots of people do actually want to maintain their SQL queries, and are not satisfied with "SQL inspired" languages that are insert-only (Aurora), or require windows on joins, or only cover the easy cases.
> The projects you've mentioned are attempts to address stream processing needs with a SQL-like language. That is fundamentally different from providing incremental view maintenance of actual SQL using streaming techniques (what Materialize does).
With all due respect, CREATE SINK and CREATE SOURCE are SQL-like. I would argue that the pipeline created from the set of SINKs and SOURCEs is the key concept to grasp for developers new to your platform. The purity of the PostgreSQL CREATE MATERIALIZED VIEW syntax and other PG/SQL constructs seems like a minor selling point, in my (very narrowly informed) opinion. I hope I'm wrong.
Our difference of opinion involves marketing language and perceived differentiators. There are some important use cases for continuous SQL queries over Kafka-like data streams that remain unaddressed (as far as I know). I hope Materialize gains traction where others have failed to do so. If PG/SQL compatibility was the only thing holding back this style of solution then kudos to you and your team for recognizing it. Good luck (honestly).
Just as a point of clarification: timescale materialized views don't allow joins during materialization. But you can join materialized views arbitrarily at query time. (TimescaleDB engineer here)
I'm a PM @ Materialize. Regarding the questions from the OP's link:
#1 -- Postgres is built in C, while Timely/Differential (which underpin Materialize) are built in Rust. Materialize could be a Postgres extension at some point, but for now we want to control the stack (to more easily do cross-dataflow optimization, etc)
#2 -- We are absolutely interested in exposing an API to enable new data sources (and data sinks!). It currently takes a bit of time to understand some basic Timely Dataflow concepts, but we intend on documenting this and opening things up. We're also trying to understand user requirements around things like durability guarantees and UPDATE/DELETEs (Feel free to email me or hit us up in our public Slack as well).
About a year ago Vitess merged in support for what we call VReplication. ( https://vitess.io/docs/reference/features/vreplication/ ) The initial demo was a custom Materialized view that did calculations and rollup's on data in realish time ( it uses the MySQL binlog ). Right now the limiting factor for us calling is true Materalized View's is that configuring them is done out of band using the Vitess Control plane, but soon we will be merging in the features to allow standard DDL to configure and manage them!
I'm not sure exactly how fast FAST REFRESH ON COMMIT is, but we at materialize are very fast, and getting faster. Once data is in materialized we can process streams to multiple downstream views with millisecond-level latencies. I'm working on improving our benchmarking capabilities so that we can provide less qualified answers to the question of "how fast is materialized?"
Much more interesting than our speed, though, in my opinion, is the fact that you can use materialized as the place where you do joins _across_ databases and file formats. It's particularly interesting in a microservices environment, where you may have a postgres db that was set up by one team and a mysql db that was set up by another team and the only thing you care about is doing some join or aggregate across the two: with materialized (and debezium) you can just stream the data into materialized and have continuously up to date views across services. Combine this with a graphql or rest postgres api layer (like hasura) and a large amount of CRUD code -- entire CRUD services that I've worked on in the past -- just disappears.
The OP briefly alludes to the big difference: differential dataflow.
My understanding is that it means that the database doesn't update the view simply because a datum arrived, but rather only if the datum will change it. That avoids an enormous amount of churn and makes materialized views fantastically cheaper on average.
Edit: The original thread here on HN has lots of detailed answers from the Materialize folks, plus some gushing fandom from that jacques_chester dweeb: https://news.ycombinator.com/item?id=22359769
I think a 5-second summary of the difference is that Materialize is Oracle FAST REFRESH for almost* all of SQL, as opposed to just those very limited cases where Oracle's FAST REFRESH works.
Oracle's docs have entire features in order to help you diagnose whether your views are eligible for fast or synchronous refresh[1]. In Materialize, everything you can express in SQL is fast refreshed. No caveats.
Indexed views in SQL server behave similarly (though not as performant as Oracle I believe). They also have some constraints on application (schema bound, data types, nesting).
I'm working on Redis adapter for DynamoDB - Dynamo is really a distributed superset of Redis, and most of the data structures that Redis has scale effectively to the distributed hash table + B-Tree-like system that Dynamo offers. Having a well known and understood API like Redis is a boon for Dynamo, whose API is much more low level and esoteric.
The Go library is in beta, working on a server that's wire compatible with Redis.
It would be super complicated, but technically feasible. DynamoDB has a change log called streams that would technically work, with a lot of aggregation, even with the distributed nature. But I would think about it only as a late stage and very expensive feature.
I really wish one of the existing db technologies, Firebird, got a shot in the arm. It has both embedded and server modes which makes it unique as far as I know. Also the database is a single file which with firebirds "careful write" methodology remains consistent at all times so while you can make a backup at any time because it has MVCC, even a file copy of the database file with open transactions should not be corrupted. The installer size comes in under 10 MB. It's being actively improved, is open source with a very liberal licence but sadly it only gets a tiny fraction of the attention that SQLite, postgres etc receive
My understanding of TileDB is that it is 100% client-side. There is no server. In a sense it’s like handling orc or paraquet or even SQLite files on S3, (except tiledb are fancy r-trees) with a delta-lake-like manifest file for transactions too.
I think in the future there’s going to be a sine-wave of smart-clients consuming S3 cleverly, and then smartness growing in the S3 interface so constraints and indices and things happen in the storage again, and back and forth...
This is a good description, except that TileDB (the open source client) is not transactional but eventually consistent at least for S3 and other object stores.
I like your point about consuming S3 cleverly, it's often difficult to get good out of the box performance from S3 so abstracting that to the degree possible is good for end-users. The cloud vendors though are always one or two steps ahead of companies that build upon their services. AWS Redshift for instance already can pre-index objects stored on S3 to accelerate queries at the storage layer. It's difficult as a third party vendor to compete with that.
This is a very interesting development that I'd like to learn more about. Whenever I've played around with writing databases (just as toy projects) I've always done so using RocksDB or something similar as a backend. This "thick client" model, though, seems to have a lot of potential benefits, most notably no need to worry about disk space or volumes (so say goodbye to a bunch of config parameters) and no need for a tiered storage setup or S3 migration tools (already accomplished!). Not ideal for most use cases but intriguing for some!
There are a lot of issues though with S3, latency, poor performance for small reads / writes, timeouts, api rate limits, api costs, and consistency issues poorly understood by third party developers.
A "thick-client" also doesn't perform well unless that client is located on a node in the same region. I think as with everything it works well in some cases and not well in others.
It's not so difficult if you control the data. Snowflake offers a relational datawarehouse built on EC2/S3 (and now other clouds) with its own column-oriented data format (a hybrid called PAX). It can seek to the right columns and rows by getting the exact byte ranges from an S3 object.
I'm frankly shocked by how well these Clojure Datalog DBs work. Both the data-modelling side (datoms) and the query side are extremely expressive and well integrated into Clojure itself as well.
The principles are (from my perspective): How would a database (API) look like when both data changes over time and structural flexibility were first class concepts. The result is almost mindbogglingly simple.
Datalog is already such a powerful language in and of itself and I really wonder why it is still such a niche thing.
Don't get me wrong. SQL is great. It is there for a reason. And it is and should be "the default" for most use-cases. But Clojure Datalog DBs have a fairly clear and large advantage, especially when data is temporal and/or graph-like.
I support FoundationDB's approach to databases which is basically provide a consistent, distributed, and ordered Key-Value store then you can build whatever type of database you need on top of it whether that's RDBMS, Document, Graph, etc.
With that said, CouchDB 4.0 (on FDB) is going to be killer. Master-Master replication between clients and server with PouchDB is phenomenal when you remove the complicated eventual consistency on the server side.
And as a plug, I'm building a multi-tenant/multi-application database on top of it.
This is why I’m inherently skeptical of any database technology that isn't built on top of Postgres :)
PG has mostly skipped the whole time series database trend until Timescale showed up. Still waiting waiting for the graph-db and git-db extensions!
I think there’s been so much progress at the lowest level, that new SQL-databases might be kind of foolhardy to not build on top of PG at this point, especially if their main claim to fame is essentially a data type and some fancy indexing.
I hope PG ends up more like Linux, and less like C++ or Java. :)
Software is advancing so fast. Interesting to constantly reconsider the things I consider myself ahead of the curve on vs behind the curve on. Prisma looks great so I've updated my I want functional dbs, not ORMs post: https://github.com/ericbets/erics-designs/blob/master/funcdb...
I'd recommend checking out PostgREST for this (if you're using Postgres). We used this approach in my previous startup quite successfully.
We also have plans at Supabase[1] to make Postgres databases semi-ephemeral. You'll be able to spin them up/down using a restful API. The schemas/types will be accessible via a restful API: https://supabase.github.io/pg-api/. Not quite as simple as SQLite, but a good alternative.
> What I have yet to see but always secretly wanted, however, is a database that natively supports incremental updates to materialized views. Yep, that’s right: Materialize listens for changes in the data sources that you specify and updates your views as those sources change.
This is precisely one of the features that make ClickHouse shine
Does anybody know of a good educational resource on software/best practices that is kept up to date. Ideally something that does not include the latest bleeding edge but things that are battle hardened or getting there. Something that includes open source and commercial software would be ideal.
It's basically a 200 line document database in Python that's backed by SQLite. I need to store a bunch of scraping data from a script but don't want a huge database or the hassle of making SQLite tables.
Goatfish is perfect because it stores free-form objects (JSON, basically), only it lets you index by arbitrary keys in them and get fast queries that way.
It's pretty simple, but a very nice compromise between the simplicity of in-memory dicts and the reliability of SQLite.
>What I’m really hoping for is the emergence of extremely “hackable,” resolutely non-monolithic DBs that provide a plugin interface for highly use-case-specific data types,
Interesting notes but I feel like the db itself has been commoditised and the battle is elsewhere now.
So anyone building a database engine today, will find out that to make it sustainable they also need an ecosystem on top of it, tooling, community, paid support, active devs, consultants (for which they may have no runway)
Finally I find anything that calls itself a database and uses S3 as a backend a bit ridiculous. S3 has eventual consistency so you can’t do the operations that differentiate a database from a file system.
I've actually used materialized views in production on SQL Server, and they're great when they work, but they have far too many limitations. Most of these make sense, but some of the join limitations are internal constraints of the engine, not a fundamental limit. This prevents their use for a wide range of scenarios where they would be useful...
When I tried to use auto incremental materialized views in SQL Server (about 10 years ago) - they had so many limitations that they were virtually useless.
I now use Postgres and this is the Number 1 feature
that’s missing for me, there have been a few attempts that have unfortunately not made it into the core. I don’t understand why this critical capability is not even on the radar of the core developers - maybe this publicity from Materialize will raise its priority.
I think that the graphql adapters like hasura and goke are also an important innovation, for small mvp projects you can create a graphql api to query your database directly from the frontend, this reduces the development time by a factor of 2 at least.
I'll just throw a note for a new product, AWS's QLDB. It's an internal managed product that combines a replicated, immutable, versioned document database with ACID transactions and an immutable, provable history of every modification. There's some streaming and subset SQL on the back end.
Something this focused should have a few applications where bit level auditability matters, eg financial, chain of events, etc. Of course it comes with some tradeoffs vs a relational or kv db.
I wonder if there would be room for a self-hosted clone?
I love these kinds of posts. They're targeted towards what people are finding interesting and they're highly tech related. It's a great way to find new technology.
Also, I'd like to add one database to the list (I work there for 3 weeks now): TriplyDB [0]. It is making linked data easier.
Linked data is useful for when people of different organizations want a shared schema.
In many commercial applications one wouldn't want this, as data is the valuable part of a company. However, scientific communities, certain government agencies and other organizations -- that I don't yet know about -- do want this.
I think the coolest application of linked data is how the bio-informatics/biology community utilizes it [1, 2]. The reason I found out at all is because one person at Triply works to see if a similar thing can be achieved with psychology. It might make conducting meta-studies a bit easier.
I read the HN discussions on linked data and agree with both the nay sayers (it's awkward and too idealistic [4]) and the yay sayers (it's awesome). The thing is:
1. Linked data open, open as in open source, the URI [3] is baked into its design.
2. While the 'API'/triple/RDF format can be awkward, anyone can quite easily understand it. The cool thing is: this includes non-programmers.
3. It's geared towards collaboration. In fact, when reading between the lines, I'd argue it's really good for collaboration between a big heterogeneous group of people.
Disclaimer: this is my own opinion, Triply does not know I'm posting this and I don't care ;-) I simply think it's an interesting way of thinking about data.
[3] I still don't know the difference between a URI and URL.
[4] I think back in the day, linked data idealists would say that all data should be linked to interconnect all the knowledge. I'm more pragmatic and simply wonder: in which socio-technological context is linked data simply more useful than other formats? My current very tentative answer is those 3 points.
I recently had to deal with some RDF data expressed as N-triples. So, naturally I loaded it into a proper triplestore and embraced the whole W3C RDF universe and started teaching myself SPARQL, right? Nah, instead I just translated that shit into CSV and loaded it into Postgres, then post-processed it into a relational schema that I can actually understand and query with a language that other people at my company also know. The RDF was just a poorly specified way of communicating that schema to me, along with a data format that's no better than a 3-column CSV. Great stuff from the Semantic Web folks here, real powerful technology.
Edit: Also, to answer your question, the difference between a URL (Uniform Resource Locator) and a URI (Uniform Resource Identifier) is that the URL actually points to something, an object at a particular ___location, and you can paste it into your web browser to view that something. A URI just uses a URL-like scheme to represent identifiers, such that your ___domain and directory structure provide a kind of namespace that you control. But as long as it follows the format, your URI can contain literally anything, it doesn't have to be human-readable or resolve to anything in a web browser. It might as well be mycompany_1231241542345.
> ...I just translated that shit into CSV and loaded it into Postgres, then post-processed it into a relational schema that I can actually understand and query with a language that other people at my company also know. The RDF was just a poorly specified way of communicating that schema to me, along with a data format that's no better than a 3-column CSV. Great stuff from the Semantic Web folks here, real powerful technology.
I'm not sure what's the point you're trying to make, that is exactly what RDF is for! It's not an implementation technology, it's purely an interchange format. You should not be using a fully-general triple store, unless you really have no idea what the RDF you work with is going to look like.
SPARQL is the same deal; it's good for exposing queryable endpoints to the outside world, but if your queries are going to a well-defined relational database with a fixed schema and the like, you should just translate the SPARQL to SQL queries and execute those.
Well I’m glad to hear that my solution was sane, but I just don’t see what technological innovation was contributed by the RDF. The file was canonical N-triples, AKA a CSV with space instead of comma. The predicates establish relationships between subjects and objects, but those relationships could be one to one, one to many, or many to many. Should a given predicate be modeled relationally as a column, or with some kind of join table? I have no idea from the RDF. Say, those objects, are the types I see attached to the values the only types that could possibly appear for that predicate? Who knows! Sure, the data has been interchanged, but the “format” is so generic that it’s useless. Why not just give me a 3-column CSV and tell me to figure it out on my own, rather than pretend that RDF provided some improvement?
Thanks for your comment on it by the way. I'm still in the phase of gathering what everyone thinks of it. I've noticed that RDF seems a bit polarizing. I have the suspicion that people who feel neutral about it also don't feel the need to chime in.
Ah, one has to resolve, the other doesn't have to. Thanks for the explanation.
It seems you're proving my point: you didn't need to collaborate outside of your company. So you picked the path of least resistance and that's totally what I would do too.
But what if you work together with a plethora of academic institutions and then decide that you want to keep options open such that other disciplines can connect to you and you can connect to them, automatically?
You could create a consortium and create a shared Postgres schema (among other things), since everyone knows it. Or you could just put your linked data online with a web page, no consortium needed. Anyone who wants to link to you, they can. And if they publish their data, then by no effort of your own, your data is enriched as well.
I view linked data as a DSL of some sorts. DSLs are amazing, except if you try to force fit them into something they shouldn't be force fitted into. You are giving an argument that one should not force fit it within one organization.
And I agree with that since that's not where linked data shines. Just like SQL doesn't shine at making a raytracing engine, but that doesn't prevent anyone [1] ;-)
That's my current view anyway (again, 3 weeks in, I mostly dealt with NodeJS and ReactJS issues at the moment).
Also, a lot of SPARQL looks like SQL (to my new eyes anyway). Here's a tutorial on it (for a basic feel, watch until episode 5 -- takes about an hour, or watch just the first episode to get a feel): https://www.youtube.com/watch?v=nbUYrs_wWto&list=PLaa8QYrMzX...
You can certainly store ML models in TileDB. I've been working on an example notebook showing how to store an opencv model in TileDB, its not quite finished yet but will be published soon.
Using python as an example language, TileDB-Py offers simple functions, like "from_numpy" [1] for basic and simple models which we can write directly. For more advanced (and more common) use cases of complex models, its easy to create an array to store the model and all associated metadata. TileDB's array metadata and time travel capabilities allow for you to even see how your model changes over time as you update it.
The article title was linkbait: "you" and "should" are linkbait tropes, especially when combined. The site guidelines call for rewriting such titles. See https://news.ycombinator.com/newsguidelines.html.
When we change titles, we always try to use representative language from the article. I replaced it with a phrase from the article where it says what it's about in a neutral way. I added a hyphen to "recently minted" because it seems to me that grammar requires it. However, if we worship different grammar gods I am happy to let yours hold sway. The hyphen is now toast.
This is bog-standard HN moderation. I dare say the reason you "really really hate it" is probably because you notice the cases you dislike and weight them much more heavily: https://hn.algolia.com/?dateRange=all&page=0&prefix=true&que.... Meanwhile the ones you don't find objectionable pass over relatively unnoticed. That is bog-standard HN perception bias.
If we didn't meticulously try to keep titles linkbait-free and less-misleading (more hyphens, sorry!), HN's front page would be completely different, and since the front page is the most important thing here, the entire forum would soon be completely different. That requires editing titles all the time. petercooper, who posted elsewhere in this thread, has an HN title-change tracker (oops, I hyphened again!) which is quite nifty. It doesn't show all of them, though, and it can't tell the difference between moderator edits and submitter edits.
There is a power shortage in my neighborhood and my laptop battery is about to expire, so if you don't see a link momentarily, or if I don't respond anywhere on the site for a few hours, it's because shelter-in-place (uh-oh!) prevents me from finding somewhere else to plug in.
I apologize for the strength of my language. There are several words I used in that comment that I should not have. It's far too easy to forget in the moment that there are real people making these decisions who may very well read what you say. I understand where you're coming from with the decision and will try to select more appropriate titles in the future. And will be much more mindful about how I register complaints.
Aw, thanks. I should have been nicer too. I was making myself chuckle while I wrote some of that, which is one way to mitigate the tedium of writing similar things over and over, but I fear that it comes across as a bit mean sometimes. By the way: nice article!
I hope you're doing alright with the shelter-in-place. Losing power sucks.
If it goes on too long, and someone has a car nearby, you can use that as a power source. (Be sure to keep the engine running though, otherwise you'll end up with two objects that need a power source...)
I was always super curious what your browser extension for modding HN is like. If there's such a huge difference between phone vs laptop, it sounds like the extension is doing much of the work. It's a neat power tool.
I'd just like to briefly comment on hyphens in compound adjectives.
It's usually not correct to add a hyphen between an adjective and an adverb that modifies it. The case of "recently minted database technologies" matches this rule. Since "linkbait-free" and "title-change" are single adjectives made up of several words each, it's correct to use the hyphen.
However, in the case of "less-misleading", the hyphen is generally incorrect as "less" is an adverb. The exception would be if it precedes an uncountable noun, where "less" could conceivably modify the noun. Example: "less misleading critique" would be ambiguous, so using a hyphen would be appropriate.
1. QuestDB – https://questdb.io/ – is a performance-focused, open-source time-series database that uses SQL. It makes heavy use of SIMD and vectorization for the performance end of things.
2. GridDB - https://griddb.net/en/ - is an in-memory NoSQL time-series database (there's a theme lately with these!) out of Toshiba that was boasting doing 5m writes per second and 60m reads per second on a 20 node cluster recently.
3. MeiliSearch - https://github.com/meilisearch/MeiliSearch – not exactly a database but basically an Elastic-esque search server written in Rust. Seems to have really taken off.
4. Dolt – https://github.com/liquidata-inc/dolt – bills itself as a 'Git for data'. It's relational, speaks SQL, but has version control on everything.
TerminusDB, KVRocks, and ImmuDB also get honorable mentions.
InfoWorld also had an article recently about 9 'offbeat' databases to check out if you want to go even further: https://www.infoworld.com/article/3533410/9-offbeat-database...
Exciting times in the database space!