We have a new business system we are trying to build that has a lot of very granular auditing and versioning requirements throughout, and this is where I started looking.
After a while of screwing with this kind of approach, I realized that the number of different types of objects involved was going to cause an explosion in the number of related auditing and versioning tables. Keeping track of not only who changed something but also the exact facts they changed (such that you could revert as needed) requires a lot of additional boilerplate in each instance.
So, I made a pretty big gamble and went all-in on an event-sourced abstraction where everything is written to 1 gigantic append-only log of events. Turns out, this was worth it in our case. As long as everything has a timestamp and you have a way to refer to specific log entries (i.e. by offset), you basically have a complete solution. The hard parts are handled by a JSON serializer and gzip. We use in-memory indexes that are simply long/long dictionaries that map an object identity to a physical log offset.
The only downside with what I have proposed is that there is no garbage collection or cleanup possible. At this time, this is actually a highly-desirable property of the system. Disk space is cheap. Losing 3 weeks of work to an accidental button click is not.
> The only downside with what I have proposed is that there is no garbage collection or cleanup possible.
If you are using Kafka, you can use log compaction[0] to achieve this. It's also possible to delete individual entries with tombstone records, assuming your objects have a "key" attribute.
In mysql parlance, I think it is pretty common to run a replicated server where you save a copy of the binlogs, which are the replication messages that the primary sends to the secondary to keep the secondary in sync. That supposedly lets you reconstruct every state that the db has been in. It would depend on implementation details that I haven't checked though: it doesn't follow automatically from replication. But I know of some sites that do it that way.
There is also a concept of "purely functional data structures" used for example by Happstack. Those are data structures where you never mutate anything, but instead do updates by allocating new nodes and pointers, so an update takes O(log n) operations since you have to reallocate the path to the root of a b-tree like structure, instead of just overwriting a cell. In principle that should still not be too bad, if the amount of data isn't huge. For whatever reason, though, happstack's performance is apparently disappointing even when you take its design into account.
This is precisely what https://github.com/sirixdb/sirix does. A resource in a database is stored in a huge persistent structure of index pages.
The main index is a trie, which indexes revision numbers. The leaf nodes of this trie are "RevisionRootPages". Under each RevisionRootPage another trie indexes the main data. Data is addressed through dense unique and stable 64bit int nodeKeys. Furthermore, the user-defined secondary indexes currently are also stored as further tries under a RevisionRootPage.
The last layer of inner pages in a trie adds references to a predefined maximum number of data page fragments. The copy-on-write architecture does not simply copy whole data pages, but it depends on the versioning algorithm. The default is a sliding snapshot algorithm, which copies changed/inserted/deleted nodes plus nodes, which fall out of a predefined window (usually the size is low, as the page fragments have to be read from random locations in parallel to reconstruct a full page). This reduces the amount of data to store for each new revision. The inner pages of the trie (as well as the data pages) are not page-aligned, thus they might be small. Furthermore, they are compressed before writing to persistent storage.
Currently, it offers a single read-write transaction on a resource plus read-only transactions without any locks.
Those logs though are at the lower Db level of abstraction (using them requires getting out of the database proper).
What the parent is referring to is the actual database schema itself. It’s a giant versioned log essentially.
Presumably there is some sort of ‘key frame’ however, or you’d need to go all the way back however many years ago to start rebuilding the objects current state?
> Presumably there is some sort of ‘key frame’ however, or you’d need to go all the way back however many years ago to start rebuilding the objects current state?
Correct. My criteria for the snapshot feature becoming a priority will be when log recovery takes longer than 5 minutes. I cannot see that occurring anytime soon based on current figures.
As posted in another comment that's basically what https://github.com/sirixdb/sirix supports along with easily reconstructing former revisions of a JSON document (or any node therein), sophisticated secondary (also versioned) indexes, querying with JSONiq and optimizations at query compile time as for joins and aggregates....
In our case we opted for three types of secondary user-defined indexes (besides the main document trie index to find unique, stable nodes by their 64bit ID), namely
1. field indexes
2. path indexes
3. content-and-structure indexes which index paths and their typed values
Furthermore a path summary keeps track of all distinct paths in the JSON resource.
I like your event sourcing approach. I was recently looking into YugabyteDB changde data capture and it quickly became apparent that cdc can be used for exactly that. I wrote about It here: https://gruchalski.com/posts/2022-02-23-yugabytedb-change-da....
One problem with the cdc approach is not having a way to monitor ddl operations. However, at least in any Postgres capable database with extensions support, it’s pretty easy to either: use pg_audit or write a rudimentary extension for ddl streaming.
I've done this before on smaller scales. Writing a set of table-valued functions that expose the event data in relational form "as of date given by param X", and views on top of that where X is set to getdate(), you get a very ergonomic way to consume "current" & "historical" data. Materialized views can make querying these more natural relations very cheap if your engine supports them.
Can you explain how you "go back" to an earlier state? I assume your event stream writes into a database, and that database is then used by the front end. Do you replay the entire stream up until date/time x into a separate database to restore or check out the state up to then?
Usually you make heavy use of SELECT DISTINCT ON (…) FROM … ORDER BY ts DESC and you can use WHERE ts < ts_0 to recreate until ts_0. Liberal use of MATERIALIZED VIEWs.
I am currently investigating synchronous replication of each compressed log batch to a 2nd host. We are totally fine with availability issues (i.e. system down because log replication broken). Availability problems are far preferable to consistency and other nasty bugs.
Love the detail and approach here, though (and it is mentioned in the bottom of the article) this shouldn't be used at any very large scale... the estimate of at about 1k transactions per second or less seems about right. For any larger scale you want to look at something like pg_audit which can scale much much further and be used with other extensions to ensure you have a full audit log that can't be modified (when dealing with compliance).
I've battled this problem many times over my career in many different ways - this approach looks very solid to me. I particularly like the way the schema design incorporates a record ID, old record ID, updated record and old record.
You do end up storing a lot of extra data, but it's very thorough - it's not going to miss anything, and figuring out what happened later on should be very easy.
I really like this approach, however I have a use case where the application user id that made the operation must be saved in the audit table, unfortunately, I cannot see how to do that with a pure SQL solution.
we do that by setting a local variable before the query and then reading that in the triggers:
```
SET LOCAL foo.who_id = 'some-uuid';
UPDATE table
SET ...
```
```
-- function to make getting the setting easier
DROP FUNCTION IF EXISTS get_who_id (text);
CREATE OR REPLACE FUNCTION get_who_id (default_value text DEFAULT null::text) RETURNS text AS $get_who_id$
DECLARE
who_id text;
BEGIN
BEGIN
who_id := current_setting('foo.who_id');
EXCEPTION
WHEN SQLSTATE '42704' THEN
RETURN default_value;
END;
IF (length(who_id) = 0) THEN
RETURN default_value;
END IF;
return who_id;
END
$get_who_id$ LANGUAGE plpgsql VOLATILE;
```
```
CREATE OR REPLACE FUNCTION some_table_audit () RETURNS TRIGGER AS $some_table_audit$
DECLARE
who_id text;
BEGIN
who_id := get_who_id(null::text);
...
```
Identifiers changed, but hopefully will give you the idea.
I've experiment with a very similar solution and it felt a bit dirty but so far it seems to be working just fine. I have made an integration for auditing with sqlmodel which I intend to share, but it is kind of rough and I was a bit stuck trying to clean it up. The idea is that you add e.g. a HeroHistory model derived from HeroBase and a HistoryMixin that creates the triggers and relationships to have an Audit log of the Hero table.
Are you at all worried about the size of the audit table? Keeping a JSON copy of both the old and the new data is going to add up pretty quickly for fast-changing tables.
I've built audit systems in the past which track just the new values, with the idea being that you can replay the audit log for a particular record from the beginning to reconstruct its state. I'm not convinced that's a better approach than yours here though, just interested in your thinking.
> Are you at all worried about the size of the audit table? Keeping a JSON copy of both the old and the new data is going to add up pretty quickly for fast-changing tables.
That's def a valid concern if you're auditing a high frequency insert/update table, or have your entire database under audit. If you do have those needs a system that logs outside of postgres (like pgaudit) would be a better fit.
In my experience most startups and mid-size enterprises sprinkle in auditing around the sensitive parts of the DB, like `account` or `access_control` tables where writes load isn't much of a concern.
> I've built audit systems in the past which track just the new values
Yeah, that solution works great! The one place it breaks down is if you apply auditing to an existing table, and need to be able to recover the records from when auditing was enabled (initial state is not snapshotted)
We toyed with another approach to avoiding having to track old_record by first checking to see if the `record_id` exists in the audit table, and then inserting a row with the OLD values with a `SNAPSHOT` `operation` if it does not.
Even though that query/check was operating on an index, the performance overhead was higher than we were comfortable with.
Storage is pretty cheap these days so we opted to optimize for write throughput + reduced memory usage rather than any considerations wrt disk.
> We toyed with another approach to avoiding having to track old_record by first checking to see if the `record_id` exists in the audit table, and then inserting a row with the OLD values with a `SNAPSHOT` `operation` if it does not.
> Even though that query/check was operating on an index, the performance overhead was higher than we were comfortable with.
Was this done inside the trigger function? Couldn't you just insert the SNAPSHOT values first before applying the trigger? Seems wasteful to me to track both NEW and OLD for all subsequent operations just to ensure the initial value is covered.
I've also used an update trigger to track only changed columns, inspired by this post[0] which wraps a comparison function into a new subtract operator for JSONB, similar to the built in subtract operator for HSTORE which removes any keypairs that haven't changed.
Building a bit off Simon's above question. I'm curious if any thought was given to using JSON over JSONB. The trade off of really fast ability to ingest them and save them seems like it could be better than JSONB which has to do some conversion and if there are any indexes on the JSONB columns could really dramatically slow down throughput.
> I'm curious if any thought was given to using JSON over JSONB
For sure! There is a conversion penalty that is paid incrementally (at write time). For paying that penalty, jsonb gets you reduced storage size and dramatically faster column value extraction with `->` or `->>` if you end up having to filter the audit table by something other than the `record_id`
It is a tradeoff though and depending on the specifics of your use-case JSON could be a better choice.
First of all, this looks super awesome, and at least an order of magnitude better than our home-grown solution (grown over 25+ years on a DB that is now mariadb).
How well does this work when you audit multiple tables with foreign key relationships between them?
If the user references a customer, and you want to find out what customer the user belonged to at time X (but the customer might have been deleted in the mean time), you have to track through the history of multiple tables to find answers. More if you have n2m relationships...
It sounds doable, but not trivial to get right for more complicated models.
---
Update: to clarify, I want to be able to present a somewhat homan-readable history of an object without doing much table-specific logic.
If an author can publish many books, you typically have an n2m mapping table, like book_authors. If you want to add newly published books to the "author" history view, you have to determine than you have to include parts of the history of the book_authors table. I don't think you can do that through a join though? Trying to think my way through this...
Maybe this can be dealt with easier if you can assume each table has a non-composite primary key, and include that primary key in the history table, and then you can do joins more easily on the history?
All the data would be available and joining it would be possible with some jsonb gymnastics but its optimized more for compliance style auditing than rapid lookups for complex queries.
For that scenario replicating audit data to a OLAP system may be a better fit
Really nice post! A possible addition could be showing the diff as its own field. This seems like a sane approach: https://stackoverflow.com/a/36043269
But how could you go about not storing the diff but still being able to select it?
this is great! I've done something similar many times, but the ability to do it on arbitrary tables is really nice.
If you have the HSTORE extension, you can also use it to get a diff of the old and new: `HSTORE(OLD) - HSTORE(NEW)` and vice versa, very nice to easily have the deltas.
I'm not completely clear on the reason for both record_id and old_record_id - is this so that you can capture instances where one of the columns that makes up the primary key for the row is itself modified for some reason?
Being able to see when the primary key changes using `record_id` and `old_record_id` is a perk, but we also wanted to make sure a snapshot of the existing data was available when auditing is enabled on a table that already has data in it. See comment ^ for more color
Long story short: it is avoidable if you're okay with paying the penalty of querying the audit table by `record_id` before every write. We decided to use more disk and keep the higher write throughput
At 42Floors, we experienced a similar need for auditing. Initial we did a
similar approach as described in the post. Over time the audit tables began
to be incorporated into every aspect of the product, which over time caused the
table to become quite large, and ended up causing a lot of downtime. Our initial
implementation only audited what was inserted by our Rails application. If a
change was made by another application or manually, it wouldn’t appear in the
table.
Over the course of a few months we extracted the system and built a standalone
application that today has become [changebase.io](https://changebase.io/). We
now use CDC, change data capture, to capture all events in the database and a metadata table to
capture the context of why the change occurred and who made the change. The
changes are then structured to be easily queryable.
Initially, we encoded the data as JSON, but over time have opted for encoding the
data in the database’s native format. This creates a more complex query system,
but at the end of the day we don’t have to worry about type conversions and
querying is done in the same format the database that was replicated.
Teams have benefited from the system by being able to recover data that was
accidentally deleted, or deleted maliciously. Notifications have also been used
for high value datasets where someone on a sales team wants to immediately know of
a change in the system.
This looks like it works pretty well to just 'dump' data changes, and if needs be to visualize them somewhere, but the data is unstructured and long-term there may no longer be any relationship to older data.
Which may be fine for your use case, don't change anything if it works for you.
In my use case, I also have a requirement that a user needs to be able to revert changes - an undo function, for edits and a 'deleted' state. I've been looking into Temporal Tables for that, where every table has a sibling 'history' table with a pair of timestamps - valid from and valid to. This allows you to query the database - including related rows in other tables - at a certain point in time, and to revert to a previous state without losing any intermediate steps (copy rows at timestamp X, insert as the 'current' version).
This is a built-in feature in a lot of modern enterprisey databases, but I'm constrained to sqlite. But, using triggers I've been able to make it work in a POC.
Whether it'll work in my 'real' application is yet to be determined, I've got a lot of work to do still - I need to get rid of my ORM and redo my data access to something more manual, unfortunately.
Ok, my SQL skills might be a bit lacking, but from the article:
> select audit.enable_tracking('public.members');
So you can actually have functions that have side effects in a select statement? I guess nothing prevents it, so it is allowed. But somehow I have the impression that select statements don't change anything.
For even more fun, try "SELECT pg_cancel_backend(pid) from pg_stat_activity".
(DON'T ACTUALLY DO THIS on anything other than a personal test db as it will kill all the connections it has permission to kill)
Related, postgres has a number of different volatility options for functions so you can declare if there are side effects: https://www.postgresql.org/docs/14/xfunc-volatility.html These can become very important in some cases to let the optimizer have the freedom to shine.
Yep you can, though a lot of systems that should know better also assume they don't have side effects. For example postgrest GETs happen in readonly transactions which would cause a problem if you wanted to use a system like this to keep an audit logs of hits to a certain endpoint or whatever.
I think you can use a standalone VALUES statement to avoid side effects? Those aren't well supported even though it's part of the SQL spec, but postgres does have them.
I'd argue that's still a lot of work to manually do. However, great work and detail, thanks a lot :-)
I'm working on a database system[1] in my spare time, which automatically retains all revisions and assignes revision timestamps during commits (single timestamp in the RevisionRootPage). Furthermore, it is tamper proof and the whole storage can be verified by comparing a single UberPage hash as in ZFS.
Basically it is a persistent trie-based revision index (plus document and secondary indexes) mapped to durable storage, a simple log-structured append-only file. A second file tracks revision offsets to provide binary search on an in-memory map of timestamps. As the root of the tree is atomically swapped it does not need a WAL, which basically is another data file and can be omitted in this case.
Besides versioning the data itself in a binary encoding similar to BSON it tracks changes and writes simple JSON diff files for each new revision.
The data pages are furthermore not simply copied on write, but a sliding snapshot algorithm makes sure, that only changed records mainly have to be written. Before the page fragments are written on durable storage they are furthermore compressed and in the future might be encrypted.
You might like what we're doing with Splitgraph. Our command line tool (sgr) installs an audit log into Postgres to track changes [0]. Then `sgr commit` can write these changes to delta-compressed objects [1], where each object is a columnar fragment of data, addressable by the LTHash of rows added/deleted by the fragment, and attached to metadata describing its index [2].
I haven't explored sirix before, but at first glance it looks like we have some similar ideas — thanks for sharing, I'm excited to learn more, especially about its application of ZFS.
In order to provide fast audits of subtrees, the system stores optionally a merkle hash tree (a hash in each node) and updates the hashes for all ancestors automatically during updates.
Every google hit for "UberPage" seems to be your writing, is there standard ZFS
terminology that would be found in ZFS documentation to what you're referring?
It's in our case the tree root of the index, which is always written after all the descendant pages have been written as a new revision is committed (during a postorder traversal of the new pages).
In ZFS the UberPage is called UberBlock. We borrowed some of the concepts as to add checksums in the parent pages instead of the pages itself. In ZFS they are blocks :-)
True. In general I could add storing relational data as well, but currently I'm entirely focusing on JSON and auto-indexing for secondary indexes as well higher order function support in Brackit.
Of course we'd need more man-power as I'm more or less the only one working on the core in my spare time (since 2012).
Moshe mainly works on the clients and a frontend. A new frontend based on SolidJS is in the works showing the history and diffs as in
This is a great technique, we used to have it setup very similarly at a previous startup and worked wonders. Thanks for sharing a generic way of doing it @oliverrice !
Is there something like Datomic for Postgres? Having your team learn clojure, datalog, deploying Datomic was a huge pain for us. Not the road we recommend since all we really needed were just immutable append only log of all transactions.
XTDB (https://xtdb.com) while is not directly for Postgres is at least not reliant on devs knowing Clojure. It's very close to Datomic in terms of approach. They expose SQL, Java, HTTP APIs.
This used to exist in Postgres, Postgres held all history for all time and you could rewind etc. The reality is it was a huge pain to maintain and no one ever really used it so it got pulled out.
Now days it's far better to look at tooling outside and around Postgres to give you that. Under the covers WAL is effectively this, so with tools like backrest you can maintain backups and WAL forever then have ability to restore and old point as you see fit. On managed service providers a lot of them give this out of the box though only for a limited period of time.
I did a similar thing with SQL Server using XML. I created views that would convert the XML value back to the correct SQL data type when querying the audit table. The view also Unions the audit data with the current data so that you could add the trigger at any time and the view would return the current data (the audit table only stores updates and deletes). One nice addition was saving the current user ID with all audit records, which I was able to get in the trigger by adding it to the db connection using CONTEXT_INFO().
I am afraid data audit is not just about data changes. The real challenge is how to audit the data read which people should not do.
Since there is no redo logs generated for data read, CDC could not help in this case. It will reply on the audit traces, the SQL capture and cache, etc. But it's costly.
That should be captured in security, right? I.e. if someone has access to data the assumption is they are able to read it. Security audit can be done without the grain of select statements but rather the grain of changes to provisioning. Fundamentally, you have an issue even if someone did not actually read what they should not but could. Select audit can be useful for performance tuning but I would not classify it as “audit”, rather monitoring.
> if someone has access to data the assumption is they are able to read it.
This is not the case in most production for two major reasons:
1. Row-level security is expensive and hard to manage, so in most cases, users are granted the select privilege on table level. But indeed, it doesn't mean they should have access to all the rows in that table.
2. Power users should not have production accounts all the time. The production accounts should only be activated when power users have to perform the changes during the production change window.
So financial institutes usually generate two kinds of reports for audit/regulation purposes.
1. Report on the privileges and roles, which tells if a user should have those privileges and roles. This is a periodically static report.
2. Report on the actions a user has performed to see if these actions are part of the changes. This is a dynamic report in which we have to capture the select statements.
I can see the point around expensive row level security. Cannot see how it is harder to manage than to monitor though. Designing around risk acceptance of manual detective rather than preventative controls does not make this approach generally applicable or best practice, just a feasible one with conditions.
I’d +1 the usage of pgAudit when higher throughput is needed. If you’re on AWS, RDS can automatically ship these logs to S3 where you can then run Athena queries against them.
After a while of screwing with this kind of approach, I realized that the number of different types of objects involved was going to cause an explosion in the number of related auditing and versioning tables. Keeping track of not only who changed something but also the exact facts they changed (such that you could revert as needed) requires a lot of additional boilerplate in each instance.
So, I made a pretty big gamble and went all-in on an event-sourced abstraction where everything is written to 1 gigantic append-only log of events. Turns out, this was worth it in our case. As long as everything has a timestamp and you have a way to refer to specific log entries (i.e. by offset), you basically have a complete solution. The hard parts are handled by a JSON serializer and gzip. We use in-memory indexes that are simply long/long dictionaries that map an object identity to a physical log offset.
The only downside with what I have proposed is that there is no garbage collection or cleanup possible. At this time, this is actually a highly-desirable property of the system. Disk space is cheap. Losing 3 weeks of work to an accidental button click is not.