Hacker News new | past | comments | ask | show | jobs | submit login
You might as well timestamp it (changelog.com)
307 points by mot2ba on April 24, 2021 | hide | past | favorite | 198 comments



There is a downside which I've experienced: if you want a triple-state boolean (null, false, true) then having a boolean column allows for that while a timestamp-as-boolean column does not (you lose the "null" value because that equals `false` in timestamp-as-boolean).

Having a distinction between `null` and `false` can be handy for values that are optional or have a dynamic default. If it's `null` you know it is not explicitly set and could use a fallback value. If it's false you know it's explicitly set as false.

A simple use-case for this is when a user can leave the field blank. This is impossible to model with only a timestamp-as-boolean.

Another use-case is dynamic defaults or fallbacks, e.g. `hidden` of a folder where if `hidden` is nil, you fall back to the parent folder's value.

TL;DR a boolean column actually has 3 states, a timestamp only has 2. Article makes a big deal about there not being any nuance about the fact that a timestamp is superior. I disagree, because you go from 3 states to 2 states, there are cases where you'd want a boolean instead of a timestamp. Ironically OP missed this nuance (or they'll pull a no-true-scotsman).


What you describe calls for an Enum, not a nullable boolean which is just another way of passing a hardcoded magic value carrying an implicit meaning.

> a boolean column actually has 3 states, a timestamp only has 2.

Going with your logic a timestamp has billions of states, you just have to arbitrarily assign special meanings to certain dates that won't ever be used. Just like using null as another state I wouldn't call it a good idea, though.


Your statement is false. Many use cases call for optional bools. The great thing is there is no implicit meaning - it's True, False or Unset. It is true that nulls can be abused but in this case, far more elegant than an enum.


There is no way for you to make sure that `null` was "set" intentionally and not due to a bug or other kind of failure.

Meanwhile if you see a `None` value you know for a fact that it was set by your software and if you actually encounter a `null` you know that something went horribly wrong.

Strong type systems and a few overheads in favor of better bug detection/prevention are popular for a reason.


There are many application states in which `null` could mean "not got the data yet" and/or "there is no data".

Just use an enum, it's much more expressive.


Could you supply an example that an enum wouldn't do as well? How is a bool null, false, true different than an enum with null, false, true?

Edit: grammar


Logically they're the same, they're isomorphic. It's just about ergonomics, readability, etc.


If you go with nullable Boolean in the database schema, do you give it a type of nullable boolean in the code?

In which case null would either mean the value on the recordin the database is null, or that that for whatever reasons that parameter was not read from the database.

You will need two types of null.

All of this goes away with an enum or similar solution.

You can decompose the nullable boolean in the database when you read it , but again your task would be easier to treat it like an enum.


I use nulls or undefined in javascript as the initial state for most values for this reason, it's explicit in that it's unset

Ditto for database design too


Tri-state booleans are also very ugly. What happens if you suddenly need a 4th state?

Use enums (or any equivalent) for states that are non-boolean.


Null is a signifier that a value isn't know. Yes, if you're using null to mean something else you should use another type, but nullable-booleans (and nullable fields in general) can be extremely useful. For instance, if you're collecting information, but wont know everything at creation time, the unknown values are null.


> For instance, if you're collecting information, but wont know everything at creation time, the unknown values are null.

I find the idea of 'creating unknown values' to be self-contradictory. I find it much more logical to define a PartialFoo containing only the parts we know up-front, and a fill in the rest later using a function PartialFoo -> Foo.

If we need multiple steps, we can put some Optional fields in the PartialFoo, to avoid lots of intermediate types. This is still annoying, but it keeps all of the 'intialisation headaches' separate from the Foo type itself; so code dealing with Foo doesn't have to care about null checks, missing fields, etc.


> If we need multiple steps, we can put some Optional fields in the PartialFoo, to avoid lots of intermediate types.

You juat described the purpose of SQL's NULL.


If you are using a nullable bool, you do not need a fourth state. True, False, Unset. Very useful in a lot of cases. It's actually not ugly but very elegant.


I don't think it's elegant at all. Especially in SQL dialects where you need to check null state separately from false. So much bug potential...


With an enum you’d have to check the unset state separate as well.


He meant that in some SQL Dialects you cannot check FIELD = NULL, you have to use FIELD IS (NOT) NULL (like Oracle). Therefore you cannot use FIELD inside an IN clause either.


But with an (NOT NULL) enum/integer/string or however you are representing it, every state is a regular case. Not a special case. Null nearly always needs some special treatment.


The obvious solution is to encode it like this: `null` is `null, `yes` is the timestamp when it was set and `false` is a timestamp somewhere in the future. Now t < now() maps nicely to the value of your boolean.

You just have to pick your `false` timestamp somewhere far into the future, let's say something arbitrary like 03:14:07 on Tuesday, 19 January 2038. The software won't be around for that long anyway, so it will never be a problem...


Hehe, that's creative. But I'd say - if you're making it that complex, just use a boolean column and a timestamp column separately instead, if you really need the timestamp, and make it explicit and simple to understand.


And this is why legacy code is such a nightmare. Compound unnecessary optimizations like this over a decade and it’s impossible to understand the data in the DB or how to safely refactor the code without breaking some weird corner case that was handled by using magic values instead of structured data.


Better use something so far in the past that the software certainly didn't exist yet, like 3 January 1970. In fact, you could use all of, say, the '70s to encode lots of flags... gets excited


Store your timestamps in a long int or string.

Positive = timestamp

-1 = false

0 = null


I can’t tell if you’re being ironic or not


Yeah it's mostly a reference to Y2K and Y2038 [1], using representations that seem clever and work now, but will lead to bugs in the future because nobody thinks their software will be around for that long.

[1] https://en.wikipedia.org/wiki/Year_2038_problem

> The latest time since 1 January 1970 that can be stored using a signed 32-bit integer is 03:14:07 on Tuesday, 19 January 2038

> MySQL database's built-in functions like UNIX_TIMESTAMP() will return 0 after 03:14:07 UTC on 19 January 2038


OP missed that nuance, and so will a lot of other developers you work with. Relying on a difference between null and false is one of those things that seems clever when you first do it, but then three years later your project is scattered with comments like "// be very careful checking this, it must be false here null means something else" after you or some other dev you're working with confuses falsy for false


I've been in software development for 27 years, and I've never seen a case where a boolean column or variable being null was anything but a terrible, confusing, fragile, hack.

Remember that in some languages the behavior of null is weird, and can be false, or can be treated as 0.

    $ node
    Welcome to Node.js v14.16.0.
    Type ".help" for more information.
    > null + null
    0
    >
    > 0 == false
    true
    > 0 === false
    false
    >


What has the ridiculousness of Javascript got to do with a database schema?


What good is a database schema if it doesn't have data and isn't used by any applications?


There are languages with nice optional types, which maps nicely to SQL null. Not Javascript though and I wouldn't want to use it server side.


Long time readers of The Daily WTF know that the canonical tri-state boolean is { True, False, FileNotFound } [1]

[1] https://thedailywtf.com/articles/What_Is_Truth_0x3f_


"triple state boolean", what's next, a double precision fp32?


Also known as *bool, "Maybe Bool", "Boolean?", "Optional<Boolean>" etc.


Well, in browsers there's `HTMLMediaElement.canPlayType()` which returns one of the following strings:

- "probably"

- "maybe"

- ""

Reference: https://developer.mozilla.org/en-US/docs/Web/API/HTMLMediaEl...


My goodness. I remember having to explain this to my manager. They really wanted recording + playback in our platform. For as many browsers as possible. Even mobile (iOS, I'm looking daggers at you). In 2015.

AFAIR I just said fuck it and made playback as permissive as possible (i.e., only prevent media playback if canPlayType returned ""). I don't know how advisable that is but the bug got off my back anyway.

I dunno what makes this so difficult, why we can't get at least a definite "yes" even in 2021.


That's like my washing machine which has a Normal setting, a Standard setting, a Wash setting...


Do you mean as in "If the washing machine has a Wash setting, what does it do if it is not set to Wash?"


And what's the difference between normal and standard?


Wat


yeah, but if you really need to store 3 states, why not make 1 first-class concept instead of awkwardly combining two?


Because a nullable bool has literally the same semantics as Maybe<Bool>. Why invent it again but in your app this time?


Long Boolean! Yes, no, don’t know, weekdays only.


If boolean comes from George Boole, then maybe we should call triple state values "Łukasiewiczan", it would make for a funny typing exercise :P https://en.wikipedia.org/wiki/Jan_%C5%81ukasiewicz


The use cases the blog post is discussing aren't tri-state bools though. Either the user did or did not sign in. Either the version was or was not published. Null and false are equivalent here.


If the timestamp is a creation date after your system went on, which it is here, you can always store 0 for false.


-1 can be useful, too, especially if you've already assigned a certain meaning to 0.

Javascript developers are used to certain functions returning -1 if there's no match, so -1 shouldn't feel strange as long as it's well documented.


I wonder why they decided to use -1. Why not undefined or NaN or even null? The -1 produces bugs which go unnoticed for a long time.


On the other hand, there's strcmp()/strncmp(). I can never wrap my head around the fact that strcmp(s1, s2) == 0 (or even better, !strcmp(s1, s2)) means s1 is equal to s2.


If you look at strcmp like "the difference when you compare the strings for sorting", it makes complete sense. If the first mismatch of strcmp(a, b) is smaller than 0, you put the a before b, otherwise you put b before a.

The fact that an equality check can be made by running a comparison function is useful, but that's not all the method does.

For other methods in much C code, a common mindset is that a method returning a value will return the error code, with error code details in errno. The error code for success is 0, which is fitting of course.

Had C implemented booleans, this problem would never have been a problem, because if(int) wouldn't have been a legal expression, but sadly booleans are implemented as integers in the language instead. I strongly dislike languages that do allow implicit casting from integers and such to a boolean, if(var!=0) is much more readable because of the explicit boolean expression.


But then it's empty string vs null value all over again.

Besides it's about an improvement to boolean, not about adding one more optional value (which likely lead to optional values and definitely out of the boolean field).


I'm struggling to think of an example following the OP's example - discrete events that may have not occured yet - where you'd need to differentiate between a certain `false` and an uncertain `null`.


A question like "Do you have any allergies?" probably requires ternary logic: yes/no/unset

Discrete events are usually more binary by nature: a thing either happened or it didn't.

That said, if it's possible for an event to un-happen, you're back in ternary-land: there's now a distinction between un-set and false which may be important to capture.

There's a reason why relational databases use ternary logic when most of the rest of the computing world uses binary logic.

You might argue that you could just create a brand new event, but now you've almost assuredly changed the grain of your table and goofed up the primary key. Your nice normalized table is now a dumb, non-performant endless event log: good luck with indexing that table and tuning those SELECT queries.


The OP does not talk about events, OP talks about state and when it was set. There's no where in the article that OP restricts themselves to discrete events that may or may not have occured.


I'm not understanding the distinction. The only thing you get out of this schema is knowing what happened (implicitly defined by the field), whether it's happened yet, and when it happened. That feels like an event to me.

Fields where there isn't an discrete event don't work. E.g. is_dog_owner can become adopted_dog_at, but is_dog_lover can't become loved_dogs_at.

I'd actually even argue that this is not storing state directly. You derive state from knowing an event has occured in the past: deleted_at (event) => is_deleted (state).


Why not store "loved_dogs_at"? A potential use case could be to send some coupons to "new dog lover" within the last 2 weeks.


Well the article uses the example of swapping `is_published` for `published_at` — with a boolean you could have:

* NULL — never published (e.g. draft) * true — live now * false — previously live but explicitly unpublished

Which you miss if just a date. Similarly he talks about `is_signed_in` — NULL/true/false let's you model the case where a user has never signed in (e.g. an admin created your account but you've never used it) but NULL/timestamp missed this


In my case the null state was "not yet processed" i.e. a decision pending


Perl got this right decades ago with its 'undefined' status for unset variables, so you can tell the difference between false and undef


That make me think more of Javascript, which distinguishes between undefined values (usually means “not set”) and null values.

You’re correct that a Perl scalar can always be set to undef, which is the Perl name for null. But that’s not really unique to Perl. For instance, while a Java boolean can’t be null, a Java Boolean can be.


That depends on the language. In Clojure and ClojureScript, for example, distinguishing between nil and false is not a problem at all.


I'd assume most languages don't have a problem with distinguishing between nil and false. The article explicitly maps nil to false, thus I don't see the relevance of your comment.


the initial state is going to be false before any user input. if the user leaves a field blank, then that still counts as input, just run a check such as 'if blank then null else timestamp' on the insert


0 for never set, timestamp for set time, negate the timestamp for unset time


dates between October 4th and 15th 1582 for special values


You can use negative values for your alternative states.


Also, from the point of query optimisation this is a really bad idea. Usually you DO actually care about size of fields in SQL databases, because something like BOOLEAN is usually stored as single byte (or bit in a bitfield) vs 4 bytes or even 8 in case of timestamp. This not only multiplies on disk usage by at least 4 times, but also makes ALL indexes using this field way bigger. Also boolean indexes can be compressed (or stored as bitmaps), while timestamp indexes contain lots of unique values, so they can't be. This is also the reason why serial IDs are way better than UUIDs for internal IDs.


> This is also the reason why serial IDs are way better than UUIDs for internal IDs.

There are three core problems with that:

  a) Serial IDs are a nightmare for database merges, clustering or anything like that
  b) Serial IDs won't scale
  c) Serial IDs require management, whilst UUIDs can be produced anywhere (in DB, in frontend etc)
There is the KSUID[1] if people want a time-sortable thing that is near-enough to a UUID.

[1] https://github.com/segmentio/ksuid


You can do both. Use the UUIDs where you need the scalability and serial where you don’t.


> Serial IDs won't scale

You mean scaling into several machines? Yes, they do scale. Nothing requires that the values are always increasing and have no holes, so you can slice and cluster them at will (and many DBMS do exactly that).


I'm confused? Serial means "always increasing and having no holes" -- it's one thing after the other. What you're arguing for sounds like just IDs, not serial IDs.


Well, by your definition what people use as serial IDs isn't serial.

Databases have the concept of sequences, that are closer to your definition, but make no promises about holes (they normally don't generate holes by themselves, but there is no way to guarantee you won't lose numbers upon usage). It is common to use sequences to feed serial IDs, but not all DBMS do that and it's not a requirement in any way.


IMO if you’re going this route, may as well just implement a snowflake clone and get the best of all worlds.


Re: Snowflake clone I would humbly invite you to read the note in the KSUID Github, namely:

  *To fit into a 64-bit number space, Snowflake IDs and its derivatives require coordination to avoid collisions, which significantly increases the deployment complexity and operational burden.*
Therefore KSUID remains the best option.


I don't necessarily agree with your conclusion, but I will admit you can't put KSUID and Snowflake in a room and declare either of them the winner.

I would argue that doubling your key-space in the case of KSUIDs may have just as much of an impact as coordinating node ids in the case of snowflake (and in fact the snowflake technique only runs into coordination problems when you're at pretty extreme scale).


I think this is still workable with an expression/functional index, where the indexed expression is "happened_at IS NOT NULL".


Seems like a whole lot of extra trouble to be honest. What happens when you need to create a composite index?


You can do indexes that are composites of plain field values and/or expressions! At least in Postgres.

As for the trouble, I agree, it's a bit more involved.


Storing both boolean and timestamp with the correct constraints seems like a better solution to me.


This is the typical solution for MySQL. There are indexable virtual generated columns but don't seem to have been used/around enough to be battle tested.

Using just the nullable timestamp only lets you index on one of the boolean states: an index on (ts, col2) works for WHERE ts IS NULL ORDER BY col2, but WHERE ts IS NOT NULL ORDER BY col2 doesn't work well (without using generated boolean, etc).


Theoretically, this should compress quite well. In practice, I don't know how much compression db engines apply.


This is diametrically opposed to the advice to never store booleans, but rather store enums.

Experience shows that the initial assumption of two states (false, true) often requires a third, or even fourth, fifth state etc. added down the road. (Business-logic states like "reserved", "pending", "in progress", "confirmed", "processed", etc.)

As long as these states are mutually exclusive, it's far more elegant to add another enum value rather than new fields.

So no, don't timestamp it. Stick to enumerated values rather than booleans, which will generally be of far greater benefit.

If you need to store a log of actions, then create that explicitly. Otherwise, it seems pretty silly and arbitrary to have booleans record their timestamp but not strings, integers, etc.

Edit in response to comments below: of course there are times when you need values that aren't mutually exclusive, so obviously you add another column. It's just that you very often do add another state that is mutually exclusive, and so using an enum keeps your data cleaner, more intuitive, and prevents accidental invalid combinations of booleans as well.


> As long as these states are mutually exclusive, it's far more elegant to add another enum value rather than new fields.

In my experience these states are often not mutually exclusive. Boolean encoding is a superset of enum encoding.

Additionally, enums in many programming languages are often over-strict and easy to use in a non-forwards-compatible way.

Any advice that starts with "never" or "always" is suspect advice IMO. Study your ___domain, and decide whether you want to lock yourself into a mutually-exclusive state space and deal with consumers who may consume it in a way that prevents adding new states. Sometimes enums make sense, sometimes they don't.


I have often found that multiple booleans are better than enums.

    processed: true
    confirmed: false
    in progress: false
    pending: false
    reserved: true
It lets you capture more complicated state, such as above where this was processed but never confirmed, but still resulted in a reservation. Maybe you have an admin portal that lets you create reservations without going through the confirmation process, and now the data can capture the difference.

But my actual preferred variant, if the tech stack can support it, is to have things like `confirmations` have their own tables, so you can have a `confirmedByUserId` as well as a confirmation timestamp.

That way you can instead have something like

    computed_processed: boolean(has a related entry in processed table)
    computed_confirmed: boolean(has a related entry in confirmation table)
   etc


The most obvious problem with that is that it allows representing impossible states, like in_progress and processed both being true.


It's a tradeoff. If you have an explicit finite state machine in which your model can only ever be in one state, then obviously an enum has its benefits.

In order to avoid impossible states, I'll rely on something like hooks or triggers to make sure that setting one value also updates all dependent values.

In other words, setting processed to true will trigger something which always sets in_progress and pending to false.

It's not at all uncommon that I will have chosen an enum for a situation in which I thought I was dealing with a finite state machine, only to unearth new ___domain contingencies that made me realize that I actually need a more nuanced and flexible model. This is why I prefer booleans, particularly the computed/derived booleans when possible (as I mentioned in my original comment). I'd rather have a model capable of capturing the actual complexity of the state than trying to force a finite state machine that might result in a loss of information, even if that means that I'm forced to rely on declarative hooks/triggers to ensure data integrity.


Yep. This is basic data model design.


Correct, it is basic data model design in that it is only for handling basic data models. When you are dealing with complicated data models with many contingencies, you aren't always able to have a data model incapable of representing invalid states. Sometimes it's necessary to rely on application logic/stored procedures/index constraints, etc, in order to enforce valid state.


This is feasible if your program is the only one ever reading (let alone writing) the data. When you add a separate frontend and stringly-typed back-and-forth communication (JSON), it becomes a huge headache for whichever platform(s) are not the gatekeeper of the data. They have to a) depend on your rules being modelled in a translatable way, and b) find a rational user experience for handling the inevitable inconsistencies.


> stringly-typed back-and-forth communication (JSON)

JSON, while it has a very limited set of built-in types and (outside of add-ons like JSON-schema) no type definition mechanism, isn’t “stringly-typed”.


If you have multiple applications accessing a database, you can have data integrity maintained using something like triggers in the actual DB, as well as constraints.


My experience is that status fields often end-up being a bad design.

Status is an information you display, but it's not the actual data. Status fields that contains multiple information and features in a single value creates confusing and complex logic for no good reason.

Booleans are way better to store data, and each one can bear a single different concern.


I think it can also go to a enum/timestamp pair of a state transitions table.


I don't like this. Yes, you can alias the true/false fact to null/non-null datetime value, but this is missing the point of ___domain modeling.

The immediate impact of this decision is probably negligible as long as you did not need to store a nullable boolean fact, as opposed to a non-nullable boolean fact.

The broader impact of this decision is that you have endorsed a policy of assuming how things will be used in the future and are not interested in a 100% authentic modeling of the problem ___domain anymore. In a larger team, these "well wouldn't it be nice if..." design decisions are extremely subjective and can beg many further questions that wind up being distracting.

Discipline becomes very important as the complexity of your software project increases. It is easy to collapse the whole house of cards over little incremental things like this. You have to have a stricter policy across the entire team of saying things like "booleans go in as booleans, if you want who, when, why, those are 3 new facts next to the boolean".


The other thing I don't like about this is that it's great at tracking the switch to "true" (the timestamp value), but when you want to go back to "false", you have to wipe out the timestamp, and you then have no idea when it was unpublished or unhidden or whatever. Now you need a second column to track that as I see it, and you start getting into weird territory. IMO it's better to keep the Boolean, and just introduce timestamp columns for the specific things you want to track.


If you need to maintain a proper audit trail for every state change, you need a separate table for that.

Any trick you play with a fixed number of scalar columns will only let you access the timestamp of the last change of the same type. This won't be particularly useful when there's an edit war among moderators who unpublish and republish the same thing over and over.


Yeah, that's true, I would prefer tracking things separately, which is another reason I don't particularly like the proposed idea.


I get that there’s a YAGNI aspect to it, but I don’t buy the argument that a timestamp adds complexity over a boolean.


This all provokes confusion over 2 different types of facts:

A) Knowledge of when a specific event occurred.

B) If something is true or not.

To use cases like "logged_in_at" as the example for why booleans shouldn't be used is essentially a strawman argument.

There are many situations in which a boolean fact does not occur in the time ___domain or have any possible value. Knowledge of certain facts in certain problem domains can be viewed as timeless even if they did come into being at a discrete point in time. For example, regulatory facts that govern entire industries. You probably never care when a specific regulation started to matter for a situation, just that it does or not. All these timestamps would do is confuse downstream users and bloat extracts of data.

The biggest problem of all is this statement:

> Storing timestamps instead of booleans, however, is one of those things I can go out on a limb and say it doesn’t really depend all that much. You might as well timestamp it. There are plenty of times in my career when I’ve stored a boolean and later wished I’d had a timestamp. There are zero times when I’ve stored a timestamp and regretted that decision.

There is nuance to this problem. A and B are both perfectly valid cases and each have their own representations that make the most sense. The discipline is in identifying these cases appropriately and using the correct tool for the job.


I think you're reading OP a little too harshly. That's another type of a strawman argument.

Of course there are nuances and exceptions to every generalization. OP seems to be saying that most boolean flags we care about on a day-to-day basis can benefit from being placed in a time ___domain. If you're aware of an exception, you can just point it out. There's no need to dismiss the entire argument.

Returning to your example, you probably do care when a certain regulation started to apply to a certain industry, because factories built and products sold before that date may not comply, and may not even be required to comply, with said regulation. You probably also care when it stops applying. Laws are not rules of nature; they change all the time and often come with expiry dates. If you store every regulation with a few timestamps, it's going to be pretty straightforward to find out which entities need to be in compliance but currently aren't, etc.

Besides, the cost of keeping a few unnecessary integer columns around in your database is often negligible compared to the cost of updating the schema and notifying everyone who uses your API several years down the road when you realize that you need it after all. Disk is cheap. RAM is cheap. CPU is cheap. Updating enterprise software is not.


- "Wanna get laid?"

- "Jan 23, 2020"

Sure, I mean. It's Javascript after all.


> are not interested in a 100% authentic modeling of the problem

A model is an approximation. There is no such thing as "100% authentic modeling" for anything non-trivial, and insisting on it grows models that aren't particularly useful. They might _seem_ simple at first glance because of their "purity", but a) they're not _actually 100% accurate, and b) are usually very fragile on revision, becoming, ironically, quite complex.

> "booleans go in as booleans, if you want who, when, why, those are 3 new facts next to the boolean".

...you have now replicated the information 3 times in a way that includes risk of going out of sync. Congratulations.


A model turns into a gamble if you throw your hands up and claim that its impossible to have a 100% authentic model of a ___domain.

If you truly believe this to be the case, then you have not tread far enough into the forest of SQL, 3NF, BCNF, and the relational calculus. It is possible to use math to prove that a problem ___domain is modeled appropriately. With SQL and views, you can construct extremely high-order representations of domains that would otherwise be viewed as pure magic by any outside onlookers. The only way any of this becomes possible is if you have solid foundations and the courage to produce exceptionally clean models.

Yes, you will definitely screw it up a few times. We started over 4-5 times. Plan to iterate. Start with your ___domain modeling. You can do this shit in excel. No one gets too salty when you have to throw away a spreadsheet.


> A model turns into a gamble if you throw your hands up and claim that its impossible to have a 100% authentic model of a ___domain.

...no it doesn't. I've deleted the claim that you don't know what a model is from the previous comment, but now it's clear: you don't know what a model is.

> If you truly believe this to be the case, then you have not tread far enough into the forest of SQL, 3NF, BCNF, and the relational calculus.

...you're mixing models with mathematical theories in there, and none of these are "100% authentic models" of anything. Unless by authentic you mean "not copied", I suppose.

> It is possible to use math to prove that a problem ___domain is modeled appropriately.

...appropriately doesn't mean 100% accurate (which is _I think_ the word you were looking for). That's not how models work.

> The only way any of this becomes possible is if you have solid foundations and the courage to produce exceptionally clean models.

"Clean" doesn't imply "accurate" at all. It does sometimes help with being useful or easy to implement, but it usually _sacrifices_ accuracy. Like the two main and often contrary properties of models are "accuracy" and "usefulness." That's, like, first five pages of any epistemology handbook.

> You can do this shit in excel.

If you start in excel, you ain't getting anywhere. And this is not, by the way, a dunk on Excel.


Missing the point of ___domain modeling? I think you’re projecting a scenario or past experience forward on this because there is nothing about this that goes against “classical ___domain modeling”

At the end of the day, as far as computers go what is actually the difference between (TRUE, FALSE) and (MMDDYYTTTTTT/Null) - assuming your environment and storage engine support it.

You’re encoding far more meaning by using a date than using TRUE. Instead of two columns you now have one.

You can have a rock solid system without any room for misinterpretation and confusion while also using this approach. I do it all the time.


Since you mention ___domain issues, I’ll also note that all the examples given are tracking system events, not ___domain events. For ___domain events, a system timestamp is usually incorrect.

The date something occurs in the real-world can be different from when the data entry was done. If you blindly convert Booleans into timestamps without that differentiation, you’ll end up with misleading data.


If one does this please be appropriate about semantics / naming. Ie. Don't put a timestamp in a variable called `isTermsAccepted`. Rename it into `termsAcceptedAt`.

Generally accept that timestamps and booleans are not the same, but the truth value can be derived from the timestamp.


> truth value can be derived from the timestamp

Python used to disagree with you: https://lwn.net/Articles/590299/ (and the bug report with discussion spanning a couple of years: https://bugs.python.org/issue13936)


Yeah, glad this was fixed

> If I had to do it over again I would definitely never make a time value "falsy".


More effective still is bitemporalism, or even just unitemporalism.

Let's do a unitemporal table. Instead of `published_at`, you retain the `is_published` boolean field. On the row you have a `valid_time` timestamp range; alternatively `valid_began` and `valid_ended` timestamps if your database doesn't do ranges.

The range shows the time during which the fact is true. At creation you set `[now, Infinity)` to indicate that it is true as of the entry. When it becomes false you change the row to `[then, now)`. Outside of that range, the record is false.

Notably this lets you encode the switching back and forth of a value over time with no ambiguity about when something began or ceased to be true. More importantly, it's not limited to bools. Any row can be turned into a unitemporal or bitemporal form. If, as others are rightfully suggesting, you should favour enums, not a problem. Strings? Numbers? Complex types? Embedded XML? All fine in the eyes of temporal tables.

Some databases even include SQL:2011 temporal table support for "application time" and "system time". I expect whenever it lands in PostgreSQL it'll reach a far wider audience here at HN.


This is right. I'm a big fan of this sort of embedded audit metadata wherever it makes sense.

I do wonder when doing stuff like this though, if this really shouldn't be something that the database gives you for free.

I read a few years ago about 'fact based' event stream style databases which store your data as a stream of time ordered ops that can later serve as an audit log, but can be used for even more powerful things such as backups at any point in time, debugging at any point in time, etc.

For practical reasons (i.e. just picking a standard postgres setup to get stuff done) I've never dug into any of these systems or played around with them. Anyone know what the latest and greatest is here? Is there anything I can install on top of postgres to give me this functionality today?


I think Datomic[1] works like that

Here's the relevant quote: "Understand how and when changes were made. Datomic stores all history, and lets you query against any point in time. Learn More"

[1] https://www.datomic.com/


This is often called "event sourcing". The other commenter mentions Datomic, which probably is the 'latest and greatest' form of event sourcing (I've never used it).

https://vvvvalvalval.github.io/posts/2018-11-12-datomic-even...

If you want something built on postgres, I don't have any specific recommendations, but you can build a simple event sourcing system yourself. I worked somewhere that used event sourcing on postgres, and the core log was basically just a table with aggregate IDs, event sequence numbers, and a JSONB column for the event payload. I recommend starting with just one part of your application if you're going to adopt event sourcing. You will quickly find that there are a lot of new considerations and pitfalls that you don't have with traditional RDBMS usage. Overall, event sourcing is hard to get right, so you should consider the trade-offs carefully. There are easier ways to get audit logs, for example: https://www.pgaudit.org/

I will also recommend this as a way to start to understand the tricky aspects of event sourcing: https://leanpub.com/esversioning


I’ve been following the progress of Dolt [1] which is a SQL database that works like git. This would give you modification history in a similar way to git. That’s different from recording when events happened, though (and changing your mind about when they happened), so you’ll still need timestamps for that.

[1] https://github.com/dolthub/dolt


Indeed, I do this pretty much all the time too.

One common'ish example not mentioned in the article is storing whether or not a user is active. Storing "is_active" as a boolean makes sense but switching that to "deactivated_at" gives you so much more information.


last_active_at also makes sense, especially if you'd like some flexibility in deciding what to do with users who have been inactive for a certain amount of time.


I've yet to find a case where using a timestamp over a boolean hasn't been the better option.

This is because turning a boolean on is an event so it'll always have a timestamp.

Sometimes it's useful to know when this event happened.

The only exception I can think of is if for some reason you're trying to save on bytes, which in this day and age, especially true for web applications, this is practically never the case.


I guess I’m a little confused, as this article seems to be speaking to programmers who are doing stuff that I’ve never done and in programming paradigms I’ve never used, so I’m definitely not in the target audience.

But.. if I’m understanding the proposal correctly, this only gives you a timestamp if the value is ‘true’, and not if it’s ‘false’. Is that correct?

Is there a reason why we care about when a boolean is turned on but not when it’s turned off? Why would we not store the boolean and the “timestamp of the last change” as separate values, so we can track the timestamp of changes in both directions, if that’s a thing we care about?


That's how I usually do it (deleted bit, updated timestamp and also a created timestamp as that's sometimes info we need). In cases that need an audit trail I'll add a log line as well.


It’s a good question, but if you care about the audit trail, then perhaps event sourcing is a better choice.

So yes, this only assumes you care about when the on state happened and you don’t care about the history.


Eventsourcing is a possible solution for audit trails but comes with additional complexity.

If you just want to add audits of what changed when then you can also get there with a separate audit table that just stores timeframe, changed fields with new values and who/what made the change. Some frameworks have support for this out of the box or with a small library.


I think it depends very much on the data being modelled.

Certain situations might call for a timestamp instead of a boolean, especially if it is a value that is only ever turned on once and never turned off, possibly `user_deactivated_at`; I do prefer having a bit field and a separate timestamp for things that can flip; and for a lot of use cases it is good to just have a full event stream implementation where you can construct the state at any point in time and you get events data combined with the timestamps.


Correct. But to track both directions, I'd use two timestamp columns: is_active and is_inactive with check constraint that only one of them can be non-null. Otherwise someone is likely to omit updating the “timestamp of the last change” in a hurry.


> This is because turning a boolean on is an event so it'll always have a timestamp.

Turning it off too but we don't have a timestamp :/.


I suppose you could store negative timestamps to represent boolean false values, but that will inevitably lead to bugs where a piece of code mistakenly compares a negative timestamp with zero.


Every change you can make in a database becomes an event by this definition. It's just as useful to know when every change happens in your database, and by who.

Which is why we have `audit' logs. Usually online logs that can recover every change over the entire history of a database, every row having a versioned history of what changed by who. By keeping it separate, not only does it make ___domain modelling and intuitive use more logical, it makes primary query performance dramatically better.

And if logically you want to treat them as events, they should be in a chronological events table by themselves, not as an overloaded nullable field.

To distill what I've said above, as politely as possible I will say that if you model a boolean as a timestamp, you are covering up for other much larger problems.


As mentioned in other comments, there's a bunch of drawbacks with this approach, but there's a more general term "boolean blindness" [1], which is usually applied to programs (not databases). I find it useful to avoid using booleans where more descriptive types make sense (and can be used), as well as not throwing information away when it may be needed still, but to a reasonable extent.

[1] https://www.cs.cmu.edu/~15150/previous-semesters/2012-spring...


Indeed, this is a bit of wisdom I first encountered when playing with Django and seeing others do it. You can still see it in a couple of the many soft-delete packages available on PyPI, in the form of deleted or deleted_at fields with DateTimeFields. (Though admittedly, I'm pretty out of date on Django these days.)

(Though it is worth noting that you sort-of get this for free if you implement a scheme with 'revisioned' data, or a database that simply has that as a feature. But, it's still useful to just have this additional bit of information handy, if nothing else.)


I've had experience with the soft delete in laravel and I'm not fond of it. Because we had a separate dashboard with manual SQL reports without an ORM to magically filter out all the deleted_at rows. And it gets tedious to remember to filter out the rows with deleted_at timestamps. I would like to see the soft-delete implemented differently, say a second trashcan table for each active table, e.g. users_archive, and the delete operation would move the row there.


The reason why deleted_at is probably done this way is due to foreign key constraints. Especially for a users table, there are probably a whole lot of things that point to the user's table. A cascading deletion could wipe out quite a bit of unintended data.


Or you could have views with `deleted_at` filtered out:

https://learnsql.com/blog/sql-view/


Oh, good point. Maybe this could be built into ORMs.


I first picked it up from Rails many years ago. Now almost every SQL table I create has created_at, updated_at and possibly deleted_at or archived_at fields. Many frameworks or ORMs have support for automatically setting updated_at. And it might even be done at SQL level but I’m not 100% sure on that.


I’ve tried that approach of moving users to a “deleted_users” table but ran into the problem of losing the foreign key references to any records pointing at the user.


If you're going this route, it's hard to understand why you wouldn't just store all the information you want explicitly in a string.

"true [timestamp]" "false [timestamp]" "unset [timestamp]"

That's more information than described in the article and it's easier for future you to understand what's going on, without implicit assumptions on the meaning of an undefined variable. Furthermore, you can keep a complete record of all status changes if that's what you want:

"false [timestamp] true [timestamp] false [timestamp] true [timestamp]"


This makes querying harder. You cannot use “is not null” or “= true” for where clauses with this. You’ll need to parse the string value.

When going the explicit route I would recommend an is_archived (nullable) bool column combined with an archived_at timestamp column.


If you're using sqlite, for instance, you can call the substr function (I was sticking with the article's constraint to not add a new column).

There is one thing in favor of storing the full history in a single string - you might not query the full history very often, and you can keep a lot of information around without adding another table. I've occasionally stored the full history of objects (short notes mostly) in an sqlite database as a string in json form. Pretty convenient if you're keeping it there just in case you want to go back in time and don't make a large number of changes.


At this point just create a archived_resources table where you store the resource id, timestamp and user id


I would not do this unless there is a use case. For analysis purposes you can always read the whole change history from audit logs. The solution also only gives you the timestamp when something was set, but not when it was unset.


In my experience I would even say having a boolean field in your table is a bad idea, with rare exceptions.

e.g. - You may want proper state transition rather than having 4 booleans each representing one state - You may want to normalize the boolean with other metadata (timestamp, as OP suggestion, and author) into separate table,


Sitting in the data science seat, downstream from the application development and trying to gain insights on production data, I completely second this approach. It just gives you more to go on and more ways to validate whether something unexpected is going on.


In this case, why not go straight to append-only databases where every entry has a timestamp?

That will be _the_ audit log in your database.


This was my thought too.

Timestamps are a usefull trick. But also one that allows you to postpone what the ___domain is really asking: to store a log of events.

Maybe even as primary source (aka event sourced).


This reminds me of tricks in JavaScript such as using !! to convert values to Boolean: it’s clever, “idiomatic” and save a few characters, but it’s not self explanatory to someone not familiar with the idioms


Fortunately `published_at == null` is much more intuitive that using `!!` to typecast.


I'm not really down with assigning meaning to NULL. NULL means "unknown", full stop. This is why SQL doesn't like if you compare to NULL using equality, because nothing "equals" NULL.

Similarly wouldn't I want to know when a true value became false? This post seemed very strange in that regard.

Count me in as storing a boolean as a boolean (or as mentioned elsewhere, an enumeration) and if I need auditing on that, then I will also implement proper auditing columns and/or tables to suit my needs.


The point still stands outside of Javascript, though. Unless you're dealing with very large databases, retrieving a boolean or doing a null check on a timestamp are pretty close operations, especially if you use some kind of frontend to render the data.

Storing a boolean expression ("is published", "marked as read", etc.) as a timestamp can still be valuable. It just happens to be entirely equivalent in Javascript, but in normal, typed languages, the same practice can be used to prepare yourself for debugging a broken application or database later.

I don't think this is a practice that you should just universally apply everywhere, but it's worth considering in a lot of cases where people generally tend to use booleans.


> NULL means "unknown", full stop.

Hum... Null means whatever the data design says it means. We are talking about mathematics here, not religion. Rules don't come written in stone from the havens.

Using it as "not applicable" is even way more common than "unknown".


This seems to only really work in languages that allow null variables/timestamps. I wouldn't really want to have to do comparators to the default value of a timestamp.


The author is talking about databases, not programming languages.

I do see a different issue, though: The article indeed seems to make no distinction between an absent value and a default timestamp of 0. That limits your database to more or less "now". You cannot really store things about the past. Someone might take such a pattern and fixate it into some kind of library. If then someone else tries to store data from 1970, things can get ... interesting.


That does assume your database only allows unix-style timestamps. MariaDB, for example, has "datetime", supporting dates between the year 1000 and 9999, distinct from null/zero.

Unfortunately, datetime takes 8 bytes vs the 4 for a timestamp.


That is a very neat and smart improvement!


> ...default value of a timestamp.

Assuming the timestamp represent a change of state in a contemporary application, I would expect 1970-01-01 0:00:00Z (UNIX epoch +0 seconds) to be unambiguous enough (But that's definitely an engineering constraint to maintain awareness of)


Languages with Option or Maybe types instead of null will also work fine. So it works in every langauge except Go?


Yep. And I assume you could use Time.IsZero() for Go https://stackoverflow.com/a/36234533


PostgreSQL has computed columns. Creating one for every field that returns its `IS NOT NULL`-ness would accommodate such programming languages.


I have about the same sentiment in regard to personal notekeeping, and do in general care about preserving metadata. Many times I have consulted the date when I created a note or, say, an entry in the password manager—or when I last changed it. That may inform my decision on what to do with the note next, or at least allows me to contemplate how much is not done in the passing years and how much is yet to not do.

‘Remember The Milk’ and Evernote make it pretty nice and easy by keeping the dates and some other info. (Though of course there's a gotcha that RTM's Android app forgets to implement the display of this metadata.) Not that I recommend these apps currently, especially Evernote.

Well, after migrating to Org-mode I have a persistent itch caused by the fact that Org doesn't have modification times for outline items, and implementing them in Emacs is a pain. That's one downside of not separating the view from the model. But the creation time is easy to add, in case someone wonders.

Similarly, I love having the archive of deleted notes and completed todos: once in a while I need to figure out what the hell I did to some particular items, or I change my mind on some edits. And on bulk moving or copying, I like to keep record of what I moved from where. (cough unlike HN ahem.)


I don’t like NULLs. Nullables always get back you in ways you would never expect.

That’s why I won’t use a tip like that.


The problem can be managed somewhat depending on the language and tooling. I have used this advice for archived_at or deleted_at before. In the database it’s a NULLable timestamp column. In application code I would map this field to an Instant? (Kotlin) and a computed isArchived val that checks for presence of the archivedAt field.


Exception i encountered recently was needing ternary state boolean. a set of flags where true and false indicates outcome and undefined indicates "not yet processed". I suppose some sort of 0 timestamp could work but... No, I think I'll stick with boolean.


TLDR don't use a boolean, leave the field empty (NULL) and set a timestamp when should be true.

To be honest a tldr isn't actually needed, the post is both very concise, straight to the point, and convincing.

As per the languages I use more often to query DBs (TS, JS, PHP, Python) I don't see any downside. Evaluating if a variable is empty or not, or it's type, is not "bad", compared to evaluating if a variable is true or false. Even in TypeScript in strict mode, evaluating if a variable with type number is empty or not will result in validly typed code, without any noticeable difference compared to evaluating a variable with type boolean.


This makes me worry about clock skew and asynchronous code in general. For example, a timestamp might lead you to believe things happened in a different order than they did because the timestamping process isn't atomic.


It's a very neat idea that I'll consider in the future however I have some concerns about it.

One thing is about the database design. I remember lecturer proclaiming that models with many nullable relations is:

- bad design

- performance risk

- might mess with indexing

I haven't verified this knowledge in many years, to I'm not sure if that point still stands, also in wake of not optimizing pre-mature this might not be an issue.

The other thing is introducing of (needless) complexity to the system. It allows to make unwise decisions which otherwise would not be possible if the flag would remain simple boolean and as such stands against KISS system design principles.


Yes, nullable relations are harder to work with, especially with ON DELETE CASCADE and checking if another table contains a row based on a nullable column. However, this is about booleans, so they werent used in relations to begin with. The indexing in this case could be a bit less performant but probably neglegable.


The OP doesn't talk about *nullable relations*, which I agree is a sign of bad desing


Converting any non boolean type into a boolean is always a lossy compression that will often need to be reversed later. If/else was invented so that you don't need to perform this compression at the variable level.

I also don't like nullable fields in my databases. Anything nullable is representable as some form of coproduct - and can therefore be represented as a relationship to an entity with a property that can be joined to for terms of definition.


As the storage cost is very low, why not use both? Not trying to be obtuse, but I genuinely typically have separate columns in my schema design for booleans and the timestamps of these events flipping from false to true (published, edited). Come to think of it: These events warrant saving them to a separate table altogether; booleans represent a current state - there may be 1:n events like multiple edits.


  let published_at = new Date()

  if (published_at) console.log("it's true!")
  if (!published_at) console.log("it's false!")
As a FE dev I haven't had workplace with a codebase allowing above for at least 5 years. No one even asks "shall we us JS ot TS?". Strictly enforced static typing all over. It's not that I like it, just no one asks me.


What's wrong with this? Checking if a variable is defined/null is not exactly uncommon?


In JS, if (!variable) ... coerces the type to boolean, so anything falsy (0, null, undefined, empty string, etc...) will become true. For example if you have a timestamp of 0, it will be counted as false (but is defined and definitely not null)


I understand that. But how would strictly enforced typing solve this? Especially when using TypeScript. An incorrect value being given isn’t going to be caught by the compiler.


But these are two very different use cases and just using one for another won't work in many scenarios or will make thing less efficient. It is much, much better to leave the boolean intact for the reasons other people explained and just add another column with the date if you actually need it. This will give you more flexibility while keeping things in order.


Am I missing something, or does this assume I never care about the timestamp for turning a value off?


Yeah, I don't get that either. This only seems to work in very specific cases, like where the boolean starts off false, becomes true and can never be false again. So like "user viewed homepage". That seems like a tiny subset of what Booleans are used for but it's presented as being the one, obvious use case.


Some of us have an audit log that tells when a record has changed. Others just use event sourcing.


except when it is actually a nullable Boolean and there is a big difference between the 3 states


When reading this I miss a HN feature "upvote as a falsehood to be aware of".


I wrote a django model field that does this https://gist.github.com/gavinwahl/17c07335c8dd1b832911


To quote the author: "it depends"

If when something happens needs to fold into your business logic then by all means go for it.

If you're moreso doing it as an audit then logging out the event with it's context is going to be more useful.


> To quote the author: "it depends"

Where do you see this? To quote the article I'm reading: "it doesn’t really depend"


Not op but in the first paragraph he links to another post of his that is all about saying "it depends" often

>https://changelog.com/posts/good-reason-experienced-devs-say...

I'm assuming op is referencing that

(I agree with you here though that in this case he makes a very good argument for "it doesn't really depend")


Note that this doesn’t keep historical values.

E.g.: If we convert ‘synced’ boolean to ‘synced_at’ timestamp and if the sync status changes often, we’re only storing the most recent sync timestamp.

In some cases this might be insufficient.


You might as well version it. Timestamps are relative if different machines set it. Versions are primitive (ie atomic) and always absolute, and usually the best way to treat concurrency.


... but then please also update your privacy policy that you are storing this additional data about your users!

The only data that cannot be leaked or stolen is what you do not store in the first place.


Is there any concept in database engines of an audit trail which logs previous value and a time stamp? I’ve seen the concept using triggers, but never seen an engine native solution


Maybe you're thinking of "event sourcing"?


You can implement an append-only database, where each record is a snapshot of the latest version and a timestamp of when it was updated.


You probably should be storing a vector or stack of timestamps. Empty vector/stack is still falsey, and now you have a surface level history of when things changed.


I find timestamps as boolean state flags a sort of gateway into event-driven data architecture land. It becomes very enticing to evolve in that direction.


Why just booleans though? If you have any configuration in database and you don't store when and who changed it - you're doing it wrong.


So we're trying to come up with reasons why this is bad advice? uhhhhhh ... something something far future something Y10K problem something.


What about all the overhead in testing that using a timestamp instead of boolean will introduce? Specially if you end up not needing it


One important downside: Data protection. This approach of "store it now in case you might need it later" is in direct violation of the principle of data minimisation in GDPR.


I think thats only the case when it the data can be used to (help) identify a specific person


No, not exactly. It is the case when this data is related to a single person. E.g. "has this person subscribed to my newsletter" vs "whan has this person subscribed to my newsletter".


GDPR only applies to personal data though? Like you can't store gender info "in case it's usefull later". I really don't see how a timestamp can be used in that way.


Well, since gender is mutable now, I suppose you can't store the date the gender changed "just in case". But yeah, unless you're dealing with people, this doesn't apply.


I don’t see the benefit. This is what changelogs are for.


I love the drake meme they embeded for sharing the post!




Consider applying for YC's Summer 2025 batch! Applications are open till May 13

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: