Would be interesting to know how files get stored. They don't mention any distri...

ryao · 2024-12-22T13:47:48 1734875268

Unlike ext4 that locks the directory when unlinking, ZFS is able to scale on parallel unlinking. In specific, ZFS has range locks that permit directory entries to be removed in parallel from the extendible hash trees that store them. While this is relatively slow for sequential workloads, it is fast on parallel workloads. If you want to delete a large directory subtree fast on ZFS, do the rm operations in parallel. For example, this will run faster on ZFS than a naive rm operation:

  find /path/to/subtree -name -type f | parallel -j250 rm --
  rm -r /path/to/subtree

A friend had this issue on spinning disks the other day. I suggested he do this and the remaining files were gone in seconds when at the rate his naive rm was running, it should have taken minutes. It is a shame that rm does not implement a parallel unlink option internally (e.g. -j), which would be even faster, since it would eliminate the execve overhead and likely would eliminate some directory lookup overhead too, versus using find and parallel to run many rm processes.

For something like fast mail that has many users, unlinking should be parallel already, so unlinking on ZFS will not be slow for them.

By the way, that 80% figure has not been true for more than a decade. You are referring to the best fit allocator being used to minimize external fragmentation under low space conditions. The new figure is 96%. It is controlled by metaslab_df_free_pct in metaslab.c:

https://github.com/openzfs/zfs/blob/zfs-2.2.0/module/zfs/met...

Modification operations become slow when you are at/above 96% space filled, but that is to prevent even worse problems from happening. Note that my friend’s pool was below the 96% threshold when he was suffering from a slow rm -r. He just had a directory subtree with a large amount of directory entries he wanted to remove.

For what it is worth, I am the ryao listed here and I was around when the 80% to 96% change was made:

https://github.com/openzfs/zfs/graphs/contributors

switch007 · 2024-12-22T15:54:31 1734882871

I discovered this yesterday! Blew my mind. I had to check 3 times that the files were actually gone and that I specified the correct directory as I couldn't believe how quick it ran. Super cool

brongondwana · 2024-12-22T23:04:10 1734908650

Unlinking gets done asynchronously on the weekends from Cyrus, using the `cyr_expire` tool. Right now it only runs one unlinking process at a time on the whole machine due to historical ext4 issues ... but maybe we should revisit that now we're on ZFS and NVMe. Thanks for the reminder.

jmakov · 2024-12-22T15:56:39 1734882999

Thank you very much for sharing this, very insightful.

ryao · 2024-12-22T18:43:45 1734893025

Thank you for posting your original comment. The process of writing my reply gave me a flash of inspiration:

https://github.com/openzfs/zfs/pull/16896

I doubt that this will make us as fast as ext4 at unlinking files in a single thread, but it should narrow the gap somewhat. It also should make many other common operations slightly faster.

I had looked into range lock overhead years ago, but when I saw the majority of time entering range locks was spent in an “unavoidable” memory allocation, I did not feel that making the operations outside the memory allocation faster would make much difference, so I put this down. I imagine many others profiling the code came to the same conclusion. Now that the memory allocation overhead will soon be gone, additional profiling might yield further improvements. :)

shrubble · 2024-12-22T12:15:19 1734869719

The open-source Cyrus IMAP server which they mention using, has replication built-in. ZFS also has built-in replication available.

Deletion of files depends on how they have configured the message store - they may be storing a lot of data into a database, for example.

mastax · 2024-12-22T13:51:40 1734875500

ZFS replication is quite unreliable when used with ZFS native encryption, in my experience. Didn't lose data but constant bugs.

brongondwana · 2024-12-22T23:05:52 1734908752

Yeah, we're only using ZFS replication for logs; we're using the Cyrus replication for emails because it has other sanity checks and data model consistency enforcement which is really valuable.

(And both are async. We'd need something like drbd for real synchronous replication)

ackshi · 2024-12-22T13:46:56 1734875216

Keeping enough free space should be much less of a problem with SSDs. They can tune it so the array needs to be 95% full before the slower best-fit allocator kicks in. https://openzfs.readthedocs.io/en/latest/performance-tuning....

I think that 80% figure is from when drives were much smaller and finding free space over that threshold with the first-fit allocator was harder.

brongondwana · 2024-12-22T23:02:41 1734908561

Emails are stored in cyrus-imapd.

For now, the "file storage" product is a Node tree in mysql, with content stored in a content-addressed blob store, which is some custom crap I wrote 15 years ago that is still going strong because it's so simple there's not much to go wrong.

We do plan to eventually move the blob storage into Cyrus as well though, because then we have a single replication and backup system rather than needing separate logic to maintain the blob store.