Would be interesting to know how files get stored. They don't mention any distributed FS solutions like SeaweedFS so once a drive is full, does the file get sent to another one via some service? Also ZFS seems an odd choice since deletions (esp of small files) at +80% full drive are crazy slow.
Unlike ext4 that locks the directory when unlinking, ZFS is able to scale on parallel unlinking. In specific, ZFS has range locks that permit directory entries to be removed in parallel from the extendible hash trees that store them. While this is relatively slow for sequential workloads, it is fast on parallel workloads. If you want to delete a large directory subtree fast on ZFS, do the rm operations in parallel. For example, this will run faster on ZFS than a naive rm operation:
A friend had this issue on spinning disks the other day. I suggested he do this and the remaining files were gone in seconds when at the rate his naive rm was running, it should have taken minutes. It is a shame that rm does not implement a parallel unlink option internally (e.g. -j), which would be even faster, since it would eliminate the execve overhead and likely would eliminate some directory lookup overhead too, versus using find and parallel to run many rm processes.
For something like fast mail that has many users, unlinking should be parallel already, so unlinking on ZFS will not be slow for them.
By the way, that 80% figure has not been true for more than a decade. You are referring to the best fit allocator being used to minimize external fragmentation under low space conditions. The new figure is 96%. It is controlled by metaslab_df_free_pct in metaslab.c:
Modification operations become slow when you are at/above 96% space filled, but that is to prevent even worse problems from happening. Note that my friend’s pool was below the 96% threshold when he was suffering from a slow rm -r. He just had a directory subtree with a large amount of directory entries he wanted to remove.
For what it is worth, I am the ryao listed here and I was around when the 80% to 96% change was made:
I discovered this yesterday! Blew my mind. I had to check 3 times that the files were actually gone and that I specified the correct directory as I couldn't believe how quick it ran. Super cool
Unlinking gets done asynchronously on the weekends from Cyrus, using the `cyr_expire` tool. Right now it only runs one unlinking process at a time on the whole machine due to historical ext4 issues ... but maybe we should revisit that now we're on ZFS and NVMe. Thanks for the reminder.
I doubt that this will make us as fast as ext4 at unlinking files in a single thread, but it should narrow the gap somewhat. It also should make many other common operations slightly faster.
I had looked into range lock overhead years ago, but when I saw the majority of time entering range locks was spent in an “unavoidable” memory allocation, I did not feel that making the operations outside the memory allocation faster would make much difference, so I put this down. I imagine many others profiling the code came to the same conclusion. Now that the memory allocation overhead will soon be gone, additional profiling might yield further improvements. :)
Yeah, we're only using ZFS replication for logs; we're using the Cyrus replication for emails because it has other sanity checks and data model consistency enforcement which is really valuable.
(And both are async. We'd need something like drbd for real synchronous replication)
For now, the "file storage" product is a Node tree in mysql, with content stored in a content-addressed blob store, which is some custom crap I wrote 15 years ago that is still going strong because it's so simple there's not much to go wrong.
We do plan to eventually move the blob storage into Cyrus as well though, because then we have a single replication and backup system rather than needing separate logic to maintain the blob store.