ZFS will outscale ext4 in parallel workloads with ease. XFS will often scale bet...

jeltz · on Jan 14, 2025

Not on most database workloads. There zfs does not scale very well.

ryao · on Jan 14, 2025

Percona and many others who benchmarked this properly would disagree with you. Percona found that ext4 and ZFS performed similarly when given identical hardware (with proper tuning of ZFS):

https://www.percona.com/blog/mysql-zfs-performance-update/

In this older comparison where they did not initially tune ZFS properly for the database, they found XFS to perform better, only for ZFS to outperform it when tuning was done and a L2ARC was added:

https://www.percona.com/blog/about-zfs-performance/

This is roughly what others find when they take the time to do proper tuning and benchmarks. ZFS outscales both ext4 and XFS, since it is a multiple block device filesystem that supports tiered storage while ext4 and XFS are single block device filesystems (with the exception of supporting journals on external drives). They need other things to provide them with scaling to multiple block devices and there is no block device level substitute for supporting tiered storage at the filesystem level.

That said, ZFS has a killer feature that ext4 and XFS do not have, which is low cost replication. You can snapshot and send/recv without affecting system performance very much, so even in situations where ZFS is not at the top in every benchmark such as being on equal hardware, it still wins, since the performance penalty of database backups on ext4 and XFS is huge.

LtdJorge · on Jan 15, 2025

There is no way that a CoW filesystem with parity calculations or striping is gonna beat XFS on multiple disks, specially on high speed NVMe.

The article provides great insight into optimizing ZFS, but using an EBS volume as store (with pretty poor IOPS) and then giving the NVMe as metadata cache only for ZFS feels like cheating. At the very least, metadata for XFS could have been offloaded to the NVMe too. I bet if we store set XFS with metadata and log to a RAMFS it will beat ZFS :)

ryao · on Jan 15, 2025

L2ARC is a cache. Cache is actually part of its full name, which is Level 2 Adaptive Replacement Cache. It is intended to make fast storage devices into extensions of the in memory Adaptative Replacement Cache. L2ARC functions as a victim cache. While L2ARC does cache metadata, it caches data too. You can disable the data caching, but performance typically suffers when you do. While you can put ZFS metadata on a special device if you want, that was not the configuration that Percona evaluated.

If you do proper testing, you will find ZFS does beat XFS if you scale it. Its L2ARC devices are able to improve IOPS of storage cheaply, which XFS cannot do. Using a feature ZFS has to improve performance at price point that XFS cannot match is competition, not cheating.

ZFS cleverly uses CoW in a way that eliminates the need for a journal, which is overhead for XFS. CoW also enables ZFS' best advantage over XFS, which is that database backups on ZFS via snapshots and (incremental) send/recv affect system performance minimally where backups on XFS are extremely disruptive to performance. Percona had high praise for database backups on ZFS:

https://www.percona.com/blog/zfs-for-mongodb-backups/

Finally, there were no parity calculations in the configurations that Percona tested. Did you post a preformed opinion without taking the time to actually understand the configurations used in Percona's benchmarks?

LtdJorge · on Jan 17, 2025

No I didn't. I separated my thoughts in two paragraphs, the first doesn't have anything to do with the articles, it was just about the general use case for ZFS, which is using it with redundant hardware. I also conflated the L2ARC with metadata device, yes. The point about the second paragraph was that the using a much faster device just on one of the comparisons doesn't seem fair to me. Of course, if you had a 1TB ZFS HDD and 1TB of RAM as ARC the "HDD" would be the fastest on earth, lol.

About the inherent advantages of ZFS like send/recv, I have nothing to say. I know how good they are. It's one reason I use ZFS.

> If you do proper testing, you will find ZFS does beat XFS if you scale it. Its L2ARC devices are able to improve IOPS of storage cheaply, which XFS cannot do.

What does proper testing here mean? And what does "if you scale it" mean? Genuinely. From my basic testing and what I've got from online benchmarks, ZFS tends to be a bit slower than XFS in general. Of course, my testing is not thorough because there are many things to tune and documentation is scattered around and sometimes conflicting. What would you say is a configuration where ZFS will beat XFS on flash? I have 4x Intel U.2 drives with 2x P5800X empty as can be, I could test on them right now. I wanna make clear, that I'm not saying it's 100% impossible ZFS beats XFS, just that I find it very unlikely.

Edit: P4800x, actually. The flash disk are D5-P5530.

ryao · on Jan 17, 2025

> No I didn't. I separated my thoughts in two paragraphs, the first doesn't have anything to do with the articles, it was just about the general use case for ZFS, which is using it with redundant hardware. I also conflated the L2ARC with metadata device, yes.

That makes sense.

> The point about the second paragraph was that the using a much faster device just on one of the comparisons doesn't seem fair to me. Of course, if you had a 1TB ZFS HDD and 1TB of RAM as ARC the "HDD" would be the fastest on earth, lol.

it is a balancing act. It is a feature ZFS has that XFS does not, but it is ridiculous to use a device that can fit the entire database as L2ARC, since in that case, you can just use that device directly and keeping it as a cache for ZFS does not make for a fair or realistic comparison. Fast devices that can be used with tiered storage are generally too small to be used as main storage, since if you could use them as main storage, you would.

With the caveat that the higher tier should be too small to be used as main storage, you can get a huge boost from being able to use it as cache in tiered storage, and that is why ZFS has L2ARC.

> What does proper testing here mean? And what does "if you scale it" mean?

Let me preface my answer by saying that doing good benchmarks is often hard, so I can't give a simple answer here. However, I can give a long answer.

First, small databases that can fit entirely in RAM cache (be it the database's own userland cache or a kernel cache) are pointless to benchmark. In general, anything can run that well (since it is really running out of RAM as you pointed out). The database needs to be significantly larger than RAM.

Second, when it comes to using tiered storage, the purpose of doing tiering is that the faster tier is either too small or too expensive to use for the entire database. If the database size is small enough that it is inexpensive to use the higher tier for general storage, then a test where ZFS gets the higher tiered storage for use as cache is neither fair nor realistic. Thus, we need to scale the database to a larger size such that the higher tier being only usable as cache is a realistic scenario. This is what I had in mind when I said "if you scale it".

Third, we need to test workloads that are representative of real things. This part is hard and the last time I did it was 2015 (I had previously said 2016, but upon recollection, I realized it was likely 2015). When I did, I used a proprietary workload simulator that was provided by my job. It might have been from SPEC, but I am not sure.

Fourth, we need to tune things properly. I wrote the following documentation years ago describing correct tuning for ZFS:

https://openzfs.github.io/openzfs-docs/Performance%20and%20T...

At the time I wrote that, I omitted that tuning the I/O elevator can also improve performance, since there is no one size fits all advice for how to do it. Here is some documentation for that which someone else wrote:

https://openzfs.github.io/openzfs-docs/Performance%20and%20T...

If you are using SSDs, you could probably just get away with setting each of the maximum asynchronous queue depth limits to something like 64 (or even 256) and benchmark that.

> From my basic testing and what I've got from online benchmarks, ZFS tends to be a bit slower than XFS in general. Of course, my testing is not thorough because there are many things to tune and documentation is scattered around and sometimes conflicting.

In 2015 when I did database benchmarks, ZFS and XFS were given equal hardware. The hardware was a fairly beefy EC2 instance with 4x high end SSDs. MD RAID 0 was used under XFS while ZFS was given the devices in what was effectively a RAID 0 configuration. With proper tuning (what I described earlier in this reply), I was able to achieve 85% of XFS performance in that configuration. This was considered a win due to the previously stated reason of performance under database backups. ZFS has since had performance improvements done, which would probably narrow the gap. It now uses B-Trees internally to do operations faster and also now has redundant_metadata=most, which was added for database workloads.

Anyway, on equal hardware in a general performance comparison, I would expect ZFS to lose to XFS, but not by much. ZFS' ability to use tiered storage and do low overhead backups is what would put it ahead.

> What would you say is a configuration where ZFS will beat XFS on flash? I have 4x Intel U.2 drives with 2x P5800X empty as can be, I could test on them right now. I wanna make clear, that I'm not saying it's 100% impossible ZFS beats XFS, just that I find it very unlikely.

You need to have a database whose size is so big that optane storage is not practical to use for main storage. Then you need to setup ZFS with Optane storage as L2ARC. You can give regular flash drives to ZFS and XFS on MD RAID in a comparable configuration (RAID 0 to make life easier, although in practice you probably want to use RAID 10). You will want to follow best practices for tuning the database and filesystems (although from what I know, XFS has remarkably few knobs). You could give XFS the optane devices to use for metadata and its journal for fairness, although I do not expect it to help XFS enough. In this situation, ZFS should win on performance.

You would need to pick a database for this. One option would be PostgreSQL, which is probably the main open source database that people would scale to such levels. The pgbench tool likely could be used for benchmarking.

https://www.postgresql.org/docs/current/pgbench.html

You would need to pick a scaling factor that will make the database big enough and do a workload simulating a large number of clients (what is large is open to interpretation).

Finally, I probably should add that the default script used by pgbench probably is not very realistic for a database workload. A real database will have a good proportion of reads from select queries (at least 50%) while the script that is being used does a write mostly workload. It probably should be changed. How is probably an exercise best left for a reader. That is not the answer you probably want to hear, but I did say earlier in this reply that doing proper benchmarks is hard, and I do not know offhand how to adjust the script to be more representative of real workloads. That said, there is definite utility in benchmarking write mostly workloads too, although that utility is probably more applicable for the database developers than as a way to determine which of two filesystems is better for running the database.

LtdJorge · on Jan 19, 2025

Thanks for the long post. Sorry for the nerd snipe, it might or might not have been intentional :D

I agree with what you said. I'll test what you provided, first with fio and then with Postgres (was also my choice beforehand) with a TPC-E benchmark. If I remember, I'll let you know. Postgres on ZFS is specially difficult to be sure about just from theory around the internet, there's too much contradiction or outdated info.

menaerus · on Jan 15, 2025

Refuting the "it doesn't scale" argument with a data from a blog that showcases a single workload (TPC-C) with 200G+10tables dataset (small to medium) at 2vCPU (wtf) machine with 16 connections (no thread pool so overprovisioned) is not quite a definition of a scale at all. It's a lost experiment if anything.

ryao · on Jan 15, 2025

The guy did not have any data to justify his claims of not scaling. Percona’s data says otherwise. If you don’t like how they got their data, then I advise you to do your own benchmarks.

jeltz · on Jan 15, 2025

It is based on data from internal benchmarks. Zfs is fine for database workloads but scales worse than Xfs based on my personal experience. It is unpublished benchmarks and I do not have access to any farm to win a discussion on the internet.

ryao · on Jan 15, 2025

I did internal benchmarks at ClusterHQ in 2016. Those benchmarks showed that a tuned ZFS FS of the time had 85% the performance of XFS on equal hardware (a beefy EC2 instance with 4 SSDs, with XFS using MD RAID 0), but it was considered a win for ZFS because of the performance difference when running backups. L2ARC was not considered since the underlying storage was already SSD based and there was nothing faster, but in practice, you often can use it with a faster tier of storage and that puts ZFS ahead even without considering the substantial performance dips of backups.

menaerus · on Jan 15, 2025

I don't have anything to like or not to like. I'm not a user of ZFS filesystem. I'm just dismissing your invalid argumentation. Percona's data is nothing about the scale for reasons I already mentioned.

ryao · on Jan 15, 2025

The argument he made was invalid without data to back it up. I at least cited something. The remarks on the performance when backups are made and the benefits of L2ARC were really the most important points, and are far from invalid.

bayindirh · on Jan 14, 2025

No doubt. I want to reiterate my point. Citing myself:

> "I personally won't use either on a single disk system as root FS, regardless of how fast my storage subsystem is." (emphasis mine)

We are no strangers to filesystems. I personally benchmarked a ZFS7320 extensively, writing a characterization report, plus we have a ZFS7420 for a very long time, complete with separate log SSDs for read and write on every box.

However, ZFS is not saturation proof, plus is nowhere near a Lustre cluster performance wise, when scaled.

What kills ZFS and BTRFS on desktop systems are write performance, esp. on heavy workloads like system updates. If I need a desktop server (performance-wise), I'd configure it accordingly and use these, but I'd never use BTRFS or ZFS on a single root disk due to their overhead, to reiterate myself thrice.

ryao · on Jan 14, 2025

I am generally happy with the write performance of ZFS. I have not noticed slow system updates on ZFS (although I run Gentoo, so slow is relative here). In what ways is the write performance bad?

I am one of the OpenZFS contributors (although I am less active as late). If you bring some deficiency to my attention, there is a chance I might spend the time needed to improve upon it.

By the way, ZFS limits the outstanding IO queue depth to try to keep latencies down as a type of QoS, but you can tune it to allow larger IO queue depths, which should improve write performance. If your issue is related to that, it is an area that is known to be able to use improvement in certain situations:

https://openzfs.github.io/openzfs-docs/Performance%20and%20T...

bayindirh · on Jan 15, 2025

What I see with CoW filesystems is, when you force the FS to sync a lot (like apt does to keep immunity against power losses to a maximum), the write performance slouches visibly. This also means that when you're writing a lot of small files with a lot of processes and flood the FS with syncs, you get the same slouching, making everything slower in the process. This effect is better controlled in simpler filesystems, namely XFS and EXT4. This is why I keep backups elsewhere and keep my single disk rootfs on "simple" filesystems.

I'll be installing a 2 disk OpenZFS RAID1 volume on a SBC for high value files soon-ish, and I might be doing some tests on that when it's up. Honestly, I don't expect stellar performance since I'll be already putting it on constrained hardware, but let you know if I experience anything that doesn't feel right.

Thanks for the doc links, I'll be devouring them when my volume is up and running.

Where do you prefer your (bug and other) reports? GitHub? E-mail? IP over Avian Carriers?

ryao · on Jan 15, 2025

Heavy synchronous IO from incredibly frequent fsync is a weak point. You can make it better using SLOG devices. I realize what I am about to say is not what you want to hear, but any application doing excessive fsync operations is probably doing things wrong. This is a view that you will find prevalent among all filesystem developers (i.e. the ext4 and XFS guys will have this view too). That is because all filesystems run significantly faster when fsync() is used sparingly.

In the case of APT, it should install all of the files and then call sync() once. This is equivalent of calling fsync on every file like APT currently does, but aggregates it for efficiency. The reason APT does not use sync() is probably a portability thing, because the standard does not require sync() to be blocking, but on Linux it is:

https://www.man7.org/linux/man-pages/man2/sync.2.html

From a power loss perspective, if power is lost when installing a package into the filesystem, you need to repair the package. Thus it does not really matter for power loss protection if you are using fsync() on all files or sync() once for all files, since what must happen next to fix it is the same. However, from a performance perspective, it really does matter.

That said, slow fsync performance generally is not an issue for desktop workloads because they rarely ever use fsync. APT is the main exception. You are the first to complain about APT performance in years as far as I know (there were fixes to improve APT performance 10 years ago, when its performance was truly horrendous).

You can file bug reports against ZFS here:

https://github.com/openzfs/zfs

I suggest filing a bug report against APT. There is no reason for it to be doing fsync calls on every file it installs in the filesystem. It is inefficient.

bayindirh · on Jan 15, 2025

Actually this was discussed recently [0]. While everybody knows it's not efficient, it's required to keep update process resilient against unwanted shutdowns (like power losses which corrupt the filesystem due to uncommitted work left on the filesystem).

> From a power loss perspective, if power is lost when installing a package into the filesystem, you need to repair the package.

Yes, but at least you have all the files, otherwise you can have 0 length files which can prevent you from booting your system. In this case, your system boots, all files are in place, but some packages are in semi-configured state. Believe me, apt can recover from many nasty corners without any ill effects as long as all files are there. I used to be a tech-lead for a Debian derivative back in the day, so I lived in the trenches in Debian for a long time, so I have seen things.

Again it's decided that the massive sync will stay in place for now, because the risks involved in the wild doesn't justify the performance difference yet. If you prefer to be reckless, there's "eatmydata" and "--force-unsafe-io" options baked in already.

Thanks for the links, I'll let you know if I find something. I just need to build the machine from the parts I have, then I'll be off to the races.

[0]: https://lists.debian.org/debian-devel/2024/12/msg00533.html [warning, long thread]

ryao · on Jan 15, 2025

This email mentions a bunch of operations that are done per file to ensure the file put in the final location always has the correct contents:

https://lists.debian.org/debian-devel/2024/12/msg00540.html

It claims that the fsync is needed to avoid the file appearing at the final location with a zero length after a power loss. This is not true on ZFS.

ZFS puts every filesystem operation into a transaction group that is committed atomically about every 5 seconds by default. On power loss, the transaction group either succeeds or never happens. The result is that even without using fsync, there will never be a zero length file at the final location because the rename being part of a successful transaction group commit implies that the earlier writes also were part of a successful transaction group commit.

The result is that you can use --force-unsafe-io with dpkg on ZFS, things will run faster and there should be no issues for power loss recovery as far as zero length files go.

The following email mentions that sync() had been used at one point but caused problems when flash drives were connected, so it was dropped:

https://lists.debian.org/debian-devel/2024/12/msg00597.html

The timeline is unclear, but I suspect this happened before Linux 2.6.29 introduced syncfs(), which would have addressed that. Unfortunately, it would have had problems for systems with things like a separate /usr mount, which requires the package manager to realize multiple syncfs calls are needed. It sounds like dpkg was calling sync() per file, which is even worse than calling fsync() per file, although it would have ensured that the directory entries for prior files were there following a power loss event.

The email also mentions that fsync is not called on directories. The result is that a power loss event (on any Linux filesystem, not just ZFS) could have the files missing from multiple packages marked as installed in the package database, which is said to use fsync to properly record installations. I find this situation weird since I would use sync() to avoid this, but if they are comfortable having systems have multiple “installed” packages missing files in the filesystem after a power loss, then there is no need to use sync().

gf000 · on Jan 15, 2025

Hi! I am quite a beginner when it comes to file systems. Would this sync effect not be helped by direct IO in ZFS's case?

Also, given that you seem quite knowledgeable of the topic, what is your go-to backup solution?

I initially thought about storing `zfs send` files into backblaze (as backup at a different location), but without recv-ing these, I don't think the usual checksumming works properly. I can checksum the whole before and after updating, but I'm not convinced if this is the best solution.

ryao · on Jan 15, 2025

No, it will not. It would be helped by APT switching to using a single sync/syncfs call after installing all files, which is the performant way to do what it wants on Linux:

https://www.man7.org/linux/man-pages/man2/sync.2.html

ryao · on Jan 15, 2025

After studying the DPKG developers’ reasoning for using fsync excessively, it turns out that there is no need for them to use fsync on a ZFS rootfs. When the rootfs is ZFS, you can use --force-unsafe-io to skip the fsync operations for a speed improvement and there will be no safety issues due to how ZFS is designed.

DPKG will write each file to a temporary location and then rename it to the final location. On ext4, without fsync, when a power loss event occurs, it is possible for the rename to the final location to be done, without any of the writes such that you have a zero length file. On ZFS, the rename being done after the writes means that the rename being done implies the writes were done due to the sequential nature of ZFS’ transaction group commit, so the file will never appear in the final location without the file contents following a power loss event, which is why ZFS does not need the fsync there.