The Linux Filesystem Cache is Braindead

Say you have 2 GiB of RAM and you copy 2 GiB's worth of data from one disk directory to another. That will effectively flush the Linux filesystem cache (the page cache), and you don't even have to be root. This effect is called cache pollution. Anything you want to do afterwards will have to reload any other files needed from disk, which means that the system will always respond slowly after copying large files.

The Linux cache subsystem does not realise that you are copying a set of large files just once, that they would not fit in the cache anyway, and that other often-used files should remain in the cache instead. Tool nocache is supposed to come to the rescue, but it's too brutal, it just turns any caching off for the given operation. As a result, if you copy a large number of small files, it will take forever.

Well-behaved applications should provide an option to use posix_fadvise( POSIX_FADV_NOREUSE ) in order to minimise the impact of large sequential reads. It is however unfortunate that flag POSIX_FADV_NOREUSE is ignored by Linux in this day and age. There was some proposed patch years ago, but it seems it never made it. If you want to look yourself, check out kernel source file mm/fadvise.c.

Alternatively, an application can call posix_fadvise( POSIX_FADV_DONTNEED ) at regular intervals (before reading too much data at once).

Some people suggest opening files with O_DIRECT instead, but that is problematic:
 * First of all, Linus himself says "There really is no valid reason for EVER using O_DIRECT".
 * A second consideration is that O_DIRECT bypasses caching, while POSIX_FADV_NOREUSE still allows for some caching, which may actually be better, depending on the situation.
 * Yet another aspect is that O_DIRECT does no read-ahead, which can easily kill performance in many simple applications.
 * Furthermore, O_DIRECT is specific to Linux, and even then, some filesystems are not compatible with O_DIRECT. On the other hand, posix_fadvise is a POSIX standard that should be more portable
 * Finally, O_DIRECT demands a certain memory alignment, which can be difficult to achieve in a scripting language like Perl.

Cache pollution is especially problematic when you run virtual machines, because a single VM can severely degrade disk performance for the whole host system, including all other VMs. This is why, when using libvirt/KVM/QEMU, it's often best to bypass the host's page cache with QEMU's option cache=none. That option, and alternative option cache=directsync, use flag O_DIRECT.

Another way to limit file cache usage is to run the program in a memory-limited cgroup. Unfortunately such a limit affects all memory usage within the cgroup, and not just the file cache. The only tool I found to painlessly create a temporary cgroup is systemd-run, and even this way is not without rough edges. See script background.sh in my tools repository for an attempt to make it user friendly.

For more information about the Linux cache behaviour, check out rsync bug 9560 drop-cache option.

There is a tool called nocache which may help in some scenarios. However, nocache does not help for very big files, because it only calls posix_fadvise( POSIX_FADV_DONTNEED ) once when closing a file. I have created a bug against nocache asking the author to document this limitation.

There is an attempt to mitigate this problem when reading (not so much when writing), see LWN's article Buffered I/O without page-cache thrashing.

I never saw this cache flushing effect with Microsoft Windows, but I suspect that this issue is not limited to Linux though. Please drop me a line if you know what other OSes are affected.

= The Obligatory Rant =

This cache pollution issue is very annoying. It has been there for many, many years, probably since the beginning. I know about it, and I sill get caught all the time, because it is not intuitive to keep thinking about cache pollution every time you create a VM or you just copy or compress a bunch of biggish files.

Unfortunately, I do not think that this issue will be fixed any time soon:
 * It is probably not easy to fix, especially now that the Linux kernel has become so complex. Such issues are normally easy to tackle only if the original design already catered for it.
 * Well-paid professionals know about it and implement workarounds for their operations. Therefore, there is no immediate economical pressure.
 * It does not show up in public benchmarks. Or did you ever see a disk cache pollution benchmark comparing different operating systems? This problem is like cooking TV shows: only the result matters, leaving a mess in the kitchen does not count.
 * Throwing hardware a the problem (SSDs) mitigates it. SSDs do not solve the issue, and even make it harder to pin-point the cause of such a performance hit, as only the very slow HDDs make it obvious. But with a fast SSD, most desktop people will happily live with it.

This is one of Linux' big contradictions. Linus and his maintainers are very fussy with code quality and quickly reject patches even due to trivial matters like source styling/formatting. However, big issues like disk cache pollution, affecting thousands of people every day, are left unresolved for too many years.

The Linux community is actually being dishonest to their users here. You would expect a prominent warning about this issue in some README file, perhaps in the Linux Kernel, or in the User's Guide of most Linux distributions. Some "known problems" or "caveats" section stating "Warning: disk cache pollution can easily cause severe performance drops". But there is only silence.