Write cache horrors
fsync and fdatasync
A few days ago someone asked, in the ##slackware IRC channel at FreeNode, if there was a simple way to encrypt a small piece of data in a hard drive. Instead of trying to apply encryption to the whole filesystem and other measures, I recommended using GnuPG. Generating a pair of keys and using them to encrypt one file is a very short and easy process. You only have to be careful when choosing the private key password, and when deleting the original file. When you tell GnuPG to encrypt a file, it creates a new one with the encrypted data, but the original file stays on disk. If you don’t want anyone to see those contents, you have to be careful when removing that file. A simple file deletion is not enough, because the data blocks are going to remain on disk and may be available until they are overwriten, and recovered before. Fortunately, there’s a perfect tool for this problem, called shred and distributed part of GNU coreutils. shred takes the file and, by default, overwrites its contents with random data 25 times. Optionally, you can make shred do a final pass overwriting the file contents with zeros and, also optionally, delete the file. This has the purpose of making the original data unavailable to any forensic analysis tool, given that it has been proven that it’s possible to recover the original contents of a hard drive even when they have been overwritten a few times. Up to here, that’s what I knew. However, one person by nickname fred pointed out a flaw in this process, and I have to thank him for making me try to find out more information about the matter. fred argued that, even if shred calls fsync or fdatasync after each write pass (which it does; it calls fdatasync), the data would only be passed to the device’s write cache, which is normally enabled. When fdatasync returns, the data has not really been written to the disk surface. Despite sounding horrible, the story is true. As of today, under many operating systems and, for sure, under Linux, fdatasync does not request the data to be flushed from the device write cache. When it returns, your data may not have reached the disk surface yet as, I think, any programmer would expect, even though the manpage for fsync and fdatasync in my system mentions this issue in the notes section. This is a problem for shred. Even if there is no power failure in the process, after shred has finished its job, you can’t really guarantee anything else that in the very near future, the data on the disk surface will have been overwritten once, at least for small files. I don’t know if this stands when the file contents are bigger than the disk cache. Granted, this is a pessimistic approach and probably the data has been overwritten more times, but the danger is there, and it is surprising that the kernel doesn’t even try to send a flush cache command to the hard drive. There are rumours, and I must make it clear that they are only rumours as I could not verify anything, that some drives happen to ignore the flush cache requests in order to look better during benchmarks, which would make the problem even worse. But, at least, the kernel should probably try to ask the disk to flush the write cache before returning from fsync and fdatasync, even if the current behaviour is, strictly speaking, allowed by the POSIX standard. That’s no excuse, in my humble opinion. Those two system calls are the only ones available in POSIX if you don’t want to use system specific functions, if available. They are the key to create applications that can guarantee data consistency. The manpage reads like "these system calls are part of the POSIX standard and the only portable way of making sure your critical data in this file is already on disk before the program continues… (by the way, they may not work at all)". They are of no good use if they don’t work as expected and flush the write cache.
I am not alone in thinking that, even if it comes at a performance cost, and our shred problem is not an issue compared to the database software out there that calls fsync or fdatasync to try to keep the database consistent. Linux is, after all, being used in production servers holding databases with critical data, and you would expect that when someone calls fdatasync, the data is sent to the disk surface and the function doesn’t return until the write operation has been performed. At these moments, I think the only way to request a write cache flush from userspace in Linux is to call hdparm -F /disk/device. This, in turn, seems to call ioctl with some Linux-specific arguments, but this requires root access and ioctl isn’t POSIX either. Also, you can enable, disable and see if the write cache is active with hdparm -W. Disabling the write cache permanently on a hard drive is not recommended. While it may improve performance in some very specific situations and loads, in general your computer will become much, much slower, and the life of your hard drive will be severely shortened. Still, before using shred, you may want to disable the write cache and enable it later so as to try to make sure the data is really being overwritten 25 times. Let’s see an example with a small, around 4k, file called test. First, with the disk write cache enabled:
`$ time shred test `
real 0m0.074s user 0m0.014s sys 0m0.017s
Then, after disabling the write cache:
`$ time shred test `
real 0m1.327s user 0m0.011s sys 0m0.022s
An amazing difference, as you can see. I don’t know how many times the data is being overwritten with the write cache enabled, but this test suggests they are not 25. The write cache problem affects security applications, like shred, and the consistency of data in other applications. When investigating this topic I reached a thread started on the Linux Kernel Mailing List (LKML) by Jamie Lokier in which he proposes a solution, from February 26, 2008. As I didn’t know if there had been any progress in this area, even though the shred test doesn’t suggest so, I emailed him directly and asked. In his reply, he says:
As far as I know, there has been no progress on code, but it’s nice to see higher awareness.
This, for some of us, surprising situation makes you give much more important to UPSs. If you have a server with critical data, it’s very important to have a UPS in place, because it lets you at least shut the machine down cleanly, making sure the data present in the write cache is written to the disk surface before the computer loses power. If that happens after a database system has called fdatasync, you face a possible nasty database corruption. The database can’t guarantee consistency despite calling fdatasync. If you perform some writes, call fdatasync, perform more writes and then call fdtasync a second time, in a short period of time, there’s no guarantee that the first writes are on disk when the second set of writes begins. If this process is interrupted, you may get inconsistent data. This introduces the concept of write ordering, also very important for journaled filesystems.
Write ordering
The write cache also affects the order in which data is written to your hard drive, and this is important. Its only purpose is to make the hard drive work better. If there was no write cache, every time you send data to the disk so it’s written, it would be written. The order of those writes is the order in which you pass the data to the hard drive. Many times, this would mean the disk would be rotating all the time and the disk head would jump from one position to another one. With the write cache in place, the disk can first accumulate an amount of data to be written and, once it has reached a critical mass or some time has passed and no new data has been received, it can proceed to write it to the disk surface optimizing the order, so as to minimize how much the disk has to rotate and how much the head has to move. This makes it work much faster and, as an additional bonus, last longer. However, this conflicts with an aspect of journaled filesystems, like Linux' ext3. I’m sorry for the lies I may tell in the following sentences. You can consider a filesystem with journal writes data in three steps. First, it writes a journal entry indicating what it’s going to do. It then proceeds. Finally, it puts a mark in the journal indicating it has finished doing what it said it was going to do. It is very important that those steps are performed in that order, and that no step starts before the previous one has finished. If this process is interrupted, the system would be able to verify if the first journal entry has been written completely. If not, it knows no other steps were performed and it can discard the journal entry. If the entry is complete but there is no confirmation that the work has been done, it can do it because all the modifications are described in the journal.
Until now, we know the kernel doesn’t try to flush the write cache when using fsync and fdatasync, but does it use something when working with the journal of some filesystems? Yes, from kernel 2.6.9 this process was sanitized with the creation of the so-called barriers. Barriers try to guarantee that a write process cannot continue until all the pending data has been written on the hard drive. This is perfect for the conditions we mentioned before. So the only remaining question is: are the barriers enabled by default in journaled filesystems? The answer is "no". They are not used by default. As sad and surprising as it sounds, they are not used and the disk write cache, by default, can interfere in the write order of those three critical steps. There are a few practical factors that, fortunately this time, make it unlikely that an ext3 filesystem is corrupted during a power loss and that’s why the filesystem corruption reports are few and far between for ext3, but the danger is there and there are artificial test programs which are able to corrupt an ext3 filesystem with a high degree of success rate if interrupted while running. From what I have read, SuSE has been shipping kernels patched to enable barriers by default for some time. In the future, maybe barriers will be enabled by default. As of today, they are not. They can be enabled, though, with a mount option in the fstab or passed to the mount command. The exact option depends on the filesystem type. For ext3, it’s barrier=1. This is mentioned in a Gentoo guide I read at some point while finding information, but more prominently it’s also mentioned in the Wikipedia entry for ext3. Enabling barriers is said to cause, in some situations, a reduction of 30% in the hard drive’s performance (comment from Alan Cox in one of the LKML threads). For a desktop or laptop computer holding important personal data you don’t want to lose, even if that was a constant performance reduction, I’d enable barriers. Performance is the main reason many times these features are not enabled by default.
Conclusion
Remember to enable barriers in your personal computer, disabling the write cache when using shred and, if possible, use your laptop battery or a UPS in more critical environments, and check it works. Your data may not be as safe as you thought due to the device write cache. Here are some significant links I used to obtain information about the write cache problems.
-
Proposal for "proper" durable fsync() and fdatasync() --LKML
-
[PATCH 0/4 (RESEND) ext3[34] barrier changes --LKML]