|
|
Subscribe / Log in / New account

Optimizing stable pages

This article brought to you by LWN subscribers

Subscribers to LWN.net made this article — and everything that surrounds it — possible. If you appreciate our content, please buy a subscription and make the next set of articles possible.

By Jonathan Corbet
December 5, 2012
The term "stable pages" refers to the concept that the system should not modify the data in a page of memory while that page is being written out to its backing store. Much of the time, writing new data to in-flight pages is not actively harmful; it just results in the writing of the newer data sooner than might be expected. But sometimes, modification of in-flight pages can create trouble; examples include hardware where data integrity features are in use, higher-level RAID implementations, or filesystem-implemented compression schemes. In those cases, unexpected data modification can cause checksum failures or, possibly, data corruption.

To avoid these problems, the stable pages feature was merged for the 3.0 development cycle. This relatively simple patch set simply ensures that any thread trying to modify an under-writeback page blocks until the pending write operation is complete. This patch set, by Darrick Wong, appeared to solve the problem; by blocking inopportune data modifications, potential problems were avoided and everybody would be happy.

Except that not everybody was happy. In early 2012, some users started reporting performance problems associated with stable pages. In retrospect, such reports are not entirely surprising; any change that causes processes to block and wait for asynchronous events is unlikely to make things go faster. In any case, the reported problems were more severe than anybody expected, with multi-second stalls being observed at times. As a result, some users (Google, for example) have added patches to their kernels to disable the feature. The performance costs are too high, and, in the absence of a use case like those described above, there is no real advantage to using stable pages in the first place.

So now Darrick is back with a new patch set aimed at improving this situation. The core idea is simple enough: a new flag (BDI_CAP_STABLE_WRITES) is added to the backing_dev_info structure used to describe a storage device. If that flag is set, the memory management code will enforce stable pages as is done in current kernels. Without the flag, though, attempts to write a page will not be forced to wait for any current writeback activity. So the flag gives the ability to choose between a slow (but maybe safer) mode or a higher-performance mode.

Much of the discussion around this patch set has focused on just how that flag gets set. One possibility is that the driver for the low-level storage device will turn on stable pages; that can happen, for example, when hardware data integrity features are in use. Filesystem code could also enable stable pages if, for example, it is compressing data transparently as that data is written to disk. Thus far, things work fine: if either the storage device or the filesystem implementation requests stable pages, they will be enforced; otherwise things will run in the faster mode.

The real question is whether the system administrator should be able to change this setting. Initial versions of the patch gave complete control over stable pages to the user by way of a sysfs attribute, but a number of developers complained about that option. Neil Brown pointed out that, if the flag could change at any time, he could never rely on it within the MD RAID code; stable pages that could disappear without warning at any time might as well not exist at all. So there was little disagreement that users should never be able to turn off the stable-pages flag. That left the question of whether they should be able to enable the feature, even if neither the hardware nor the filesystem needs it, presumably because it would make them feel safer somehow. Darrick had left that capability in, saying:

I dislike the idea that if a program is dirtying pages that are being written out, then I don't really know whether the disk will write the before or after version. If the power goes out before the inevitable second write, how do you know which version you get? Sure would be nice if I could force on stable writes if I'm feeling paranoid.

Once again, the prevailing opinion seemed to be that there is no actual value provided to the user in that case, so there is no point in making the flag user-settable in either direction. As a result, subsequent updates from Darrick took that feature out.

Finally, there was some disagreement over how to handle the ext3 filesystem, which is capable of modifying journal pages during writeback even when stable pages are enabled. Darrick's patch changed the filesystem's behavior in a significant way: if the underlying device indicates that stable pages are needed and the filesystem is to be mounted in the data=ordered mode, the filesystem will complain and mount it read-only. The idea was that, now that the kernel could determine that a specific configuration was unsafe, it should refuse to operate in that mode.

At this point, Neil returned to point out that, with this behavior, he would not be able to set the "stable pages required" flag in the MD RAID code. Any system running an ext3 filesystem over an MD volume would break, and he doesn't want to deal with the subsequent bug reports. Neil has requested a variant on the flag whereby the storage level could request stable pages on an optional basis. If stable pages are available, the RAID code can depend on that behavior to avoid copying the data internally. But that code can still work without stable pages (by copying the data, thus stabilizing it) as long as it knows that stable pages are unavailable.

Thus far, no patches adding that feature have appeared; Darrick did, however, post a patch set aimed at simply fixing the ext3 problem. It works by changing the stable page mechanism to not depend on the PG_writeback page flag; instead, it uses a new flag called PG_stable. That allows the journaling layer to mark its pages as being stable without making them look like writeback pages, solving the problem. Comments from developers have pointed out some issues with the patches, not the least of which is that page flags are in extremely short supply. Using a flag to work around a problem with a single, old filesystem may not survive the review process.

The end result is that, while the form of the solution to the stable page performance issue is reasonably clear, there are still a few details to be dealt with. There appears to be enough interest in fixing this problem to get something worked out. Needless to say, that will not happen for the 3.8 development cycle, but having something in place for 3.9 looks like a reasonable goal.

Index entries for this article
KernelData integrity
KernelStable pages


(Log in to post comments)

Optimizing stable pages

Posted Dec 6, 2012 8:03 UTC (Thu) by djwong (subscriber, #23506) [Link]

At the moment, I'm trying to figure out if there's a sane way to fix jbd either by backporting what jbd2 does to flush out dirty data prior to committing a transaction, or by finding a way to have jbd set PG_writeback before calling submit_bh() on the file data.

Or by removing fs/ext3/. I suspect that would not be popular, however. ;)

Optimizing stable pages

Posted Dec 6, 2012 14:22 UTC (Thu) by Jonno (subscriber, #49613) [Link]

> Or by removing fs/ext3/.

Honestly, removing fs/ext2, fs/ext3 and fs/jbd is probably the only sane thing to do in the long run, as fs/ext4 and fs/jbd2 supports everything they do, and is more well tested (at least on recent kernels, conservatives still using ext3 tend not to run -rc kernels).

When the "long run" comes is of course up for debate, but I would say "immediately after Greg's next -longterm announcement", giving conservative users a minimum of two years to prepare, while letting the rest of us go on without the baggage.

Optimizing stable pages

Posted Dec 6, 2012 19:56 UTC (Thu) by dlang (guest, #313) [Link]

There are still times when the best filesystem to use is ext2

one prime example would be the temporary storage on Amazon Cloud machines. If the system crashes, all the data disappears, so there's no value in having a journaling filesystem, and in many cases ext3 and ext4 can have significant overhead compared to ext2

Optimizing stable pages

Posted Dec 6, 2012 20:58 UTC (Thu) by bjencks (subscriber, #80303) [Link]

If you're truly not worried about data integrity, why not just add all that disk space as swap and use tmpfs? (I haven't tried this; it could be that it actually works terribly, but it seems like it *should* be the optimal solution)

Optimizing stable pages

Posted Dec 6, 2012 21:18 UTC (Thu) by dlang (guest, #313) [Link]

> If you're truly not worried about data integrity, why not just add all that disk space as swap and use tmpfs?

swap has horrible data locality, depending on how things get swapped out a single file could end up scattered all over the disk.

In addition, you approach puts the file storage directly competing with all processes in terms of memory, you may end up swapping out program data because your file storage 'seems' more important.

disk caching has a similar pressure, but the kernel knows that cache data is cache, and that it can therefor be thrown away if needed. tmpfs data isn't in that category.

Optimizing stable pages

Posted Dec 6, 2012 22:38 UTC (Thu) by Cyberax (✭ supporter ✭, #52523) [Link]

ext4 can be used without journaling (you need to use tune2fs to set it up). Google added this feature for these use-cases specifically.

We've benchmarked it on Amazon EC2 machines. ext4 without journaling is faster than ext2. There are really no more use cases for ext2/3.

Optimizing stable pages

Posted Dec 6, 2012 23:06 UTC (Thu) by andresfreund (subscriber, #69562) [Link]

ISTM that the other improvements like extents, hashed directory lookups, delayed allocation and whatever already might offset the journal overhead by a good bit.

Also - I haven't tried this though - shouldn't you be able to create an ext4 without a journal while keeping the other ext4 benefits? According to man tune2fs you can even remove the journal with -O^has_journal from an existing FS. The same is probably true for mkfs.ext4.

Optimizing stable pages

Posted Dec 7, 2012 10:22 UTC (Fri) by cesarb (subscriber, #6266) [Link]

But AFAIK, you can use an ext2 or ext3 filesystem with the ext4 filesystem driver, and it will work fine.

IIRC, the default Fedora kernel was configured to always use the ext4 code, even when mounting ext2/ext3 filesystems.

Optimizing stable pages

Posted Dec 8, 2012 22:39 UTC (Sat) by man_ls (guest, #15091) [Link]

one prime example would be the temporary storage on Amazon Cloud machines. If the system crashes, all the data disappears
That is a common misconception, but it is not true. As this Amazon doc explains, data in the local instance storage is not lost on a reboot. Quoting that page:
However, data on instance store volumes is lost under the following circumstances:
  • Failure of an underlying drive
  • Stopping an Amazon EBS-backed instance
  • Terminating an instance
So it is not guaranteed but it is not ephemeral either: many instance types actually have their root on an instance store. Amazon teaches you to treat it as ephemeral so that users do not rely on it too much. But using ext2 on it is not a good idea unless it is truly ephemeral.

Optimizing stable pages

Posted Dec 8, 2012 23:28 UTC (Sat) by dlang (guest, #313) [Link]

according to the Instructor in the class I've been in for the last three days, when an ex2 instance dies, nothing that you ever do will give you access to the data that you stored on the ephemeral drive. This is not EBS storage, this is the instance storage.

Ignoring what they say and just looking at it from a practical point of view:

The odds of any new EC2 instance you fire up starting on the same hardware, and therefor having access to the data are virtually nonexistant.

If you can't get access to the drive again, journaling is not going to be any good at all.

Add to this the fact that they probably have the hypervisor either do some form of transparent encryption, or make it so that they return all zeros if you read a block you haven't written to yet (to prevent you from seeing someone else's data) and you now have no reason to even try to use a journal on these drives.

EC2 (local) instance storage

Posted Dec 8, 2012 23:56 UTC (Sat) by man_ls (guest, #15091) [Link]

I am not sure what "dies" means in this context. If the instance is stopped or terminated, then the instance storage is lost. If the instance is rebooted then the same instance storage is kept. Usually you reboot machines which "die" (i.e. crash or oops), so you don't lose instance storage.

In short: any new EC2 instance will of course get a new instance storage, but the same instance will get the same instance storage.

I understand your last paragraph even less. Why do transparent encryption? Just use regular filesystem options (i.e. don't use FALLOC_FL_NO_HIDE_STALE) and you are good. I don't get what a journal has to do with it.

Again, keep in mind that many instance types keep their root filesystem on local instance storage. Would you run / without a journal? I would not.

EC2 (local) instance storage

Posted Dec 9, 2012 1:11 UTC (Sun) by dlang (guest, #313) [Link]

If you don't do any sort of encryption, then when a new instance mounts the drives, it would be able to see whatever was written to the drive by the last instance used it.

I would absolutely run / without a journal if / is on media that I won't be able to access after a shutdown (a ramdisk for example)

I don't remember seeing anything in the AWS management console that would let you reboot an instance, are you talking about rebooting it from inside the instance? If you can do that you don't need a journal because you can still do a clean shutdown. I don't consider the system to have crashed. I count a crash as being when the system stops without being able to do any cleanup (kernel hang or power off on traditional hardware)

EC2 (local) instance storage

Posted Dec 9, 2012 1:18 UTC (Sun) by man_ls (guest, #15091) [Link]

New instances should not see the contents of uninitialized (by them) disk sectors. That is the point of the recent discussion about FALLOC_FL_NO_HIDE_STALE. The kernel will not allow one virtual machine to see the contents of another's disk, or at least that is what I understand.

The AWS console has an option to reboot a machine, between "Terminate" and "Stop". You can also do it programmatically using EC2 commands, e.g. if the machine stops responding.

EC2 (local) instance storage

Posted Dec 9, 2012 1:50 UTC (Sun) by dlang (guest, #313) [Link]

Thanks for the correction about the ability to reboot an instance.

I don't think this is what FALLOC_FL_NO_HIDE_STALE is about. FALLOC_FL_NO_HIDE_STALE is about not zeroing something that this filesystem has not allocated before, but if you have a disk that has a valid ext4 filesystem on it and plug that disk into another computer, you can just read the filesystem.

When you delete a file, the data remains on the disk and root can go access the raw device and read the data that used to be in a file.

by default, when a filesystem allocates a block to a new file, it zeros out the data on that block, it's this step that FALLOC_FL_NO_HIDE_STALE lets you skip.

the If you really had raw access to the local instance storage without the hypervisor doing something, then you could just mount whatever filesystem the person before you left there. To avoid this Amazon would need to wipe the disks, and since it takes a long time to write a TB or so of data (even on SSDs), I'm guessing that they do something much easier, like doing some sort of encryption to make it so that one instance can't see data written by a prior instance.

EC2 (local) instance storage

Posted Dec 9, 2012 12:04 UTC (Sun) by man_ls (guest, #15091) [Link]

When Amazon EC2 creates a new instance, it allocates a new instance storage with its own filesystem. This process includes formatting the filesystem, and sometimes copying files from the AMI (image file) to the new filesystem. So any previous filesystems are erased. It is here that zeroing unallocated blocks from the previous filesystem comes into place, which is what FALLOC_FL_NO_HIDE_STALE would mess up.

I don't know how Amazon (or the hypervisor) prevents access to the raw disk, where unallocated sectors might be found and scavenged even if the filesystem is erased. I guess they do something clever or we would have heard about people reading Zynga's customer database from a stale instance.

EC2 (local) instance storage

Posted Dec 9, 2012 16:01 UTC (Sun) by Cyberax (✭ supporter ✭, #52523) [Link]

Uhm. Nope.

Amazon doesn't care about your filesystem. AMIs are just dumps of block devices - Amazon simply unpacks them onto a suitable disk. You're free to use any filesystem you want (there might be problems with the bootloader, but they are not insurmountable).

You certainly can access the underlying disk device.

EC2 (local) instance storage

Posted Dec 10, 2012 1:07 UTC (Mon) by dlang (guest, #313) [Link]

Amazon doesn't put a filesystem on the device, you do.

> I don't know how Amazon (or the hypervisor) prevents access to the raw disk, where unallocated sectors might be found and scavenged even if the filesystem is erased. I guess they do something clever or we would have heard about people reading Zynga's customer database from a stale instance.

This is exactly what I'm talking about.

There are basically three approaches to doing this without the cooperation of the OS running on the instance (which you don't have)

1. the hypervisor zeros out the entire drive before the hardware is considered available again.

2. the hypervisor does encryption of the blocks with a random key for each instance, loose the key and reading the blocks just returns garbage

3. the hypervisor tracks what blocks have been written to and only returns valid data for those blocks.

I would guess #1 or #2, and after thinking about it for a while would not bet either way

#1 is simple, but it takes a while (unless the drive has direct support for trim and effectively implements #3 in the drive, SSDs may do this)

#2 is more expensive, but it allows the system to be re-used faster

EC2 (local) instance storage

Posted Dec 10, 2012 2:55 UTC (Mon) by Cyberax (✭ supporter ✭, #52523) [Link]

They are using #3. The raw device reads on initialized areas return zeroes.

EC2 (local) instance storage

Posted Dec 10, 2012 3:27 UTC (Mon) by dlang (guest, #313) [Link]

that eliminates #2, but it could be #1 or #3

It seems like trying to keep a map of if this block has been written to would be rather expensive to do at the hypervisor level, particularly if you are talking about large drives.

Good to know that you should get zeros for uninitialized sectors.

EC2 (local) instance storage

Posted Dec 10, 2012 3:30 UTC (Mon) by Cyberax (✭ supporter ✭, #52523) [Link]

#1 is unlikely because local storage is quite large (4Tb on some nodes). It's not hard to keep track of dirtied blocks, they need it to support snapshots on EBS volumes anyway.

EC2 (local) instance storage

Posted Dec 10, 2012 6:18 UTC (Mon) by bjencks (subscriber, #80303) [Link]

Just to be clear, there are two different ways of initializing storage: root filesystems are created from a full disk image that specifies every block, so there are no uninitialized blocks to worry about, while non-root instance storage and fresh EBS volumes are created in a blank state, returning zeros for every block.

It's well documented that fresh EBS volumes keep track of touched blocks; to get full performance on random writes you need to touch every block first. That implies to me that they don't even allocate the block on the back end until it's written to.

Not sure how instance storage initialization works, though.

EC2 (local) instance storage

Posted Dec 10, 2012 6:34 UTC (Mon) by dlang (guest, #313) [Link]

EBS storage is not simple disks, the size flexibility and performance you can get cannot be supported by providing raw access to drives or drive arrays.

As you say, instance local storage is different.

Optimizing stable pages

Posted Dec 9, 2012 1:40 UTC (Sun) by Cyberax (✭ supporter ✭, #52523) [Link]

We use instance storage on the new SSD-based nodes for very fast PostgreSQL replicated nodes. It indeed survives reboots and oopses.

It does not survive stopping the instance through the Amazon EC2 API.

Optimizing stable pages

Posted Dec 10, 2012 16:51 UTC (Mon) by butlerm (subscriber, #13312) [Link]

Given the incredibly severe performance issues the use of the stable pages feature may incur, it seems like the optimal long term solution would be to add copy on write capability for pages that are under writeout.

Meaning that when a thread attempts to modify such a page, a duplicate physical page is created, the page structure and PTE are updated accordingly, and ownership of the physical page under writeout is transferred to the fs or device doing the writeout, for reclamation when the writeout completes. That would be far superior in most cases than stalling a thread for an arbitrary period in the meantime, something that is the death of anything resembling real time response.

It would also be markedly superior to a copy-always policy by the FS or storage layer concerned. The best of both worlds, essentially.

Optimizing stable pages

Posted Dec 10, 2012 17:27 UTC (Mon) by Cyberax (✭ supporter ✭, #52523) [Link]

CoW using VM tricks is quite often _inferior_ in speed to simple copying.

Optimizing stable pages

Posted Dec 10, 2012 18:11 UTC (Mon) by dlang (guest, #313) [Link]

> CoW using VM tricks is quite often _inferior_ in speed to simple copying.

COW is slower if the copy actually needs to take place, but faster if the copy is never needed.

The question is how likely are you to need to do the copy.


Copyright © 2012, Eklektix, Inc.
This article may be redistributed under the terms of the Creative Commons CC BY-SA 4.0 license
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds