|
|
Subscribe / Log in / New account

Facebook and the kernel

Please consider subscribing to LWN

Subscriptions are the lifeblood of LWN.net. If you appreciate this content and would like to see more of it, your subscription will help to ensure that LWN continues to thrive. Please visit this page to join up and keep LWN on the net.

By Jake Edge
March 26, 2014
2014 LSFMM Summit

As one of the plenary sessions on the first day of the Linux Storage, Filesystem, and Memory Management (LSFMM) Summit, Btrfs developer Chris Mason presented on how his new employer, Facebook, uses the Linux kernel. He shared some of the eye-opening numbers that demonstrate just how much processing Facebook does using Linux, along with some of the "pain points" the company has with the kernel. Many of those pain points are shared with the user-space database woes that were presented in an earlier session (and will be discussed further at the Collaboration Summit that follows LSFMM), he said, so he mostly concentrated on other problem areas.

Architecture and statistics

In a brief overview of Facebook's architecture, he noted that there is a web tier that is CPU- and network-bound. It handles requests from users as well as sending replies. Behind that is a memcached tier that caches data, mostly from MySQL queries. Those queries are handled by a storage tier that is a collection of several different database systems: MySQL, Hadoop, RocksDB, and others.

[Chris Mason]

Within Facebook, anyone can look at and change the code in its source repositories. The facebook.com site has its code updated twice daily, he said, so the barrier to getting new code in the hands of users is low. Those changes can be fixes or new features.

As an example, he noted that the "Look Back" videos, which were created by Facebook for each user and reviewed all of their posts to the service, added a huge amount of data and required a lot more network bandwidth. The process of creating and serving all of those videos was the topic of a Facebook engineering blog post. In all 720 million videos were created, which required an additional 11 petabytes of storage, as well as consuming 450 Gb/second of peak network bandwidth for people viewing the videos. The Look Back feature was conceived, provisioned, and deployed in only 30 days, he said.

The code changes quickly, so when performance or other problems crop up, he and other kernel developers can tell others in the company that "you're doing it wrong". In fact, he said, "they love that". It does mean that he has to come up with concrete suggestions on how to do it right, but Facebook is not unwilling to change its code.

Facebook runs a variety of kernel versions. The "most conservative" hosts run a 2.6.38-based kernel. Others run the 3.2 stable series with roughly 250 patches. Other servers run the 3.10 stable series with around 60 patches. Most of the patches are in the networking and tracing subsystems, with a few memory-management patches as well.

One thing that seemed to surprise Mason was the high failure tolerance that the Facebook production system has. He mentioned the 3.10 pipe race condition that Linus Torvalds fixed. It is a "tiny race", he said, but Facebook was hitting it (and recovering from it) 500 times per day. The architecture of the system is such that it could absorb that kind of failure rate without users noticing anything wrong.

Pain points

Mason asked around within Facebook to try to determine what the worst problem is that the company has with the kernel. In the end, two features were mentioned the most frequently: Stable pages and the completely fair queueing (CFQ) I/O scheduler. "I hope we never find those guys", he said with a laugh, since Btrfs implements stable pages. In addition, James Bottomley noted that Facebook already employs another CFQ developer (Jens Axboe).

Another area that was problematic for Facebook is surprises with buffered I/O latency, especially for append-only database files. Most of the time, those writes go fast, but sometimes they are quite slow. He would like to see the kernel avoid latency spikes like that.

He would like to see kernel-style spinlocks be available from user space. Rik van Riel suggested that perhaps POSIX locks could use adaptive locking, which would spin for a short time then switch to sleeping if the lock did not become available quickly. The memcached tier has a kind of user-space spinlock, Mason said, but it is "very primitive compared to the kernel".

Fine-grained I/O priorities is another wish list item for Facebook (and for the PostgreSQL developers as well). There are always cleaners and compaction threads that need to do I/O, but shouldn't hold off the higher-priority "foreground" I/O. Mason was asked about how the priorities would be specified, by I/O operation or file range, for example. In addition, he was asked about how fine-grained the priorities needed to be. Either way of specifying the priorities would be reasonable, and Facebook really only needs two (or few) priority levels: low and high.

The subject of ionice was raised again. One of the problems with that as a solution is that it only works with the (disabled by Facebook) CFQ scheduler. Bottomley suggested making ionice work with all of the schedulers, which Mason said might help. In order to do that, though, Ted Ts'o noted that the writeback daemon will have to understand the ionice settings.

Another problem area is logging. Facebook logs a lot of data and the logging workloads have to use fadvise() and madvise() to tell the kernel that those pages should not be saved in the page cache. "We should do better than that." Van Riel suggested that the page replacement patches in recent kernels may make things better. Mason said that Facebook does not mind explicitly telling the kernel which processes are sequentially accessing the log files, but continually calling *advise() seems excessive.

Josef Bacik has also been working on a small change to Btrfs to allow rate limiting buffered I/Os. It was easy to do in Btrfs, Mason said, but if the idea pans out would move elsewhere for more general availability. Jan Kara was concerned that only limiting buffered I/O would be difficult since there are other kinds of I/O bound for the disk at any given time. Mason agreed, saying that the solution would not be perfect but might help.

Bottomley noted that ionice is an existing API that should be reused to help with these kinds of problems. Similar discussions of using other mechanisms in the past have run aground on "painful arguments about which API is right", he said. Just making balance_dirty_pages() aware of the ionice priority may solve 90% of the problem. Other solutions can be added later.

Mason explained that Facebook stores its logs in a large Hadoop database, but that the tools for finding problems in those logs are fairly primitive—grep essentially. He said that he would "channel Lennart [Poettering] and Kay [Sievers]" briefly to wish for a way to tag kernel messages. Bottomley's suggestion that Mason bring it up with Linus Torvalds at the next Kernel summit was met with widespread chuckling.

Danger tier

While 3.10 is fairly close to the latest kernels, Mason would like to run even more recent kernels. To that end, he is creating something he calls the "danger tier". He ported the 60 patches that Facebook currently adds to 3.10.x to the current mainline Git tree and is carving out roughly 1000 machines to test that kernel in the web tier. He will be able to gather lots of performance metrics from those systems.

As a simple example of the kinds of data he can gather, he put up a graph of request response times (without any units) that was gathered over 3 days. It showed a steady average response time line all the way at the bottom as well as the ten worst systems' response times. Those not only showed large spikes in the response times, but also that the baseline for those systems was roughly twice that of the average. He can determine which systems those are, ssh in, and try to diagnose what is happening with them.

He said that was just an example. Eventually he will be able to share more detailed information that can be used to try to diagnose problems in newer kernels and get them fixed more quickly. He asked for suggestions of metrics to gather for the future. With that, his session slot expired.

[ Thanks to the Linux Foundation for travel support to attend LSFMM. ]

Index entries for this article
ConferenceStorage Filesystem & Memory Management/2014


(Log in to post comments)

Facebook and the kernel

Posted Mar 26, 2014 23:56 UTC (Wed) by ballombe (subscriber, #9523) [Link]

I eagerly wait for the talk: E. Snowden: 'The NSA and the kernel'

Facebook and the kernel

Posted Mar 27, 2014 18:38 UTC (Thu) by jospoortvliet (guest, #33164) [Link]

That must be awesome. I bet they do some interesting stuff there in that massive, secret data center of theirs. Or, centerS :D

Facebook and the kernel

Posted Mar 28, 2014 0:28 UTC (Fri) by Lennie (subscriber, #49641) [Link]

The NSA does seem to use OpenStack and they did give a talk on that, so who knows:

http://www.youtube.com/watch?v=NgahKksMZis

Facebook and the kernel

Posted Mar 27, 2014 3:00 UTC (Thu) by zblaxell (subscriber, #26385) [Link]

When I'm wearing my developer hat, API calls like madvise and fadvise are fun and useful for specialized processes that know what they're doing and how they fit into the rest of the system's workload.

When I'm wearing my administrator hat, I'd really like an easy way to impose such decisions from outside. I'd like to say to the kernel "you should never cache anything this PID or its children write", or "you should cache everything that PID reads so aggressively that you'll end up swapping out other processes to keep it." Ditto for mlock() and maybe even open() with the O_SYNC flag. I'd love to have a version of mlockall() that takes a PID argument and can be inherited over fork() (or a cgroup option that does something similar to the entire cgroup).

Facebook and the kernel

Posted Mar 27, 2014 9:09 UTC (Thu) by pushersk (guest, #95128) [Link]

Facebook and the kernel

Posted Mar 27, 2014 16:53 UTC (Thu) by zblaxell (subscriber, #26385) [Link]

Cute...but LD_PRELOAD hacks don't count. I want to do fadvise to processes after they've started, e.g. when I notice a performance problem, or on a schedule. I want to do mlock from privileged processes to affect unprivileged ones. I suppose I could make an LD_PRELOAD hack to spawn a thread that listens to a socket or something, but for the effort it's probably easier to just modify the kernel to add some new syscalls instead.

For rsync and backups I use a cgroup RAM limit, so the rsync gets a small amount of cache for itself but won't flood the entire RAM with its reading. cgroup charges cache memory to the process that read the page, so rsync can read pages from other cgroups' caches but won't remove their cache pages as rsync reads other data. I can also freeze and thaw the backup cgroup when I want it to temporarily stop executing (e.g. because I'm working on the machine).

cgroups aren't enough, though. I can't do a global POSIX_FADV_RANDOM or mlockall(MCL_FUTURE) through a cgroup.

Facebook and the kernel

Posted Mar 27, 2014 17:23 UTC (Thu) by dag- (guest, #30207) [Link]

I agree. Similar to nice and ionice, a way to disable caching for a (running) process would be very useful. A necessity really.

And it would also be useful for benchmarking purposes. See what the impact is for given workloads.

Facebook and the kernel

Posted Mar 27, 2014 22:28 UTC (Thu) by giraffedata (guest, #1954) [Link]

a way to disable caching for a (running) process would be very useful.

I believe what the article refers to is something in between. No administrator has to hold the kernel's hand and tell it what to do, but no developer has to make an application give the kernel play by play descriptions of what the application is doing so the kernel knows what to do.

The middle ground is the application tells the kernel what it is generally doing, for example, "File X will normally be accessed in a log-type access pattern." The kernel figures out for itself that that means it should not waste cache space on data just written to the file, among other things.

Of course, that's not to say the direct adminstrator control wouldn't also be great, particularly for working around cases where the kernel doesn't yet have the required intelligence.

Facebook and the kernel

Posted Apr 10, 2014 20:08 UTC (Thu) by dchichkov (guest, #90878) [Link]

I'd love to say a bit more to the kernel: "Stay out of the way for that particular tight-loop user-space process that is running on an isolated CPU. Do not call it, it will call you. Do not try to steal or remap physical memory from it. Do not do numa rebalancing, it will cause soft page faults. Try to accelerate and extend RCU grace periods as much as you can. Do not interfere with L1,L2 and try not to pollute L3. If I'm looking on 'perf top', for that particular isolated CPU, you shouldn't be nowhere near the top for both cache and CPU counters. Nor should you interrupt that process for more than 100,000 CPU cycles."

Facebook and the kernel

Posted Apr 10, 2014 20:21 UTC (Thu) by mpr22 (subscriber, #60784) [Link]

I believe Linux already supports a sizeable slice of that list thanks to mlockall(), sched_setscheduler(), and sched_setaffinity().

Facebook and the kernel--adaptive spinlocks

Posted Apr 16, 2014 4:02 UTC (Wed) by vomlehn (guest, #45588) [Link]

Adaptive spinlocks in user space is a doable thing; a friend of mine and I did some work on this fifteen years ago. I'm pretty sure I'm forgetting some details, but the approach was something like this:

The question that we looked at was, how long do you spin before you give up and go into the kernel. Our short answer is to assume that spin locks are only held for a short while, so the biggest thing you want to know is whether the task currently holding the lock is currently running. A process with a lock does two things:

  • It registers a section of memory to hold status of other tasks with which locks might be shared. This list is, of necessity, dynamic.
  • When it acquires the lock, it stores its PID in the lock. As I recall, we trusted processes to be honest in reporting the PID, but we may have taken a somewhat different approach.

If the process acquires the lock on the first try, all is well. If not, you look at the PID stored in the lock and index into the task status memory. (actually, you probably don't use the PID, per se, since those can be pretty sparse. If the task is running, you loop again. If you don't, you do a system call to wait on the spinlock.

When a task fails to acquire a lock and enters the kernel, the kernel makes a note that there is a task waiting for the lock. When the lock holder is ready to release the lock, it checks to see if something is waiting. If not, it zeros out the lock and goes on. If a task is waiting, it makes a system call to release the lock.

As you can see, this is highly optimized on the assumption that spinlock contention is low. But, if it's not, you shouldn't be using spin locks. This approach yielded some amazingly flat curves, however, even for some pretty high contention rates.

Note that checking to see if a task that is hold a lock is currently running can be augmented if other information is known. You could, possibly, look and see how much longer it has until its scheduler quantum expires, or whether it is currently processing a signal, or anything else which can provide a hint as to whether the task is likely to run until it releases the lock. Likewise, metrics on how long locks are likely to be held can provide heuristics useful in deciding whether more CPU time will be spent in spinning or by doing a system call. And, on the latter issue, there are tradeoffs of CPU time verus latency.

Lots and lots of things can be done here, even by the intrepid novice

Facebook and the kernel--adaptive spinlocks

Posted Apr 16, 2014 12:35 UTC (Wed) by intgr (subscriber, #39733) [Link]

> so the biggest thing you want to know is whether the task currently holding the lock is currently running

So the point is to figure out if the owning PID is currently running or preempted? How do you find that out?

If you ask the kernel via /proc or some other means, you've already paid the cost of a syscall or few. Doesn't that mean you already incur just as much overhead as you would using normal synchronization provided by the kernel, like futexes? Or am I missing something?

Facebook and the kernel--adaptive spinlocks

Posted Apr 16, 2014 17:36 UTC (Wed) by vomlehn (guest, #45588) [Link]

Good question. As I said, it has been a while, but as I recall, we didn't actually use PIDs because they are sparse. Instead, we assigned each task a dense unique task ID. Then, on initialization, you request access to a read-only memory region from the kernel, which contains an array indexed by task ID. When the task's status changes, such as it starting to run, the appropriate array element is set to indicate that. So, checking on the status of a task is simply a read of a memory location. Though in our implementation the task status just indicated whether it was running or not, it could be more sophisticated than that.

Note that, from the security standpoint, this scheme does have a slow information leak in that tasks can signal to each other by suspending or resuming execution, and it does allow some information about all processes to leak to all other processes. Some thought needs to be given to this. I think it's not a big deal but I may be wrong. Also, security is one motivation for sparse PIDs. If the dense task ID has limited scope, I think it should not pose much of a security risk.

Facebook and the kernel--adaptive spinlocks

Posted Apr 16, 2014 23:34 UTC (Wed) by valkyrie (guest, #96637) [Link]

As the other contributor mentioned by vomlehn (Hi!), I remember a few more
of the details.

We were originally asked to look at a method to tell the kernel "I am the
lock holder, and nobody can do anything until I'm done, so run me
exclusively." I think we can all agree that was probably the wrong way to
look at the problem.

Since the patent application was denied and the text has been freely viewable for years, I don't see any harm in filling in the gaps in the description above. We had it working on FreeBSD and Linux (2.4.18-ish, I think). Our experiments seemed to indicate that it solved the class of problems we set out to work. The code never made it out of the lab, unless there's a backup sitting in a landfill somewhere. I'm sure an interested party could probably take the design concepts and re-implement this without too much trouble if it is useful.

First, as vomlehn noted, the most important piece of information when
trying to decide how to contend (spin, sleep) is the status of the lock
holder. Second, if the lock holder is not running, contenders need an
efficient method to get out of the way. Third, even though I feel it's
implied by the word "efficient," it's worth pointing out that effort must
be made to minimize context switches.

As my cohort mentioned, we did trust the lock holder to write its own
identifier (index, handle, what-have-you) as the lock value, because we
didn't want to expose task list memory. I don't think we wrote the PIDs
into the structure, because that could be read by other contenders and
used to harass the lock holder.

Our solution required that tasks agree on a shared memory region that
would hold:

1) Spinlocks
2) Process status information

It isn't necessary to teach the kernel about user-space spinlocks.
Instead our solution provides a primitives that allow a task to inform the
scheduler that this task isn't interested in running until a memory
location it can read contains (or does not contain) a specific value. We
provided another primitive that allowed processes to inform the scheduler
that it should publish their status changes (running, waiting, dead) in a
specified memory location in their address space. In this way, tasks can
opt-in to sharing certain information in a shared memory block, with all
of the protections (or lack thereof) that shared memory usually enjoys.
This information could be used for many purposes (DMA, IPC, etc.), but our
experiments focused on implementing efficient spinlocks with these
primitives. The addresses supplied by the task were converted to kernel
space by methods which should be obvious, and stored in the task's
structure.

Therefore, a contender can determine the state of the lock holder. If the
lock holder is dead, the lock can be safely cleared and acquired
atomically. If the lock holder is running, the contender might choose to
spin for a time. If the lock holder is waiting, or if it is running but
the contender's patience has expired, the contender might suspend itself
(via one of our primitives) until the specific location (that is, the
spinlock) contains the "available" value. Alternatively, it might choose
to suspend itself until the location no longer contains the handle of the
current lock holder. Whenever the scheduler would normally wake a process
which has suspended itself in this manner, the value at the specified
address is checked first. If the value matches the criteria, then the
task has a non-zero chance of acquiring the resource (the lock holder has
released the lock or a new lock holder has acquired the lock), and the
contender is run as normal. If the value does not match the criteria, the
task has virtually no chance of acquiring the resource, and is skipped.
It may be germane to mention that the scheduler never knows WHY the
criteria is interesting. The criteria either matches or doesn't, and the
process is run (or not) accordingly.

One caveat: our prototype addressed neither the case where the permissions on the shared segment are changed, nor the case where the segment is removed. A production implementation might want to consider these cases for security reasons.

Whatever you do, don't ever read the (denied) patent application (US
20040103414 A1). It's incomprehensible and does not in any way resemble
the write-up we gave the lawyers. Reading it will only rot your brain and
may wake the Great Old-Ph'nglui mglw'nafh Cthulhu R'lyeh wgah'nagl fhtagn


Copyright © 2014, Eklektix, Inc.
This article may be redistributed under the terms of the Creative Commons CC BY-SA 4.0 license
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds