|
|
Subscribe / Log in / New account

TCP small queues

Please consider subscribing to LWN

Subscriptions are the lifeblood of LWN.net. If you appreciate this content and would like to see more of it, your subscription will help to ensure that LWN continues to thrive. Please visit this page to join up and keep LWN on the net.

By Jonathan Corbet
July 17, 2012
The "bufferbloat" problem is the result of excessive buffering in the network stack; it leads to long latencies and poor reliability in the network as a whole. Fixing it is a matter of buffering less data in each system between any two endpoints—a task that sounds simple, but proves to be more challenging than one might expect. It turns out that buffering can show up in many surprising places in the networking stack; tracking all of these places down and fixing them is not always easy.

A number of bloat-fighting changes have gone into the kernel over the last year. The CoDel queue management algorithm works to prevent packets from building up in router queues over time. At a much lower level, byte queue limits put a cap on the amount of data that can be waiting to go out a specific network interface. Byte queue limits work only at the device queue level, though, while the networking stack has other places—such as the queueing discipline level—where buffering can happen. So there would be value in an implementation that could limit buffering at levels above the device queue.

Eric Dumazet's TCP small queues patch looks like it should be able to fill at least part of that gap. It limits the amount of data that can be queued for transmission by any given socket regardless of where the data is queued, so it shouldn't be fooled by buffers lurking in the queueing, traffic control, or netfilter code. That limit is set by a new sysctl knob found at:

    /proc/sys/net/ipv4/tcp_limit_output_bytes

The default value of this limit is 128KB; it could be set lower on systems where latency is the primary concern.

The networking stack already tracks the amount of data waiting to be transmitted through any given socket; that value lives in the sk_wmem_alloc field of struct sock. So applying a limit is relatively easy; tcp_write_xmit() need only look to see if sk_wmem_alloc is above the limit. If that is the case, the socket is marked as being throttled and no more packets are queued.

The harder part is figuring out when some space opens up and it is possible to add more packets to the queue. The time when queue space becomes free is when a queued packet is freed. So Eric's patch overrides the normal struct sk_buff destructor when an output limit is in effect; the new destructor can check to see whether it is time to queue more data for the relevant socket. The only problem is that this destructor can be called from deep within the network stack with important locks already held, so it cannot queue new data directly. So Eric had to add a new tasklet to do the actual job of queuing new packets.

It seems that the patch is having the intended result:

Results on my dev machine (tg3 nic) are really impressive, using standard pfifo_fast, and with or without TSO/GSO. Without reduction of nominal bandwidth. I no longer have 3MBytes backlogged in qdisc by a single netperf session, and both side socket autotuning no longer use 4 Mbytes.

He also ran some tests over a 10Gb link and was able to get full wire speed, even with a relatively small output limit.

There are some outstanding questions, still. For example, Tom Herbert asked about how this mechanism interacts with more complex queuing disciplines; that question will take more time and experimentation to answer. Tom also suggested that the limit could be made dynamic and tied to the lower-level byte queue limits. Still, the patch seems like an obvious win, so it has already been pulled into the net-next tree for the 3.6 kernel. The details can be worked out later, and the feature can always be turned off by default if problems emerge during the 3.6 development cycle.

Index entries for this article
KernelNetworking/Bufferbloat


(Log in to post comments)

TCP small queues

Posted Jul 19, 2012 5:45 UTC (Thu) by butlerm (subscriber, #13312) [Link]

This looks great. I have one question though: shouldn't this be made into generic infrastructure that can be used by other transport protocols? I imagine this would be helpful for UDP, SCTP, and DCCP sockets as well. The limit probably doesn't even need to be per protocol.

TCP small queues

Posted Jul 21, 2012 0:18 UTC (Sat) by raven667 (subscriber, #5198) [Link]

It's good that this is being worked on but I'm concerned that there may be side effects in this approach, small fixed size buffers followed by some magic auto tuning buffers. CoDel achieves its robust simplicity by not worrying about buffer sizes at all but instead by tracking latency because getting an algorithm right to pick an appropriate buffer size per flow proved impossible. It seems like they might be going down a path that is already known not to work or maybe that experience doesnt apply at this layer of the stack.


Copyright © 2012, Eklektix, Inc.
This article may be redistributed under the terms of the Creative Commons CC BY-SA 4.0 license
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds