|
|
Subscribe / Log in / New account

Bulk network packet transmission

Please consider subscribing to LWN

Subscriptions are the lifeblood of LWN.net. If you appreciate this content and would like to see more of it, your subscription will help to ensure that LWN continues to thrive. Please visit this page to join up and keep LWN on the net.

By Jonathan Corbet
October 8, 2014
One of the keys to good performance on contemporary systems is batching — getting a lot of work done relative to a given fixed cost. If, for example, a lock must be acquired to do one unit of work in a specific subsystem, doing multiple units of work while the lock is held will reduce the overall overhead of the system. Much of the scalability work that has been done in recent years has, in some way, related to increasing batching where possible. Some recent changes in the networking subsystem show that batching can improve performance there as well.

Every time a packet is transmitted over the network, a sequence of operations must be performed. These include acquiring the lock for the queue of outgoing packets, passing a packet to the driver, putting the packet in the device's transmit queue, and telling the device to start transmitting. Some of those operations are inherently per-packet, but others are not. The acquisition of the queue lock could be amortized across multiple packet transmissions, for example, and the act of telling the device to start transmission may be expensive indeed. It can involve hardware operations or, even, on some systems, hypervisor calls.

Often, when there is one packet to transmit, there are others waiting in the queue as well; network traffic can be inherently bursty. So it would make sense for the networking stack to try to split the fixed costs of starting packet transmission across as many packets as possible. Some techniques, such as segmentation offload (wherein the network interface splits large chunks of data into packets) perform that kind of batching. But, in current kernels, if the networking stack has a set of packets ready to go, they will be sent out the slow way, one at a time.

That situation will begin to change in 3.18, when a relatively small set of changes will be merged. Consider the function exported by drivers now to send a packet:

    netdev_tx_t	(*ndo_start_xmit) (struct sk_buff *skb, struct net_device *dev);

This function takes the packet pointed to by skb and transmits it via the specified dev. Every call is a standalone operation, with all the associated fixed costs. The initial plan for 3.18 was to specify a new function that drivers could provide:

    void (*ndo_xmit_flush)(struct net_device *dev, u16 queue);

If a driver provided this function, it was indicating to the networking stack that it is prepared for (and can benefit from) batched transmission. In this case, the networking stack could make multiple calls to ndo_start_xmit() to queue packets for transmission; the driver would accept them, but not actually start the transmission operation. At the end of a sequence of such calls, ndo_xmit_flush() would be called to indicate the end; at that point, actual hardware transmission would be started.

There were concerns, though, that putting another indirect function call into the transmit path would add too much overhead, so this particular function was ripped out almost as soon as it landed in the net-next repository. In its place, the sk_buff structure has gained a new Boolean variable called xmit_more. If that variable is true, then there are more packets coming and the driver can defer starting hardware transmission. This variable takes out the extra function call while making the needed information available to drivers that can make use of it.

This mechanism, added by David Miller, makes batching possible. A couple of drivers were fixed to support batching, but David did not change the networking stack to actually do the batching. That work fell to Jesper Dangaard Brouer, whose bulk dequeue support patches have also been merged for 3.18. This work, too, is limited in scope; in particular, it will only work with queuing disciplines that have a single transmit queue.

The change Jesper made is simple enough: in a situation where a packet is being transmitted, the stack will attempt to send out a series of packets together while the queue lock is held. The byte queue limits mechanism is used to put an upper bound on the amount of data that can be in flight at once. Once the limit is hit (or the supply of packets runs out), skb->xmit_more will be set to false and the traffic will be on its way.

Eric Dumazet looked at the patch set and realized that things could be taken a bit further: the process of validating packets for transmission could be moved outside of the queue lock entirely, increasing concurrency in the system. The resulting patch had benefits that Eric described as awesome: full 40Gb/sec wire speed, even in the absence of segmentation offload. Needless to say, this patch, too, has been accepted into the net-next tree for the 3.18 merge window.

All told, the changes are relatively small. But small changes can have big effects when they are applied to the right places. These little changes should help to ensure that the networking stack in the 3.18 release is the fastest yet.

Index entries for this article
KernelNetworking/Performance


(Log in to post comments)

Bulk network packet transmission

Posted Oct 9, 2014 19:37 UTC (Thu) by dlang (guest, #313) [Link]

don't delay starting transmission of data, that adds latency when it's not needed.

It's better to do an inefficient transmission and let the data queue and then the next time through the loop, send all the pending data.

This approach minimizes the latency and avoids all the possible problems that can show up when the start is blocked and never gets unblocked.

It sounds as if this doesn't force the hardware to delay it's start of transmission, so the flag can just be 'there is more data available'), so the patch still sounds like it's a reasonable implementation, but the way it's being described is wrong.

It will be very interesting to see how this ends up interacting with multiple queue setups. The reason BQL was added in the first place was to prevent large sets of packets from eating up so much time that they caused unacceptable latency for other data. This could aggravate that sort of problem.

Bulk network packet transmission

Posted Oct 10, 2014 7:26 UTC (Fri) by JesperBrouer (guest, #62728) [Link]

Hi dlang,

We have tried hard not to introduce latency when using xmit_more.
* Explained in: http://netoptimizer.blogspot.dk/2014/10/unlocked-10gbps-t...

Further more qdisc dequeue bulking will ONLY be done for drivers supporting BQL. And I've done extensive measurements for Head-of-Line blocking using the netperf-wrapper tool.

Lots of graphs avail here: http://people.netfilter.org/hawk/qdisc/

--Jesper Brouer

Bulk network packet transmission

Posted Oct 10, 2014 19:26 UTC (Fri) by dlang (guest, #313) [Link]

Your post makes it clear that you are doing the right thing, the article was less clear (it talked about when the flag is set the card delays starting to transmit the data)

The approach of delaying the start of work because it can be done more efficiently later is a bit of a hot button with me. If you have other things to keep the resources busy, it can be a good thing, but if it means keeping a resource idle because it can be done more efficiently later, it's much more questionable (about the only reason to do so is power efficiency), and deciding to _not_ do work now, always adds the problem of how you decide to go ahead and do the work later, with the potential bug of never deciding to actually start the work.

Thanks for pointing me at the clarification.

Bulk network packet transmission

Posted Oct 9, 2014 19:54 UTC (Thu) by JesperBrouer (guest, #62728) [Link]

I've written a blogpost on the subject:
http://netoptimizer.blogspot.com/2014/10/unlocked-10gbps-...

Bulk network packet transmission

Posted Oct 10, 2014 18:04 UTC (Fri) by stressinduktion (subscriber, #46452) [Link]

> This work, too, is limited in scope; in particular, it will only work with queuing disciplines that have a single transmit queue.

multiqueue TX NICs by default setup a mq qdisc scheduler which establishes a 1:1 qdisc <-> txq mapping. So this one-to-one mapping is quite common.

Bulk network packet transmission

Posted Oct 10, 2014 22:08 UTC (Fri) by jhoblitt (subscriber, #77733) [Link]

Are there any power effeciency gains to be had from batching? Either with current hardware or some future device that could drop into lower power states? 10GBaseT power usage isn't trvial even on a modern cmos process.

Bulk network packet transmission

Posted Oct 11, 2014 0:21 UTC (Sat) by giraffedata (guest, #1954) [Link]

Are there any power efficiency gains to be had from batching?

That's a different kind of batching - saving up work until you have enough to make it worth incurring the fixed costs. That's definitely not what this work is aimed at.

In fact, "batching" may be a misleading term for this work. It's more like what they call "coalescing" when it happens on disk drive communication links.

Bulk network packet transmission

Posted Oct 11, 2014 3:46 UTC (Sat) by dlang (guest, #313) [Link]

In theory it would be possible if this send the packets faster there would be larger chunks of idle time that power savings could be used.

But this is not looking at doing that, it's 'just' looking at decreasing the cost of sending packets by only paying the overhead of sending data for several packets instead of paying it for each packet.

This sort of operation can significantly reduce the contention for locks, so the full effects on performance are probably going to be larger than you would think, and some are going to show up as (small) benefits in odd areas that you may not even think to measure.

Bulk network packet transmission

Posted Oct 14, 2014 20:59 UTC (Tue) by marcH (subscriber, #57642) [Link]

How much "burstiness" can this add to the traffic? (Sorry I don't have enough time to read all the references - I really wish I had)

On the one hand, TCP windows and timers are trying hard to pace / smooth traffic to reduce burstiness as much as possible in order to reduce packet losses in switches and optimize utilization.

On the other hand, operating systems are constantly looking at "batching" opportunities like this one to optimize raw throughput.

I feel like this ever-lasting tension provides an infinite source of research problems, papers and prototypes...

Bulk network packet transmission

Posted Oct 14, 2014 22:44 UTC (Tue) by dlang (guest, #313) [Link]

not a lot, all the packets that would be transmitted are already scheduled to e sent ASAP. Unless the CPU is not able to keep up with the network, this doesn't result in more traffic on the wire, it's just getting there more efficiently.


Copyright © 2014, Eklektix, Inc.
This article may be redistributed under the terms of the Creative Commons CC BY-SA 4.0 license
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds