Quantcast

if_start / if_transmit handling and packet ordering - how can I guarantee it?

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

if_start / if_transmit handling and packet ordering - how can I guarantee it?

Adrian Chadd-2
Hi all,

I've been pushing my ath 11n driver hard (250Mbit UDP) and I've found
a rather interesting behaviour, at least on my SMP machine.

I'm running single direction UDP iperf tests. This is all on stable/9,
with -head ath/net80211 stacks.

The receiver logs about 40/20000 frames a second as being "out of
order". The net80211 stack and ath software TX aggregation path isn't
reordering frames.

So I reviewed the code again and I wondered - is what I'm doing with
ath_start() and the TX side locking allowing for multiple ath_start()
invocations to occur, where they interleave processing frames? The
short answer is yes.

Then, something else hit me - I wondered whether ath_start() itself is
being preempted by the ath taskqueue (say, doing RX, or TX completion)
- where ath_start() is then called. Ie;

* iperf -> sendto() -> socket layer -> net80211 ieee80211_start() ->
if_transmit -> if_start() -> ath_start()

And then an interrupt coming in, during ath_start() having dequeued a
frame, but before it had placed it on any hardware/software queue:

* ath0 taskqueue -> ath_tx_proc() (tx completion) -> ath_start()

Or, similarly, CPU #0 running iperf, CPU #1 running the ath taskqueue,
and one or the other being preempted by something higher priority (eg
an interrupt coming in) between having removed a frame from the ifnet
queue and it being thrown into the software/hardware queue.

The ath driver locking only holds the TX locks as long as is needed to
do individual TX queue operations (ie, queue a frame to the software /
hardware queue) rather than holding it for the entirety of the TX
path. Because of this, I wonder if preemption is possibly causing
issues.

So I then went grovelling through ixgbe to see what it does with
if_transmit. It holds the TX lock for that particular NIC hardware
queue for the entirety of:

* a TX to hardware operation (which can be an individual frame queue,
or servicing the whole bufring contents);
* completing the frames via _txeof(), until nothing more needs to be done.

So by holding the TX lock for the entirety of the queue operation, it
means that any other entries into that particular TX path for that NIC
queue will stall, waiting for the lock, and thus effectively be
serialised, avoiding my initial issue. Any TX completion which leads
to further transmissions from the bufring (and simultaneous incoming
TX frames) will block each other.

Ok, so now that I've mostly tried to lucidly dump what's going on-
what do people think about holding the locks for (potentially) so
long? I know iwn(4) holds the driver lock for as long as it can for
_everything_, so it avoids this issue. But again, I don't really like
the idea of holding a lock for this long. Does anyone else have any
other ideas?

FWIW - I temporarily converted the ath driver to make ath_start()
enqueue a taskqueue task, which then did all of the TX inside the
taskqueue. This serialised with TX completion and RX, which can also
start TX; and all of my out of order issues went away. I unfortunately
then become very, very susceptible to scheduling latency (hence my
initial post about KTR, SMP, preemption and what looks to be like
Cx/idle/powerd issues. If I run my laptop (Lenovo T60) with all the
power/Cx stuff disabled, I can quite happily sustain 240-250MBit of
UDP without any reordering or packet latency.

Thanks in advance (and phew!)


Adrian
_______________________________________________
[hidden email] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-wireless
To unsubscribe, send any mail to "[hidden email]"
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: if_start / if_transmit handling and packet ordering - how can I guarantee it?

PseudoCylon
> ----------------------------------------------------------------------
>
> Message: 1
> Date: Fri, 1 Jun 2012 00:46:02 -0700
> From: Adrian Chadd <[hidden email]>
> Subject: if_start / if_transmit handling and packet ordering - how can
>        I guarantee it?
> To: [hidden email]
> Cc: [hidden email]
> Message-ID:
>        <CAJ-Vmo=[hidden email]>
> Content-Type: text/plain; charset=ISO-8859-1
>
> Ok, so now that I've mostly tried to lucidly dump what's going on-
> what do people think about holding the locks for (potentially) so
> long? I know iwn(4) holds the driver lock for as long as it can for
> _everything_, so it avoids this issue. But again, I don't really like
> the idea of holding a lock for this long.

Nether do I. Basically, this is how things go with holding a lock.

if_start()
{
        LOCK();
        for (;;) {
                add_slot();
                if (++queue_counter > MAX)
                        break;
        }
        UNLOCK();
}

_txeof() or usb_bulk_callback()
{
        LOCK();
        clear_slot();
        queue_counter--;
        UNLOCK();
        if_start();
}

When if_start() is called first time, it will loop until slots get
full. This is guaranteed because both functions want the lock, so no
slot will be cleared until if_start() exits.
When _txeof() is called, it frees one slot. Then calls if_start().
After enqueuing one frame, slots get full again. Then, _txeof() frees
one slot, if_start() adds one ...
So, the driver processes one frame at a time. The packet order is
maintained, but it seems wasting memory for queue slots.

> Does anyone else have any
> other ideas?

Maybe...
If we guarantee only one thread/process runs the if_start(), we won't
have to hold the lock for that long. i.e

if_start()
{
        if (!atomic_cmpset(&running, 0, 1))
                return;

        for(;;) {
                if (full) {
                        running = 0;
                        break;
                }
        }
}

> FWIW - I temporarily converted the ath driver to make ath_start()
> enqueue a taskqueue task, which then did all of the TX inside the
> taskqueue.

I have tried the similar thing with run(4), because I thought calling
taskqueue_enqueue() is better than calling if_start() in
usb_bulk_callback(). (if_start() is a big process.) I got extra
bandwidth (forget the actual number).

> I unfortunately
> then become very, very susceptible to scheduling latency

I had to use a private taskqueue instead of shared one to over come
the latency. Other than that, it worked well, at least under 1 ap + 1
sta both use run(4) environment.


AK
_______________________________________________
[hidden email] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-wireless
To unsubscribe, send any mail to "[hidden email]"
Loading...