Quantcast

Replace bcopy() to update ether_addr

classic Classic list List threaded Threaded
34 messages Options
12
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Replace bcopy() to update ether_addr

Dmitry S. Luhtionov
Hi.
I found some overhead code in /src/sys/net/if_ethersubr.c and
/src/sys/netgraph/ng_ether.c

It contains strings, like bcopy(src, dst, ETHER_ADDR_LEN);
When src and dst are "struct ether_addr*", and ETHER_ADDR_LEN equal 6.
This code call every time, when we send Ethernet packet.
On example, on my machine in invoked nearly 20K per second.

Why we are use bcopy(), to copy only 6 bytes?
Answer - in some architectures we are can not directly copy unaligned data.

I propose this solution.

In file /usr/src/include/net/ethernet.h add this lines:

static inline void ether_addr_copy(ether_addr* src, ether_addr* dst) {
#if defined(__i386__) || defined(__amd64__)
     *dst = *src;
#else
     bcopy(src, dst, ETHER_ADDR_LEN);
#endif
}

On platform i386 gcc produce like this code:
     leal    -30(%ebp), %eax
     leal    6(%eax), %ecx
     leal    -44(%ebp), %edx
     movl    (%edx), %eax
     movl    %eax, (%ecx)
     movzwl  4(%edx), %eax
     movw    %ax, 4(%ecx)
And clang produce this:
     movl    -48(%ebp), %ecx
     movl    %ecx, -26(%ebp)
     movw    -44(%ebp), %si
     movw    %si, -22(%ebp)


All this variants are much faster, than bcopy()

_______________________________________________
[hidden email] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "[hidden email]"
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: Replace bcopy() to update ether_addr

Warner Losh

On Aug 20, 2012, at 8:46 AM, Mitya wrote:

> Hi.
> I found some overhead code in /src/sys/net/if_ethersubr.c and /src/sys/netgraph/ng_ether.c
>
> It contains strings, like bcopy(src, dst, ETHER_ADDR_LEN);
> When src and dst are "struct ether_addr*", and ETHER_ADDR_LEN equal 6.
> This code call every time, when we send Ethernet packet.
> On example, on my machine in invoked nearly 20K per second.
>
> Why we are use bcopy(), to copy only 6 bytes?
> Answer - in some architectures we are can not directly copy unaligned data.

True for shorts, longs, etc.  But why do we need it for ether_addr?  It is a struct that's just an array...  That's always safe.

> I propose this solution.
>
> In file /usr/src/include/net/ethernet.h add this lines:
>
> static inline void ether_addr_copy(ether_addr* src, ether_addr* dst) {
> #if defined(__i386__) || defined(__amd64__)

Bleck.  that's uber ugly.  We have a define for unaligned vs aligned architectures.  we should use that here.  If we even need this ifdef at all.

Warner

>    *dst = *src;
> #else
>    bcopy(src, dst, ETHER_ADDR_LEN);
> #endif
> }
>
> On platform i386 gcc produce like this code:
>    leal    -30(%ebp), %eax
>    leal    6(%eax), %ecx
>    leal    -44(%ebp), %edx
>    movl    (%edx), %eax
>    movl    %eax, (%ecx)
>    movzwl  4(%edx), %eax
>    movw    %ax, 4(%ecx)
> And clang produce this:
>    movl    -48(%ebp), %ecx
>    movl    %ecx, -26(%ebp)
>    movw    -44(%ebp), %si
>    movw    %si, -22(%ebp)
>
>
> All this variants are much faster, than bcopy()
>
> _______________________________________________
> [hidden email] mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
> To unsubscribe, send any mail to "[hidden email]"

_______________________________________________
[hidden email] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "[hidden email]"
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: Replace bcopy() to update ether_addr

Wojciech Puchar-5
In reply to this post by Dmitry S. Luhtionov
> #if defined(__i386__) || defined(__amd64__)
>    *dst = *src;
> #else
>    bcopy(src, dst, ETHER_ADDR_LEN);
#else
short *tmp1=((*short)src),*tmp2=((*short)dst);
*tmp2=*tmp1; *(tmp2+1)=*(tmp1+1); *(tmp2+2)=*(tmp1+2);

or use ++.

i think it is always aligned to 2 bytes and this should produce usable
code on any CPU? should be 6 instructions on MIPS and PPC IMHO.
_______________________________________________
[hidden email] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "[hidden email]"
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: Replace bcopy() to update ether_addr

Warner Losh

On Aug 20, 2012, at 10:48 AM, Wojciech Puchar wrote:

>> #if defined(__i386__) || defined(__amd64__)
>>   *dst = *src;
>> #else
>>   bcopy(src, dst, ETHER_ADDR_LEN);
> #else
> short *tmp1=((*short)src),*tmp2=((*short)dst);
> *tmp2=*tmp1; *(tmp2+1)=*(tmp1+1); *(tmp2+2)=*(tmp1+2);
>
> or use ++.
>
> i think it is always aligned to 2 bytes and this should produce usable code on any CPU? should be 6 instructions on MIPS and PPC IMHO.

We should tag it as __aligned(2) then, no?  If so, then the compiler should generate the code you posted.

Warner

_______________________________________________
[hidden email] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "[hidden email]"
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: Replace bcopy() to update ether_addr

Wojciech Puchar-5
>> or use ++.
>>
>> i think it is always aligned to 2 bytes and this should produce usable code on any CPU? should be 6 instructions on MIPS and PPC IMHO.
>
> We should tag it as __aligned(2) then, no?  If so, then the compiler should generate the code you posted.
should is the most important word in Your post. what it actually do - i
don't know.
_______________________________________________
[hidden email] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "[hidden email]"
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: Replace bcopy() to update ether_addr

Warner Losh

On Aug 20, 2012, at 1:17 PM, Wojciech Puchar wrote:

>>> or use ++.
>>>
>>> i think it is always aligned to 2 bytes and this should produce usable code on any CPU? should be 6 instructions on MIPS and PPC IMHO.
>>
>> We should tag it as __aligned(2) then, no?  If so, then the compiler should generate the code you posted.
> should is the most important word in Your post. what it actually do - i don't know.

If we are requiring this to be __aligned(2), we should tag it as such to enforce this.

Even without this tagging, the code to do a structure level copy of 6 bytes is going to be tiny...

Warner

_______________________________________________
[hidden email] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "[hidden email]"
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: Replace bcopy() to update ether_addr

Bakul Shah
In reply to this post by Warner Losh
On Mon, 20 Aug 2012 13:05:51 MDT Warner Losh <[hidden email]>  wrote:

>
> On Aug 20, 2012, at 10:48 AM, Wojciech Puchar wrote:
>
> >> #if defined(__i386__) || defined(__amd64__)
> >>   *dst =3D *src;
> >> #else
> >>   bcopy(src, dst, ETHER_ADDR_LEN);
> > #else
> > short *tmp1=3D((*short)src),*tmp2=3D((*short)dst);
> > *tmp2=3D*tmp1; *(tmp2+1)=3D*(tmp1+1); *(tmp2+2)=3D*(tmp1+2);
> >
> > or use ++.
> >
> > i think it is always aligned to 2 bytes and this should produce usable
> code on any CPU? should be 6 instructions on MIPS and PPC IMHO.
>
> We should tag it as __aligned(2) then, no?  If so, then the compiler
> should generate the code you posted.

Doesn't gcc inline memcpy these days?
_______________________________________________
[hidden email] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "[hidden email]"
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: Replace bcopy() to update ether_addr

Dmitry S. Luhtionov
In reply to this post by Warner Losh
20.08.2012 22:20, Warner Losh написал:

> On Aug 20, 2012, at 1:17 PM, Wojciech Puchar wrote:
>
>>>> or use ++.
>>>>
>>>> i think it is always aligned to 2 bytes and this should produce usable code on any CPU? should be 6 instructions on MIPS and PPC IMHO.
>>> We should tag it as __aligned(2) then, no?  If so, then the compiler should generate the code you posted.
>> should is the most important word in Your post. what it actually do - i don't know.
> If we are requiring this to be __aligned(2), we should tag it as such to enforce this.
>
> Even without this tagging, the code to do a structure level copy of 6 bytes is going to be tiny...
>
> Warner
>
I try some times different algorithms. This is one of thees:
*(u_int32_t *)(dst) = *(u_int32_t *)(src);
*(u_int16_t *)&(dst->octet[4]) = *(u_int16_t *)&(src->octet[4]);

But, internal gcc and clang optimisations are much better, than my attempt.
For aligned platforms (2 bytes aligned) best choice is *dst = *src;
_______________________________________________
[hidden email] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "[hidden email]"
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: Replace bcopy() to update ether_addr

Wojciech Puchar-5
In reply to this post by Warner Losh
>
> Even without this tagging, the code to do a structure level copy of 6 bytes is going to be tiny...

true.

just to make sure it will be absolutely portable how about

bcopymacaddress(dst,src)

and then define it whatever you find it fastest on any architecture?
_______________________________________________
[hidden email] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "[hidden email]"
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: Replace bcopy() to update ether_addr

Marius Strobl
In reply to this post by Dmitry S. Luhtionov
On Mon, Aug 20, 2012 at 05:46:12PM +0300, Mitya wrote:

> Hi.
> I found some overhead code in /src/sys/net/if_ethersubr.c and
> /src/sys/netgraph/ng_ether.c
>
> It contains strings, like bcopy(src, dst, ETHER_ADDR_LEN);
> When src and dst are "struct ether_addr*", and ETHER_ADDR_LEN equal 6.
> This code call every time, when we send Ethernet packet.
> On example, on my machine in invoked nearly 20K per second.
>
> Why we are use bcopy(), to copy only 6 bytes?
> Answer - in some architectures we are can not directly copy unaligned data.
>
> I propose this solution.
>
> In file /usr/src/include/net/ethernet.h add this lines:
>
> static inline void ether_addr_copy(ether_addr* src, ether_addr* dst) {
> #if defined(__i386__) || defined(__amd64__)
>     *dst = *src;
> #else
>     bcopy(src, dst, ETHER_ADDR_LEN);
> #endif
> }
>
> On platform i386 gcc produce like this code:
>     leal    -30(%ebp), %eax
>     leal    6(%eax), %ecx
>     leal    -44(%ebp), %edx
>     movl    (%edx), %eax
>     movl    %eax, (%ecx)
>     movzwl  4(%edx), %eax
>     movw    %ax, 4(%ecx)
> And clang produce this:
>     movl    -48(%ebp), %ecx
>     movl    %ecx, -26(%ebp)
>     movw    -44(%ebp), %si
>     movw    %si, -22(%ebp)
>
>
> All this variants are much faster, than bcopy()
>

A bit orthogonal to this but also related to the performance
impact of these bcopy() calls, for !__NO_STRICT_ALIGNMENT
architectures these places probably should use memcpy()
instead as bcopy() additionally has to check for overlap
while the former does not. Overlaps unlikely are an issue
in these cases and at least NetBSD apparently has done the
switch to memcpy() 5.5 years ago.

Marius

_______________________________________________
[hidden email] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "[hidden email]"
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: Replace bcopy() to update ether_addr

Luigi Rizzo-5
On Tue, Aug 21, 2012 at 12:26:30PM +0200, Marius Strobl wrote:
...

> > Why we are use bcopy(), to copy only 6 bytes?
> > Answer - in some architectures we are can not directly copy unaligned data.
> >
> > I propose this solution.
> >
> > In file /usr/src/include/net/ethernet.h add this lines:
> >
> > static inline void ether_addr_copy(ether_addr* src, ether_addr* dst) {
> > #if defined(__i386__) || defined(__amd64__)
> >     *dst = *src;
> > #else
> >     bcopy(src, dst, ETHER_ADDR_LEN);
> > #endif
> > }
...

> > All this variants are much faster, than bcopy()
> >
>
> A bit orthogonal to this but also related to the performance
> impact of these bcopy() calls, for !__NO_STRICT_ALIGNMENT
> architectures these places probably should use memcpy()
> instead as bcopy() additionally has to check for overlap
> while the former does not. Overlaps unlikely are an issue
> in these cases and at least NetBSD apparently has done the
> switch to memcpy() 5.5 years ago.

even more orthogonal:

I found that copying 8n + (5, 6 or 7) bytes was much much slower than
copying a multiple of 8 bytes. For n=0, 1,2,4,8 bytes are efficient,
other cases are slow (turned into 2 or 3 different writes).

The netmap code uses a pkt_copy routine that does exactly this
rounding, gaining some 10-20ns per packet for small sizes.

cheers
luigi
_______________________________________________
[hidden email] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "[hidden email]"
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: Replace bcopy() to update ether_addr

Marius Strobl
In reply to this post by Warner Losh
On Mon, Aug 20, 2012 at 01:20:29PM -0600, Warner Losh wrote:

>
> On Aug 20, 2012, at 1:17 PM, Wojciech Puchar wrote:
>
> >>> or use ++.
> >>>
> >>> i think it is always aligned to 2 bytes and this should produce usable code on any CPU? should be 6 instructions on MIPS and PPC IMHO.
> >>
> >> We should tag it as __aligned(2) then, no?  If so, then the compiler should generate the code you posted.
> > should is the most important word in Your post. what it actually do - i don't know.
>
> If we are requiring this to be __aligned(2), we should tag it as such to enforce this.
>
> Even without this tagging, the code to do a structure level copy of 6 bytes is going to be tiny...
>

While the __aligned(2) approach certainly works, I've actually rather
mixed experiences on x86 with it as the compiler doesn't necessarily
produce the small and efficient one would expect from code it. Such
a change certainly shouldn't be done just on the assumption that the
compiler has all hints required to produce good code from it but the
resulting asm should be verified across all affected architectures.

Marius

_______________________________________________
[hidden email] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "[hidden email]"
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: Replace bcopy() to update ether_addr

Dmitry S. Luhtionov
21.08.2012 14:26, Marius Strobl написал:

> On Mon, Aug 20, 2012 at 01:20:29PM -0600, Warner Losh wrote:
>> On Aug 20, 2012, at 1:17 PM, Wojciech Puchar wrote:
>>
>>>>> or use ++.
>>>>>
>>>>> i think it is always aligned to 2 bytes and this should produce usable code on any CPU? should be 6 instructions on MIPS and PPC IMHO.
>>>> We should tag it as __aligned(2) then, no?  If so, then the compiler should generate the code you posted.
>>> should is the most important word in Your post. what it actually do - i don't know.
>> If we are requiring this to be __aligned(2), we should tag it as such to enforce this.
>>
>> Even without this tagging, the code to do a structure level copy of 6 bytes is going to be tiny...
>>
> While the __aligned(2) approach certainly works, I've actually rather
> mixed experiences on x86 with it as the compiler doesn't necessarily
> produce the small and efficient one would expect from code it. Such
> a change certainly shouldn't be done just on the assumption that the
> compiler has all hints required to produce good code from it but the
> resulting asm should be verified across all affected architectures.
>
> Marius
>
Yes. I totally agree. That is why I have limited use of this feature
only i386 and amd64 architectures.

_______________________________________________
[hidden email] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "[hidden email]"
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: Replace bcopy() to update ether_addr

John Baldwin
In reply to this post by Dmitry S. Luhtionov
On Monday, August 20, 2012 10:46:12 am Mitya wrote:

> Hi.
> I found some overhead code in /src/sys/net/if_ethersubr.c and
> /src/sys/netgraph/ng_ether.c
>
> It contains strings, like bcopy(src, dst, ETHER_ADDR_LEN);
> When src and dst are "struct ether_addr*", and ETHER_ADDR_LEN equal 6.
> This code call every time, when we send Ethernet packet.
> On example, on my machine in invoked nearly 20K per second.
>
> Why we are use bcopy(), to copy only 6 bytes?
> Answer - in some architectures we are can not directly copy unaligned data.
>
> I propose this solution.
>
> In file /usr/src/include/net/ethernet.h add this lines:
>
> static inline void ether_addr_copy(ether_addr* src, ether_addr* dst) {
> #if defined(__i386__) || defined(__amd64__)
>      *dst = *src;
> #else
>      bcopy(src, dst, ETHER_ADDR_LEN);
> #endif
> }

Doesn't '*dst = *src' just work on all platforms?

--
John Baldwin
_______________________________________________
[hidden email] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "[hidden email]"
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: Replace bcopy() to update ether_addr

Warner Losh
In reply to this post by Marius Strobl

On Aug 21, 2012, at 5:26 AM, Marius Strobl wrote:

> On Mon, Aug 20, 2012 at 01:20:29PM -0600, Warner Losh wrote:
>>
>> On Aug 20, 2012, at 1:17 PM, Wojciech Puchar wrote:
>>
>>>>> or use ++.
>>>>>
>>>>> i think it is always aligned to 2 bytes and this should produce usable code on any CPU? should be 6 instructions on MIPS and PPC IMHO.
>>>>
>>>> We should tag it as __aligned(2) then, no?  If so, then the compiler should generate the code you posted.
>>> should is the most important word in Your post. what it actually do - i don't know.
>>
>> If we are requiring this to be __aligned(2), we should tag it as such to enforce this.
>>
>> Even without this tagging, the code to do a structure level copy of 6 bytes is going to be tiny...
>>
>
> While the __aligned(2) approach certainly works, I've actually rather
> mixed experiences on x86 with it as the compiler doesn't necessarily
> produce the small and efficient one would expect from code it. Such
> a change certainly shouldn't be done just on the assumption that the
> compiler has all hints required to produce good code from it but the
> resulting asm should be verified across all affected architectures.

Very true...  I would have thought that went without saying...

Warner_______________________________________________
[hidden email] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "[hidden email]"
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: Replace bcopy() to update ether_addr

Warner Losh
In reply to this post by Wojciech Puchar-5

On Aug 21, 2012, at 1:42 AM, Wojciech Puchar wrote:

>>
>> Even without this tagging, the code to do a structure level copy of 6 bytes is going to be tiny...
>
> true.
>
> just to make sure it will be absolutely portable how about
>
> bcopymacaddress(dst,src)
>
> and then define it whatever you find it fastest on any architecture?

How about just changing it to the *dst = *src, compiling it on all architectures and then deciding if the improvement of the code from a hand-tweaked thing is worth that uglification?

Warner

_______________________________________________
[hidden email] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "[hidden email]"
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: Replace bcopy() to update ether_addr

Adrian Chadd-2
Hi,

What about just creating an ETHER_ADDR_COPY(dst, src) and putting that
in a relevant include file, then hide the ugliness there?

The same benefits will likely appear when copying wifi MAC addresses
to/from headers.

Thanks, I'm glad someone noticed this.




Adrian
_______________________________________________
[hidden email] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "[hidden email]"
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: Replace bcopy() to update ether_addr

Bruce Evans-2
In reply to this post by Marius Strobl
> On Mon, Aug 20, 2012 at 05:46:12PM +0300, Mitya wrote:
> > Hi.
> > I found some overhead code in /src/sys/net/if_ethersubr.c and
> > /src/sys/netgraph/ng_ether.c
> >
> > It contains strings, like bcopy(src, dst, ETHER_ADDR_LEN);
> > When src and dst are "struct ether_addr*", and ETHER_ADDR_LEN equal 6.

Only ng_ether.c contains such strings.  if_ethersubr.c doesn't exist.
if_ether.c exists, but was converted to use memcpy() 17.25 years ago.

> > This code call every time, when we send Ethernet packet.
> > On example, on my machine in invoked nearly 20K per second.
> >
> > Why we are use bcopy(), to copy only 6 bytes?
> > Answer - in some architectures we are can not directly copy unaligned data.
> >
> > I propose this solution.
> >
> > In file /usr/src/include/net/ethernet.h add this lines:
> >
> > static inline void ether_addr_copy(ether_addr* src, ether_addr* dst) {
> > #if defined(__i386__) || defined(__amd64__)
> >     *dst = *src;
> > #else
> >     bcopy(src, dst, ETHER_ADDR_LEN);
> > #endif
> > }
> >
> > On platform i386 gcc produce like this code:
> >     leal    -30(%ebp), %eax
> >     leal    6(%eax), %ecx
> >     leal    -44(%ebp), %edx
> >     movl    (%edx), %eax
> >     movl    %eax, (%ecx)
> >     movzwl  4(%edx), %eax
> >     movw    %ax, 4(%ecx)
> > And clang produce this:
> >     movl    -48(%ebp), %ecx
> >     movl    %ecx, -26(%ebp)
> >     movw    -44(%ebp), %si
> >     movw    %si, -22(%ebp)
> >
> > All this variants are much faster, than bcopy()

You mean "as much as 5 nanoseconds faster".  Possibly even 10 nanoseconds
faster.

> A bit orthogonal to this but also related to the performance
> impact of these bcopy() calls, for !__NO_STRICT_ALIGNMENT
> architectures these places probably should use memcpy()
> instead as bcopy() additionally has to check for overlap
> while the former does not. Overlaps unlikely are an issue
> in these cases and at least NetBSD apparently has done the
> switch to memcpy() 5.5 years ago.

This is essentially just a style bug.  FreeBSD switched to memcpy()
17.25 years ago for selected networking copying.  memcpy() is supposed
to be used if and only if compilers can optimize it.  This means that
the size must be fixed and small, and of course that the copies don't
overlap.  Otherwise, compilers can't do any better than call an extern
copying routine, which is memcpy() in practice.  memcpy() was
interntionally left out in FreeBSD until it was added 17.25 years
ago to satisfy the changes to use memcpy() in networking code (since
with -O0, memcpy() won't be inlined and the extern memcpy() gets
used).  Other uses are style bugs, but there are many now :-(.

bcopy() is still the primary interface, and might be faster than
memcpy(), especially for misaligned cases, but in practice it isn't,
except in the kernel on Pentium1 in 1996-1998 where I only optimized
(large) bcopy()s.  Since it has to support overlapping copies it is
inherently slower.

Although compilers can't do better for large copying than call an
extern routine, some compilers bogusly inline it to something like
"rep movsd" on i386, (or worse, to a very large number of loads and
stores).  gcc used to have a very large threshold for inlining
moderately-sized copies and/or for switching between "rep movsd" and
load/store.  It now understands better than ut doesn't understand
memory, and has more reasonable thresholds.   Or rather the thresholds
are more MD.  gcc still makes a mess with some CFLAGS:

% struct foo
% {
% short x;
% struct bar {
% short yy[31];
% } y;
% } s, t;
%
% foo()
% {
% s.y = t.y;
% }

With just -O, gcc-4.2.1 -O on i386 handles this very badly, by
generating 15 misaligned 32-bit load/stores followed by 1 aligned
16-bit load/store.  With -march=<almost anything>, it generates 1
16-bit aligned load-store followed by an aligned "rep movsd" with a
count of 15.  The latter is not too bad.  Similarly for yy[32].  But
for yy[33] it switches to a call to memcpy() even with plain -O.

However, improvements in memory systems and branch prediction since
Pentium1 in 1996-98 mean that optimimizing copying mostly gets
nowhere.  Copying is either from the cache[s], in which case it is
fast (~1 nanosecond per 128 bits for L1), or it is not from the
caches in which case it is slow (~50-200 nanseconds per cache miss).
With 6-byte ethernet addresses, using bcopy() does slow the copying
to considerably below 1 nanosecond per 128 bits (a sloppy estimate
gives 5-10 ns/call), but it's hard for a single call to be made often
enough to make a significant difference.  Someone mentioned 20000
calls.  That's the same as 0 calls: 20000 * 10 nsec = 200 usec =
0.05% of 1 CPU.

If anyone cared about this, then they would use __builtin_memcpy()
instead of memcpy().  (Note that the point of the 17.25-year old
optimization has been defeated for ~10 years by turning off _all_
builtins, which was initially done mainly to kill builtin putc().
(-ffreestanding should have done that.)  So gcc inlines struct
copying for small structs, but never inlines memcpy(), or strlen(),
or other unimportant things.)   I once used the following to
partially fix this, but stopped using it because it made no
observable difference:

% diff -c2 ./sys/systm.h~ ./sys/systm.h
% *** ./sys/systm.h~ Sat Oct 13 12:43:54 2007
% --- ./sys/systm.h Sat Oct 13 12:43:56 2007
% ***************
% *** 188,193 ****
% --- 188,204 ----
%   void bcopy(const void *from, void *to, size_t len) __nonnull(1) __nonnull(2);
%   void bzero(void *buf, size_t len) __nonnull(1);
% + #if 1
% + #define bzero(p, n) ({ \
% + if (__builtin_constant_p(n) && (n) <= 64) \
% + __builtin_memset((p), 0, (n)); \
% + else \
% + (bzero)((p), (n)); \
% + })
% + #endif
%  
%   void *memcpy(void *to, const void *from, size_t len) __nonnull(1) __nonnull(2);
% + #if 1
% + #define memcpy(to, from, n) __builtin_memcpy((to), (from), (n))
% + #endif
%  
%   int copystr(const void * __restrict kfaddr, void * __restrict kdaddr,

In another set of dead patches, I changed lots of memcpy()s in networking
code to __builtin_memcpy()'s.

There is another set of style bugs and micro-pessimizations involving
other mem* functions starting with bzero().  The above avoids the
micro-pessimizations for the usual case of memset() to 0, at a cost
of rewarding the style bug.

Summary: use builtin memcpy() for small copies, and don't try hard to
otherwise optimize this.

Bruce
_______________________________________________
[hidden email] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "[hidden email]"
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: Replace bcopy() to update ether_addr

Bruce Evans-2
In reply to this post by Luigi Rizzo-5
luigi wrote:

> even more orthogonal:
>
> I found that copying 8n + (5, 6 or 7) bytes was much much slower than
> copying a multiple of 8 bytes. For n=0, 1,2,4,8 bytes are efficient,
> other cases are slow (turned into 2 or 3 different writes).
>
> The netmap code uses a pkt_copy routine that does exactly this
> rounding, gaining some 10-20ns per packet for small sizes.

I don't believe 10-20ns for just the extra bytes.  memcpy() ends up
with a movsb to copy the extra bytes.  This can be slow, but I don't
believe 10-20ns (except on machines running at i486 speeds of course).

% ENTRY(memcpy)
% pushl %edi
% pushl %esi
% movl 12(%esp),%edi
% movl 16(%esp),%esi
% movl 20(%esp),%ecx
% movl %edi,%eax
% shrl $2,%ecx /* copy by 32-bit words */
% cld /* nope, copy forwards */
% rep
% movsl
% movl 20(%esp),%ecx
% andl $3,%ecx /* any bytes left? */

This avoids a branch.  Some optimization manuals say that the branch is
actually better for some machines,

The above 2 instructions have a throughput of 1 per cycle each on
modern x86.  Latency might be 6 cycles.

% rep

Maybe 5-15 cycles throughput.

% movsb

Now hopefully at most 1 cycle/byte.  Some hardware might combine the
bytes as much as possible, so the whole function should use 1 single
"rep movsb" and let the hardware do it all.

% popl %esi
% popl %edi
% ret

Well, it's easy to get a latency of 20 cycles 5-10 ns) and maybe even
a throughput of that.  But all of thus is out of order on modern x86.
The extra cycles for the movsb might not cost at all if nothing accesses
the part of the target that they were written to soon.

With builtin memcpy, 6 bytes would be done using load/store of 4+2 bytes
and thus take the same time as 8 bytes on i386, but on amd64 8 bytes
would be faster.

Bruce
_______________________________________________
[hidden email] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "[hidden email]"
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: Replace bcopy() to update ether_addr

Bruce Evans-2
In reply to this post by John Baldwin
jhb wrote:

> On Monday, August 20, 2012 10:46:12 am Mitya wrote:
> > ...
> > I propose this solution.
> >
> > In file /usr/src/include/net/ethernet.h add this lines:
> >
> > static inline void ether_addr_copy(ether_addr* src, ether_addr* dst) {
> > #if defined(__i386__) || defined(__amd64__)
> >      *dst = *src;
> > #else
> >      bcopy(src, dst, ETHER_ADDR_LEN);
> > #endif
> > }
>
> Doesn't '*dst = *src' just work on all platforms?

It does when you have an actual struct, but often in networking code
you only have an array of bytes, and then casting the pointers from
uint8_t * to full object pointers fails iff there is an alignment
problem.

Even on i386, it may be pessimal to use struct copying for 6 bytes,
The bytes should be copied as 4+2 or 2+4 depending on alignment of the
first bytes.  Oops, not even that -- this reminds me of a problem with
penalties for mismatched loads and stores which affects at least
AthlonXP and Athlon64 significantly (10-20 cycle penalty = enough to
copy 80-160 bytes unpenalized).  The 6 should probably be copied as
2+2+2 or even as 1+1+1+1+1+1 to match previous stores and later loads,
also as 2+2+2, etc.  But if the 6 are part of another struct, they
might be accessed as 8+0 or 4+4 as part of copying the full struct.
gcc is not smart about this.

Bruce
_______________________________________________
[hidden email] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "[hidden email]"
12
Loading...