|
Hi all,
I implemented multiqueue support for bpf, I'd like to present for review. This is a Google Summer of Code project, the project goal is to support multiqueue network interface on BPF, and provide interfaces for multithreaded packet processing using BPF. Modern high performance NICs have multiple receive/send queues and RSS feature, this allows to process packet concurrently on multiple processors. Main purpose of the project is to support these hardware and get benefit of parallelism. This provides following new APIs: - queue filter for each bpf descriptor (bpf ioctl) - BIOCENAQMASK Enables multiqueue filter on the descriptor - BIOCDISQMASK Disables multiqueue filter on the descriptor - BIOCSTRXQMASK Set mask bit on specified RX queue - BIOCCRRXQMASK Clear mask bit on specified RX queue - BIOCGTRXQMASK Get mask bit on specified RX queue - BIOCSTTXQMASK Set mask bit on specified TX queue - BIOCCRTXQMASK Clear mask bit on specified TX queue - BIOCGTTXQMASK Get mask bit on specified TX queue - BIOCSTOTHERMASK Set mask bit for the packets which not tied with any queues - BIOCCROTHERMASK Clear mask bit for the packets which not tied with any queues - BIOCGTOTHERMASK Get mask bit for the packets which not tied with any queues - generic interface for getting hardware queue information from NIC driver (socket ioctl) - SIOCGIFQLEN Get interface RX/TX queue length - SIOCGIFRXQAFFINITY Get interface RX queue affinity - SIOCGIFTXQAFFINITY Get interface TX queue affinity Patch for -CURRENT is here, right now it only supports igb(4), ixgbe(4), mxge(4): http://www.dokukino.com/mq_bpf_20110813.diff And below is performance benchmark: ==== I implemented benchmark programs based on bpfnull(//depot/projects/zcopybpf/utils/bpfnull/), test_sqbpf measures bpf throughput on one thread, without using multiqueue APIs. http://p4db.freebsd.org/fileViewer.cgi?FSPC=//depot/projects/soc2011/mq_bpf/src/tools/regression/bpf/mq_bpf/test_sqbpf/test_sqbpf.c test_mqbpf is multithreaded version of test_sqbpf, using multiqueue APIs. http://p4db.freebsd.org/fileViewer.cgi?FSPC=//depot/projects/soc2011/mq_bpf/src/tools/regression/bpf/mq_bpf/test_mqbpf/test_mqbpf.c I benchmarked with six conditions: - benchmark1 only reads bpf, doesn't write packet anywhere - benchmark2 writes packet on memory(mfs) - benchmark3 writes packet on hdd(zfs) - benchmark4 only reads bpf, doesn't write packet anywhere, with zerocopy - benchmark5 writes packet on memory(mfs), with zerocopy - benchmark6 writes packet on hdd(zfs), with zerocopy >From benchmark result, I can say the performance is increased using mq_bpf on 10GbE, but not on GbE. * Throughput benchmark - Test environment - FreeBSD node CPU: Core i7 X980 (12 threads) MB: ASUS P6X58D Premium(Intel X58) NIC1: Intel Gigabit ET Dual Port Server Adapter(82576) NIC2: Intel Ethernet X520-DA2 Server Adapter(82599) - Linux node CPU: Core 2 Quad (4 threads) MB: GIGABYTE GA-G33-DS3R(Intel G33) NIC1: Intel Gigabit ET Dual Port Server Adapter(82576) NIC2: Intel Ethernet X520-DA2 Server Adapter(82599) iperf used for generate network traffic, with following argument options - Linux node: iperf -c [IP] -i 10 -t 100000 -P12 - FreeBSD node: iperf -s # 12 threads, TCP following sysctl parameter is changed sysctl -w net.bpf.maxbufsize=1048576 - Benchmark1 Benchmark1 doesn't write packet anywhere using following commands ./test_sqbpf -i [interface] -b 1048576 ./test_mqbpf -i [interface] -b 1048576 - ixgbe test_mqbpf: 5303.09007533333 Mbps test_sqbpf: 3959.83021733333 Mbps - igb test_mqbpf: 916.752133333333 Mbps test_sqbpf: 917.597079 Mbps - Benchmark2 Benchmark2 write packet on mfs using following commands mdmfs -s 10G md /mnt ./test_sqbpf -i [interface] -b 1048576 -w -f /mnt/test ./test_mqbpf -i [interface] -b 1048576 -w -f /mnt/test - ixgbe test_mqbpf: 1061.24890333333 Mbps test_sqbpf: 204.779881 Mbps - igb test_mqbpf: 916.656664666667 Mbps test_sqbpf: 914.378636 Mbps - Benchmark3 Benchmark3 write packet on zfs(on HDD) using following commands ./test_sqbpf -i [interface] -b 1048576 -w -f test ./test_mqbpf -i [interface] -b 1048576 -w -f test - ixgbe test_mqbpf: 119.912253333333 Mbps test_sqbpf: 101.195918 Mbps - igb test_mqbpf: 228.910355333333 Mbps test_sqbpf: 199.639093666667 Mbps - Benchmark4 Benchmark4 doesn't write packet anywhere using following commands, with zerocopy ./test_sqbpf -i [interface] -b 1048576 ./test_mqbpf -i [interface] -b 1048576 - ixgbe test_mqbpf: 4772.924974 Mbps test_sqbpf: 3173.19967133333 Mbps - igb test_mqbpf: 931.217345 Mbps test_sqbpf: 925.965270666667 Mbps - Benchmark5 Benchmark5 write packet on mfs using following commands, with zerocopy mdmfs -s 10G md /mnt ./test_sqbpf -i [interface] -b 1048576 -w -f /mnt/test ./test_mqbpf -i [interface] -b 1048576 -w -f /mnt/test - ixgbe test_mqbpf: 306.902822333333 Mbps test_sqbpf: 317.605016666667 Mbps - igb test_mqbpf: 729.075349666667 Mbps test_sqbpf: 708.987822666667 Mbps - Benchmark6 Benchmark6 write packet on zfs(on HDD) using following commands, with zerocopy ./test_sqbpf -i [interface] -b 1048576 -w -f test ./test_mqbpf -i [interface] -b 1048576 -w -f test - ixgbe test_mqbpf: 174.016136666667 Mbps test_sqbpf: 138.068732666667 Mbps - igb test_mqbpf: 228.794880333333 Mbps test_sqbpf: 229.367386333333 Mbps _______________________________________________ [hidden email] mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "[hidden email]" |
|
On Aug 16, 2011, at 11:13 AM, Takuya ASADA wrote:
> Hi all, > > I implemented multiqueue support for bpf, I'd like to present for review. > This is a Google Summer of Code project, the project goal is to > support multiqueue network interface on BPF, and provide interfaces > for multithreaded packet processing using BPF. > Modern high performance NICs have multiple receive/send queues and RSS > feature, this allows to process packet concurrently on multiple > processors. > Main purpose of the project is to support these hardware and get > benefit of parallelism. > > This provides following new APIs: > - queue filter for each bpf descriptor (bpf ioctl) > - BIOCENAQMASK Enables multiqueue filter on the descriptor > - BIOCDISQMASK Disables multiqueue filter on the descriptor > - BIOCSTRXQMASK Set mask bit on specified RX queue > - BIOCCRRXQMASK Clear mask bit on specified RX queue > - BIOCGTRXQMASK Get mask bit on specified RX queue > - BIOCSTTXQMASK Set mask bit on specified TX queue > - BIOCCRTXQMASK Clear mask bit on specified TX queue > - BIOCGTTXQMASK Get mask bit on specified TX queue > - BIOCSTOTHERMASK Set mask bit for the packets which not tied > with any queues > - BIOCCROTHERMASK Clear mask bit for the packets which not tied > with any queues > - BIOCGTOTHERMASK Get mask bit for the packets which not tied > with any queues > > - generic interface for getting hardware queue information from NIC > driver (socket ioctl) > - SIOCGIFQLEN Get interface RX/TX queue length > - SIOCGIFRXQAFFINITY Get interface RX queue affinity > - SIOCGIFTXQAFFINITY Get interface TX queue affinity > > Patch for -CURRENT is here, right now it only supports igb(4), > ixgbe(4), mxge(4): > http://www.dokukino.com/mq_bpf_20110813.diff > > And below is performance benchmark: > > ==== > I implemented benchmark programs based on > bpfnull(//depot/projects/zcopybpf/utils/bpfnull/), > > test_sqbpf measures bpf throughput on one thread, without using multiqueue APIs. > http://p4db.freebsd.org/fileViewer.cgi?FSPC=//depot/projects/soc2011/mq_bpf/src/tools/regression/bpf/mq_bpf/test_sqbpf/test_sqbpf.c > > test_mqbpf is multithreaded version of test_sqbpf, using multiqueue APIs. > http://p4db.freebsd.org/fileViewer.cgi?FSPC=//depot/projects/soc2011/mq_bpf/src/tools/regression/bpf/mq_bpf/test_mqbpf/test_mqbpf.c > > I benchmarked with six conditions: > - benchmark1 only reads bpf, doesn't write packet anywhere > - benchmark2 writes packet on memory(mfs) > - benchmark3 writes packet on hdd(zfs) > - benchmark4 only reads bpf, doesn't write packet anywhere, with zerocopy > - benchmark5 writes packet on memory(mfs), with zerocopy > - benchmark6 writes packet on hdd(zfs), with zerocopy > >> From benchmark result, I can say the performance is increased using > mq_bpf on 10GbE, but not on GbE. > > * Throughput benchmark > - Test environment > - FreeBSD node > CPU: Core i7 X980 (12 threads) > MB: ASUS P6X58D Premium(Intel X58) > NIC1: Intel Gigabit ET Dual Port Server Adapter(82576) > NIC2: Intel Ethernet X520-DA2 Server Adapter(82599) > - Linux node > CPU: Core 2 Quad (4 threads) > MB: GIGABYTE GA-G33-DS3R(Intel G33) > NIC1: Intel Gigabit ET Dual Port Server Adapter(82576) > NIC2: Intel Ethernet X520-DA2 Server Adapter(82599) > > iperf used for generate network traffic, with following argument options > - Linux node: iperf -c [IP] -i 10 -t 100000 -P12 > - FreeBSD node: iperf -s > # 12 threads, TCP > > following sysctl parameter is changed > sysctl -w net.bpf.maxbufsize=1048576 Thank you for your work! You may want to increase that (4x/8x) and rerun the test, though._______________________________________________ [hidden email] mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "[hidden email]" |
|
On Aug 16, 2011, at 11:50 AM, Vlad Galu wrote:
> On Aug 16, 2011, at 11:13 AM, Takuya ASADA wrote: >> Hi all, >> >> I implemented multiqueue support for bpf, I'd like to present for review. >> This is a Google Summer of Code project, the project goal is to >> support multiqueue network interface on BPF, and provide interfaces >> for multithreaded packet processing using BPF. >> Modern high performance NICs have multiple receive/send queues and RSS >> feature, this allows to process packet concurrently on multiple >> processors. >> Main purpose of the project is to support these hardware and get >> benefit of parallelism. >> >> This provides following new APIs: >> - queue filter for each bpf descriptor (bpf ioctl) >> - BIOCENAQMASK Enables multiqueue filter on the descriptor >> - BIOCDISQMASK Disables multiqueue filter on the descriptor >> - BIOCSTRXQMASK Set mask bit on specified RX queue >> - BIOCCRRXQMASK Clear mask bit on specified RX queue >> - BIOCGTRXQMASK Get mask bit on specified RX queue >> - BIOCSTTXQMASK Set mask bit on specified TX queue >> - BIOCCRTXQMASK Clear mask bit on specified TX queue >> - BIOCGTTXQMASK Get mask bit on specified TX queue >> - BIOCSTOTHERMASK Set mask bit for the packets which not tied >> with any queues >> - BIOCCROTHERMASK Clear mask bit for the packets which not tied >> with any queues >> - BIOCGTOTHERMASK Get mask bit for the packets which not tied >> with any queues >> >> - generic interface for getting hardware queue information from NIC >> driver (socket ioctl) >> - SIOCGIFQLEN Get interface RX/TX queue length >> - SIOCGIFRXQAFFINITY Get interface RX queue affinity >> - SIOCGIFTXQAFFINITY Get interface TX queue affinity >> >> Patch for -CURRENT is here, right now it only supports igb(4), >> ixgbe(4), mxge(4): >> http://www.dokukino.com/mq_bpf_20110813.diff >> >> And below is performance benchmark: >> >> ==== >> I implemented benchmark programs based on >> bpfnull(//depot/projects/zcopybpf/utils/bpfnull/), >> >> test_sqbpf measures bpf throughput on one thread, without using multiqueue APIs. >> http://p4db.freebsd.org/fileViewer.cgi?FSPC=//depot/projects/soc2011/mq_bpf/src/tools/regression/bpf/mq_bpf/test_sqbpf/test_sqbpf.c >> >> test_mqbpf is multithreaded version of test_sqbpf, using multiqueue APIs. >> http://p4db.freebsd.org/fileViewer.cgi?FSPC=//depot/projects/soc2011/mq_bpf/src/tools/regression/bpf/mq_bpf/test_mqbpf/test_mqbpf.c >> >> I benchmarked with six conditions: >> - benchmark1 only reads bpf, doesn't write packet anywhere >> - benchmark2 writes packet on memory(mfs) >> - benchmark3 writes packet on hdd(zfs) >> - benchmark4 only reads bpf, doesn't write packet anywhere, with zerocopy >> - benchmark5 writes packet on memory(mfs), with zerocopy >> - benchmark6 writes packet on hdd(zfs), with zerocopy >> >>> From benchmark result, I can say the performance is increased using >> mq_bpf on 10GbE, but not on GbE. >> >> * Throughput benchmark >> - Test environment >> - FreeBSD node >> CPU: Core i7 X980 (12 threads) >> MB: ASUS P6X58D Premium(Intel X58) >> NIC1: Intel Gigabit ET Dual Port Server Adapter(82576) >> NIC2: Intel Ethernet X520-DA2 Server Adapter(82599) >> - Linux node >> CPU: Core 2 Quad (4 threads) >> MB: GIGABYTE GA-G33-DS3R(Intel G33) >> NIC1: Intel Gigabit ET Dual Port Server Adapter(82576) >> NIC2: Intel Ethernet X520-DA2 Server Adapter(82599) >> >> iperf used for generate network traffic, with following argument options >> - Linux node: iperf -c [IP] -i 10 -t 100000 -P12 >> - FreeBSD node: iperf -s >> # 12 threads, TCP >> >> following sysctl parameter is changed >> sysctl -w net.bpf.maxbufsize=1048576 > > > Thank you for your work! You may want to increase that (4x/8x) and rerun the test, though. More, actually. Your current buffer is easily filled._______________________________________________ [hidden email] mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "[hidden email]" |
|
2011/8/16 Vlad Galu <[hidden email]>:
> On Aug 16, 2011, at 11:50 AM, Vlad Galu wrote: >> On Aug 16, 2011, at 11:13 AM, Takuya ASADA wrote: >>> Hi all, >>> >>> I implemented multiqueue support for bpf, I'd like to present for review. >>> This is a Google Summer of Code project, the project goal is to >>> support multiqueue network interface on BPF, and provide interfaces >>> for multithreaded packet processing using BPF. >>> Modern high performance NICs have multiple receive/send queues and RSS >>> feature, this allows to process packet concurrently on multiple >>> processors. >>> Main purpose of the project is to support these hardware and get >>> benefit of parallelism. >>> >>> This provides following new APIs: >>> - queue filter for each bpf descriptor (bpf ioctl) >>> - BIOCENAQMASK Enables multiqueue filter on the descriptor >>> - BIOCDISQMASK Disables multiqueue filter on the descriptor >>> - BIOCSTRXQMASK Set mask bit on specified RX queue >>> - BIOCCRRXQMASK Clear mask bit on specified RX queue >>> - BIOCGTRXQMASK Get mask bit on specified RX queue >>> - BIOCSTTXQMASK Set mask bit on specified TX queue >>> - BIOCCRTXQMASK Clear mask bit on specified TX queue >>> - BIOCGTTXQMASK Get mask bit on specified TX queue >>> - BIOCSTOTHERMASK Set mask bit for the packets which not tied >>> with any queues >>> - BIOCCROTHERMASK Clear mask bit for the packets which not tied >>> with any queues >>> - BIOCGTOTHERMASK Get mask bit for the packets which not tied >>> with any queues >>> >>> - generic interface for getting hardware queue information from NIC >>> driver (socket ioctl) >>> - SIOCGIFQLEN Get interface RX/TX queue length >>> - SIOCGIFRXQAFFINITY Get interface RX queue affinity >>> - SIOCGIFTXQAFFINITY Get interface TX queue affinity >>> >>> Patch for -CURRENT is here, right now it only supports igb(4), >>> ixgbe(4), mxge(4): >>> http://www.dokukino.com/mq_bpf_20110813.diff >>> >>> And below is performance benchmark: >>> >>> ==== >>> I implemented benchmark programs based on >>> bpfnull(//depot/projects/zcopybpf/utils/bpfnull/), >>> >>> test_sqbpf measures bpf throughput on one thread, without using multiqueue APIs. >>> http://p4db.freebsd.org/fileViewer.cgi?FSPC=//depot/projects/soc2011/mq_bpf/src/tools/regression/bpf/mq_bpf/test_sqbpf/test_sqbpf.c >>> >>> test_mqbpf is multithreaded version of test_sqbpf, using multiqueue APIs. >>> http://p4db.freebsd.org/fileViewer.cgi?FSPC=//depot/projects/soc2011/mq_bpf/src/tools/regression/bpf/mq_bpf/test_mqbpf/test_mqbpf.c >>> >>> I benchmarked with six conditions: >>> - benchmark1 only reads bpf, doesn't write packet anywhere >>> - benchmark2 writes packet on memory(mfs) >>> - benchmark3 writes packet on hdd(zfs) >>> - benchmark4 only reads bpf, doesn't write packet anywhere, with zerocopy >>> - benchmark5 writes packet on memory(mfs), with zerocopy >>> - benchmark6 writes packet on hdd(zfs), with zerocopy >>> >>>> From benchmark result, I can say the performance is increased using >>> mq_bpf on 10GbE, but not on GbE. >>> >>> * Throughput benchmark >>> - Test environment >>> - FreeBSD node >>> CPU: Core i7 X980 (12 threads) >>> MB: ASUS P6X58D Premium(Intel X58) >>> NIC1: Intel Gigabit ET Dual Port Server Adapter(82576) >>> NIC2: Intel Ethernet X520-DA2 Server Adapter(82599) >>> - Linux node >>> CPU: Core 2 Quad (4 threads) >>> MB: GIGABYTE GA-G33-DS3R(Intel G33) >>> NIC1: Intel Gigabit ET Dual Port Server Adapter(82576) >>> NIC2: Intel Ethernet X520-DA2 Server Adapter(82599) >>> >>> iperf used for generate network traffic, with following argument options >>> - Linux node: iperf -c [IP] -i 10 -t 100000 -P12 >>> - FreeBSD node: iperf -s >>> # 12 threads, TCP >>> >>> following sysctl parameter is changed >>> sysctl -w net.bpf.maxbufsize=1048576 >> >> >> Thank you for your work! You may want to increase that (4x/8x) and rerun the test, though. > > More, actually. Your current buffer is easily filled. Hi, I measured performance again with maxbufsize = 268435456 and multiple cpu configurations, here's an result. It seems the performance on 10GbE is bit unstable, not scaling linearly by adding cpus/queues. Maybe it depends some sort of system parameter, but I don't figure out the answer. Multithreaded BPF performance is increasing than single thread BPF in all case, anyway. * Test environment - FreeBSD node CPU: Core i7 X980 (12 threads) # Tested on 1 core, 2 core, 4 core and 6 core configuration (Each core has 2 threads using HT) MB: ASUS P6X58D Premium(Intel X58) NIC: Intel Ethernet X520-DA2 Server Adapter(82599) - Linux node CPU: Core 2 Quad (4 threads) MB: GIGABYTE GA-G33-DS3R(Intel G33) NIC: Intel Ethernet X520-DA2 Server Adapter(82599) - iperf Linux node: iperf -c [IP] -i 10 -t 100000 -P16 FreeBSD node: iperf -s # 16 threads, TCP - system parameter net.bpf.maxbufsize=268435456 hw.ixgbe.num_queues=[n queues] * 2threads, 2queues - iperf throughput iperf only: 8.845Gbps test_mqbpf: 5.78Gbps test_sqbpf: 6.89Gbps - test program throughput test_mqbpf: 4526.863414 Mbps test_sqbpf: 762.452475 Mbps - received/dropped test_mqbpf: 45315011 packets received (BPF) 9646958 packets dropped (BPF) test_sqbpf: 56216145 packets received (BPF) 49765127 packets dropped (BPF) * 4threads, 4queues - iperf throughput iperf only: 3.03Gbps test_mqbpf: 2.49Gbps test_sqbpf: 2.57Gbps - test program throughput test_mqbpf: 2420.195051 Mbps test_sqbpf: 430.774870 Mbps - received/dropped test_mqbpf: 19601503 packets received (BPF) 0 packets dropped (BPF) test_sqbpf: 22803778 packets received (BPF) 18869653 packets dropped (BPF) * 8threads, 8queues - iperf throughput iperf only: 5.80Gbps test_mqbpf: 4.42Gbps test_sqbpf: 4.30Gbps - test program throughput test_mqbpf: 4242.314913 Mbps test_sqbpf: 1291.719866 Mbps - received/dropped test_mqbpf: 34996953 packets received (BPF) 361947 packets dropped (BPF) test_sqbpf: 35738058 packets received (BPF) 24749546 packets dropped (BPF) * 12threads, 12queues - iperf throughput iperf only: 9.31Gbps test_mqbpf: 8.06Gbps test_sqbpf: 5.67Gbps - test program throughput test_mqbpf: 8089.242472 Mbps test_sqbpf: 5754.910665 Mbps - received/dropped test_mqbpf: 73783957 packets received (BPF) 9938 packets dropped (BPF) test_sqbpf: 49434479 packets received (BPF) 0 packets dropped (BPF) _______________________________________________ [hidden email] mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "[hidden email]" |
|
Any comments or suggestions?
2011/8/18 Takuya ASADA <[hidden email]>: > 2011/8/16 Vlad Galu <[hidden email]>: >> On Aug 16, 2011, at 11:50 AM, Vlad Galu wrote: >>> On Aug 16, 2011, at 11:13 AM, Takuya ASADA wrote: >>>> Hi all, >>>> >>>> I implemented multiqueue support for bpf, I'd like to present for review. >>>> This is a Google Summer of Code project, the project goal is to >>>> support multiqueue network interface on BPF, and provide interfaces >>>> for multithreaded packet processing using BPF. >>>> Modern high performance NICs have multiple receive/send queues and RSS >>>> feature, this allows to process packet concurrently on multiple >>>> processors. >>>> Main purpose of the project is to support these hardware and get >>>> benefit of parallelism. >>>> >>>> This provides following new APIs: >>>> - queue filter for each bpf descriptor (bpf ioctl) >>>> - BIOCENAQMASK Enables multiqueue filter on the descriptor >>>> - BIOCDISQMASK Disables multiqueue filter on the descriptor >>>> - BIOCSTRXQMASK Set mask bit on specified RX queue >>>> - BIOCCRRXQMASK Clear mask bit on specified RX queue >>>> - BIOCGTRXQMASK Get mask bit on specified RX queue >>>> - BIOCSTTXQMASK Set mask bit on specified TX queue >>>> - BIOCCRTXQMASK Clear mask bit on specified TX queue >>>> - BIOCGTTXQMASK Get mask bit on specified TX queue >>>> - BIOCSTOTHERMASK Set mask bit for the packets which not tied >>>> with any queues >>>> - BIOCCROTHERMASK Clear mask bit for the packets which not tied >>>> with any queues >>>> - BIOCGTOTHERMASK Get mask bit for the packets which not tied >>>> with any queues >>>> >>>> - generic interface for getting hardware queue information from NIC >>>> driver (socket ioctl) >>>> - SIOCGIFQLEN Get interface RX/TX queue length >>>> - SIOCGIFRXQAFFINITY Get interface RX queue affinity >>>> - SIOCGIFTXQAFFINITY Get interface TX queue affinity >>>> >>>> Patch for -CURRENT is here, right now it only supports igb(4), >>>> ixgbe(4), mxge(4): >>>> http://www.dokukino.com/mq_bpf_20110813.diff >>>> >>>> And below is performance benchmark: >>>> >>>> ==== >>>> I implemented benchmark programs based on >>>> bpfnull(//depot/projects/zcopybpf/utils/bpfnull/), >>>> >>>> test_sqbpf measures bpf throughput on one thread, without using multiqueue APIs. >>>> http://p4db.freebsd.org/fileViewer.cgi?FSPC=//depot/projects/soc2011/mq_bpf/src/tools/regression/bpf/mq_bpf/test_sqbpf/test_sqbpf.c >>>> >>>> test_mqbpf is multithreaded version of test_sqbpf, using multiqueue APIs. >>>> http://p4db.freebsd.org/fileViewer.cgi?FSPC=//depot/projects/soc2011/mq_bpf/src/tools/regression/bpf/mq_bpf/test_mqbpf/test_mqbpf.c >>>> >>>> I benchmarked with six conditions: >>>> - benchmark1 only reads bpf, doesn't write packet anywhere >>>> - benchmark2 writes packet on memory(mfs) >>>> - benchmark3 writes packet on hdd(zfs) >>>> - benchmark4 only reads bpf, doesn't write packet anywhere, with zerocopy >>>> - benchmark5 writes packet on memory(mfs), with zerocopy >>>> - benchmark6 writes packet on hdd(zfs), with zerocopy >>>> >>>>> From benchmark result, I can say the performance is increased using >>>> mq_bpf on 10GbE, but not on GbE. >>>> >>>> * Throughput benchmark >>>> - Test environment >>>> - FreeBSD node >>>> CPU: Core i7 X980 (12 threads) >>>> MB: ASUS P6X58D Premium(Intel X58) >>>> NIC1: Intel Gigabit ET Dual Port Server Adapter(82576) >>>> NIC2: Intel Ethernet X520-DA2 Server Adapter(82599) >>>> - Linux node >>>> CPU: Core 2 Quad (4 threads) >>>> MB: GIGABYTE GA-G33-DS3R(Intel G33) >>>> NIC1: Intel Gigabit ET Dual Port Server Adapter(82576) >>>> NIC2: Intel Ethernet X520-DA2 Server Adapter(82599) >>>> >>>> iperf used for generate network traffic, with following argument options >>>> - Linux node: iperf -c [IP] -i 10 -t 100000 -P12 >>>> - FreeBSD node: iperf -s >>>> # 12 threads, TCP >>>> >>>> following sysctl parameter is changed >>>> sysctl -w net.bpf.maxbufsize=1048576 >>> >>> >>> Thank you for your work! You may want to increase that (4x/8x) and rerun the test, though. >> >> More, actually. Your current buffer is easily filled. > > Hi, > > I measured performance again with maxbufsize = 268435456 and multiple > cpu configurations, here's an result. > It seems the performance on 10GbE is bit unstable, not scaling > linearly by adding cpus/queues. > Maybe it depends some sort of system parameter, but I don't figure out > the answer. > > Multithreaded BPF performance is increasing than single thread BPF in > all case, anyway. > > * Test environment > - FreeBSD node > CPU: Core i7 X980 (12 threads) > # Tested on 1 core, 2 core, 4 core and 6 core configuration (Each > core has 2 threads using HT) > MB: ASUS P6X58D Premium(Intel X58) > NIC: Intel Ethernet X520-DA2 Server Adapter(82599) > > - Linux node > CPU: Core 2 Quad (4 threads) > MB: GIGABYTE GA-G33-DS3R(Intel G33) > NIC: Intel Ethernet X520-DA2 Server Adapter(82599) > > - iperf > Linux node: iperf -c [IP] -i 10 -t 100000 -P16 > FreeBSD node: iperf -s > # 16 threads, TCP > - system parameter > net.bpf.maxbufsize=268435456 > hw.ixgbe.num_queues=[n queues] > > * 2threads, 2queues > - iperf throughput > iperf only: 8.845Gbps > test_mqbpf: 5.78Gbps > test_sqbpf: 6.89Gbps > - test program throughput > test_mqbpf: 4526.863414 Mbps > test_sqbpf: 762.452475 Mbps > - received/dropped > test_mqbpf: > 45315011 packets received (BPF) > 9646958 packets dropped (BPF) > test_sqbpf: > 56216145 packets received (BPF) > 49765127 packets dropped (BPF) > > * 4threads, 4queues > - iperf throughput > iperf only: 3.03Gbps > test_mqbpf: 2.49Gbps > test_sqbpf: 2.57Gbps > - test program throughput > test_mqbpf: 2420.195051 Mbps > test_sqbpf: 430.774870 Mbps > - received/dropped > test_mqbpf: > 19601503 packets received (BPF) > 0 packets dropped (BPF) > test_sqbpf: > 22803778 packets received (BPF) > 18869653 packets dropped (BPF) > > * 8threads, 8queues > - iperf throughput > iperf only: 5.80Gbps > test_mqbpf: 4.42Gbps > test_sqbpf: 4.30Gbps > - test program throughput > test_mqbpf: 4242.314913 Mbps > test_sqbpf: 1291.719866 Mbps > - received/dropped > test_mqbpf: > 34996953 packets received (BPF) > 361947 packets dropped (BPF) > test_sqbpf: > 35738058 packets received (BPF) > 24749546 packets dropped (BPF) > > * 12threads, 12queues > - iperf throughput > iperf only: 9.31Gbps > test_mqbpf: 8.06Gbps > test_sqbpf: 5.67Gbps > - test program throughput > test_mqbpf: 8089.242472 Mbps > test_sqbpf: 5754.910665 Mbps > - received/dropped > test_mqbpf: > 73783957 packets received (BPF) > 9938 packets dropped (BPF) > test_sqbpf: > 49434479 packets received (BPF) > 0 packets dropped (BPF) > [hidden email] mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "[hidden email]" |
|
On Aug 19, 2011, at 04:21 , Takuya ASADA wrote: > Any comments or suggestions? > > One comment, one question. First, I think we should try to integrate this work and then tune it up more. The API is, I think, fine, and performance tuning takes a bit of work. Second, what are the parameters set on buffers for the drivers? I.e. how many slots do they have in their queues etc.? If they defaults are too small, and often they are, then that's going to hurt your performance. Best, George _______________________________________________ [hidden email] mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "[hidden email]" |
|
Sorry for late replying,
> One comment, one question. > > First, I think we should try to integrate this work and then tune it up more. The API > is, I think, fine, and performance tuning takes a bit of work. Is there good way(I mean tools or something) to find the bottleneck? > Second, what are the parameters set on buffers for the drivers? I.e. how many slots > do they have in their queues etc.? If they defaults are too small, and often they are, > then that's going to hurt your performance. It does equals to number of descriptors per queue, right? If I'm correct, it's 2048 descriptors per queue by default, and I used default parameter when I perform benchmarks. It's on line 290 of http://p4db.freebsd.org/fileViewer.cgi?FSPC=//depot/projects/soc2011/mq_bpf/src/sys/dev/ixgbe/ixgbe.c&REV=2 and line 105 of http://p4db.freebsd.org/fileViewer.cgi?FSPC=//depot/projects/soc2011/mq_bpf/src/sys/dev/ixgbe/ixgbe.h&REV=2 _______________________________________________ [hidden email] mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "[hidden email]" |
|
Hi,
Probably my previous mail had been skipped or forgot replying, so I'd like to try notice again. # This is original post of this thread, if you don't remember what is this: http://lists.freebsd.org/pipermail/freebsd-net/2011-August/029585.html George said "I think we should try to integrate this work and then tune it up more. in previous mail, then I want to merge this now. Is there any additional work required to merge, or just fine? 2011/9/22 Takuya ASADA <[hidden email]>: > Sorry for late replying, > >> One comment, one question. >> >> First, I think we should try to integrate this work and then tune it up more. The API >> is, I think, fine, and performance tuning takes a bit of work. > > Is there good way(I mean tools or something) to find the bottleneck? > >> Second, what are the parameters set on buffers for the drivers? I.e. how many slots >> do they have in their queues etc.? If they defaults are too small, and often they are, >> then that's going to hurt your performance. > > It does equals to number of descriptors per queue, right? > If I'm correct, it's 2048 descriptors per queue by default, and I used > default parameter when I perform benchmarks. > > It's on line 290 of > http://p4db.freebsd.org/fileViewer.cgi?FSPC=//depot/projects/soc2011/mq_bpf/src/sys/dev/ixgbe/ixgbe.c&REV=2 > > and line 105 of > http://p4db.freebsd.org/fileViewer.cgi?FSPC=//depot/projects/soc2011/mq_bpf/src/sys/dev/ixgbe/ixgbe.h&REV=2 > [hidden email] mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "[hidden email]" |
| Powered by Nabble | Edit this page |
